AVGen-Bench: A Task-Driven Benchmark for Multi-Granular Evaluation of Text-to-Audio-Video Generation
Ziwei Zhou, Zeyuan Lai, Rui Wang
et al.
Text-to-Audio-Video (T2AV) generation is rapidly becoming a core interface for media creation, yet its evaluation remains fragmented. Existing benchmarks largely assess audio and video in isolation or rely on coarse embedding similarity, failing to capture the fine-grained joint correctness required by realistic prompts. We introduce AVGen-Bench, a task-driven benchmark for T2AV generation featuring high-quality prompts across 11 real-world categories. To support comprehensive assessment, we propose a multi-granular evaluation framework that combines lightweight specialist models with Multimodal Large Language Models (MLLMs), enabling evaluation from perceptual quality to fine-grained semantic controllability. Our evaluation reveals a pronounced gap between strong audio-visual aesthetics and weak semantic reliability, including persistent failures in text rendering, speech coherence, physical reasoning, and a universal breakdown in musical pitch control. Code and benchmark resources are available at http://aka.ms/avgenbench.
PaperBanana: Automating Academic Illustration for AI Scientists
Dawei Zhu, Rui Meng, Yale Song
et al.
Despite rapid advances in autonomous AI scientists powered by language models, generating publication-ready illustrations remains a labor-intensive bottleneck in the research workflow. To lift this burden, we introduce PaperBanana, an agentic framework for automated generation of publication-ready academic illustrations. Powered by state-of-the-art VLMs and image generation models, PaperBanana orchestrates specialized agents to retrieve references, plan content and style, render images, and iteratively refine via self-critique. To rigorously evaluate our framework, we introduce PaperBananaBench, comprising 292 test cases for methodology diagrams curated from NeurIPS 2025 publications, covering diverse research domains and illustration styles. Comprehensive experiments demonstrate that PaperBanana consistently outperforms leading baselines in faithfulness, conciseness, readability, and aesthetics. We further show that our method effectively extends to the generation of high-quality statistical plots. Collectively, PaperBanana paves the way for the automated generation of publication-ready illustrations.
ChartOptimiser: Task-driven Optimisation of Chart Designs
Yao Wang, Jiarong Pan, Danqing Shi
et al.
Automated chart design has seen significant advancements with the emergence of Large-Language Models (LLMs), which offer a practical solution for generating charts. However, LLMs frequently introduce possibly critical design failures, such as data manipulation and confabulation. While expert users can potentially mitigate these issues through iterative prompt engineering, this process requires substantial design knowledge and significant effort, remaining a massive barrier for the general public. In this paper, we present ChartOptimiser, an automated method for generating chart designs with fidelity, efficiency, and effectiveness. Given the inter-dependencies between individual design parameters, ChartOptimiser employs Bayesian optimisation to effectively search the chart design space for a novel objective function grounded in four perceptual metrics. Our empirical evaluations in bar and pie charts demonstrate that ChartOptimiser eliminates iterative design loops, providing non-expert users with high-quality charts that outperform LLM-generated designs in chart clarity, task-solving ease, and visual aesthetics.
Agentic Design Review System
Sayan Nag, K J Joseph, Koustava Goswami
et al.
Evaluating graphic designs involves assessing it from multiple facets like alignment, composition, aesthetics and color choices. Evaluating designs in a holistic way involves aggregating feedback from individual expert reviewers. Towards this, we propose an Agentic Design Review System (AgenticDRS), where multiple agents collaboratively analyze a design, orchestrated by a meta-agent. A novel in-context exemplar selection approach based on graph matching and a unique prompt expansion method plays central role towards making each agent design aware. Towards evaluating this framework, we propose DRS-BENCH benchmark. Thorough experimental evaluation against state-of-the-art baselines adapted to the problem setup, backed-up with critical ablation experiments brings out the efficacy of Agentic-DRS in evaluating graphic designs and generating actionable feedback. We hope that this work will attract attention to this pragmatic, yet under-explored research direction.
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen, Candace Ross, Reyhane Askari-Hemmat
et al.
The steady improvements of text-to-image (T2I) generative models lead to slow deprecation of automatic evaluation benchmarks that rely on static datasets, motivating researchers to seek alternative ways to evaluate the T2I progress. In this paper, we explore the potential of multi-modal large language models (MLLMs) as evaluator agents that interact with a T2I model, with the objective of assessing prompt-generation consistency and image aesthetics. We present Multimodal Text-to-Image Eval (MT2IE), an evaluation framework that iteratively generates prompts for evaluation, scores generated images and matches T2I evaluation of existing benchmarks with a fraction of the prompts used in existing static benchmarks. Moreover, we show that MT2IE's prompt-generation consistency scores have higher correlation with human judgment than scores previously introduced in the literature. MT2IE generates prompts that are efficient at probing T2I model performance, producing the same relative T2I model rankings as existing benchmarks while using only 1/80th the number of prompts for evaluation.
PixelArena: A benchmark for Pixel-Precision Visual Intelligence
Feng Liang, Sizhe Cheng, Chenqi Yi
et al.
Omni-modal models that have multimodal input and output are emerging. However, benchmarking their multimodal generation, especially in image generation, is challenging due to the subtleties of human preferences and model biases. Many image generation benchmarks focus on aesthetics instead of the fine-grained generation capabilities of these models, failing to evaluate their visual intelligence with objective metrics. In PixelArena, we propose using semantic segmentation tasks to objectively examine their fine-grained generative intelligence with pixel precision. With our benchmark and experiments, we find the latest Gemini 3 Pro Image has emergent image generation capabilities that generate semantic masks with high fidelity under zero-shot settings, showcasing visual intelligence unseen before and true generalization in new image generation tasks. We further investigate its results, compare them qualitatively and quantitatively with those of other models, and present failure cases. The findings not only signal exciting progress in the field but also provide insights into future research related to dataset development, omni-modal model development, and the design of metrics.
UltraFlux: Data-Model Co-Design for High-quality Native 4K Text-to-Image Generation across Diverse Aspect Ratios
Tian Ye, Song Fei, Lei Zhu
Diffusion transformers have recently delivered strong text-to-image generation around 1K resolution, but we show that extending them to native 4K across diverse aspect ratios exposes a tightly coupled failure mode spanning positional encoding, VAE compression, and optimization. Tackling any of these factors in isolation leaves substantial quality on the table. We therefore take a data-model co-design view and introduce UltraFlux, a Flux-based DiT trained natively at 4K on MultiAspect-4K-1M, a 1M-image 4K corpus with controlled multi-AR coverage, bilingual captions, and rich VLM/IQA metadata for resolution- and AR-aware sampling. On the model side, UltraFlux couples (i) Resonance 2D RoPE with YaRN for training-window-, frequency-, and AR-aware positional encoding at 4K; (ii) a simple, non-adversarial VAE post-training scheme that improves 4K reconstruction fidelity; (iii) an SNR-Aware Huber Wavelet objective that rebalances gradients across timesteps and frequency bands; and (iv) a Stage-wise Aesthetic Curriculum Learning strategy that concentrates high-aesthetic supervision on high-noise steps governed by the model prior. Together, these components yield a stable, detail-preserving 4K DiT that generalizes across wide, square, and tall ARs. On the Aesthetic-Eval at 4096 benchmark and multi-AR 4K settings, UltraFlux consistently outperforms strong open-source baselines across fidelity, aesthetic, and alignment metrics, and-with a LLM prompt refiner-matches or surpasses the proprietary Seedream 4.0.
Synthesizing Artifact Dataset for Pixel-level Detection
Dennis Menn, Feng Liang, Diana Marculescu
Artifact detectors have been shown to enhance the performance of image-generative models by serving as reward models during fine-tuning. These detectors enable the generative model to improve overall output fidelity and aesthetics. However, training the artifact detector requires expensive pixel-level human annotations that specify the artifact regions. The lack of annotated data limits the performance of the artifact detector. A naive pseudo-labeling approach-training a weak detector and using it to annotate unlabeled images-suffers from noisy labels, resulting in poor performance. To address this, we propose an artifact corruption pipeline that automatically injects artifacts into clean, high-quality synthetic images on a predetermined region, thereby producing pixel-level annotations without manual labeling. The proposed method enables training of an artifact detector that achieves performance improvements of 13.2% for ConvNeXt and 3.7% for Swin-T, as verified on human-labeled data, compared to baseline approaches. This work represents an initial step toward scalable pixel-level artifact annotation datasets that integrate world knowledge into artifact detection.
Experiencing More-than-Human Through Human Augmentation
Botao 'Amber' Hu, Danlin Huang
The recent more-than-human turn in design calls for attentiveness to nonhuman beings. Yet -- as Thomas Nagel's famous ``What is it like to be a bat?'' thought experiment highlights -- human experience is constrained by our own sensorium and an irreducible gap in phenomenal access to nonhuman \emph{Umwelten}. Grounded in eco-phenomenology and eco-somatics, this paper proposes \textbf{Experiencing More-than-Human through Human Augmentation} (MtHtHA, or ``>HtH+''), a design approach that repurposes human augmentation technologies -- typically aimed at enhancing human capabilities for human optimization -- to create temporary, embodied, first-person experiences that modulate the human sensorium to approximate nonhuman sensory experiences, cultivating ecological awareness, empathy, and care across species boundaries. We articulate seven design principles, report five design cases -- EchoVision (bat-like echolocation), FeltSight (star-nosed-mole tactile navigation), FungiSync (fungal network attunement), TentacUs (octopus-like distributed agency), and City of Sparkles (urban data from an AI's perspective) -- and discuss implications for more-than-human aesthetics and design practice.
AI-Driven Multi-Stage Computer Vision System for Defect Detection in Laser-Engraved Industrial Nameplates
Adhish Anitha Vilasan, Stephan Jäger, Noah Klarmann
Automated defect detection in industrial manufacturing is essential for maintaining product quality and minimizing production errors. In air disc brake manufacturing, ensuring the precision of laser-engraved nameplates is crucial for accurate product identification and quality control. Engraving errors, such as misprints or missing characters, can compromise both aesthetics and functionality, leading to material waste and production delays. This paper presents a proof of concept for an AI-driven computer vision system that inspects and verifies laser-engraved nameplates, detecting defects in logos and alphanumeric strings. The system integrates object detection using YOLOv7, optical character recognition (OCR) with Tesseract, and anomaly detection through a residual variational autoencoder (ResVAE) along with other computer vision methods to enable comprehensive inspections at multiple stages. Experimental results demonstrate the system's effectiveness, achieving 91.33% accuracy and 100% recall, ensuring that defective nameplates are consistently detected and addressed. This solution highlights the potential of AI-driven visual inspection to enhance quality control, reduce manual inspection efforts, and improve overall manufacturing efficiency.
A Narrative Scoping Review of Neuroaesthetics and Objective Understanding of Human Appearance
Bishara Atiyeh, MD, Paul Beaineh, MD, Edwin Chrabieh, MD
et al.
Background:. Neuroaesthetics has made meaningful progress in unraveling the neurological mechanisms involved in the perception of various arts, landscapes, and architecture. Only recently, neural processes of artistic valuation and judgments have been investigated regarding the aesthetics of the human body. The current narrative review intended to summarize the recent advances made in this field.
Methods:. A literature review of PubMed, MEDLINE, Embase, PsycInfo, and Web of Science databases was conducted. Cerebral areas involved in aesthetic experience and judgment are reviewed, and the biological neural basis of judgment processes are analyzed in the multidimensional preference space with specific attention to visual perception and beauty appraisal of the human body.
Results:. Twenty-nine experimental studies specifically related to the neurobiological basis of aesthetic appraisal and judgments regarding the human face and body were identified.
Conclusions:. Aesthetic judgment emerges from the interaction among emotion–valuation, sensory–motor, and meaning–knowledge processes. There is objective evidence that human body beauty perception is shaped by natural selection and activates brain regions distinct from those activated by visual arts. It may still be early to grasp all the implications of neuroaesthetics, and much research is still needed. However, progress in the field holds promise for generating a more holistic understanding that will most probably have an impact on aesthetic interventions, which must aim to reproduce the characteristics of ideal mates. It also holds promise for possibly quantifying the outcomes of aesthetic interventions objectively.
Long Take in Cinema: A Philosophical Analysis (Based on the Works of Andrei Tarkovsky)
Anastasia V. Kucherova
The aim of the article is to formulate the philosophical approaches to the interpretation of the long take as an aesthetic tool in the cinema of Andrei Tarkovsky. The problematic of the study lies in the possibility of a more comprehensive analysis of the long take as an aesthetic and philosophical tool in his work. It looks especially important to consider this cinematographic technique as a practice endowed with various philosophical meanings. It seems that this will potentially enrich both existing studies of long takes and the analysis of Tarkovsky’s films. The relevance of the article is due to the revival in contemporary cinema of interest in contemplative aesthetics and long takes as a counterweight to the clip editing of mass cinema (which is associated with the phenomenon of “Slow cinema”), as well as the fact that Tarkovsky’s legacy continues to have a significant influence on world cinema. The results of the study demonstrate that the long take in the works of A. Tarkovsky represents a special aesthetic and philosophical practice. Based on the material of his films, the relationship between the duration of shots and the ontological, existential and phenomenological aspects of the film language is revealed. It is substantiated that the long take can be considered as a means of “photographic authenticity” of cinema and enhancing realism; this activates the work of thinking and memory of the viewer, forming the basis for his/her independent interpretations of what s/he saw. Also, the long take can be considered as a way of involving the viewer as a participant in the sensory experience of the film, going beyond the direct narration. In addition, it becomes an instrument that allows time to manifest itself in its natural flow, creating conditions for understanding its ontological nature. The long take contributes to the erasure of the boundaries between the real and the imaginary and the disclosure of the transcendental, and it also forms the conditions for an ethical dialogue with the viewer. The work is addressed to film scholars, philosophers, cultural researchers, and a wider audience interested in cinema and its philosophical aspects.
Communication. Mass media
Debugging Defective Visualizations: Empirical Insights Informing a Human-AI Co-Debugging System
Shuyu Shen, Sirong Lu, Leixian Shen
et al.
Visualization authoring is an iterative process requiring users to adjust parameters to achieve desired aesthetics. Due to its complexity, users often create defective visualizations and struggle to fix them. Many seek help on forums (e.g., Stack Overflow), while others turn to AI, yet little is known about the strengths and limitations of these approaches, or how they can be effectively combined. We analyze Vega-Lite debugging cases from Stack Overflow, categorizing question types by askers, evaluating human responses, and assessing AI performance. Guided by these findings, we design a human-AI co-debugging system that combines LLM-generated suggestions with forum knowledge. We evaluated this system in a user study on 36 unresolved problems, comparing it with forum answers and LLM baselines. Our results show that while forum contributors provide accurate but slow solutions and LLMs offer immediate but sometimes misaligned guidance, the hybrid system resolves 86% of cases, higher than either alone.
A Diffusion-Based Approach for Simulating Forward-in-Time State-Dependent Speciation and Extinction Dynamics
Albert C. Soewongsono, Michael J. Landis
We establish a general framework using a diffusion approximation to simulate forward-in-time state counts or frequencies for cladogenetic state-dependent speciation-extinction (ClaSSE) models. We apply the framework to various two- and three-region geographic-state speciation-extinction (GeoSSE) models. We show that the species range state dynamics simulated under tree-based and diffusion-based processes are comparable. We derive a method to infer rate parameters that are compatible with given observed stationary state frequencies and obtain an analytical result to compute stationary state frequencies for a given set of rate parameters. We also describe a procedure to find the time to reach the stationary frequencies of a ClaSSE model using our diffusion-based approach, which we demonstrate using a worked example for a two-region GeoSSE model. Finally, we discuss how the diffusion framework can be applied to formalize relationships between evolutionary patterns and processes under state-dependent diversification scenarios.
KITTEN: A Knowledge-Intensive Evaluation of Image Generation on Visual Entities
Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton
et al.
Recent advances in text-to-image generation have improved the quality of synthesized images, but evaluations mainly focus on aesthetics or alignment with text prompts. Thus, it remains unclear whether these models can accurately represent a wide variety of realistic visual entities. To bridge this gap, we propose KITTEN, a benchmark for Knowledge-InTensive image generaTion on real-world ENtities. Using KITTEN, we conduct a systematic study of the latest text-to-image models and retrieval-augmented models, focusing on their ability to generate real-world visual entities, such as landmarks and animals. Analysis using carefully designed human evaluations, automatic metrics, and MLLM evaluations show that even advanced text-to-image models fail to generate accurate visual details of entities. While retrieval-augmented models improve entity fidelity by incorporating reference images, they tend to over-rely on them and struggle to create novel configurations of the entity in creative text prompts.
Artistic Intelligence: A Diffusion-Based Framework for High-Fidelity Landscape Painting Synthesis
Wanggong Yang, Yifei Zhao
Generating high-fidelity landscape paintings remains a challenging task that requires precise control over both structure and style. In this paper, we present LPGen, a novel diffusion-based model specifically designed for landscape painting generation. LPGen introduces a decoupled cross-attention mechanism that independently processes structural and stylistic features, effectively mimicking the layered approach of traditional painting techniques. Additionally, LPGen proposes a structural controller, a multi-scale encoder designed to control the layout of landscape paintings, striking a balance between aesthetics and composition. Besides, the model is pre-trained on a curated dataset of high-resolution landscape images, categorized by distinct artistic styles, and then fine-tuned to ensure detailed and consistent output. Through extensive evaluations, LPGen demonstrates superior performance in producing paintings that are not only structurally accurate but also stylistically coherent, surpassing current state-of-the-art models. This work advances AI-generated art and offers new avenues for exploring the intersection of technology and traditional artistic practices. Our code, dataset, and model weights will be publicly available.
Perbandingan Visualisasi Motif Kalpataru pada Berbagai Candi Peninggalan Mataram Kuno Abad 8-10 Masehi
Herliyana Rosalinda, Nurulfatmi Amzy, Edo Galasro Limbong
This study examines the diverse visualizations of the Kalpataru motif in ancient Mataram temples. As significant religious symbols, these temples embodied societal values and were constructed with meticulous attention to materials, architecture, and aesthetics. By chronologically comparing the Kalpataru motif in Hindu and Buddhist temples, this research explores its evolving meaning and form. The Kalpataru, representing purity and sacredness in Hindu-Buddhist thought, appears prominently in temples such as Borobudur, Mendut, Pawon, Prambanan, and Sojiwan. While its origins lie in Buddhist symbolism, associated with Siddhartha Gautama's enlightenment, the motif underwent significant transformation in Hindu temples, particularly Prambanan, where it became more decorative and incorporated animal figures. This study employs a historical methodology, encompassing heuristic, critical, interpretive, and historiographical stages, to analyze these changes and their implications for understanding ancient Mataram culture.
History (General) and history of Europe, History of Asia
Retrospective analysis on prognosis of oral cancer patients according to surgical approaches for effective cancer ablation: swing approach versus visor approach
Yun-Ho Kim, Jae-Young Yang, Dong-Min Lee
et al.
Abstract Background For the surgical treatment of oral cancer, it is sometimes necessary to expand intraoral access within the oral cavity. The “swing approach” that involves lip splitting of the mandible and temporary mandibular osteotomy and the “visor approach” that does not split the lower lip and mandible are mainly used. This study analyzed postoperative outcomes such as complications, recurrence rate, and survival rate by these two approaches. The goal of this study is to evaluate the surgical outcomes of patients using these two approaches, to propose effective perioperative management for oral cancer surgery, and to compare the prognosis of oral cancer patients. Materials and methods From 2005 to 2020, 29 patients who underwent surgery at the Department of Oral and Maxillofacial Surgery of Pusan National University Dental Hospital for oral cancer lesions occurred in the mandible, floor of mouth, and tongue were selected for the study. Based on the surgical approach used, a chart review was conducted on various prognostic clinical factors such as the patients’ sex and age, primary site, TNM stage, histopathologic grade, recurrence and metastasis, postoperative survival rate, adjuvant chemo-radiation therapy, satisfaction with aesthetics/function/swallowing, length of hospital stay, tracheostomy and its duration, and neck dissection and its type. Statistical analysis was conducted using SPSS 25.0 (SPSS Inc., Chicago, IL) through Fisher’s exact t-test. Result There was no statistically significant difference between two groups in terms of clinical and pathological findings, such as survival rate, the need for adjuvant therapies, and the local recurrence rate. Although better outcomes were observed in terms of function, aesthetics, and postoperative complications in the group with visor approach, there was still no statistically significant difference between two groups. However, the duration of hospital stay was shorter in the visor approach group. Conclusion There was no statistically significant difference in clinical prognostic factors between the swing approach and the visor approach. Therefore, when choosing between the two approaches for the ablation of oral cancer, it is considered to select the surgical priority approach that can be easy access based on the size and location of the lesion. The visor approach had advantages of aesthetics and healing period.
Generating Animatable 3D Cartoon Faces from Single Portraits
Chuanyu Pan, Guowei Yang, Taijiang Mu
et al.
With the booming of virtual reality (VR) technology, there is a growing need for customized 3D avatars. However, traditional methods for 3D avatar modeling are either time-consuming or fail to retain similarity to the person being modeled. We present a novel framework to generate animatable 3D cartoon faces from a single portrait image. We first transfer an input real-world portrait to a stylized cartoon image with a StyleGAN. Then we propose a two-stage reconstruction method to recover the 3D cartoon face with detailed texture, which first makes a coarse estimation based on template models, and then refines the model by non-rigid deformation under landmark supervision. Finally, we propose a semantic preserving face rigging method based on manually created templates and deformation transfer. Compared with prior arts, qualitative and quantitative results show that our method achieves better accuracy, aesthetics, and similarity criteria. Furthermore, we demonstrate the capability of real-time facial animation of our 3D model.
Optimisation-Based Multi-Modal Semantic Image Editing
Bowen Li, Yongxin Yang, Steven McDonagh
et al.
Image editing affords increased control over the aesthetics and content of generated images. Pre-existing works focus predominantly on text-based instructions to achieve desired image modifications, which limit edit precision and accuracy. In this work, we propose an inference-time editing optimisation, designed to extend beyond textual edits to accommodate multiple editing instruction types (e.g. spatial layout-based; pose, scribbles, edge maps). We propose to disentangle the editing task into two competing subtasks: successful local image modifications and global content consistency preservation, where subtasks are guided through two dedicated loss functions. By allowing to adjust the influence of each loss function, we build a flexible editing solution that can be adjusted to user preferences. We evaluate our method using text, pose and scribble edit conditions, and highlight our ability to achieve complex edits, through both qualitative and quantitative experiments.