Images often communicate more than they literally depict: a set of tools can suggest an occupation and a cultural artifact can suggest a tradition. This kind of indirect visual reference, known as visual metonymy, invites viewers to recover a target concept via associated cues rather than explicit depiction. In this work, we present the first computational investigation of visual metonymy. We introduce a novel pipeline grounded in semiotic theory that leverages large language models and text-to-image models to generate metonymic visual representations. Using this framework, we construct ViMET, the first visual metonymy dataset comprising 2,000 multiple-choice questions to evaluate the cognitive reasoning abilities in multimodal language models. Experimental results on our dataset reveal a significant gap between human performance (86.9%) and state-of-the-art vision-language models (65.9%), highlighting limitations in machines' ability to interpret indirect visual references. Our dataset is publicly available at: https://github.com/cincynlp/ViMET.
Zahra Babaiee, Peyman M. Kiasari, Daniela Rus
et al.
Recent advancements in multimodal large language models have driven breakthroughs in visual question answering. Yet, a critical gap persists, `conceptualization'-the ability to recognize and reason about the same concept despite variations in visual form, a basic ability of human reasoning. To address this challenge, we introduce the Visual Graph Arena (VGA), a dataset featuring six graph-based tasks designed to evaluate and improve AI systems' capacity for visual abstraction. VGA uses diverse graph layouts (e.g., Kamada-Kawai vs. planar) to test reasoning independent of visual form. Experiments with state-of-the-art vision models and multimodal LLMs reveal a striking divide: humans achieved near-perfect accuracy across tasks, while models totally failed on isomorphism detection and showed limited success in path/cycle tasks. We further identify behavioral anomalies suggesting pseudo-intelligent pattern matching rather than genuine understanding. These findings underscore fundamental limitations in current AI models for visual understanding. By isolating the challenge of representation-invariant reasoning, the VGA provides a framework to drive progress toward human-like conceptualization in AI visual models. The Visual Graph Arena is available at: \href{https://vga.csail.mit.edu/}{vga.csail.mit.edu}
The architecture of multimodal large language models (MLLMs) commonly connects a vision encoder, often based on CLIP-ViT, to a large language model. While CLIP-ViT works well for capturing global image features, it struggles to model local relationships between adjacent patches, leading to weaker visual representation, which in turn affects the detailed understanding ability of MLLMs. To solve this, we propose LLaVA-SP, which only adds six spatial visual tokens to the original visual tokens to enhance the visual representation. Our approach offers three key advantages: 1) We propose a novel Projector, which uses convolutional kernels to derive visual spatial tokens from ViT patch features, simulating two visual spatial ordering approaches: "from central region to global" and "from abstract to specific". Then, a cross-attention mechanism is applied to fuse fine-grained visual information, enriching the overall visual representation. 2) We present two model variants: LLaVA-SP-Cropping, which focuses on detail features through progressive cropping, and LLaVA-SP-Pooling, which captures global semantics through adaptive pooling, enabling the model to handle diverse visual understanding tasks. 3) Extensive experiments show that LLaVA-SP, fine-tuned with LoRA, achieves significant performance improvements across various multimodal benchmarks, outperforming the state-of-the-art LLaVA-1.5 model in multiple tasks with nearly identical inference latency. The code and models are available at https://github.com/CnFaker/LLaVA-SP.
This paper presents an artistic and technical investigation into the attention mechanisms of video diffusion transformers. Inspired by early video artists who manipulated analog video signals to create new visual aesthetics, this study proposes a method for extracting and visualizing cross-attention maps in generative video models. Built on the open-source Wan model, our tool provides an interpretable window into the temporal and spatial behavior of attention in text-to-video generation. Through exploratory probes and an artistic case study, we examine the potential of attention maps as both analytical tools and raw artistic material. This work contributes to the growing field of Explainable AI for the Arts (XAIxArts), inviting artists to reclaim the inner workings of AI as a creative medium.
This research is part of the NeuroArtIfAct project []aimed at exploring the cognitive and emotional processes involved in perceiving Etruscan artifacts. This case study is focused on the Sarcophagus of the Spouses at the National Etruscan Museum in Rome. The study utilized AI and eye-tracking technology to analyze how viewers engage with the Etruscan Sarcophagus of the Spouses, revealing key patterns of visual attention and engagement. Open AI ChatGBT 4o was used in conjunction with Python in order to elaborate all the spreadsheets and data coming from the eye0tracking recording. Results showed that viewers primarily focused on the central figures, especially the faces and hands, indicating a high level of interest in the human elements of the artifact. The longer fixation durations on these features suggest that viewers find them particularly engaging, likely due to their detailed craftsmanship and symbolic significance. The eye-tracking data also highlighted specific gaze patterns, such as diagonal scanning across the sarcophagus, which reflects the composition's ability to guide viewer attention strategically. These findings suggest that the artifact's design effectively captures and directs viewer attention, emphasizing its cultural and emotional significance.
El presente trabajo propone una aproximación comparativa a las imágenes del cuerpo sindical en 1963 y 2017 para relevar las continuidades y desplazamientos en la puesta en forma de esa corporalidad, masculina y organizada, en dos imágenes con tiempos históricos diferentes: los años 60 del siglo XX y la segunda década del XXI, con el objetivo de abordar cómo se delinea ese cuerpo social sindical en ambos momentos. En este sentido, proponemos como corpus a analizar el afiche “¡Basta!” de 1963 del artista plástico Ricardo Carpani (1930-1997) y la fotografía sin título de 2017 del fotógrafo Fernando de la Orden (1976-) realizada para el diario Clarín.
In this paper, we discuss the opportunities and challenges of teaching a human-robot interaction course at an undergraduate liberal arts college. We provide a sample syllabus adapted from a previous version of a course.
Sergio Tascon-Morales, Pablo Márquez-Neila, Raphael Sznitman
With growing interest in recent years, medical visual question answering (Med-VQA) has rapidly evolved, with multimodal large language models (MLLMs) emerging as an alternative to classical model architectures. Specifically, their ability to add visual information to the input of pre-trained LLMs brings new capabilities for image interpretation. However, simple visual errors cast doubt on the actual visual understanding abilities of these models. To address this, region-based questions have been proposed as a means to assess and enhance actual visual understanding through compositional evaluation. To combine these two perspectives, this paper introduces targeted visual prompting to equip MLLMs with region-based questioning capabilities. By presenting the model with both the isolated region and the region in its context in a customized visual prompt, we show the effectiveness of our method across multiple datasets while comparing it to several baseline models. Our code and data are available at https://github.com/sergiotasconmorales/locvqallm.
Este texto visa comentar o artigo de Marcela Oliveira, “Escrever a palavra fim; cem anos da morte de Proust”, destacando os aspectos da leitura deleuziana da Recherche mobilizados por Marcela. Além de retomar o problema da unidade de uma obra de arte em sua relação com o estilo, da importância e configuração da temporalidade em Proust e da relação dos signos com o aprendizado temporal que é a matéria mesma do romance em questão, procuro indicar o motivo do interesse de filósofos de diferentes tendências pela literatura de Proust, partindo do caso de Deleuze.
In this essay, I make a case for transversal writing as a method for artistic re search. I use transversal to describe a writing entity that is capable of shifting across media and disciplines, while remaining open for critical reflection. This method is akin to a metamodel, a model not confined to defining itself to the exclusion of other models. Rather, it serves as a heuristic tool for re flecting on how we proceed in an artistic research inquiry.
My inquiry is located in the fields of artist’s film and video, ficto critical writing, and performance. I begin by shortly positioning artistic research as an intentional and original search for new knowledge, with the inherent value of radical openness. I introduce a metamodel for transversal writing, which encompasses but goes beyond the film essay tradition. The hybrid forms of essayistic film and fictocritical writing are shown to carry a common interest in radical openness, turning them into sites of new knowledge. Through a link with performative writing, the concept of the event completes my metaphor of writing from the milieu.
In conclusion, I propose a diagram and a narrative reflection on the making of the artwork seeds of degrowth, whose production is a site of my research.
Marloes Arts, Victor Garcia Satorras, Chin-Wei Huang
et al.
Coarse-grained (CG) molecular dynamics enables the study of biological processes at temporal and spatial scales that would be intractable at an atomistic resolution. However, accurately learning a CG force field remains a challenge. In this work, we leverage connections between score-based generative models, force fields and molecular dynamics to learn a CG force field without requiring any force inputs during training. Specifically, we train a diffusion generative model on protein structures from molecular dynamics simulations, and we show that its score function approximates a force field that can directly be used to simulate CG molecular dynamics. While having a vastly simplified training setup compared to previous work, we demonstrate that our approach leads to improved performance across several small- to medium-sized protein simulations, reproducing the CG equilibrium distribution, and preserving dynamics of all-atom simulations such as protein folding events.
Nick Bryan-Kinns, Corey Ford, Alan Chamberlain
et al.
This first international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 15th ACM Conference on Creativity and Cognition (C&C 2023).
A new class of tools, colloquially called generative AI, can produce high-quality artistic media for visual arts, concept art, music, fiction, literature, video, and animation. The generative capabilities of these tools are likely to fundamentally alter the creative processes by which creators formulate ideas and put them into production. As creativity is reimagined, so too may be many sectors of society. Understanding the impact of generative AI - and making policy decisions around it - requires new interdisciplinary scientific inquiry into culture, economics, law, algorithms, and the interaction of technology and creativity. We argue that generative AI is not the harbinger of art's demise, but rather is a new medium with its own distinct affordances. In this vein, we consider the impacts of this new medium on creators across four themes: aesthetics and culture, legal questions of ownership and credit, the future of creative work, and impacts on the contemporary media ecosystem. Across these themes, we highlight key research questions and directions to inform policy and beneficial uses of the technology.
Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
Recently, attention-based Visual Question Answering (VQA) has achieved great success by utilizing question to selectively target different visual areas that are related to the answer. Existing visual attention models are generally planar, i.e., different channels of the last conv-layer feature map of an image share the same weight. This conflicts with the attention mechanism because CNN features are naturally spatial and channel-wise. Also, visual attention models are usually conducted on pixel-level, which may cause region discontinuous problems. In this paper, we propose a Cubic Visual Attention (CVA) model by successfully applying a novel channel and spatial attention on object regions to improve VQA task. Specifically, instead of attending to pixels, we first take advantage of the object proposal networks to generate a set of object candidates and extract their associated conv features. Then, we utilize the question to guide channel attention and spatial attention calculation based on the con-layer feature map. Finally, the attended visual features and the question are combined to infer the answer. We assess the performance of our proposed CVA on three public image QA datasets, including COCO-QA, VQA and Visual7W. Experimental results show that our proposed method significantly outperforms the state-of-the-arts.
El videoclip como formato audiovisual publicitario suele ser sinónimo de efectismos, intertextualidad y marcos visuales complejos, funciones para las que sin duda se sirve de toda la vanguardia tecnológica de la que pueda disponer. Este artículo trata de analizar qué ventajas encuentra en la integración de modelos 3D animados para el desarrollo de mundos irreales como parte de su puesta en escena. Desde los primeros casos hasta la actualidad, se presenta una serie de ejemplos que sirven como muestra cronológica de la evolución de la animación 3D integrada en videoclip para concluir finalmente en sus aportaciones vinculadas a la relación escenografía/narración y a su idoneidad como nexo de unión con los marcos referenciales usados en el formato cinematográfico.
Resenha de: Costa, Alexandre Rodrigues da (org.). Corpos labirínticos: textos de Hans Bellmer. Tradução de Alexandre Rodrigues da Costa. Rio de Janeiro: Gramma, 2019.