Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.
The evolving nature of beauty in digital fine arts aesthetics signifies profound shifts within contemporary social and cultural contexts. Technological advancements have enabled artists to explore new mediums, creating works that reflect and respond to current societal issues. This study examines the paradigm shift in digital fine arts, focusing on the interpretation and manifestation of beauty in digital creations. It explores the interplay between digital aesthetics and societal realities, providing insights into the transformation of beauty in the digital age. An interdisciplinary approach is utilized, incorporating textual analysis of relevant literature, participatory observation at art exhibitions and online platforms, and a comprehensive literature review. These methods trace the evolution of beauty in digital fine arts and analyze how artists incorporate visual, auditory, and interactive elements. The research demonstrates that technological progress has significantly impacted the creation and perception of beauty in fine arts. This transformation includes broader definitions of aesthetics, integrating multisensory experiences and community participation. Digital fine arts now serve as a dynamic medium that reflects diverse social, political, and cultural themes. The transformation of beauty in this field underscores the complex relationships among art, technology, and society. Digital artworks provide innovative aesthetic experiences and mirror contemporary social dynamics. Despite concerns regarding the authenticity and preservation of digital art, its role in expanding artistic expression and fostering innovation is undeniable
Resumé
Artiklen orienterer sig mod æstetisk-visuelle metoder og tilgange i forskning med børn. Konteksten er børns møder med skolekoncerter, hvor professionelle musikere fremfører musik og fremviser instrumenter. Studiet knytter sig til et pilotprojekt og funderer sig i arts-based research med fokus på det æstetiske, performative, postkvalitative og materielle som omdrejningspunkter for at skabe handlinger, mening og viden. Med deltagende observationer, børneinterviews optaget med videokamera og børns tegninger som konkrete metoder udfolder artiklen børns møder med skolekoncerter som komplekse æstetisk-performative tilblivelsesprocesser. Her viser analyserne, at børnene i mødet med skolekoncerter og gennem tegning fremhæver materielle, musiske og menneskelige fænomener og relationer, og at disse har værdi for børnene. Artiklen viser og argumenterer for, at æstetisk-visuelle metoder og postkvalitative perspektiver bidrager til et syn på forskning, børn og skolekoncerter, som komplekse fænomener og øjeblikke, der bliver til gennem relationelle møder med humane og nonhumane kræfter.
Nøgleord
skolekoncert, arts-based research, instrumenter, børns tegninger, materialitet
Abstract
This article is turning towards aesthetic-visual methods and perspectives in research with children. The context is children’s encounters with school concerts (within the Danish compulsory school system) performed by professional musicians and their instruments. The study connects to a pilot-project and has an arts-based approach with attention to aesthetic, performative, post-qualitative, and material ways to create actions, meaning, and knowledge. Analysis of qualitative empirical material such as field notes and interviews with children including their drawings show how encounters with school concert are aesthetic-performative and becoming processes in which (intra)relations with instruments, music, musicians, and children are vital and vibrant.
Keywords
school concert, arts-based research, instruments, children’s drawings, materiality
The article investigates fashion film as a paratext to the materiality of fashion. It outlines methodology and creative practice in bridging traditional archives with contemporary digital storytelling as a means of knowledge transfer. These may include clothes, design practices, cultural narratives, and brand identities—all tangible and intangible assets that fashion heritage encompasses. Although previous scholarship has commented on the marketing and the aesthetic of fashion films, there is yet to be a comprehensive analysis of their potential as a viable mobility for documenting and communicating fashion’s historical and cultural legacies. The research illustrates how documentary fashion films archive creative processes and brand histories by unpacking a series of case studies. It highlights the role of fashion film festivals as platforms that amplify heritage narratives while fostering discussions on sustainability, diversity, and innovation. The study further proposes the development of a dedicated online database for fashion films, offering a participatory and accessible resource to preserve and disseminate fashion heritage. The article aims to add to the discussion surrounding fashion archival and preservation by providing an argument for fashion films — as aesthetic devices in representing culture and as integral storytelling and archiving tools that can preserve identity from one generation to another.
We introduce Visual Persona, a foundation model for text-to-image full-body human customization that, given a single in-the-wild human image, generates diverse images of the individual guided by text descriptions. Unlike prior methods that focus solely on preserving facial identity, our approach captures detailed full-body appearance, aligning with text descriptions for body structure and scene variations. Training this model requires large-scale paired human data, consisting of multiple images per individual with consistent full-body identities, which is notoriously difficult to obtain. To address this, we propose a data curation pipeline leveraging vision-language models to evaluate full-body appearance consistency, resulting in Visual Persona-500K, a dataset of 580k paired human images across 100k unique identities. For precise appearance transfer, we introduce a transformer encoder-decoder architecture adapted to a pre-trained text-to-image diffusion model, which augments the input image into distinct body regions, encodes these regions as local appearance features, and projects them into dense identity embeddings independently to condition the diffusion model for synthesizing customized images. Visual Persona consistently surpasses existing approaches, generating high-quality, customized images from in-the-wild inputs. Extensive ablation studies validate design choices, and we demonstrate the versatility of Visual Persona across various downstream tasks.
The retinal afterimage is a widely known effect in the human visual system, which has been studied and used in the context of a number of major art movements. Therefore, when considering the general role of computation in the visual arts, this begs the question whether this effect, too, may be induced using partly automated techniques. If so, it may become a computationally controllable ingredient of (interactive) visual art, and thus take its place among the many other aspects of visual perception which already have preceded it in this sense. The present moment provides additional inspiration to lay the groundwork for extending computer graphics in general with the retinal afterimage: Historically, we are in a phase where some head-mounted stereoscopic AR/VR technologies are now providing eye tracking by default, thereby allowing realtime monitoring of the processes of visual fixation that can induce the retinal afterimage. A logical starting point for general investigation is then shape display via the retinal afterimage, since shape recognition lends itself well to unambiguous reporting. Shape recognition, however, may also occur due to normal vision, which happens simultaneously. Carefully and rigorously excluding this possibility, we develop computational techniques enabling shape display exclusive to the retinal afterimage.
Pablo Acuaviva, Aram Davtyan, Mariam Hassan
et al.
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
While the cultural phenomenon known as vaporwave is commonly traced back to 2009, its defining characteristics remain a subject of ongoing debate within both popular culture and academic circles. Various perspectives categorize it as a microgenre of electronic music, a meme, an art movement, a critique of capitalism, or even a manifestation of pure aesthetics. As such, vaporwave remains a complex and multifaceted topic for research.
In this paper, I explore the influence of post-internet culture on the formation of vaporwave and its aesthetics by analyzing the methods and strategies used to create what is recognized as vaporwave, rather than attempting to label or define it precisely. As a further step in this inquiry, I document an attempt to apply these methods and strategies, resulting in the publication of four music albums. This practice-based approach to analyzing vaporwave through creation and publication helps to better understand some core qualities and aesthetics of this art movement.
We propose a novel self-supervised approach for learning audio and visual representations from unlabeled videos, based on their correspondence. The approach uses an attention mechanism to learn the relative importance of convolutional features extracted at different resolutions from the audio and visual streams and uses the attention features to encode the audio and visual input based on their correspondence. We evaluated the representations learned by the model to classify audio-visual correlation as well as to recommend sound effects for visual scenes. Our results show that the representations generated by the attention model improves the correlation accuracy compared to the baseline, by 18% and the recommendation accuracy by 10% for VGG-Sound, which is a public video dataset. Additionally, audio-visual representations learned by training the attention model with cross-modal contrastive learning further improves the recommendation performance, based on our evaluation using VGG-Sound and a more challenging dataset consisting of gameplay video recordings.
Many artwork collections contain textual attributes that provide rich and contextualised descriptions of artworks. Visual grounding offers the potential for localising subjects within these descriptions on images, however, existing approaches are trained on natural images and generalise poorly to art. In this paper, we present CIGAr (Context-Infused GroundingDINO for Art), a visual grounding approach which utilises the artwork descriptions during training as context, thereby enabling visual grounding on art. In addition, we present a new dataset, Ukiyo-eVG, with manually annotated phrase-grounding annotations, and we set a new state-of-the-art for object detection on two artwork datasets.
Systematic compositionality, or the ability to adapt to novel situations by creating a mental model of the world using reusable pieces of knowledge, remains a significant challenge in machine learning. While there has been considerable progress in the language domain, efforts towards systematic visual imagination, or envisioning the dynamical implications of a visual observation, are in their infancy. We introduce the Systematic Visual Imagination Benchmark (SVIB), the first benchmark designed to address this problem head-on. SVIB offers a novel framework for a minimal world modeling problem, where models are evaluated based on their ability to generate one-step image-to-image transformations under a latent world dynamics. The framework provides benefits such as the possibility to jointly optimize for systematic perception and imagination, a range of difficulty levels, and the ability to control the fraction of possible factor combinations used during training. We provide a comprehensive evaluation of various baseline models on SVIB, offering insight into the current state-of-the-art in systematic visual imagination. We hope that this benchmark will help advance visual systematic compositionality.
Christina Stoiber, Conny Walchshofer, Margit Pohl
et al.
Comprehending and exploring large and complex data is becoming increasingly important for users in a wide range of application domains. Still, non-experts in visual data analysis often have problems with correctly reading and interpreting information from visualizations that are new to them. To support novices in learning how to use new digital technologies, the concept of onboarding has been successfully applied in other fields and first approaches also exist in the visualization domain. However, empirical evidence on the effectiveness of such approaches is scarce. Therefore, we conducted 3 studies: 1) Firstly, we explored the effect of vis onboarding, using an interactive step-by-step guide, on user performance for four increasingly complex visualization techniques. We performed a between-subject experiment with 596 participants in total. The results showed that there are no significant differences between the answer correctness of the questions with and without onboarding. Furthermore, participants commented that for highly familiar visualization types no onboarding is needed. 2) Second, we performed another study with MTurk workers to assess if there is a difference in user performances on different onboarding types: step-by-step, scrollytelling tutorial, and video tutorial. The study revealed that the video tutorial was ranked as the most positive on average, based on sentiment analysis, followed by the scrollytelling tutorial and the interactive step-by-step guide. 3) For our third study with students, we gathered data on users' experience in using an in-situ scrollytelling for the VA tool. The results showed that they preferred scrollytelling over the tutorial integrated into the landing page. In summary, the in-situ scrollytelling approach works well for visualization onboarding and a video tutorial can help to introduce interaction techniques.
The massive spread of visual content through the web and social media poses both challenges and opportunities. Tracking visually-similar content is an important task for studying and analyzing social phenomena related to the spread of such content. In this paper, we address this need by building a dataset of social media images and evaluating visual near-duplicates retrieval methods based on image retrieval and several advanced visual feature extraction methods. We evaluate the methods using a large-scale dataset of images we crawl from social media and their manipulated versions we generated, presenting promising results in terms of recall. We demonstrate the potential of this method in two case studies: one that shows the value of creating systems supporting manual content review, and another that demonstrates the usefulness of automatic large-scale data analysis.
My memories of Mario Ridolfi are infinite in number, all of them as clear, even in the smallest details, as those we preserve of our loved ones, of our teachers.
I can see him drawing Casa Lina and the Love knot in front of me; I always sat at the first table directly facing his desk.
He drew with a fine-pointed Pelikan fountain pen on ‘carta burro,’ a translucent tracing paper of minimal thickness used for architectural drawings, that allows you to superimpose successive ideas and correct them.
In his brown leather bag, Ridolfi kept his small bottle of black ink and a roll of this paper. I never saw him throw away any of the sheets of paper he had drawn on, not even the ones he rejected immediately because he was not satisfied with the solution.
And I was there, in front of him, learning to draw perspective views and observing him... [read more]
Daniel Seebacher, Maximilian T. Fischer, Rita Sevastjanova
et al.
Large-scale interaction networks of human communication are often modeled as complex graph structures, obscuring temporal patterns within individual conversations. To facilitate the understanding of such conversational dynamics, episodes with low or high communication activity as well as breaks in communication need to be detected to enable the identification of temporal interaction patterns. Traditional episode detection approaches are highly dependent on the choice of parameters, such as window-size or binning-resolution. In this paper, we present a novel technique for the identification of relevant episodes in bi-directional interaction sequences from abstract communication networks. We model communication as a continuous density function, allowing for a more robust segmentation into individual episodes and estimation of communication volume. Additionally, we define a tailored feature set to characterize conversational dynamics and enable a user-steered classification of communication behavior. We apply our technique to a real-world corpus of email data from a large European research institution. The results show that our technique allows users to effectively define, identify, and analyze relevant communication episodes.
The expression of human art, and supposedly sentient art in general, is modulated by the available rendition, receiving and communication techniques. The components or instruments of these techniques ultimately exhibit a physical, in particular, quantum layer, which in turn translates into physical and technological capacities to comprehend and utilize what is possible in our universe. In this sense, we can apply a sort of Church-Turing thesis to art, or at least to its rendition.
"Art is the lie that enables us to realize the truth." - Pablo Picasso. For centuries, humans have dedicated themselves to producing arts to convey their imagination. The advancement in technology and deep learning in particular, has caught the attention of many researchers trying to investigate whether art generation is possible by computers and algorithms. Using generative adversarial networks (GANs), applications such as synthesizing photorealistic human faces and creating captions automatically from images were realized. This survey takes a comprehensive look at the recent works using GANs for generating visual arts, music, and literary text. A performance comparison and description of the various GAN architecture are also presented. Finally, some of the key challenges in art generation using GANs are highlighted along with recommendations for future work.