Hasil untuk "Visual arts"

Menampilkan 20 dari ~3385418 hasil · dari CrossRef, DOAJ, Semantic Scholar, arXiv

JSON API
DOAJ Open Access 2026
Enrico Castellani and the 1972-1973 experiments: grid, decentralisation and transparency in Lavagna and Topologema

Michele Giuliano Cipolla

The aim of the article is to investigate some works produced by Enrico Castellani in the early 1970s. In the light of the theoretical elaborations conceived by Castellani in the 1960s and during the following decades, the article aims to reread in particular two types of formal articulation executed between 1972 and 1973, that is Lavagna and the Topologema series, in which it seems possible to trace the genesis of an unprecedented speculative need by the author. Without ever neglecting Castellani's ‘classic’ work, relating to his famous monochrome Superfici sensitised by a punctuation of nails, and indeed always developing the discourse through the necessary reference to the canonical work, the artist's explorations carried out almost in the same months on canvas and slate manifest an investigation of the formal opportunities provided by the constituent elements of painting. Point, line and surface are the foundation of the reflection started by Castellani already at the end of the 1950s, the basis to which are now added the coordinates of the grid, decentralisation and transparency, values that in the 1970s the artist seems to want to recode within a new technical-theoretical research. Reaching as far as the observation of the plastic outcome of this trend through a sculpture from 2015, the aim of the article is to record these experiments precisely as processes of testing and verifying the artist's primary idea, comparing his writings with the texts of international critics - such as Alloway and Krauss - who simultaneously, in the United States, identify similar paths.

arXiv Open Access 2026
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

Jiwan Kim, Kibum Kim, Wonjoong Kim et al.

Recently, visual token pruning has been studied to handle the vast number of visual tokens in Multimodal Large Language Models. However, we observe that while existing pruning methods perform reliably on simple visual understanding, they struggle to effectively generalize to complex visual reasoning tasks, a critical gap underexplored in previous studies. Through a systematic analysis, we identify Relevant Visual Information Shift (RVIS) during decoding as the primary failure driver. To address this, we propose Decoding-stage Shift-aware Token Pruning (DSTP), a training-free add-on framework that enables existing pruning methods to align visual tokens with shifting reasoning requirements during the decoding stage. Extensive experiments demonstrate that DSTP significantly mitigates performance degradation of pruning methods in complex reasoning tasks, while consistently yielding performance gains even across visual understanding benchmarks. Furthermore, DSTP demonstrates effectiveness across diverse state-of-the-art architectures, highlighting its generalizability and efficiency with minimal computational overhead.

en cs.CV
arXiv Open Access 2026
Visual Set Program Synthesizer

Zehua Cheng, Wei Dai, Wenhu Zhang et al.

A user pointing their phone at a supermarket shelf and asking "Which soda has the least sugar?" poses a difficult challenge for current visual Al assistants. Such queries require not only object recognition, but explicit set-based reasoning such as filtering, comparison, and aggregation. Standard endto-end MLLMs often fail at these tasks because they lack an explicit mechanism for compositional logic. We propose treating visual reasoning as Visual Program Synthesis, where the model first generates a symbolic program that is executed by a separate engine grounded in visual scenes. We also introduce Set-VQA, a new benchmark designed specifically for evaluating set-based visual reasoning. Experiments show that our approach significantly outperforms state-of-the-art baselines on complex reasoning tasks, producing more systematic and transparent behavior while substantially improving answer accuracy. These results demonstrate that program-driven reasoning provides a principled alternative to black-box visual-language inference.

en cs.MM, cs.CL
DOAJ Open Access 2025
The ‘Muslim Vote’ Movement and Parliamentary Representation of Ethnic Minorities in the UK: A Critical Test for the Labour Party?

Donia Touihri-Mebarek

Over the past decades, the Labour Party has cemented strong popularity and the political allegiance among ethnic minorities. However, since the election of the first four MPs with ethnic minority backgrounds in 1987, ethnic minorities’ parliamentary representation has remained well below their share of the UK electorate. The 2024 general election constitutes a significant landmark in nearly redressing this imbalance. This article adopts a historical approach to determining the factors that have enabled ethnic minorities to be proportionally represented in the House of Commons. This article first analyses the dichotomy between, on the one hand, ethnic minorities’ strong alignment with the Labour Party and, on the other hand, their persistent under-representation in the House of Commons. It then explores how the Conservatives have successfully challenged the Labour Party’s representation of ethnic diversity, making historic gains since the 2010 general elections. Finally, in light of the 2024 general election results and focusing on the specific case of Muslims, this article questions Muslims’ political loyalty to the Labour Party. It shows how alternative forms of political mobilisation, such as the Muslim vote movement, could present a critical test for the Labour Party. Unless it shows more commitment to substantive minority representation, the Labour Party could lose its legacy in future elections.

DOAJ Open Access 2024
Sound Design in the Turkish TV Series and Cinema Sector: Post-Production Stage

Rahime Akikol, Bahar Atmaca Demir, Haluk Zülfikar

The present study examines the key factors determining the quality of sound design in the post-production stage, the importance of which is often overlooked in Turkish film and television. To this end, we evaluate the technical infrastructure used by the industry, the quantitative aspects and competencies of professionals, as well as their needs and the problems they encounter. The “snowball sampling” technique was used to reach out to prominent professionals with whom in-depth interviews were conducted. The data obtained during the interviews were subjected to descriptive analysis, based on which the strengths and weaknesses of Turkish TV and cinema were examined. Although employees in the Turkish sector are no different from their peers in other countries in terms of their qualifications and technical abilities, the study reveals that the time and budget allocated to the visual aspects of film production in Türkiye are considerably above those allocated to sound design, and that responsibility for this matter lies with the director and producer. This study is important in terms of presenting findings about the sectoral situation of the sound design to the academic literature.

Visual arts, Communication. Mass media
arXiv Open Access 2024
Learning to Unify Audio, Visual and Text for Audio-Enhanced Multilingual Visual Answer Localization

Zhibin Wen, Bin Li

The goal of Multilingual Visual Answer Localization (MVAL) is to locate a video segment that answers a given multilingual question. Existing methods either focus solely on visual modality or integrate visual and subtitle modalities. However, these methods neglect the audio modality in videos, consequently leading to incomplete input information and poor performance in the MVAL task. In this paper, we propose a unified Audio-Visual-Textual Span Localization (AVTSL) method that incorporates audio modality to augment both visual and textual representations for the MVAL task. Specifically, we integrate features from three modalities and develop three predictors, each tailored to the unique contributions of the fused modalities: an audio-visual predictor, a visual predictor, and a textual predictor. Each predictor generates predictions based on its respective modality. To maintain consistency across the predicted results, we introduce an Audio-Visual-Textual Consistency module. This module utilizes a Dynamic Triangular Loss (DTL) function, allowing each modality's predictor to dynamically learn from the others. This collaborative learning ensures that the model generates consistent and comprehensive answers. Extensive experiments show that our proposed method outperforms several state-of-the-art (SOTA) methods, which demonstrates the effectiveness of the audio modality.

en cs.MM, cs.AI
arXiv Open Access 2024
The Visualization JUDGE : Can Multimodal Foundation Models Guide Visualization Design Through Visual Perception?

Matthew Berger, Shusen Liu

Foundation models for vision and language are the basis of AI applications across numerous sectors of society. The success of these models stems from their ability to mimic human capabilities, namely visual perception in vision models, and analytical reasoning in large language models. As visual perception and analysis are fundamental to data visualization, in this position paper we ask: how can we harness foundation models to advance progress in visualization design? Specifically, how can multimodal foundation models (MFMs) guide visualization design through visual perception? We approach these questions by investigating the effectiveness of MFMs for perceiving visualization, and formalizing the overall visualization design and optimization space. Specifically, we think that MFMs can best be viewed as judges, equipped with the ability to criticize visualizations, and provide us with actions on how to improve a visualization. We provide a deeper characterization for text-to-image generative models, and multi-modal large language models, organized by what these models provide as output, and how to utilize the output for guiding design decisions. We hope that our perspective can inspire researchers in visualization on how to approach MFMs for visualization design.

en cs.HC
DOAJ Open Access 2023
Com(ic)onic: Architecture in the Comics

Michela De Domenico, Paola Raffa, Fabio Testaì

The city and architecture are often the protagonists of drawn narratives in which spatial connotation wants to be an element of contextualization and recognizability, but also of externalization of existential places. They appear in comics reshaped on real and fantastic lifestyles that combine narrative invention and architectural representation. The research takes into consideration three large stylistic/expressive groupings of comics, to then carry out a parallel analysis of three large cities, coinciding with the same contexts of origin of the comics. To identify new scenarios in the field of architectural representation and visual communication. In the narrative layout of comics, architecture assumes the role of iconic figure, whose meaning is expressed through the separation and cohesion of the sign and refers to a place understood as a defined space-time element.

Psychology, Visual arts
arXiv Open Access 2022
Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction

Xiang Chen, Ningyu Zhang, Lei Li et al.

Multimodal named entity recognition and relation extraction (MNER and MRE) is a fundamental and crucial branch in information extraction. However, existing approaches for MNER and MRE usually suffer from error sensitivity when irrelevant object images incorporated in texts. To deal with these issues, we propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction, aiming to achieve more effective and robust performance. Specifically, we regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision. We further propose a dynamic gated aggregation strategy to achieve hierarchical multi-scaled visual features as visual prefix for fusion. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance. Code is available in https://github.com/zjunlp/HVPNeT.

en cs.CL, cs.AI
arXiv Open Access 2022
Deep Learning-based Assessment of Hepatic Steatosis on chest CT

Zhongyi Zhang, Jakob Weiss, Jana Taron et al.

Purpose: Automatic methods are required for the early detection of hepatic steatosis to avoid progression to cirrhosis and cancer. Here, we developed a fully automated deep learning pipeline to quantify hepatic steatosis on non-contrast enhanced chest computed tomography (CT) scans. Materials and Methods: We developed and evaluated our pipeline on chest CT images of 1,431 randomly selected National Lung Screening Trial (NLST) participants. A dataset of 451 CT scans with volumetric liver segmentations of expert readers was used for training a deep learning model. For testing, in an independent dataset of 980 CT scans hepatic attenuation was manually measured by an expert reader on three cross-sectional images at different hepatic levels by selecting three circular regions of interest. Additionally, 100 randomly selected cases of the test set were volumetrically segmented by expert readers. Hepatic steatosis on the test set was defined as mean hepatic attenuation of < 40 Hounsfield unit. Spearman correlation was conducted to analyze liver fat quantification accuracy and the Cohen's Kappa coefficient was calculated for hepatic steatosis prediction reliability. Results: Our pipeline demonstrated strong performance and achieved a mean dice score of 0.970 for the volumetric liver segmentation. The spearman correlation of the liver fat quantification was 0.954 (P <0.0001) between the automated and expert reader measurements. The cohen's kappa coefficient was 0.875 for automatic assessment of hepatic steatosis. Conclusion: We developed a fully automatic deep learning-based pipeline for the assessment of hepatic steatosis in chest CT images. With the fast and cheap screening of hepatic steatosis, our pipeline has the potential to help initiate preventive measures to avoid progression to cirrhosis and cancer.

en q-bio.QM
arXiv Open Access 2021
Dense Video Captioning Using Unsupervised Semantic Information

Valter Estevam, Rayson Laroca, Helio Pedrini et al.

We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events and that these simple events are shared across several complex events. We first employ a clustering method to group representations producing a visual codebook. Then, we learn a dense representation by encoding the co-occurrence probability matrix for the codebook entries. This representation leverages the performance of the dense video captioning task in a scenario with only visual features. For example, we replace the audio signal in the BMT method and produce temporal proposals with comparable performance. Furthermore, we concatenate the visual representation with our descriptor in a vanilla transformer method to achieve state-of-the-art performance in the captioning subtask compared to the methods that explore only visual features, as well as a competitive performance with multi-modal methods. Our code is available at https://github.com/valterlej/dvcusi.

arXiv Open Access 2021
Learning to Ground Visual Objects for Visual Dialog

Feilong Chen, Xiuyi Chen, Can Xu et al.

Visual dialog is challenging since it needs to answer a series of coherent questions based on understanding the visual environment. How to ground related visual objects is one of the key problems. Previous studies utilize the question and history to attend to the image and achieve satisfactory performance, however these methods are not sufficient to locate related visual objects without any guidance. The inappropriate grounding of visual objects prohibits the performance of visual dialog models. In this paper, we propose a novel approach to Learn to Ground visual objects for visual dialog, which employs a novel visual objects grounding mechanism where both prior and posterior distributions over visual objects are used to facilitate visual objects grounding. Specifically, a posterior distribution over visual objects is inferred from both context (history and questions) and answers, and it ensures the appropriate grounding of visual objects during the training process. Meanwhile, a prior distribution, which is inferred from context only, is used to approximate the posterior distribution so that appropriate visual objects can be grounded even without answers during the inference process. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate that our approach improves the previous strong models in both generative and discriminative settings by a significant margin.

en cs.CV, cs.CL
arXiv Open Access 2021
When Creators Meet the Metaverse: A Survey on Computational Arts

Lik-Hang Lee, Zijun Lin, Rui Hu et al.

The metaverse, enormous virtual-physical cyberspace, has brought unprecedented opportunities for artists to blend every corner of our physical surroundings with digital creativity. This article conducts a comprehensive survey on computational arts, in which seven critical topics are relevant to the metaverse, describing novel artworks in blended virtual-physical realities. The topics first cover the building elements for the metaverse, e.g., virtual scenes and characters, auditory, textual elements. Next, several remarkable types of novel creations in the expanded horizons of metaverse cyberspace have been reflected, such as immersive arts, robotic arts, and other user-centric approaches fuelling contemporary creative outputs. Finally, we propose several research agendas: democratising computational arts, digital privacy, and safety for metaverse artists, ownership recognition for digital artworks, technological challenges, and so on. The survey also serves as introductory material for artists and metaverse technologists to begin creations in the realm of surrealistic cyberspace.

en cs.CY, cs.AI
arXiv Open Access 2021
Multimodal Incremental Transformer with Visual Grounding for Visual Dialogue Generation

Feilong Chen, Fandong Meng, Xiuyi Chen et al.

Visual dialogue is a challenging task since it needs to answer a series of coherent questions on the basis of understanding the visual environment. Previous studies focus on the implicit exploration of multimodal co-reference by implicitly attending to spatial image features or object-level image features but neglect the importance of locating the objects explicitly in the visual content, which is associated with entities in the textual content. Therefore, in this paper we propose a {\bf M}ultimodal {\bf I}ncremental {\bf T}ransformer with {\bf V}isual {\bf G}rounding, named MITVG, which consists of two key parts: visual grounding and multimodal incremental transformer. Visual grounding aims to explicitly locate related objects in the image guided by textual entities, which helps the model exclude the visual content that does not need attention. On the basis of visual grounding, the multimodal incremental transformer encodes the multi-turn dialogue history combined with visual scene step by step according to the order of the dialogue and then generates a contextually and visually coherent response. Experimental results on the VisDial v0.9 and v1.0 datasets demonstrate the superiority of the proposed model, which achieves comparable performance.

en cs.CL, cs.CV
arXiv Open Access 2021
Visual Keyword Spotting with Attention

K R Prajwal, Liliane Momeni, Triantafyllos Afouras et al.

In this paper, we consider the task of spotting spoken keywords in silent video sequences -- also known as visual keyword spotting. To this end, we investigate Transformer-based models that ingest two streams, a visual encoding of the video and a phonetic encoding of the keyword, and output the temporal location of the keyword if present. Our contributions are as follows: (1) We propose a novel architecture, the Transpotter, that uses full cross-modal attention between the visual and phonetic streams; (2) We show through extensive evaluations that our model outperforms the prior state-of-the-art visual keyword spotting and lip reading methods on the challenging LRW, LRS2, LRS3 datasets by a large margin; (3) We demonstrate the ability of our model to spot words under the extreme conditions of isolated mouthings in sign language videos.

en cs.CV, cs.CL
DOAJ Open Access 2020
‘Smashing crockery’ : objets cassés, objets-déchets dans la poésie de Derek Mahon

Marion Naugrette-Fournier

Throughout his poetic œuvre the Northern Irish poet Derek Mahon has made the tabula rasa a true ars poetica of his very own, using discarded objects or even ‘breaking’ objects in good condition, in order to salvage them or recycle them by reintegrating them to a new system of objects of which he alone knows the rules. To achieve such a goal, in his poems Mahon has relentlessly re-envisioned the value of an object economically, metaphysically and poetically, by temporarily destroying it. This ‘tabula rasa’ phase is thus seen as almost essential to the process of re-evaluating the value of objects, which in turn is followed by a process of reconstructing them, implying the poet’s gathering of what Donald W. Winnicott names ‘transitional objects’. Mahon’s transitional objects, so to speak, act as new toys he has made his own, so that he could be free from any kind of attachment towards a community, and especially a Northern Irish one. Mahon’s poetry is thus a ‘Building Site’ of sorts, always in motion.

arXiv Open Access 2020
Geometric Structure Aided Visual Inertial Localization

Huaiyang Huang, Haoyang Ye, Jianhao Jiao et al.

Visual Localization is an essential component in autonomous navigation. Existing approaches are either based on the visual structure from SLAM/SfM or the geometric structure from dense mapping. To take the advantages of both, in this work, we present a complete visual inertial localization system based on a hybrid map representation to reduce the computational cost and increase the positioning accuracy. Specially, we propose two modules for data association and batch optimization, respectively. To this end, we develop an efficient data association module to associate map components with local features, which takes only $2$ms to generate temporal landmarks. For batch optimization, instead of using visual factors, we develop a module to estimate a pose prior from the instant localization results to constrain poses. The experimental results on the EuRoC MAV dataset demonstrate a competitive performance compared to the state of the arts. Specially, our system achieves an average position error in 1.7 cm with 100% recall. The timings show that the proposed modules reduce the computational cost by 20-30%. We will make our implementation open source at http://github.com/hyhuang1995/gmmloc.

en cs.RO, cs.CV

Halaman 38 dari 169271