AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition
Shoufa Chen, Chongjian Ge, Zhan Tong
et al.
Pretraining Vision Transformers (ViTs) has achieved great success in visual recognition. A following scenario is to adapt a ViT to various image and video recognition tasks. The adaptation is challenging because of heavy computation and memory storage. Each model needs an independent and complete finetuning process to adapt to different tasks, which limits its transferability to different visual domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100\% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-and-play in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something~v2 and HMDB51, respectively. Code is available at https://github.com/ShoufaChen/AdaptFormer.
1022 sitasi
en
Computer Science
Todo se jugó en un instante: fotografía documental y archivos visuales de la infancia en la ciudad moderna
Nicolás Stutzin, José Parra-Martínez, M. Elia Gutiérrez-Mozo
Este artículo examina el papel que los registros fotográficos de niñas y niños jugando en la calle desempeñaron en los debates arquitectónicos de posguerra e investiga cómo, hasta la irrupción del color, la fotografía documental en blanco y negro contribuyó a forjar diversos imaginarios vinculados a la infancia. En el periodo comprendido entre el primer playground inaugurado por Robert Moses (1934) y el último diseñado por Aldo van Eyck (1978), se analizan y comparan tres casos paradigmáticos: las plazas de juego de Nueva York, Ámsterdam y Londres, cuyos archivos fotográficos fueron instrumentales en la configuración, difusión e historiografía de cada proyecto. El estudio sistemático de fondos municipales, colecciones históricas y repositorios institucionales llevado a cabo en estos tres escenarios evidencia que tales imágenes, más allá de documentar espacios recreativos, articularon visiones muy distintas sobre infancia, arquitectura y ciudad: legitimando discursos patrióticos en Estados Unidos; construyendo memoria colectiva en Holanda; y narrando la acción efímera y disruptiva de los adventure playgrounds ingleses frente a las lógicas de la planificación moderna. Este análisis permite, finalmente, situar dicho legado ante los desafíos contemporáneos de representación de la niñez en un contexto marcado por la postfotografía y la progresiva erosión del fotoperiodismo como registro crítico de lo urbano.
StoryMovie: A Dataset for Semantic Alignment of Visual Stories with Movie Scripts and Subtitles
Daniel Oliveira, David Martins de Matos
Visual storytelling models that correctly ground entities in images may still hallucinate semantic relationships, generating incorrect dialogue attribution, character interactions, or emotional states. We introduce StoryMovie, a dataset of 1,757 stories aligned with movie scripts and subtitles through LCS matching. Our alignment pipeline synchronizes screenplay dialogue with subtitle timestamps, enabling dialogue attribution by linking character names from scripts to temporal positions from subtitles. Using this aligned content, we generate stories that maintain visual grounding tags while incorporating authentic character names, dialogue, and relationship dynamics. We fine-tune Qwen Storyteller3 on this dataset, building on prior work in visual grounding and entity re-identification. Evaluation using DeepSeek V3 as judge shows that Storyteller3 achieves an 89.9% win rate against base Qwen2.5-VL 7B on subtitle alignment. Compared to Storyteller, trained without script grounding, Storyteller3 achieves 48.5% versus 38.0%, confirming that semantic alignment progressively improves dialogue attribution beyond visual grounding alone.
Del palacio al hogar
Marina Moguillansky
Este artículo reconstruye la trayectoria de la cinefilia del escritor Manuel Puig, desde su etapa temprana como espectador frecuente de salas cinematográficas durante el período 1930 y 1940 hasta su posterior conversión en coleccionista de videocassettes durante la década de 1980. A través del análisis de sus novelas y de su correspondencia personal, se exploran las transformaciones en su relación con el cine, enmarcadas en los cambios tecnológicos y culturales del siglo XX. La investigación se inscribe en el cruce entre la sociología de la cultura y la nueva historia del cine, con un enfoque centrado en las prácticas situadas de los espectadores. Se examinan aspectos como la dimensión económica del acceso al cine en diferentes ciudades, la experiencia de las salas como espacio colectivo, el impacto del doblaje y los subtítulos en su formación cinéfila, y la progresiva individualización del consumo audiovisual con la aparición del video. El estudio muestra cómo Puig construye una relación íntima y reflexiva con el cine, que va más allá del entretenimiento y se convierte en una fuente constante de conversación, análisis y memoria, articulando afectos, saberes y formas de archivo personal.
Visual arts, Communication. Mass media
Childhood under the scope of time: the 20th-century child in the aesthetics of Neorealism and Soviet cinema / Детство под прицелом времени: ребенок ХХ века в киноэстетике неореализма и советском кинематографе
Dmitry Mikhaylyuk / Дмитрий Павлович Михайлюк, Amir Kader / Амир Святославович Кадер
The aim of the research is to identify the evolutionary scenario and the specifics of the transformation of the child's image in Russian art, primarily using Soviet cinema as an example, taking into account the experience of world art development and the connection between the "Sixtiers" and the aesthetics of neorealism. The article analyzes the causes and nature of significant changes in the image of the child in 20th-century art, considering large-scale social, cultural, and ideological transformations. One of the prominent movements that shaped original perceptions of childhood was Italian neorealism, in which the image of the "orphan child" became a symbol of the tragic consequences of historical cataclysms and socio-economic crises. Soviet art, from its very inception, actively used images of young characters. For instance, the films "Kino-glaz" (1924), "The Desperate Battalion" (1933), and "Golden Honey" (1928) presented children as active participants in social life. Children were often depicted as miniature adults, actively involved in labor and patriotic processes. The article shows how, in the 1960s, Russian art, influenced by the aesthetics of neorealism, began to re-evaluate the image of the child. The "Sixtiers" paid more attention to child psychology, themes of socialization, friendship, and growing up. The relevance of studying the image of the child in contemporary art is linked to the need to support and assist children in the process of socialization. Modern cinema about children and for children not only showcases young heroes but also transmits important moral and ethical lessons to them and to society as a whole. The research results have demonstrated that realistic art, which rejects fantastical elements, offers children an experience based on real-life situations, helping them develop critical thinking and emotional perception of the world. / Цель исследования состоит в выявлении сценария эволюции и специфики трансформации образа ребенка в отечественном искусстве, преимущественно на примере советского кинематографа с учетом опыта развития мирового искусства и взаимосвязи «шестидесятников» с эстетикой неореализма. Статья анализирует причины и характер значительных изменений образа ребенка в искусстве XX века с учетом масштабных общественных, культурных и идеологических трансформаций. Одним из ярких направлений, сформировавших оригинальные представления о детстве, выступил итальянский неореализм, в котором образ «ребенка-сироты» стал символом трагических последствий исторических катаклизмов и социально-экономических кризисов. Советское искусство, начиная с первых своих шагов, активно использовало образы юных персонажей. Так фильмы «Кино-глаз» (1924), «Отчаянный батальон» (1933), «Золотой мед» (1928) представляли детей как активных участников социальной жизни. Часто дети изображались как миниатюрные взрослые, активно вовлеченные в трудовые и патриотические процессы. В статье показано, как в 1960-е годы отечественное искусство под влиянием эстетики неореализма начинает переосмысливать образ ребенка. Шестидесятники в большей мере уделяли внимание психологии ребенка, темам социализации, дружбы, взросления. Актуальность изучения образа ребенка в современном искусстве связана с необходимостью поддержки и помощи детям в процессе социализации. Современное кино о детях и для детей не только показывает юных героев, но и транслирует им и всему обществу важные уроки нравственности и морали. Результаты исследования продемонстрировали, что реалистическое искусство, отказывающееся от фантастических элементов, предлагает ребенку опыт, основанный на реальных жизненных ситуациях, помогающий развивать критическое мышление и эмоциональное восприятие мира.
Visual arts, Arts in general
Have a Nice day A Case for Sunbathing
Igor Bragado, Miles Gertler
This graphic essay reflects on the concepts developed for Common Accounts’ installation Have a Nice Day, a solar canopy and artificial sun presented at MUDAC’s Solar Biennale in 2025. The essay intersperses images of the installation with transcripts of narration sourced from influencer testimonials and Georges Bataille’s “The Accursed Share,” staging a critical encounter between contemporary domestic therapeutic devices and Bataille’s theory of excess solar energy. Through this juxtaposition, “A Case for Sunbathing” examines how artificial sunlight has been theorized in popular consumer advertising and wellness-oriented social media for applications as diverse as cellular rehabilitation, anti-aging, and enhanced fertility. Taking Have a Nice Day as a conceptual scaffold, the paper follows how the installation troubles psycho-social associations with the sun in an age of climate anxiety, channeling them into sensible, energetic encounters in the gallery and tracing their entanglement with optimization, self-design, planetarity, technology,
and daily life.
Regarding some Russian and regional exhibition projects of the 2020s in the art space of Kazan: a critical analysis
Irina F. Lobasheva, Ekaterina A. Fakhrazieva
The article analyzes the art space of Kazan as one of Russia’s cultural centers through the lens of contemporary exhibition art projects initiated by museums, exhibition halls, and galleries. It addresses both the organization of significant large-scale exhibitions in the 2020s and their scientific and creative aspects, as well as their profound semantic resonance and broad social impact. The publication is accompanied by a historiographical review that focuses on key monographs, scientific articles, online reviews, and interviews related to the historical study of the city’s cultural landmarks and their role in shaping the artistic environment of Kazan. Through selected exhibition projects, the publication reveals a palette of some current collective exhibition projects, as well as exhibitions of individual artists whose art is of particular interest. As a result, these exhibitions identify the priority contemporary themes, the moods of the artists and the audience, the latest approaches to exhibition design, and the main trends and directions in the city’s art scene. It is noted that along with the permanent museum exhibitions of classical examples of visual art, the city successfully creates and develops projects by contemporary artists in various fields. It is this area, its changes and progress, that has particularly interested and attracted the attention of the authors, and as a result of the mutual collaboration between a teacher and a student, this publication has been created. A more detailed and in-depth analysis has been conducted on the following exhibitions: “Noah’s Ark” (2023), which provides a comprehensive analysis of individual works by various artists, and two exhibitions of the “Kazan Time” project. Artists of the 1990s at the Contemporary Art Gallery of the Republic of Tatarstan (2025), featuring the creative individuality of such masters as Evgeny Golubtsov and Oleg Ivanov, and “Our Avant-Garde” at the Benois Wing of the State Russian Museum in St. Petersburg (2025), focusing on the phenomenon of the popularity of the ‘fathers’ of Russian avant-garde. The article raises questions about the future development of visual arts and the role of young artists in the 21st century. The modern development of the Kazan Art School and its role in the formation of Tatarstan’s visual arts are also discussed.
Ethnology. Social and cultural anthropology, Folklore
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Fucai Ke, Joy Hsu, Zhixi Cai
et al.
Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
Haobo Yuan, Yueyi Sun, Yanwei Li
et al.
Recent advances in Multimodal Large Language Models (MLLMs) have significantly improved performance on tasks such as visual grounding and visual question answering. However, the reasoning processes of these models remain largely opaque; they typically output only final predictions without revealing the intermediate steps or fine-grained evidence (e.g., pixels, locations) that lead to the result. This contrasts with human intelligence, which naturally operates through a chain of visual reasoning. To address this limitation, we introduce the Visual Reasoning Tracer (VRT) task, which requires models to not only localize the target object but also explicitly predict the intermediate objects that form the reasoning path. To advance research in this area, we contribute: (1) VRT-Bench, a human-annotated benchmark for evaluating visual reasoning; (2) a new metric for assessing the quality of reasoning traces; and (3) VRT-80k, a large-scale dataset for reasoning model training. Our experiments reveal that while existing models often produce the correct final output, they struggle to ground their intermediate reasoning. In contrast, models trained on VRT-80k achieve substantial improvements in tracing the reasoning path.
Towards Visual Grounding: A Survey
Linhui Xiao, Xiaoshan Yang, Xiangyuan Lan
et al.
Visual Grounding, also known as Referring Expression Comprehension and Phrase Grounding, aims to ground the specific region(s) within the image(s) based on the given expression text. This task simulates the common referential relationships between visual and linguistic modalities, enabling machines to develop human-like multimodal comprehension capabilities. Consequently, it has extensive applications in various domains. However, since 2021, visual grounding has witnessed significant advancements, with emerging new concepts such as grounded pre-training, grounding multimodal LLMs, generalized visual grounding, and giga-pixel grounding, which have brought numerous new challenges. In this survey, we first examine the developmental history of visual grounding and provide an overview of essential background knowledge. We systematically track and summarize the advancements, and then meticulously define and organize the various settings to standardize future research and ensure a fair comparison. Additionally, we delve into numerous related datasets and applications, and highlight several advanced topics. Finally, we outline the challenges confronting visual grounding and propose valuable directions for future research, which may serve as inspiration for subsequent researchers. By extracting common technical details, this survey encompasses the representative work in each subtopic over the past decade. To the best of our knowledge, this paper represents the most comprehensive overview currently available in the field of visual grounding. This survey is designed to be suitable for both beginners and experienced researchers, serving as an invaluable resource for understanding key concepts and tracking the latest research developments. We keep tracing related work at https://github.com/linhuixiao/Awesome-Visual-Grounding.
Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
Bowen Zhao, Leo Parker Dirac, Paulina Varshavskaya
Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance.
Efficient Multimodal Large Language Models: A Survey
Yizhang Jin, Jian Li, Yexin Liu
et al.
In the past year, Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in tasks such as visual question answering, visual understanding and reasoning. However, the extensive model size and high training and inference costs have hindered the widespread application of MLLMs in academia and industry. Thus, studying efficient and lightweight MLLMs has enormous potential, especially in edge computing scenarios. In this survey, we provide a comprehensive and systematic review of the current state of efficient MLLMs. Specifically, we summarize the timeline of representative efficient MLLMs, research state of efficient structures and strategies, and the applications. Finally, we discuss the limitations of current efficient MLLM research and promising future directions. Please refer to our GitHub repository for more details: https://github.com/lijiannuist/Efficient-Multimodal-LLMs-Survey.
I Stalk Myself More than I Should
S()fia Braga
Today we find ourselves immersed in digital environments made available by centralised social media platforms on a daily basis. While these platforms did provide users expanded connectivity and visibility, they also confined the same user in an economic system focused on collection and commodification of personal data for profit, and in return used them as resources of free labour. In light of this analysis, is it possible to carry out an artistic practice within centralised social media platforms, therefore take an active part in them, while remaining critically engaged, in the attempt to highlight some of the structural dynamics and problems of these realities?
In this paper some fundamental aspects of the aforementioned channels will be discussed through the analysis of selected works and two methods utilised by the author to avoid the culture of interveillance.
Visual arts, History of the arts
A Tutorial on Visual Representations of Relational Queries
Wolfgang Gatterbauer
Query formulation is increasingly performed by systems that need to guess a user's intent (e.g. via spoken word interfaces). But how can a user know that the computational agent is returning answers to the "right" query? More generally, given that relational queries can become pretty complicated, how can we help users understand existing relational queries, whether human-generated or automatically generated? Now seems the right moment to revisit a topic that predates the birth of the relational model: developing visual metaphors that help users understand relational queries. This lecture-style tutorial surveys the key visual metaphors developed for visual representations of relational expressions. We will survey the history and state-of-the art of relationally-complete diagrammatic representations of relational queries, discuss the key visual metaphors developed in over a century of investigating diagrammatic languages, and organize the landscape by mapping their used visual alphabets to the syntax and semantics of Relational Algebra (RA) and Relational Calculus (RC).
ADVISE: ADaptive Feature Relevance and VISual Explanations for Convolutional Neural Networks
Mohammad Mahdi Dehshibi, Mona Ashtari-Majlan, Gereziher Adhane
et al.
To equip Convolutional Neural Networks (CNNs) with explainability, it is essential to interpret how opaque models take specific decisions, understand what causes the errors, improve the architecture design, and identify unethical biases in the classifiers. This paper introduces ADVISE, a new explainability method that quantifies and leverages the relevance of each unit of the feature map to provide better visual explanations. To this end, we propose using adaptive bandwidth kernel density estimation to assign a relevance score to each unit of the feature map with respect to the predicted class. We also propose an evaluation protocol to quantitatively assess the visual explainability of CNN models. We extensively evaluate our idea in the image classification task using AlexNet, VGG16, ResNet50, and Xception pretrained on ImageNet. We compare ADVISE with the state-of-the-art visual explainable methods and show that the proposed method outperforms competing approaches in quantifying feature-relevance and visual explainability while maintaining competitive time complexity. Our experiments further show that ADVISE fulfils the sensitivity and implementation independence axioms while passing the sanity checks. The implementation is accessible for reproducibility purposes on https://github.com/dehshibi/ADVISE.
Access to arts education in America: the availability of visual art, music, dance, and theater courses in U.S. high schools
Kenneth Elpus
Abstract The purpose of this study was to understand the contextual, school-level factors associated with the availability of arts education courses in the high schools of the United States. In the study, course offerings for a nationally representative sample of N = 940 high schools that were part of the National Center for Education Statistics High School Longitudinal Study of 2009 (HSLS) were analyzed to understand whether there were common school characteristics linked to the availability of visual art, music, dance, or theater courses. Results suggest that the strongest and most consistent school factor related to arts course availability was school size. As enrollment increased, so did the likelihood of offering any arts course or more than one of the arts disciplines. Traditional public schools had the greatest likelihood of offering arts education, followed by Catholic schools, and non-Catholic private schools. Public charter high schools were the least likely to offer courses in the arts. Proportion of students eligible for free- or reduced-price lunch was also associated with the probability of offering arts courses, with decreased arts availability at schools serving greater proportions of students eligible for the National School Lunch Program. Neither urbanicity nor region of the country were significantly associated with arts availability. The analyses reported in the article are both aggregated across all disciplines and disaggregated for each art form.
What Happens Next?
Omari 'Motion' Carter
No abstract available.
Visual arts, The performing arts. Show business
Regardez les villes changer
Michel Rautenberg
The American visual sociology is bad known in France. It is not to be confused with the French sociologie visuelle which aims to report the social world by the way of images. One main topic of visual sociology is to be concerned with every visual aspect of the social and material life. This paper does not claim to be synthetic presentation, but bases its argument on an American sociologist’s work, Jerome Krase, who gives a pivotal role in eyesight the urban life and shooting it.
Visual arts, Arts in general
سِلفی (خود نگاره) ؛ ظهور و تجلی فردیت در هنر معاصر
مهدی خانکه, محمد شکری
در سوبژکتیویته دکارتی، انسان محور اصلی حقیقت گردید و سوژه، اصالت یافت. اگرچه دکارت تحلیل خود را با من اندیشنده فردی آغاز میکند، اما در نهایت، به استنباطی کلی درباره انسان میرسد و فرد انسانی در محاق قرار میگیرد. این وضعیت در نظامهای فکری پسادکارتی تداوم مییابد؛ تا در قرن نوزدهم با ظهور اندیشههای فردگرایانه اگزیستانسیالیستها، تلقی نوع باورانه در باره انسان به چالش گرفته میشود و به فرد انسانی با تمامی ویژگیهای فردیاش توجه میشود. در وادی هنر،، هنرمندان اندکی پیش از شکلگیری جریانهای فردگرایانه در عالم تفکر، با روشهای گوناگون بازنمایی بر ویژگیهای فردی مدل، تاکید نمودند. خودنگارههای هنرمندان پسارنسانسی یکی از مهمترین تجلیات این رویکرد فردمحورانه است. از طرفی، با اختراع دوربین عکاسی، تحول در تعریف هنر و به رسمیت شناخته شدن عکاسی به عنوان رسانه هنری، اشتیاق سراسری به خلق تصاویر مبتنی بر ویژگیهای فردی نیز گسترش مییابد. با رشد تکنولوژی و تولید تلفن همراه مجهز به دوربینهای عکاسی، این هنر بیش از پیش فراگیر شده و پدیدهای بنام سلفی بهعنوان یکی از تجلیات فردمحوری رایج، رونق بسیار مییابد. پرسش اصلی این مقاله، امکان بررسی فلسفی پدیده نوظهور سلفی از منظر فردگرایی جاری میباشد که طی مسیر پژوهش، به پرسشهای فرعی چگونگی سیر سوبژکتیویته دکارتی به فردگرایی، چرایی هنر نامیدن سلفی و چگونگی روند سیر آثار هنری انسانمحور به فردمحور پاسخ داده میشود. رویکرد این مقاله، رواج فردگرایی در اندیشه معاصر است که بر بستر آن تلاش میشود، پدیده سلفی مورد بررسی قرار گیرد. در این مقاله با واکاوی ریشههای فلسفی پدیده نوظهور سلفی، این نتیجه حاصل میگردد که، یکی از علل اقبال عمومی به این رسانه نوظهور، فردگرایی جاری در دوران معاصر منتج از سوبژکتیویته دکارتی است. پژوهش حاضر، پژوهشی کیفی است که به روش مطالعه کتابخانهای اطلاعات گردآوری شده است.
Learning Audio-Visual Dereverberation
Changan Chen, Wei Sun, David Harwath
et al.
Reverberation not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed monaural sound and visual scene. In support of this new task, we develop a large-scale dataset SoundSpaces-Speech that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over audio-only methods.