Virginia Gerlero
Hasil untuk "Motion pictures"
Menampilkan 20 dari ~2222369 hasil · dari DOAJ, arXiv, Semantic Scholar, CrossRef
Yu-Han Shu, Toshiaki Tsuji, Sho Sakaino
Imitation learning (IL) enables robots to acquire human-like motion skills from demonstrations, but it still requires extensive high-quality data and retraining to handle complex or long-horizon tasks. To improve data efficiency and adaptability, this study proposes a hierarchical IL framework that integrates motion primitives with proportion-based motion synthesis. The proposed method employs a two-layer architecture, where the upper layer performs long-term planning, while a set of lower-layer models learn individual motion primitives, which are combined according to specific proportions. Three model variants are introduced to explore different trade-offs between learning flexibility, computational cost, and adaptability: a learning-based proportion model, a sampling-based proportion model, and a playback-based proportion model, which differ in how the proportions are determined and whether the upper layer is trainable. Through real-robot pick-and-place experiments, the proposed models successfully generated complex motions not included in the primitive set. The sampling-based and playback-based proportion models achieved more stable and adaptable motion generation than the standard hierarchical model, demonstrating the effectiveness of proportion-based motion integration for practical robot learning.
Bishoy Galoaa, Sarah Ostadabbas
We propose Track and Caption Any Motion (TCAM), a motion-centric framework for automatic video understanding that discovers and describes motion patterns without user queries. Understanding videos in challenging conditions like occlusion, camouflage, or rapid movement often depends more on motion dynamics than static appearance. TCAM autonomously observes a video, identifies multiple motion activities, and spatially grounds each natural language description to its corresponding trajectory through a motion-field attention mechanism. Our key insight is that motion patterns, when aligned with contrastive vision-language representations, provide powerful semantic signals for recognizing and describing actions. Through unified training that combines global video-text alignment with fine-grained spatial correspondence, TCAM enables query-free discovery of multiple motion expressions via multi-head cross-attention. On the MeViS benchmark, TCAM achieves 58.4% video-to-text retrieval, 64.9 JF for spatial grounding, and discovers 4.8 relevant expressions per video with 84.7% precision, demonstrating strong cross-task generalization.
Hila Chefer, Uriel Singer, Amit Zohar et al.
Despite tremendous recent progress, generative video models still struggle to capture real-world motion, dynamics, and physics. We show that this limitation arises from the conventional pixel reconstruction objective, which biases models toward appearance fidelity at the expense of motion coherence. To address this, we introduce VideoJAM, a novel framework that instills an effective motion prior to video generators, by encouraging the model to learn a joint appearance-motion representation. VideoJAM is composed of two complementary units. During training, we extend the objective to predict both the generated pixels and their corresponding motion from a single learned representation. During inference, we introduce Inner-Guidance, a mechanism that steers the generation toward coherent motion by leveraging the model's own evolving motion prediction as a dynamic guidance signal. Notably, our framework can be applied to any video model with minimal adaptations, requiring no modifications to the training data or scaling of the model. VideoJAM achieves state-of-the-art performance in motion coherence, surpassing highly competitive proprietary models while also enhancing the perceived visual quality of the generations. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation. Project website: https://hila-chefer.github.io/videojam-paper.github.io/
Meenakshi Gupta, Mingyuan Lei, Tat-Jen Cham et al.
This paper introduces a novel framework named D-LORD (Double Latent Optimization for Representation Disentanglement), which is designed for motion stylization (motion style transfer and motion retargeting). The primary objective of this framework is to separate the class and content information from a given motion sequence using a data-driven latent optimization approach. Here, class refers to person-specific style, such as a particular emotion or an individual's identity, while content relates to the style-agnostic aspect of an action, such as walking or jumping, as universally understood concepts. The key advantage of D-LORD is its ability to perform style transfer without needing paired motion data. Instead, it utilizes class and content labels during the latent optimization process. By disentangling the representation, the framework enables the transformation of one motion sequences style to another's style using Adaptive Instance Normalization. The proposed D-LORD framework is designed with a focus on generalization, allowing it to handle different class and content labels for various applications. Additionally, it can generate diverse motion sequences when specific class and content labels are provided. The framework's efficacy is demonstrated through experimentation on three datasets: the CMU XIA dataset for motion style transfer, the MHAD dataset, and the RRIS Ability dataset for motion retargeting. Notably, this paper presents the first generalized framework for motion style transfer and motion retargeting, showcasing its potential contributions in this area.
Yue Su, Xinyu Zhan, Hongjie Fang et al.
Inferring object motion representations from observations enhances the performance of robotic manipulation tasks. This paper introduces a new paradigm for robot imitation learning that generates action sequences by reasoning about object motion from visual observations. We propose MBA (Motion Before Action), a novel module that employs two cascaded diffusion processes for object motion generation and robot action generation under object motion guidance. MBA first predicts the future pose sequence of the object based on observations, then uses this sequence as a condition to guide robot action generation. Designed as a plug-and-play component, MBA can be flexibly integrated into existing robotic manipulation policies with diffusion action heads. Extensive experiments in both simulated and real-world environments demonstrate that our approach substantially improves the performance of existing policies across a wide range of manipulation tasks. Project page: https://selen-suyue.github.io/MBApage/
Yuan Wang, Di Huang, Yaqi Zhang et al.
Generating lifelike human motions from descriptive texts has experienced remarkable research focus in the recent years, propelled by the emerging requirements of digital humans.Despite impressive advances, existing approaches are often constrained by limited control modalities, task specificity, and focus solely on body motion representations.In this paper, we present MotionGPT-2, a unified Large Motion-Language Model (LMLM) that addresses these limitations. MotionGPT-2 accommodates multiple motion-relevant tasks and supporting multimodal control conditions through pre-trained Large Language Models (LLMs). It quantizes multimodal inputs-such as text and single-frame poses-into discrete, LLM-interpretable tokens, seamlessly integrating them into the LLM's vocabulary. These tokens are then organized into unified prompts, guiding the LLM to generate motion outputs through a pretraining-then-finetuning paradigm. We also show that the proposed MotionGPT-2 is highly adaptable to the challenging 3D holistic motion generation task, enabled by the innovative motion discretization framework, Part-Aware VQVAE, which ensures fine-grained representations of body and hand movements. Extensive experiments and visualizations validate the effectiveness of our method, demonstrating the adaptability of MotionGPT-2 across motion generation, motion captioning, and generalized motion completion tasks.
Kaifeng Zhao, Gen Li, Siyu Tang
Text-conditioned human motion generation, which allows for user interaction through natural language, has become increasingly popular. Existing methods typically generate short, isolated motions based on a single input sentence. However, human motions are continuous and can extend over long periods, carrying rich semantics. Creating long, complex motions that precisely respond to streams of text descriptions, particularly in an online and real-time setting, remains a significant challenge. Furthermore, incorporating spatial constraints into text-conditioned motion generation presents additional challenges, as it requires aligning the motion semantics specified by text descriptions with geometric information, such as goal locations and 3D scene geometry. To address these limitations, we propose DartControl, in short DART, a Diffusion-based Autoregressive motion primitive model for Real-time Text-driven motion control. Our model effectively learns a compact motion primitive space jointly conditioned on motion history and text inputs using latent diffusion models. By autoregressively generating motion primitives based on the preceding history and current text input, DART enables real-time, sequential motion generation driven by natural language descriptions. Additionally, the learned motion primitive space allows for precise spatial motion control, which we formulate either as a latent noise optimization problem or as a Markov decision process addressed through reinforcement learning. We present effective algorithms for both approaches, demonstrating our model's versatility and superior performance in various motion synthesis tasks. Experiments show our method outperforms existing baselines in motion realism, efficiency, and controllability. Video results are available on the project page: https://zkf1997.github.io/DART/.
Sarmad Idrees, Jongeun Choi, Seokman Sohn
To achieve seamless collaboration between robots and humans in a shared environment, accurately predicting future human movements is essential. Human motion prediction has traditionally been approached as a sequence prediction problem, leveraging historical human motion data to estimate future poses. Beginning with vanilla recurrent networks, the research community has investigated a variety of methods for learning human motion dynamics, encompassing graph-based and generative approaches. Despite these efforts, achieving accurate long-term predictions continues to be a significant challenge. In this regard, we present the Adversarial Motion Transformer (AdvMT), a novel model that integrates a transformer-based motion encoder and a temporal continuity discriminator. This combination effectively captures spatial and temporal dependencies simultaneously within frames. With adversarial training, our method effectively reduces the unwanted artifacts in predictions, thereby ensuring the learning of more realistic and fluid human motions. The evaluation results indicate that AdvMT greatly enhances the accuracy of long-term predictions while also delivering robust short-term predictions
Tarik Cyril Amar
This article addresses a franchise of intelligence films in the former Communist East Germany. Under the general title Das unsichtbare Visier—The Invisible Visor—they were produced for television and very popular. In general, the Cold War East produced a rich array of its own intelligence heroes, which cannot be reduced to mere derivatives of Western models. Yet there were commonalities and interactions across Cold War divides. One of these common features that Visier shared with many intelligence films globally was the depiction of abroad as both an “invisible front” of dangers and temptations and an exciting realm of adventure and consumption. Visier could not but also be a fantasy about East German citizens encountering, withstanding, and also enjoying the dangers and temptations of the Western Cold War Other. This included their facing two peculiar challenges: a degree of international mobility unlike anything the vast majority of ordinary East Germans could experience and the West’s consumerism. Visier addressed both these issues through what we could describe as an essentially playful—and dis-playful—practical cosmopolitanism. A careful look reveals Visier as a rich artifact of Cold War popular culture, with complex messages. The image of the heroic East German agent included a running comment of compensatory wish fulfillment. Here were ideal East German citizens doing their duty and yet also getting a fair slice of the capitalist good life abroad that most of their compatriots could not reach. They also consistently punched above their weight vicariously for East Germany as a whole. Like Britain’s James Bond, these were agents of an at-best middling power doing major things in the world at large. And finally, perhaps most satisfyingly of all, they turned into gentle, benevolent guardian angels of hapless West German cousins, neatly reversing West Germany’s claims of superiority.
Marina Kaneti
What does cooperation between rival superpowers look like? Do global issues have the capacity to rise above the geopolitics of the day and trigger alignment between rival powers? This paper argues the Cold War joint space exploration program between the United States and USSR provides a lesson on the limits of cooperation. These limits, I suggest, are not only a matter of power preferences, institutional differences, material disincentives, or even a consequence of a tendency for free-riding. Rather, they are also the result of incompatible “common sense” perceptions. Cooperation, even if institutionally viable, as in the case of the joint space program, can be constrained due to a lack of popular endorsement and legitimacy. To develop the argument, I examine the aesthetics of cooperation rendered through widely circulated media images associated with space cooperation during the Cold War. I argue the Cold War imaginary can serve as both critique and inspiration for today’s attempts to legitimize cooperation on global issues such as climate change. It provides insights on the role of “common sense” perceptions and the ways in which they inform questions concerning universality, the role of affect, and the alure of competition.
António Baía Reis, Guida Mendes, Inês Rebanda Coelho et al.
Hyunjung Kim
This article discusses the artist, filmmaker, and writer Lana Lin’s “revisioning” of Audre Lorde’s The Cancer Journals (1980) to consider how the poetic encounter between two artists from different generations exemplifies the relational praxis that Lorde pursued throughout her life. In revealing the ways in which Lorde opens acoustic sensory spaces for contemporary readers and listeners of her work, I specifically focus on the political and aesthetic dynamic of recitation in Lin’s experimental documentary The Cancer Journals Revisited (2018). I trace how recitation becomes a coalitional practice of feminist chorus through which readers and listeners embody the sonorous space they co-create through and with Lorde. The article further looks into the ways Lin pairs recitation performance with visceral images of body parts on the screen to argue that the film’s visuals dwell in and extend the form of the chorus; which is to say, the film expands the meaning and practice of chorus beyond the realm of the sonic. The bodies on the screen extend the chorus, immersing the audience in a participatory rethinking of the experience of cancer (particularly breast cancer) not merely as an individual misfortune or an aesthetic concern. In adjoining textual, visual, and acoustic elements, Lin’s film provides a powerful transformative sensory environment, reverberating through the reciters and the audience.
Sihan Ma, Qiong Cao, Hongwei Yi et al.
Demystifying complex human-ground interactions is essential for accurate and realistic 3D human motion reconstruction from RGB videos, as it ensures consistency between the humans and the ground plane. Prior methods have modeled human-ground interactions either implicitly or in a sparse manner, often resulting in unrealistic and incorrect motions when faced with noise and uncertainty. In contrast, our approach explicitly represents these interactions in a dense and continuous manner. To this end, we propose a novel Ground-aware Motion Model for 3D Human Motion Reconstruction, named GraMMaR, which jointly learns the distribution of transitions in both pose and interaction between every joint and ground plane at each time step of a motion sequence. It is trained to explicitly promote consistency between the motion and distance change towards the ground. After training, we establish a joint optimization strategy that utilizes GraMMaR as a dual-prior, regularizing the optimization towards the space of plausible ground-aware motions. This leads to realistic and coherent motion reconstruction, irrespective of the assumed or learned ground plane. Through extensive evaluation on the AMASS and AIST++ datasets, our model demonstrates good generalization and discriminating abilities in challenging cases including complex and ambiguous human-ground interactions. The code will be available at https://github.com/xymsh/GraMMaR.
Ekkasit Pinyoanuntapong, Pu Wang, Minwoo Lee et al.
Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However, these models often suffer from a trade-off between real-time performance, high fidelity, and motion editability. To address this gap, we introduce MMM, a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space, and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens, conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions, MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference, this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions, therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition, MMM has innate motion editability. By simply placing mask tokens in the place that needs editing, MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429), while offering advanced editing features such as body-part modification, motion in-betweening, and the synthesis of long motion sequences. In addition, MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at \url{https://exitudio.github.io/MMM-page}.
Oliver Kroener
In April 2020, at the height of the COVID-19 pandemic, the state of Florida classified professional wrestling (pro wrestling) alongside hospitals, law enforcement, and grocery stores as an essential service. The state’s decision made global headlines and was generally met with a mix of confusion and surprise. This article investigates the impact of the COVID-19 pandemic on contemporary television wrestling. After tracing the influence that the television medium and media convergence have had on pro wrestling, the article argues that the pandemic circumstances, coupled with the demands of television wrestling’s serialized storytelling, led to an evolution of a specific wrestling-match type: the cinematic match. Through a close textual analysis of wrestling programs that were produced without a live crowd during the COVID-19 crisis (e.g., WWE Friday Night Smackdown, Wrestlemania 36), this article examines how television wrestling—in an effort to appeal to its audience—produced a number of cinematic matches that combine distinct elements (e.g., editing, cinematography, sound) of genre filmmaking (e.g., horror, action) with the spectacular athleticism of pro wrestling.
Hanne Schelstraete
This article analyses Eric Rohmer’s film theory in the light of the Platonic triad of truth, beauty and goodness, as embodied by the aesthetic philosophy of Kant, Hegel and Schiller. Although his film theory shows affinity with Kant’s ideal of art as a form of natural beauty, I will argue that a broader look at Rohmer’s philosophical foundations is necessary. The point where Rohmer’s film theory deviates from Kant’s triadic philosophy is exactly the point where he approaches the aesthetics of Hegel and Schiller. Turning towards their idealistic philosophy creates a new understanding of Rohmer’s film theory. His ideas on cinema’s ontology and value mirror both Hegel’s emphasis on beauty as the artistic and immediate manifestation of trans-historical truth and Schiller’s belief in beauty as the condition for morality and freedom. Plato’s conceptual and intertwined triad of truth, beauty and goodness lies at the basis of his affinity to 18th century aesthetic philosophy. This close affinity raises the question of the relation between film theory and philosophy, and Rohmer’s position in this debate. He enriches his own thoughts with those of philosophers that inspired him, and enriches their legacy with his cinema-related conclusions.
Simone Rijavec
In quantum theory, the concept of time rests on shaky ground. One way to address this problem is to remove the usual background time parameter as a primitive entity and explain its emergence via correlations between physical systems. This approach was adopted by Page and Wootters (1983), who showed how time can emerge in a stationary quantum universe from the correlations between two of its subsystems, one of them acting as a clock for the other. In this work, I study the robustness of the Page-Wootters construction across different pictures, states of the universe and clock interactions, clarifying the role and the nature of the correlations between the subsystems of the universe. I start by showing how to formulate the Page-Wootters construction in the Heisenberg picture via a unitary change of basis. I consider both pure and mixed states of the universe and extend the analysis to include interactions between the clock and the other subsystem of the universe. The study reveals what kind of correlations are necessary for the construction to work. Interestingly, entanglement is not required as long as there are no interactions with the clock. The study also shows that these interactions can lead to a non-unitary evolution for some mixed states of the universe. In a simple two-level system, this aspect becomes relevant at scales where one would expect strong relativistic effects. At these scales, I also observe an inversion in the system's direction of time.
Xiangkun Li, Guoqing Sun, Yifei Li
With the development of science and technology, the introduction of virtual reality technology has pushed the development of human-computer interaction technology to a new height. The combination of virtual reality and human-computer interaction technology has been applied more and more in military simulation, medical rehabilitation, game creation, and other fields. Action is the basis of human behavior. Among them, human behavior and action analysis is an important research direction. In human behavior and action, recognition research based on behavior and action has the characteristics of convenience, intuition, strong interaction, rich expression information, and so on. It has become the first choice of many researchers for human behavior analysis. However, human motion and motion pictures are complex objects with many ambiguous factors, which are difficult to express and process. Traditional motion recognition is usually based on two-dimensional color images, while two-dimensional RGB images are vulnerable to background disturbance, light, environment, and other factors that interfere with human target detection. In recent years, more and more researchers have begun to use fuzzy mathematics theory to identify human behaviors. The plantar pressure data under different motion modes were collected through experiments, and the current gait information was analyzed. The key gait events including toe-off and heel touch were identified by dynamic baseline monitoring. For the error monitoring of key gait events, the screen window is used to filter the repeated recognition events in a certain period of time, which greatly improves the recognition accuracy and provides important gait information for motion pattern recognition. The similarity matching is performed on each template, the correct rate of motion feature extraction is 90.2%, and the correct rate of motion pattern recognition is 96.3%, which verifies the feasibility and effectiveness of human motion recognition based on fuzzy theory. It is hoped to provide processing techniques and application examples for artificial intelligence recognition applications.
Halaman 27 dari 111119