Human motion video generation has garnered significant research interest due to its broad applications, enabling innovations such as photorealistic singing heads or dynamic avatars that seamlessly dance to music. However, existing surveys in this field focus on individual methods, lacking a comprehensive overview of the entire generative process. This paper addresses this gap by providing an in-depth survey of human motion video generation, encompassing over ten sub-tasks, and detailing the five key phases of the generation process: input, motion planning, motion video generation, refinement, and output. Notably, this is the first survey that discusses the potential of large language models in enhancing human motion video generation. Our survey reviews the latest developments and technological trends in human motion video generation across three primary modalities: vision, text, and audio. By covering over two hundred papers, we offer a thorough overview of the field and highlight milestone works that have driven significant technological breakthroughs. Our goal for this survey is to unveil the prospects of human motion video generation and serve as a valuable resource for advancing the comprehensive applications of digital humans. A complete list of the models examined in this survey is available in Our Repository https://github.com/Winn1y/Awesome-Human-Motion-Video-Generation.
Myocardial motion tracking is important for assessing cardiac function and diagnosing cardiovascular diseases, for which cine cardiac magnetic resonance (CMR) has been established as the gold standard imaging modality. Many existing methods learn motion from single image pairs consisting of a reference frame and a randomly selected target frame from the cardiac cycle. However, these methods overlook the continuous nature of cardiac motion and often yield inconsistent and non-smooth motion estimations. In this work, we propose a novel Mamba-based cardiac motion tracking network (MCM) that explicitly incorporates target image sequence from the cardiac cycle to achieve smooth and temporally consistent motion tracking. By developing a bi-directional Mamba block equipped with a bi-directional scanning mechanism, our method facilitates the estimation of plausible deformation fields. With our proposed motion decoder that integrates motion information from frames adjacent to the target frame, our method further enhances temporal coherence. Moreover, by taking advantage of Mamba's structured state-space formulation, the proposed method learns the continuous dynamics of the myocardium from sequential images without increasing computational complexity. We evaluate the proposed method on two public datasets. The experimental results demonstrate that the proposed method quantitatively and qualitatively outperforms both conventional and state-of-the-art learning-based cardiac motion tracking methods. The code is available at https://github.com/yjh-0104/MCM.
Abdoljalil Addeh, G. Bruce Pike, M. Ethan MacDonald
Acquiring accurate external respiratory data during functional Magnetic Resonance Imaging (fMRI) is challenging, prompting the exploration of machine learning methods to estimate respiratory variation (RV) from fMRI data. Respiration induces head motion, including real and pseudo motion, which likely provides useful information about respiratory events. Recommended notch filters mitigate respiratory-induced motion artifacts, suggesting that a bandpass filter at the respiratory frequency band isolates respiratory-induced head motion. This study seeks to enhance the accuracy of RV estimation from resting-state BOLD-fMRI data by integrating estimated head motion parameters. Specifically, we aim to determine the impact of incorporating raw versus bandpass-filtered head motion parameters on RV reconstruction accuracy using one-dimensional convolutional neural networks (1D-CNNs). This approach addresses the limitations of traditional filtering techniques and leverages the potential of head motion data to provide a more robust estimation of respiratory-induced variations.
Hsiang-Wei Huang, Cheng-Yen Yang, Wenhao Chai
et al.
In the field of multi-object tracking (MOT), traditional methods often rely on the Kalman filter for motion prediction, leveraging its strengths in linear motion scenarios. However, the inherent limitations of these methods become evident when confronted with complex, nonlinear motions and occlusions prevalent in dynamic environments like sports and dance. This paper explores the possibilities of replacing the Kalman filter with a learning-based motion model that effectively enhances tracking accuracy and adaptability beyond the constraints of Kalman filter-based tracker. In this paper, our proposed method MambaMOT and MambaMOT+, demonstrate advanced performance on challenging MOT datasets such as DanceTrack and SportsMOT, showcasing their ability to handle intricate, non-linear motion patterns and frequent occlusions more effectively than traditional methods.
Nayan Man Singh Pradhan, Patrick Frank, An Mo
et al.
Parallel robots are capable of high-speed manipulation and have become essential tools in the industry. The proximal placement of their motors and the low weight of their end effectors make them ideal for generating highly dynamic motion. Therefore, parallel robots can be adopted for motion platform designs, as long as end effector loads are low. Traditional motion platforms can be large and powerful to generate multiple g acceleration. However, these designs tend to be expensive and large. Similar but smaller motion platforms feature a small work range with reduced degrees of freedom (DoFs) and a limited payload. Here we seek a medium-sized affordable parallel robot capable of powerful and high-speed 6-DoF motion in a comparably large workspace. This work explores the concept of a quadruped robot flipped upside-down, with the motion platform fixed between its feet. In particular, we exploit the high-power dynamic brushless actuation and the four-leg redundancy when moving the motion platform. We characterize the resulting motion platform by tracking sinusoidal and circular trajectories with varying loads. Dynamic motions in 6 DoFs up to 10 Hz and ~10 mm amplitude are possible when moving a mass of 300 grams. We demonstrate single-axis end-effector translations up to ~20 mm at 10 Hz for higher loads of 1.2 kg. The motion platform can be replicated easily by 3D printing and off-the-shelf components. All motion platform-related hardware and the custom-written software required to replicate are open-source.
This article explores the affinities between animation practice and experiments in perception by Gestalt psychologists. By drawing out a Gestalt style of seeing — a sensitivity to the visual forces that scaffold an image — we can better describe movements, figures, and spaces in animation. Although these affinities make Gestalt appropriate for discussing animation, they do not necessarily imply that animated films merely illustrate or independently verify Gestalt laws of perception. Rather, they suggest two branches of cultural practice sharing what philosopher of science Ian Hacking calls a ‘style of reasoning’: a regularized procedure whose consistent results form a basis for knowledge in a given culture. This article argues that Gestalt and animation are co-participants in the ‘culture of design’: a project of shaping sensory arrangements in order to shape populations, which began in the nineteenth century and has gained force through the present day. It is this culture of design, which includes the exploration of cinema as an art of graphic arrangement, that has become all-but-ubiquitous in the 21stcentury and has led to the ubiquity of animation.
Gabriele De Luca, Thomas J. Lampoltshammer, Johannes Scholz
A human is a thing that moves in space. Like all things that move in space, we can in principle use differential equations to describe their motion as a set of functions that maps time to position (and velocity, acceleration, and so on). With inanimate objects, we can reliably predict their trajectories by using differential equations that account for up to the second-order time derivative of their position, as is commonly done in analytical mechanics. With animate objects, though, and with humans, in particular, we do not know the cardinality of the set of equations that define their trajectory. We may be tempted to think, for example, that by reason of their complexity in cognition or behaviour as compared to, say, a rock, then the motion of humans requires a more complex description than the one generally used to describe the motion of physical systems. In this paper, we examine a real-world dataset on human mobility and consider the information that is added by each (computed, but denoised) additional time derivative, and find the maximum order of derivatives of the position that, for that particular dataset, cannot be expressed as a linear transformation of the previous. In this manner, we identify the dimensionality of a minimal model that correctly describes the observed trajectories. We find that every higher-order derivative after the acceleration is linearly dependent upon one of the previous time-derivatives. This measure is robust against noise and the choice for differentiation techniques that we use to compute the time-derivatives numerically as a function of the measured position. This result imposes empirical constraints on the possible sets of differential equations that can be used to describe the kinematics of a moving human.
In this report, we present the 1st place solution for motion prediction track in 2022 Waymo Open Dataset Challenges. We propose a novel Motion Transformer framework for multimodal motion prediction, which introduces a small set of novel motion query pairs for generating better multimodal future trajectories by jointly performing the intention localization and iterative motion refinement. A simple model ensemble strategy with non-maximum-suppression is adopted to further boost the final performance. Our approach achieves the 1st place on the motion prediction leaderboard of 2022 Waymo Open Dataset Challenges, outperforming other methods with remarkable margins. Code will be available at https://github.com/sshaoshuai/MTR.
Motion, as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning. In this paper, we ask the question: how important is the motion particularly for self-supervised video representation learning. To this end, we compose a duet of exploiting the motion for data augmentation and feature learning in the regime of contrastive learning. Specifically, we present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation. On one hand, MCL capitalizes on optical flow of each frame in a video to temporally and spatially sample the tubelets (i.e., sequences of associated frame patches across time) as data augmentations. On the other hand, MCL further aligns gradient maps of the convolutional layers to optical flow maps from spatial, temporal and spatio-temporal perspectives, in order to ground motion information in feature learning. Extensive experiments conducted on R(2+1)D backbone demonstrate the effectiveness of our MCL. On UCF101, the linear classifier trained on the representations learnt by MCL achieves 81.91% top-1 accuracy, outperforming ImageNet supervised pre-training by 6.78%. On Kinetics-400, MCL achieves 66.62% top-1 accuracy under the linear protocol. Code is available at https://github.com/YihengZhang-CV/MCL-Motion-Focused-Contrastive-Learning.
Onur Akgun, Kamil Canberk Atik, Mustafa Erdem
et al.
We investigate how to utilize predictive models for selecting appropriate motion planning strategies based on perception uncertainty estimation for agile unmanned aerial vehicle (UAV) navigation tasks. Although there are variety of motion planning and perception algorithms for such tasks, the impact of perception uncertainty is not explicitly handled in many of the current motion algorithms, which leads to performance loss in real-life scenarios where the measurement are often noisy due to external disturbances. We develop a novel framework for embedding perception uncertainty to high level motion planning management, in order to select the best available motion planning approach for the currently estimated perception uncertainty. We estimate the uncertainty in visual inputs using a deep neural network (CovNet) that explicitly predicts the covariance of the current measurements. Next, we train a high level machine learning model for predicting the lowest cost motion planning algorithm given the current estimate of covariance as well as the UAV states. We demonstrate on both real-life data and drone racing simulations that our approach, named uncertainty driven motion planning switcher (UDS) yields the safest and fastest trajectories among compared alternatives. Furthermore, we show that the developed approach learns how to trade-off safety with agility by switching to motion planners that leads to more agile trajectories when the estimated covariance is high and vice versa.
Let $(Z,κ)$ be a Walsh Brownian motion with spinning measure $κ$. Suppose $μ$ is a probability measure on $\mathbb{R}^n$. We characterize all the $κ$ such that $μ$ is a stopping distribution of $(Z,κ)$. If we further restrict the solution to be integrable, we show that there would be only one choice of $κ$. We also generalize Vallois' embedding, and prove that it minimizes the expectation $\mathbb{E}[Ψ(L^Z_τ)]$ among all the admissible solutions $τ$, where $Ψ$ is a strictly convex function and $(L_t^Z)_{t \geq 0}$ is the local time of the Walsh Brownian motion at the origin.
Video salient object detection aims at discovering the most visually distinctive objects in a video. How to effectively take object motion into consideration during video salient object detection is a critical issue. Existing state-of-the-art methods either do not explicitly model and harvest motion cues or ignore spatial contexts within optical flow images. In this paper, we develop a multi-task motion guided video salient object detection network, which learns to accomplish two sub-tasks using two sub-networks, one sub-network for salient object detection in still images and the other for motion saliency detection in optical flow images. We further introduce a series of novel motion guided attention modules, which utilize the motion saliency sub-network to attend and enhance the sub-network for still images. These two sub-networks learn to adapt to each other by end-to-end training. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on a wide range of benchmarks. We hope our simple and effective approach will serve as a solid baseline and help ease future research in video salient object detection. Code and models will be made available.
We analyze the statistical physics of self-propelled particles from a general theoretical framework that properly describes the most salient characteristic of active motion, $persistence$, in arbitrary spatial dimensions. Such a framework allows the development of a Smoluchowski-like equation for the probability density of finding a particle at a given position and time, without assuming an explicit orientational dynamics of the self-propelling velocity as Langevin-like equation-based models do. Also, the Brownian motion due to thermal fluctuations and the active one due to a general intrinsic persistent motion of the particle are taken into consideration on an equal footing. The persistence of motion is introduced in our formalism in the form of a \emph{two-time memory function}, $K(t,t^{\prime})$. We focus on the consequences when $K(t,t^{\prime})\sim (t/t^{\prime})^{-η}\exp[-Γ(t-t^{\prime})]$, $Γ$ being the characteristic persistence time, and show that it precisely describes a variety of active motion patterns characterized by $η$. We find analytical expressions for the experimentally obtainable intermediate scattering function, the time dependence of the mean-squared displacement, and the kurtosis.
In this essay, I will clarify the relationship between propaganda and film censorship in Italy and French during and after the First World War through veterans' experience. I analyzed those themes between 1915 and 1925.I propose to consider the films of Italian fiction on veterans' experience not in a unified way: after two years of war thesituation got worst. No film in this period had a happy end. In French the situation was inverse: during the war brutality and death was revealed. After armistice of Compiègne, films about war became less violent and the veteran status improved.
This article seeks to entangle two current philosophic praxes: New Materialism, and Sensory Ethnography. Jane Bennett has become one of New Materialism's most prominent proponents since the release of her now-seminal text, Vibrant Matter in 2010. Due to the varied ground upon which New Materialism stands (often conflated with object-oriented ontology, post-humanism, and other general turns within nonhumanism), Bennett's work will be looked at idiosyncratically, then pushed into the realm of the cinematic via an analysis of the documentary, Leviathan. Directed by Lucien Castaing-Taylor and Véréna Paravel, this film was among the first exemplary works to emerge from the Sensory Ethnography Lab, based at Harvard University. In striving for a revitalization of ethnographic film practices, the Lab aligns itself with similarly non-anthropocentric, and non-discursive, aspects of experience to the New Materialism of Jane Bennett. By placing these two contemporary camps into conversation, this article intends to reposition them both: New Materialism as a vehicle for the Sensory Ethnographic, and the SEL as an exhibition of the kind of world Bennett's philosophy envisages. The article concludes with an assessment of the political and eco-political critiques and ramifications surrounding these works.
Xunyu Lin, Victor Campos, Xavier Giro-i-Nieto
et al.
This paper introduces an unsupervised framework to extract semantically rich features for video representation. Inspired by how the human visual system groups objects based on motion cues, we propose a deep convolutional neural network that disentangles motion, foreground and background information. The proposed architecture consists of a 3D convolutional feature encoder for blocks of 16 frames, which is trained for reconstruction tasks over the first and last frames of the sequence. A preliminary supervised experiment was conducted to verify the feasibility of proposed method by training the model with a fraction of videos from the UCF-101 dataset taking as ground truth the bounding boxes around the activity regions. Qualitative results indicate that the network can successfully segment foreground and background in videos as well as update the foreground appearance based on disentangled motion features. The benefits of these learned features are shown in a discriminative classification task, where initializing the network with the proposed pretraining method outperforms both random initialization and autoencoder pretraining. Our model and source code are publicly available at https://imatge-upc.github.io/unsupervised-2017-cvprw/ .