Text-to-motion generation, which converts motion language descriptions into coherent 3D human motion sequences, has attracted increasing attention in fields, such as avatar animation and humanoid robotic interaction. Though existing models have achieved significant fidelity, they still suffer from two core limitations: (i) They treat motion periodicity and keyframe saliency as independent factors, overlooking their coupling and causing generation drift in long sequences. (ii) They are fragile to semantically equivalent paraphrases, where minor synonym substitutions distort textual embeddings, propagating through the decoder and producing unstable or erroneous motions. In this work, we propose T2M Mamba to address these limitations by (i) proposing Periodicity-Saliency Aware Mamba, which utilizes novel algorithms for keyframe weight estimation via enhanced Density Peaks Clustering and motion periodicity estimation via FFT-accelerated autocorrelation to capture coupled dynamics with minimal computational overhead, and (ii) constructing a Periodic Differential Cross-modal Alignment Module (PDCAM) to enhance robust alignment of textual and motion embeddings. Extensive experiments on HumanML3D and KIT-ML datasets have been conducted, confirming the effectiveness of our approach, achieving an FID of 0.068 and consistent gains on all other metrics.
In this paper, we address the challenge of generating temporally consistent videos with motion guidance. While many existing methods depend on additional control modules or inference-time fine-tuning, recent studies suggest that effective motion guidance is achievable without altering the model architecture or requiring extra training. Such approaches offer promising compatibility with various video generation foundation models. However, existing training-free methods often struggle to maintain consistent temporal coherence across frames or to follow guided motion accurately. In this work, we propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss, the latter being our key innovation. Specifically, we capture the inter-frame feature correlation patterns of intermediate features from a video diffusion model to represent the motion pattern of the reference video. We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video, using the gradient of this loss in the latent space to guide the generation process for precise motion control. This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup. Extensive experiments show that our method sets a new standard for efficient, temporally coherent video generation.
Recent advances in video diffusion models have enabled the generation of high-quality videos. However, these videos still suffer from unrealistic deformations, semantic violations, and physical inconsistencies that are largely rooted in the absence of 3D physical priors. To address these challenges, we propose an object-aware 4D human motion generation framework grounded in 3D Gaussian representations and motion diffusion priors. With pre-generated 3D humans and objects, our method, Motion Score Distilled Interaction (MSDI), employs the spatial and prompt semantic information in large language models (LLMs) and motion priors through the proposed Motion Diffusion Score Distillation Sampling (MSDS). The combination of MSDS and LLMs enables our spatial-aware motion optimization, which distills score gradients from pre-trained motion diffusion models, to refine human motion while respecting object and semantic constraints. Unlike prior methods requiring joint training on limited interaction datasets, our zero-shot approach avoids retraining and generalizes to out-of-distribution object aware human motions. Experiments demonstrate that our framework produces natural and physically plausible human motions that respect 3D spatial context, offering a scalable solution for realistic 4D generation.
Aykut İşleyen, Abhidnya Kadu, René van de Molengraft
et al.
Safe, smooth, and optimal motion planning for nonholonomically constrained mobile robots and autonomous vehicles is essential for achieving reliable, seamless, and efficient autonomy in logistics, mobility, and service industries. In many such application settings, nonholonomic robots, like unicycles with restricted motion, require precise planning and control of both translational and orientational motion to approach specific locations in a designated orientation, such as for approaching changing, parking, and loading areas. In this paper, we introduce a new dual-headway unicycle pose control method by leveraging an adaptively placed headway point in front of the unicycle pose and a tailway point behind the goal pose. In summary, the unicycle robot continuously follows its headway point, which chases the tailway point of the goal pose and the asymptotic motion of the tailway point towards the goal position guides the unicycle robot to approach the goal location with the correct orientation. The simple and intuitive geometric construction of dual-headway unicycle pose control enables an explicit convex feedback motion prediction bound on the closed-loop unicycle motion trajectory for fast and accurate safety verification. We present an application of dual-headway unicycle control for optimal sampling-based motion planning around obstacles. In numerical simulations, we show that optimal unicycle motion planning using dual-headway translation and orientation distances significantly outperforms Euclidean translation and cosine orientation distances in generating smooth motion with minimal travel and turning effort.
Claire Delplancke, Kris Thielemans, Matthias J. Ehrhardt
Motion correction aims to prevent motion artefacts which may be caused by respiration, heartbeat, or head movements for example. In a preliminary step, the measured data is divided in gates corresponding to motion states, and displacement maps from a reference state to each motion state are estimated. One common technique to perform motion correction is the motion compensated image reconstruction framework, where the displacement maps are integrated into the forward model corresponding to gated data. For standard algorithms, the computational cost per iteration increases linearly with the number of gates. In order to accelerate the reconstruction, we propose the use of a randomized and convergent algorithm whose per iteration computational cost scales constantly with the number of gates. We show improvement on theoretical rates of convergence and observe the predicted speed-up on two synthetic datasets corresponding to rigid and non-rigid motion.
Videos are more informative than images because they capture the dynamics of the scene. By representing motion in videos, we can capture dynamic activities. In this work, we introduce GPT-4 generated motion descriptions that capture fine-grained motion descriptions of activities and apply them to three action datasets. We evaluated several video-text models on the task of retrieval of motion descriptions. We found that they fall far behind human expert performance on two action datasets, raising the question of whether video-text models understand motion in videos. To address it, we introduce a method of improving motion understanding in video-text models by utilizing motion descriptions. This method proves to be effective on two action datasets for the motion description retrieval task. The results draw attention to the need for quality captions involving fine-grained motion information in existing datasets and demonstrate the effectiveness of the proposed pipeline in understanding fine-grained motion during video-text retrieval.
While experiencing natural beauty is a key appeal of the cinema and other moving-image media, academic film scholarship has rarely paid attention to it. In this article I will use the widespread motif of the gently rustling wind as a pars pro toto to make some general remarks about the experience of natural beauty in film. I will first note the firm place of the motif of the rustling wind in film theoretical debates from the late 19th century until today. Then, I will propose three modes of how viewers can experience this beautiful motif in film. In the following section I will discuss how film's mediation modifies the experience of natural beauty. And in a final step I will explore the reciprocal relation between nature and film, and how one can enhance the appreciation of the other. Ultimately, I follow two – mutually imbricated – goals. On the one hand, I aim to (re)connect the film theoretical discourse about the wind in the trees to natural beauty. On the other hand, I use the motif of the gently rustling wind to say something more general about the aesthetic experience of natural beauty in film.
Oscar Dabrowski, Jean-Luc Falcone, Antoine Klauser
et al.
MRI, a widespread non-invasive medical imaging modality, is highly sensitive to patient motion. Despite many attempts over the years, motion correction remains a difficult problem and there is no general method applicable to all situations. We propose a retrospective method for motion estimation and correction to tackle the problem of in-plane rigid-body motion, apt for classical 2D Spin-Echo scans of the brain, which are regularly used in clinical practice. Due to the sequential acquisition of k-space, motion artifacts are well localized. The method leverages the power of deep neural networks to estimate motion parameters in k-space and uses a model-based approach to restore degraded images to avoid ''hallucinations''. Notable advantages are its ability to estimate motion occurring in high spatial frequencies without the need of a motion-free reference. The proposed method operates on the whole k-space dynamic range and is moderately affected by the lower SNR of higher harmonics. As a proof of concept, we provide models trained using supervised learning on 600k motion simulations based on motion-free scans of 43 different subjects. Generalization performance was tested with simulations as well as in-vivo. Qualitative and quantitative evaluations are presented for motion parameter estimations and image reconstruction. Experimental results show that our approach is able to obtain good generalization performance on simulated data and in-vivo acquisitions. We provide a Python implementation at https://gitlab.unige.ch/Oscar.Dabrowski/sismik_mri/.
Nicola Messina, Jan Sedmidubsky, Fabrizio Falchi
et al.
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available at https://github.com/mesnico/text-to-motion-retrieval.
This work aims to provide a deep-learning solution for the motion interpolation task. Previous studies solve it with geometric weight functions. Some other works propose neural networks for different problem settings with consecutive pose sequences as input. However, motion interpolation is a more complex problem that takes isolated poses (e.g., only one start pose and one end pose) as input. When applied to motion interpolation, these deep learning methods have limited performance since they do not leverage the flexible dependencies between interpolation frames as the original geometric formulas do. To realize this interpolation characteristic, we propose a novel framework, referred to as \emph{Shuffled AutoRegression}, which expands the autoregression to generate in arbitrary (shuffled) order and models any inter-frame dependencies as a directed acyclic graph. We further propose an approach to constructing a particular kind of dependency graph, with three stages assembled into an end-to-end spatial-temporal motion Transformer. Experimental results on one of the current largest datasets show that our model generates vivid and coherent motions from only one start frame to one end frame and outperforms competing methods by a large margin. The proposed model is also extensible to multiple keyframes' motion interpolation tasks and other areas' interpolation.
We propose a novel framework, On-Demand MOtion Generation (ODMO), for generating realistic and diverse long-term 3D human motion sequences conditioned only on action types with an additional capability of customization. ODMO shows improvements over SOTA approaches on all traditional motion evaluation metrics when evaluated on three public datasets (HumanAct12, UESTC, and MoCap). Furthermore, we provide both qualitative evaluations and quantitative metrics demonstrating several first-known customization capabilities afforded by our framework, including mode discovery, interpolation, and trajectory customization. These capabilities significantly widen the spectrum of potential applications of such motion generation models. The novel on-demand generative capabilities are enabled by innovations in both the encoder and decoder architectures: (i) Encoder: Utilizing contrastive learning in low-dimensional latent space to create a hierarchical embedding of motion sequences, where not only the codes of different action types form different groups, but within an action type, codes of similar inherent patterns (motion styles) cluster together, making them readily discoverable; (ii) Decoder: Using a hierarchical decoding strategy where the motion trajectory is reconstructed first and then used to reconstruct the whole motion sequence. Such an architecture enables effective trajectory control. Our code is released on the Github page: https://github.com/roychowdhuryresearch/ODMO
Neste artigo procuramos mostrar, como no seriado "The L Word", uma importante produção audiovisual para a diversidade humana, não existe um padrão de lésbica, nem de gênero, em obras de ficção ou na realidade. Que a sexualidade é muito mais ampla do que apenas o enquadramento nos gêneros masculino e feminino. Que na escala entre ser homem e ser mulher há pessoas com características acentuadas de cada lado, com mais ou menos comportamentos acentuados de ambos os lados, sem um padrão definido, porque o ser humano consegue transitar em fronteiras com territórios próximos e distantes, entre o masculino e o feminino, sem uma dosagem estipulada de um e de outro, formando vários gêneros a partir destes dois.
Abstract Person Recognition based on Gait Model (PRGM) and motion features is are indeed a challenging and novel task due to their usages and to the critical issues of human pose variation, human body occlusion, camera view variation, etc. In this project, a deep convolution neural network (CNN) was modified and adapted for person recognition with Image Augmentation (IA) technique depending on gait features. Adaptation aims to get best values for CNN parameters to get best CNN model. In Addition to the CNN parameters Adaptation, the design of CNN model itself was adapted to get best model structure; Adaptation in the design was affected the type, the number of layers in CNN and normalization between them. After choosing best parameters and best design, Image augmentation was used to increase the size of train dataset with many copies of the image to boost the number of different images that will be used to train Deep learning algorithms. The tests were achieved using known dataset (Market dataset). The dataset contains sequential pictures of people in different gait status. The image in CNN model as matrix is extracted to many images or matrices by the convolution, so dataset size may be bigger by hundred times to make the problem a big data issue. In this project, results show that adaptation has improved the accuracy of person recognition using gait model comparing to model without adaptation. In addition, dataset contains images of person carrying things. IA technique improved the model to be robust to some variations such as image dimensions (quality and resolution), rotations and carried things by persons. Results for 200 persons recognition, validation accuracy was about 82% without IA and 96.23 with IA. For 800 persons recognition, validation accuracy was 93.62% without IA.
Computer engineering. Computer hardware, Information technology
We present a novel approach to generate collision-free trajectories for a robot operating in close proximity with a human obstacle in an occluded environment. The self-occlusions of the robot can significantly reduce the accuracy of human motion prediction, and we present a novel deep learning-based prediction algorithm. Our formulation uses CNNs and LSTMs and we augment human-action datasets with synthetically generated occlusion information for training. We also present an occlusion-aware planner that uses our motion prediction algorithm to compute collision-free trajectories. We highlight performance of the overall approach (HMPO) in complex scenarios and observe upto 68% performance improvement in motion prediction accuracy, and 38% improvement in terms of error distance between the ground-truth and the predicted human joint positions.
In this article, I argue that by considering Ludwig Wittgenstein's methods, we can better understand and appreciate Jonas Mekas's diary films. Based on Wittgenstein's notion of “aesthetic puzzlement”, I identify the main confusions encountered by the viewer upon watching Mekas's films, such as: 1) fragmentation; 2) persistent repetition; and 3) the importance placed on the everyday. I discuss three films – Walden (1969), Lost Lost Lost (1976), and As I Was Moving Ahead Occasionally I Saw Brief Glimpses of Beauty (2000) – and demonstrate that the aesthetic puzzlements within them may be dissolved by looking at the format of Wittgenstein's Philosophical Investigations (1953). Mekas's lifelong interest in filming the most mundane and domestic scenes can be understood as a puzzlement in itself: why not just admire the ordinary whilst living in it? Wittgenstein's thought experiment in Culture and Value helps us understand the aesthetic puzzlement of Mekas's interest in filming, remembering and presenting an extensive array of everyday activities, and also explains why the viewer can find the most mundane and domestic activities in his films remarkable. Additionally, I discuss how Mekas's diary films may be regarded as coming close to Wittgenstein's aesthetic ideal of art as being able to represent “life itself”. I aim to show how Mekas's cinematic practice places extreme importance on ordinary acts and offers a mode of thinking which echoes Wittgenstein's own views on philosophy. I conclude with a discussion of “nomadism”, a notion that elucidates the peculiar form of the works of both Wittgenstein and Mekas.
We solve optimal stopping problems for an oscillating Brownian motion, i.e. a diffusion with positive piecewise constant volatility changing at the point $x=0$. Let $σ_1$ and $σ_2$ denote the volatilities on the negative and positive half-lines, respectively. Our main result is that continuation region of the optimal stopping problem with reward $((1+x)^+)^2$ is disconnected, if and only if $σ_1^2<σ_2^2<2σ_1^2$. Based on the fact that the skew Brownian motion in natural scale is an oscillating Brownian motion, the obtained results are translated into corresponding results for the skew Brownian motion.
How to integrate human factors into the motion planning system is of great significance for improving the acceptance of intelligent vehicles. Decomposing motion into primitives and then accurately and smoothly joining the motion primitives (MPs) is an essential issue in the motion planning system. Therefore, the purpose of this paper is to regenerate and join the learned MPs in the library. By applying a representation algorithm based on the modified dynamic movement primitives (DMPs) and singular value decomposition (SVD), our method separates the basic shape parameters and fine-tuning shape parameters from the same type of demonstration trajectories in the MP library. Moreover, we convert the MP joining problem into a re-representation problem and use the characteristics of the proposed representation algorithm to achieve an accurate and smooth transition. This paper demonstrates that the proposed method can effectively reduce the number of shape adjustment parameters when the MPs are regenerated without affecting the accuracy of the representation. Besides, we also present the ability of the proposed method to smooth the velocity jump when the MPs are connected and evaluate its effect on the accuracy of tracking the set target points. The results show that the proposed method can not only improve the adjustment ability of a single MP in response to different motion planning requirements but also meet the basic requirements of MP joining in the generation of MP sequences.
A new procedure was recently proposed for constructing massless Type IIB vertex operators in the pure spinor formalism. Instead of expressing these closed string vertex operators as left-right products of open string vertex operators, they were instead constructed from the complex N=2 d=10 superfield whose lowest real and imaginary components are the dilaton and Ramond-Ramond axion. These Type IIB vertex operators take a simple form in the -8 picture and are related to the usual vertex operators in the zero picture by acting with picture-raising operators. In this paper, we compute explicitly this picture-raising procedure and confirm this proposal in a flat background. Work is in progress on confirming this proposal in an $AdS_5\times S^5$ background.
We present a method for measuring the Sun's motion using the proper motions of Galactic halo star streams. The method relies on the fact that the motion of the stars perpendicular to a stream from a low-mass progenitor is close to zero when viewed from a non-rotating frame at rest with respect to the Galaxy, and that the deviation from zero is due to the reflex motion of the observer. The procedure we implement here has the advantage of being independent of the Galactic mass distribution. We run a suite of simulations to test the algorithm we have developed, and find that we can recover the input Solar motion to good accuracy with data of the quality that will soon become available from the ESA/Gaia mission.