Tencent Unveils HY-Motion 1.0: Revolutionizing 3D Motion Generation with AI

Tencent Released Tencent HY-Motion 1.0: A Billion-Parameter Text-to-Motion Model Built on the Diffusion Transformer (DiT) Architecture and Flow Matching

Introducing HY-Motion 1.0

Tencent has recently launched HY-Motion 1.0, a groundbreaking initiative in the field of 3D animation. This innovative model is built on a Diffusion Transformer architecture and boasts an impressive 1 billion parameters, aimed at transforming text prompts into lifelike 3D human motions. Developers can easily access it on platforms like GitHub and Hugging Face, where it’s available along with important resources like code, checkpoints, and a user-friendly Gradio interface.

What Does HY-Motion 1.0 Offer?

This latest model series comprises two main variants: the standard HY-Motion-1.0, equipped with 1 billion parameters and the more lightweight HY-Motion-1.0-Lite, with 460 million parameters. Both models excel in generating animations based on simple text inputs, providing a effortless output that can be integrated into various 3D animation pipelines. Whether you’re into gaming or creating digital humans for cinematic projects, this technology offers versatile applications.

Data Sources and Processing Techniques

The training dataset for HY-Motion 1.0 is quite extensive, drawing on three primary sources: everyday human motion videos, motion capture data, and 3D animations used in game development. Initially, the team curated over 12 million high-quality video clips from the HunyuanVideo database. They employed shot boundary detection to identify distinct scenes and utilized human detection algorithms to extract clips featuring people. Following that, they implemented the GVHMR algorithm to reconstruct motion tracks compatible with the SMPL X framework.

In addition to this, around 500 hours of motion sequences were added from motion capture sessions and 3D animation libraries. The final collection has undergone a meticulous filtering process, eliminating duplicates, outliers, and artifacts that could compromise animation quality. After this rigorous selection, researchers ended up with a dataset containing over 3,000 hours of motion data, including 400 hours of high-quality 3D motions with verified captions.

Taxonomy of Motion

A thorough taxonomy categorizes the motion data into six main classes: Locomotion, Sports and Athletics, Fitness and Outdoor Activities, Daily Activities, Social Interactions, and Leisure and Game Character Actions. Each of these categories branches into over 200 detailed motion types, covering both simple and complex movements.

Understanding Motion Representation in HY-Motion 1.0

The model employs the SMPL-H skeleton, which consists of 22 body joints (excluding hands). Each frame is represented as a vector of 201 dimensions, encapsulating global translations, orientations, and local joint rotations. Notably, the model eliminates velocities and foot contact labels to enhance training performance and output quality. This representation aligns closely with industry standards for animation workflows, making it easier for developers to adapt it for their needs.

Innovative Architecture of HY-Motion DiT

At the heart of HY-Motion 1.0 is the hybrid HY Motion DiT network. This architecture first processes motion latents and text tokens through dual-stream blocks, allowing for modality-specific attention and structure. Following this, it shifts to single-stream blocks, merging the two modalities into a cohesive sequence for more profound multimodal fusion. You might also enjoy our guide on Exploring the Differences Between GPT-5 and GPT-4: A Blind T.

For text conditioning, the model employs a dual encoder scheme that enhances the quality of instruction following. The Qwen3 model provides token-level embeddings, while a CLIP-L model supplies global text features. This setup ensures that motion tokens can effectively engage with text tokens while minimizing noise interference.

Flow Matching and Training Strategies

HY-Motion 1.0 embraces a unique Flow Matching approach, diverging from traditional diffusion models. The training involves learning a velocity field that smoothly transitions from noisy inputs to real motion data. The model’s objective is based on minimizing the mean squared error between predicted and actual velocities, resulting in stable outputs even for lengthy sequences.

And, a dedicated Duration Prediction and Prompt Rewrite module enhances the model’s ability to follow instructions accurately. This module utilizes advanced training techniques to refine its understanding of user prompts, ensuring that the generated motions aren’t only realistic but also contextually appropriate.

Three-Stage Training Curriculum

Training HY-Motion 1.0 follows a structured, three-stage process. The first stage focuses on large-scale pretraining using the extensive 3,000-hour dataset to develop a broad understanding of motion. The second stage fine-tunes the model on a smaller 400-hour dataset, sharpening its accuracy regarding motion details. Finally, the third stage incorporates reinforcement learning, using curated human preference pairs to optimize instruction following and overall motion quality.

Performance Benchmarks and Insights

In terms of performance, HY-Motion 1.0 has been rigorously tested using a specially designed set of over 2,000 prompts. Human evaluators assessed both instruction adherence and motion quality, resulting in an average instruction-following score of 3.24, compared to lower scores for baseline text-to-motion systems. It’s also noteworthy that motion quality peaked at an average score of 3.43, surpassing competitors. For more tips, check out The Rise of Intelition: Redefining AI Collaboration.

Scaling experiments revealed that as the model size increased, its ability to follow instructions consistently improved. The 1 billion parameter model reached an average score of 3.34, while the smaller 460 million parameter version exhibited similar motion quality scores. These findings suggest that while larger models enhance instruction alignment, high-quality data curation primarily boosts realism.

Key Takeaways

HY-Motion 1.0 marks a significant innovation in the use of Diffusion Transformers for text-to-motion applications.
The model’s large parameter scale allows for elaborate and realistic motion generation from text prompts.
Its in-depth training dataset and advanced filtering techniques contribute to high-quality outputs.
Performance benchmarks indicate that HY-Motion 1.0 consistently outperforms existing models in both instruction following and motion quality.
Future applications may extend across various industries, including gaming, film, and interactive media.