This is the official repository for The Quest for Generalizable Motion Generation: Data, Model, and Evaluation.
The repo provides a unified framework for generalizable motion generation, including both modeling and evaluation:
-
ViMoGen Model: A Diffusion Transformer for generalizable motion generation, supporting Text-to-Motion (T2M) and Text/Motion-to-Motion (TM2M)
-
MBench Benchmark: A comprehensive evaluation benchmark that decomposes motion generation into nine dimensions across three pillars: Motion Generalization, Motion–Condition Consistency, and Motion Quality.
Together, ViMoGen and MBench enable end-to-end research on scalable and reliable motion generation.
Despite recent advances in 3D human motion generation (MoGen) on standard benchmarks, existing models still face a fundamental bottleneck in their generalization capability. In contrast, adjacent generative fields, most notably video generation (ViGen), have demonstrated remarkable generalization in modeling human behaviors, highlighting transferable insights that MoGen can leverage.
Motivated by this observation, we present ViMoGen, a comprehensive framework that systematically transfers knowledge from ViGen to MoGen across three key pillars: data, modeling, and evaluation.
- ViMoGen-228K Dataset: A large-scale dataset comprising 228,000 high-quality motion samples that integrates high-fidelity optical MoCap data with semantically annotated motions from web videos and synthesized samples.
- ViMoGen Model: A flow-matching-based diffusion transformer that unifies priors from MoCap data and ViGen models through gated multimodal conditioning.
- MBench Benchmark: A hierarchical benchmark designed for fine-grained evaluation across motion quality, prompt fidelity, and generalization ability.
- [2025-12-19] We have released the ViMoGen-DiT pretrained weights along with the core inference pipeline.
- [2025-12-18] We have released the ViMoGen-228K Dataset and MBench leaderboard.
- Inference Code: Core inference pipeline is released.
- Pretrained Weights: ViMoGen-DiT weights are available.
- Training System: Training code and ViMoGen-228K dataset release.
- Evaluation Suite: Complete MBench evaluation scripts and data.
- Motion-to-Motion Pipeline: Detailed guide and tools for custom reference motion preparation.
conda create -n vigen python=3.10 -y
conda activate vigenInstall PyTorch with CUDA support. We recommend PyTorch 2.4+ with CUDA 12.1:
conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=12.1 -c pytorch -c nvidiaOr via pip:
pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu121pip install -r requirements.txtFor better performance, install Flash Attention 2:
pip install flash-attn --no-build-isolationPyTorch3D is needed for motion rendering and visualization:
# Option 1: Install from conda (recommended)
conda install pytorch3d -c pytorch3d
# Option 2: Install from source
pip install "git+https://github.com/facebookresearch/pytorch3d.git"Body models are required for motion visualization and MBench evaluation:
- SMPL-X: Used for motion visualization and rendering
- SMPL: Used for MBench evaluation metrics (
Pose_Quality,Body_Penetration, and VLM-based metrics)
Download SMPL-X from the official website:
- Register and download
SMPLX_python_v1.1.zip(Python v1.1.0). - Extract and place the model files in:
data/body_models/ └── smplx/ ├── SMPLX_FEMALE.npz ├── SMPLX_MALE.npz └── SMPLX_NEUTRAL.npz
Download SMPL from the official website:
- Register and download the SMPL model (version 1.1.0 for Python).
- Extract and place
SMPL_NEUTRAL.pklin:data/body_models/ └── smpl/ └── SMPL_NEUTRAL.pkl
Note: We provide smplx_root.pt in data/body_models/ for coordinate alignment.
Download pretrained models and place them in the ./checkpoints/ directory:
| Model | Description | Download Link / Command |
|---|---|---|
| ViMoGen-DiT-1.3B | Main motion generation model | Google Drive (Save as ./checkpoints/model.pt) |
| Wan2.1-T2V-1.3B | Base text encoder weights and training initialization | huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./checkpoints/Wan2.1-T2V-1.3B |
For evaluation on MBench, you need to download and extract the benchmark data:
- Download
mbench.tar.gzfrom Google Drive. This package includes:- Reference motions generated by Wan 2.2 and processed by CameraHMR.
- T5 text embeddings for all prompts.
- Extract to the
./data/directory:tar -xzvf mbench.tar.gz -C ./data/
ViMoGen/
├── checkpoints/ # Model checkpoints
├── configs/ # Configuration files
│ ├── tm2m_train.yaml # Training config
│ ├── tm2m_infer.yaml # TM2M inference config
│ └── t2m_infer.yaml # T2M inference config
├── data/ # Data directory
│ ├── mbench/ # MBench benchmark data (Download required)
│ ├── meta_info/ # Metadata for training/testing
│ ├── body_models/ # SMPL-X/SMPL models and alignment files
│ └── ViMoGen-228K/ # Training dataset (Download required)
├── data_samples/ # Example data for quick start
├── datasets/ # Dataset loading utilities
├── models/ # Model definitions
│ └── transformer/ # DiT transformer models
├── mbench/ # MBench evaluation module
├── motion_gating/ # Motion quality gating utilities
├── motion_rep/ # Motion representation conversion tools
├── scripts/ # Shell scripts
├── trainer/ # Training utilities
├── parallel/ # Distributed training utilities
├── evaluate_mbench.py # MBench evaluation entry point
├── train_eval_vimogen.py # Main training/inference entry point
└── utils.py # Common utilities
Generate motion from text prompts:
-
Edit prompts: Modify
data_samples/example_archive.jsonwith your desired text prompts (Or use our default prompts). -
Extract text embeddings:
bash scripts/text_encoding_demo.sh
-
Run inference:
bash scripts/t2m_infer.sh
Generate motion conditioned on both text and reference motion:
-
Prepare reference motion:
- Option A: Use MBench Benchmark. We provide pre-processed MBench data with stored reference motions in
./data/mbench/for immediate evaluation. - Option B: Custom Preparation. See Custom Motion Preparation below.
- Option A: Use MBench Benchmark. We provide pre-processed MBench data with stored reference motions in
-
Run inference:
bash scripts/tm2m_infer.sh
For preparing custom reference motions for TM2M inference, we provide a complete pipeline covering:
- Generate Reference Video - Using text-to-video models (e.g., Wan 2.2, CogVideoX)
- Extract Motion from Video - Using HMR models (e.g., SMPLest-X, CameraHMR)
- Convert to Motion Representation - Transform to our 276-dim format
- Quality Gating - Determine
use_ref_motionvia VLM analysis and jitter metrics
📖 See motion_rep/CUSTOM_MOTION.md for the complete guide.
-
Download ViMoGen-228K Dataset from HuggingFace:
The dataset is hosted on HuggingFace and contains:
- ViMoGen-228K.json: Unified annotation file with all 228K samples
- Split annotation files:
optical_mocap_data.json,in_the_wild_video_data.json,synthetic_video_data.json - Motion files:
.ptfiles organized inmotions/directory
Download using
huggingface-cli:huggingface-cli download wruisi/ViMoGen-228K --repo-type dataset --local-dir ./data/ViMoGen-228K
Data Format:
- Motion files (
.pt) vary by source:- Visual MoCap (in-the-wild and synthetic videos): Dictionary with
motion: Tensor of shape[#frames, 276](per-frame motion features)extrinsic: Tensor of shape[#frames, 9](camera extrinsics)intrinsic: Tensor of shape[3, 3](camera intrinsics)
- Optical MoCap: Direct tensor of shape
[#frames, 276](pure motion, no camera info)
- Visual MoCap (in-the-wild and synthetic videos): Dictionary with
- Each JSON entry contains:
id,subset,split,motion_text_annot,video_text_annot,motion_path, and optionallyvideo_path
-
Prepare training data (add sample IDs and update paths):
python scripts/prepare_training_data.py \ --input_json ./data/ViMoGen-228K/ViMoGen-228K.json \ --motion_root ./data/ViMoGen-228K \ --output_dir ./data/meta_info \ --skip_statsThis script:
- Adds
sample_idfield to each entry - Prefixes
motion_pathwith the data root directory - Outputs
./data/meta_info/ViMoGen-228K_train.json
Note: Use
--skip_statsto skip mean/std computation since we provide pre-computed statistics in./data/meta_info/. Remove this flag if you want to recompute statistics from the full dataset. - Adds
-
Extract text embeddings (requires GPU, takes several hours):
bash scripts/text_encoding_train.sh
This will:
- Extract T5 embeddings for all text prompts
- Save embeddings to
./data/ViMoGen-228K/text_embeddings/ - Update the JSON with embedding paths
Launch distributed training with 8 GPUs:
bash scripts/tm2m_train.shMBench is our hierarchical benchmark for evaluating motion generation across multiple dimensions.
Run inference on the MBench evaluation set (450 prompts):
bash scripts/tm2m_infer.shConvert generated motions to the format expected by MBench:
python scripts/organize_mbench_results.py \
--input_dir exp/tm2m_infer_mbench/test_visualization/mbench_full/step00000001 \
--output_dir exp/mbench_eval_inputThis script:
- Collects all
motion_gen_condition_on_text.ptormotion_gen_condition_on_motion.ptfiles - Extracts 3D joints and applies coordinate transformation
- Saves results as
.npyfiles in the expected format(frames, 22, 3)
Run the full MBench evaluation:
python evaluate_mbench.py \
--evaluation_path exp/mbench_eval_input \
--gemini_api_key "YOUR_GEMINI_API_KEY"Command Options:
--evaluation_path: Directory containing processed motion files--output_path: Output directory for results (default:./evaluation_results/)--dimension: Specific dimensions to evaluate (optional, evaluates all by default)--gemini_api_key: Required for VLM-based metrics
Note on Evaluation Time:
- Motion Quality metrics (
Jitter_Degree,Ground_Penetration,Foot_Floating,Foot_Sliding,Dynamic_Degree) compute directly on 3D joints and are fast (seconds to minutes).- Pose Quality metrics (
Body_Penetration,Pose_Quality) require running SMPLify (inverse kinematics from joints to SMPL parameters) and are moderate in time.- VLM-based metrics (
Motion_Condition_Consistency,Motion_Generalizability) require both SMPLify and video rendering, making them the most time-consuming (several hours for 200 samples).Although our motion representation can directly export SMPL parameters, we use SMPLify from joints for fair comparison with other skeleton-only methods. To speed up evaluation, use
--dimensionto evaluate specific metric categories separately.
MBench evaluates across three categories:
| Category | Dimension | Description |
|---|---|---|
| Motion Quality | Jitter_Degree |
Motion smoothness |
Ground_Penetration |
Feet going through ground | |
Foot_Floating |
Feet floating above ground | |
Foot_Sliding |
Feet sliding during contact | |
Dynamic_Degree |
Motion dynamics/expressiveness | |
| Pose Quality | Body_Penetration |
Self-collision detection |
Pose_Quality |
Pose naturalness (NRDF) | |
| VLM-based | Motion_Condition_Consistency |
Prompt-motion alignment |
Motion_Generalizability |
Generalization to novel prompts |
To evaluate only specific dimensions:
# Motion quality only (fast, no rendering needed)
python evaluate_mbench.py \
--evaluation_path exp/mbench_eval_input \
--dimension Jitter_Degree Ground_Penetration Foot_Sliding
# VLM-based metrics only (requires Gemini API and video rendering)
python evaluate_mbench.py \
--evaluation_path exp/mbench_eval_input \
--dimension Motion_Condition_Consistency Motion_Generalizability \
--gemini_api_key "YOUR_API_KEY"Results are saved to evaluation_results/:
{name}_eval_results.json: Aggregate metrics for each dimension{name}_per_motion_results.json: Per-sample detailed results{name}_full_info.json: Evaluation metadata
Explore More SMPLCap Projects
- [TPAMI'25] SMPLest-X: An extended version of SMPLer-X with stronger foundation models.
- [ICML'25] ADHMR: A framework to align diffusion-based human mesh recovery methods via direct preference optimization.
- [ECCV'24] WHAC: World-grounded human pose and camera estimation from monocular videos.
- [CVPR'24] AiOS: An all-in-one-stage pipeline combining detection and 3D human reconstruction.
- [NeurIPS'23] SMPLer-X: Scaling up EHPS towards a family of generalist foundation models.
- [NeurIPS'23] RoboSMPLX: A framework to enhance the robustness of whole-body pose and shape estimation.
- [ICCV'23] Zolly: 3D human mesh reconstruction from perspective-distorted images.
- [arXiv'23] PointHPS: 3D HPS from point clouds captured in real-world settings.
- [NeurIPS'22] HMR-Benchmarks: A comprehensive benchmark of HPS datasets, backbones, and training strategies.
If you find this work useful, please cite our paper:
@article{lin2025questgeneralizablemotiongeneration,
title={The Quest for Generalizable Motion Generation: Data, Model, and Evaluation},
author={Jing Lin and Ruisi Wang and Junzhe Lu and Ziqi Huang and Guorui Song and Ailing Zeng and Xian Liu and Chen Wei and Wanqi Yin and Qingping Sun and Zhongang Cai and Lei Yang and Ziwei Liu},
year={2025},
journal={arXiv preprint arXiv:2510.26794},
}