Motus: RoboTwin 2.0 Fine-Tuned Checkpoint
Motus is a unified latent action world model that leverages existing pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformers (MoT) architecture to integrate three experts (understanding, action, and video generation) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (World Models, Vision-Language-Action Models, Inverse Dynamics Models, Video Generation Models, and Video-Action Joint Prediction Models). Motus further leverages optical flow to learn latent actions and adopts a three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining.
This checkpoint is fine-tuned on RoboTwin 2.0 benchmark (50+ manipulation tasks).
Homepage | GitHub | arXiv | Feishu
Table of Contents
- Highlights
- Model Details
- Performance
- Hardware & Software Requirements
- Quickstart (Inference)
- Citation
Highlights
- 87.02% average success rate on RoboTwin 2.0 (+15% over X-VLA, +45% over Οβ.β )
- 50+ Manipulation Tasks: Trained on diverse bimanual manipulation scenarios
- Multi-Task Capable: Single model handles all 50+ tasks
- Ready for Deployment: Direct inference or further fine-tuning
Model Details
Architecture
| Component | Base Model | Parameters |
|---|---|---|
| VGM (Video Generation Model) | WAN 2.2 | ~5.00B |
| VLM (Vision-Language Model) | Qwen3-VL-2B | ~2.13B |
| Action Expert | - | ~641.5M |
| Understanding Expert | - | ~253.5M |
| Total | - | ~8B |
Training Details
- Base Checkpoint:
motus-robotics/Motus(Stage 2 pretrained) - Fine-Tuning Data: RoboTwin 2.0 (2,500 clean + 25,000 randomized demonstrations)
- Training Steps: 40k steps
Action Representation
- Control frequency: 30Hz (default)
- Action chunk size: 48 steps (default)
- Action dimension: 14 (bimanual: 7 per arm)
Performance
RoboTwin 2.0 Benchmark (50+ Tasks)
| Method | Clean | Randomized |
|---|---|---|
| Οβ.β | 42.98% | 43.84% |
| X-VLA | 72.80% | 72.84% |
| Motus (Ours) | 88.66% | 87.02% |
Key improvements:
- +15% over X-VLA
- +45% over Οβ.β
Hardware & Software Requirements
| Mode | VRAM | Recommended GPU |
|---|---|---|
| Inference (with pre-encoded T5) | ~ 24 GB | RTX 5090 |
| Inference (without pre-encoded T5) | ~ 41 GB | A100 (40GB) / A100 (80GB) / H100 / B200 |
Quickstart (Inference)
RoboTwin 2.0 Simulation
cd inference/robotwin/Motus
# Single task evaluation
bash eval.sh place_dual_shoes ./pretrained_models/Motus_robotwin2
# Multi-task batch evaluation
bash auto_eval.sh
Offline Inference (No Environment)
python inference/real_world/Motus/inference_example.py \
--model_config inference/real_world/Motus/utils/robotwin.yml \
--ckpt_dir ./pretrained_models/Motus_robotwin2 \
--wan_path /path/to/pretrained_models \
--image /path/to/input_frame.png \
--instruction "pick up the cube and place it on the right" \
--use_t5 \
--output result.png
Python API
import torch
import yaml
from models.motus import Motus, MotusConfig
# Load config
with open("configs/robotwin.yaml", "r") as f:
config = yaml.safe_load(f)
# Initialize model
model_config = MotusConfig(
wan_checkpoint_path=config['model']['wan']['checkpoint_path'],
vae_path=config['model']['wan']['vae_path'],
wan_config_path=config['model']['wan']['config_path'],
vlm_checkpoint_path=config['model']['vlm']['checkpoint_path'],
action_dim=14,
load_pretrained_backbones=False,
)
model = Motus(model_config).to("cuda").eval()
model.load_checkpoint("./pretrained_models/Motus_robotwin2", strict=False)
# Inference
with torch.no_grad():
predicted_frames, predicted_actions = model.inference_step(
first_frame=frame_tensor,
state=state_tensor,
num_inference_steps=20,
language_embeddings=t5_embeddings,
vlm_inputs=[vlm_inputs],
)
# Action chunk: [1, 48, 14]
actions = predicted_actions.squeeze(0).cpu().numpy()
Citation
@misc{motus2025,
title={Motus: A Unified Latent Action World Model},
author={Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, Jun Zhu},
year={2025},
eprint={2507.23523},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://motus-robotics.github.io/motus},
}
- Downloads last month
- 3