Motus: A Unified Latent Action World Model

1Tsinghua University   2Peking University   3Horizon Robotics
*Joint first authors Project Lead

Framework Overview

Motus Architecture

Motus Architecture
Figure 1: Motus Architecture. Here, $a_t \dots a_{t+k}$ are actions, $z_t \dots z_{t+k}$ are latent actions, and $\tau_v$ and $\tau_a$ are the rectified flow timesteps for the video generation model and the action expert, respectively.

Motus is a unified latent action world model that leverages existing pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformers (MoT) architecture to integrate three experts (understanding, action, and video generation) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (World Models, Vision-Language-Action Models, Inverse Dynamics Models, Video Generation Models, and Video-Action Joint Prediction Models). Motus further leverages optical flow to learn latent actions and adopts a three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining.

Latent Actions

Latent Action VAE
Figure 2: The Latent Action VAE. Optical flow-based representation that bridges visual dynamics with control signals through a variational autoencoder architecture.

To leverage large-scale heterogeneous data, we introduce latent actions that encode motion directly from optical flow. DPFlow computes pixel-level displacements between frames, which are then compressed via a deep convolutional variational autoencoder (DC-AE) and a lightweight encoder, enabling the model to learn cross-embodiment motion priors from diverse video sources.

Data Pyramid

Embodied Data Pyramid
Figure 3: The Embodied Data Pyramid. Six-level data hierarchy from web data (Level 1) to target-robot demonstrations (Level 6), progressively increasing in task relevance and quality.

Three-Stage Training Recipe

1

Stage 1 (VGM Training)

Trains the VGM with embodied data.

2

Stage 2 (Motus Pretraining)

Pretrains the Motus model with latent actions.

3

Stage 3 (Motus SFT)

Fine-tunes the Motus model on target-robot trajectories.

Stage Data Training
Pretrained Foundation Models Level 1: Web Data VGM and VLM
Stage 1 (VGM Training) Level 2: Egocentric Human Videos
Level 3: Synthetic Data
Level 5: Multi-Robot Task Trajectory
Only VGM
Stage 2 (Motus Pretraining) Level 2: Egocentric Human Videos
Level 3: Synthetic Data
Level 4: Task-agnostic Data
Level 5: Multi-Robot Task Trajectory
Motus (all 3 experts, with latent actions)
Stage 3 (Motus SFT) Level 6: Target-Robot Task Trajectory Motus (all 3 experts, with actions)

Real-World Robot Experiments

Grind Coffee Beans

Grind coffee beans with grinder

Get Water

Get water from water dispenser

Brew Coffee

Brew coffee using drip coffee machine

Water the flowers

Pour water from kettle to flowers

Bake bread

Use the oven to bake bread

Touch Keyboard

Touch keyboard for computer operation

Performance Highlights

Real-World Experiments

Task Description $\pi_{0.5}$ w/o Pretrain Motus
AC-One
Fold Towel 4% 1% 14.5%
Brew Coffee using Coffee Maker 0% 0% 62%
Get Water from Water Dispenser 30% 8% 36%
Place Cube into Plate 46% 60% 100%
Place Cube into Plate (OOD) 28.1% 18.8% 75%
Grind Coffee Beans with Grinder 8% 0% 92%
Pour Water from Kettle to Flowers 5% 5% 65%
Touch Instructed Keyboard 0% 100% 82.5%
Put Bread into Oven 12% 40% 42%
Average 14.79% 25.86% 63.22%
Agilex-Aloha-2
Fold Towel 27.5% 0% 39%
Get Water from Water Dispenser 62% 8% 96%
Pour Water from Kettle to Flowers 45% 40% 47.5%
Touch Instructed Keyboard 72.5% 85% 80%
Put Bread into Oven 36% 0% 34%
Average 48.60% 26.60% 59.30%

RoboTwin 2.0 Simulation (Randomized)

Simulation Task $\pi_{0.5}$ X-VLA w/o Pretrain Stage1 Motus
Place Dual Shoes 7% 88% 80% 94% 87%
Move Stapler Pad 18% 73% 37% 68% 85%
Stack Blocks Two 56% 87% 94% 99% 98%
Scan Object 38% 36% 50% 69% 66%
Place Object Stand 65% 88% 93% 96% 97%
Place Fan 36% 75% 85% 85% 87%
Move Pillbottle Pad 29% 71% 83% 90% 96%
Pick Dual Bottles 6% 36% 68% 17% 90%
Blocks Ranking RGB 35% 83% 88% 98% 97%
......(50 tasks) ...
Turn Switch 6% 61% 60% 64% 78%
Pick Diverse Bottles 3% 36% 62% 18% 91%
Place Bread Basket 56% 71% 83% 87% 94%
Stack Blocks Three 16% 10% 76% 95% 95%
Put Bottles Dustbin 9% 77% 33% 24% 79%
Place Can Basket 25% 52% 62% 55% 76%
Stamp Seal 23% 82% 88% 95% 92%
Hanging Mug 3% 27% 10% 25% 38%
Handover Block 19% 37% 15% 55% 73%
Stack Bowls Three 35% 86% 74% 83% 87%
Place Object Basket 36% 39% 75% 80% 87%
Open Microwave 37% 71% 82% 84% 91%
Average 43.84% 72.84% 77.00% 81.86% 87.02%

Citation

@misc{bi2025hrdt,
    title={H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation}, 
    author={Hongzhe Bi and Lingxuan Wu and Tianwei Lin and Hengkai Tan and Zhizhong Su and Hang Su and Jun Zhu},
    year={2025},
    eprint={2507.23523},
    archivePrefix={arXiv},
    primaryClass={cs.RO},
    url={https://embodiedfoundation.github.io/hrdt}, 
}