Motus: A Unified Latent Action World Model

¹Tsinghua University ²Shengshu ³Peking University ⁴Horizon Robotics
^*Joint first authors ^†Joint project lead

Framework Overview

Motus Architecture

Figure 1: Motus Architecture. Here, $a_t \dots a_{t+k}$ are actions, $z_t \dots z_{t+k}$ are latent actions, and $\tau_v$ and $\tau_a$ are the rectified flow timesteps for the video generation model and the action expert, respectively.

Motus is a unified latent action world model that leverages existing pretrained models and rich, sharable motion information. Motus introduces a Mixture-of-Transformers (MoT) architecture to integrate three experts (understanding, action, and video generation) and adopts a UniDiffuser-style scheduler to enable flexible switching between different modeling modes (World Models, Vision-Language-Action Models, Inverse Dynamics Models, Video Generation Models, and Video-Action Joint Prediction Models). Motus further leverages optical flow to learn latent actions and adopts a three-phase training pipeline and six-layer data pyramid, thereby extracting pixel-level "delta action" and enabling large-scale action pretraining.

Latent Actions

Figure 2: The Latent Action VAE. Optical flow-based representation that bridges visual dynamics with control signals through a variational autoencoder architecture.

To leverage large-scale heterogeneous data, we introduce latent actions that encode motion directly from optical flow. DPFlow computes pixel-level displacements between frames, which are then compressed via a deep convolutional variational autoencoder (DC-AE) and a lightweight encoder, enabling the model to learn cross-embodiment motion priors from diverse video sources.

Data Pyramid

Figure 3: The Embodied Data Pyramid. Six-level data hierarchy from web data (Level 1) to target-robot demonstrations (Level 6), progressively increasing in task relevance and quality.

Three-Stage Training Recipe

Stage 1 (VGM Training)

Trains the VGM with embodied data.

→

Stage 2 (Motus Pretraining)

Pretrains the Motus model with latent actions.

→

Stage 3 (Motus SFT)

Fine-tunes the Motus model on target-robot trajectories.

Stage	Data	Training
Pretrained Foundation Models	Level 1: Web Data	VGM and VLM
Stage 1 (VGM Training)	Level 2: Egocentric Human Videos Level 3: Synthetic Data Level 5: Multi-Robot Task Trajectory	Only VGM
Stage 2 (Motus Pretraining)	Level 2: Egocentric Human Videos Level 3: Synthetic Data Level 4: Task-agnostic Data Level 5: Multi-Robot Task Trajectory	Motus (all 3 experts, with latent actions)
Stage 3 (Motus SFT)	Level 6: Target-Robot Task Trajectory	Motus (all 3 experts, with actions)

Real-World Robot Experiments

Fold T-shirt

"Fold Black T-shirt Neatly." -- 1x speed

Fold Towel

"Fold Pink Lettered Towel Neatly." -- 1x speed

Fold Towel

"Fold Bear Pattern Towel Neatly." -- 1x speed

Grind Coffee Beans

"Grind coffee beans with grinder." -- 4x speed

Get Water

"Get water from water dispenser." -- 4x speed

Brew Coffee

"Brew coffee using drip coffee machine." -- 4x speed

Water the flowers

"Pour water from kettle to flowers." -- 4x speed

Bake bread

"Use the oven to bake bread." -- 4x speed

Touch Keyboard

"Touch keyboard for computer operation." -- 8x speed

Performance Highlights

Real-World Experiments

Task Description	$\pi_{0.5}$	w/o Pretrain	Motus
AC-One
Fold Towel	4%	1%	14.5%
Brew Coffee using Coffee Maker	0%	0%	62%
Get Water from Water Dispenser	30%	8%	36%
Place Cube into Plate	46%	60%	100%
Place Cube into Plate (OOD)	28.1%	18.8%	75%
Grind Coffee Beans with Grinder	8%	0%	92%
Pour Water from Kettle to Flowers	5%	5%	65%
Touch Instructed Keyboard	0%	100%	82.5%
Put Bread into Oven	12%	40%	42%
Average	14.79%	25.86%	63.22%
Agilex-Aloha-2
Fold Towel	27.5%	0%	39%
Get Water from Water Dispenser	62%	8%	96%
Pour Water from Kettle to Flowers	45%	40%	47.5%
Touch Instructed Keyboard	72.5%	85%	80%
Put Bread into Oven	36%	0%	34%
Average	48.60%	26.60%	59.30%

RoboTwin 2.0 Simulation (Randomized)

Simulation Task	$\pi_{0.5}$	X-VLA	w/o Pretrain	Stage1	Motus
Place Dual Shoes	7%	88%	80%	94%	87%
Move Stapler Pad	18%	73%	37%	68%	85%
Stack Blocks Two	56%	87%	94%	99%	98%
Scan Object	38%	36%	50%	69%	66%
Place Object Stand	65%	88%	93%	96%	97%
Place Fan	36%	75%	85%	85%	87%
Move Pillbottle Pad	29%	71%	83%	90%	96%
Pick Dual Bottles	6%	36%	68%	17%	90%
Blocks Ranking RGB	35%	83%	88%	98%	97%
......(50 tasks)	...
Turn Switch	6%	61%	60%	64%	78%
Pick Diverse Bottles	3%	36%	62%	18%	91%
Place Bread Basket	56%	71%	83%	87%	94%
Stack Blocks Three	16%	10%	76%	95%	95%
Put Bottles Dustbin	9%	77%	33%	24%	79%
Place Can Basket	25%	52%	62%	55%	76%
Stamp Seal	23%	82%	88%	95%	92%
Hanging Mug	3%	27%	10%	25%	38%
Handover Block	19%	37%	15%	55%	73%
Stack Bowls Three	35%	86%	74%	83%	87%
Place Object Basket	36%	39%	75%	80%	87%
Open Microwave	37%	71%	82%	84%	91%
Average	43.84%	72.84%	77.00%	81.86%	87.02%

Citation

@misc{bi2025motusunifiedlatentaction,
            title={Motus: A Unified Latent Action World Model}, 
            author={Hongzhe Bi and Hengkai Tan and Shenghao Xie and Zeyuan Wang and Shuhe Huang and Haitian Liu and Ruowen Zhao and Yao Feng and Chendong Xiang and Yinze Rong and Hongyan Zhao and Hanyu Liu and Zhizhong Su and Lei Ma and Hang Su and Jun Zhu},
            year={2025},
            eprint={2512.13030},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2512.13030}, 
      }