Unified Number-Free Text-to-Motion Generation Via Flow Matching

Huang, Guanhe; Celiktutan, Oya

Unified Number-Free Text-to-Motion Generation Via Flow Matching

Guanhe Huang, Oya Celiktutan

King's College London
CVPR 2026

Abstract

Generative models excel at motion synthesis for a fixed number of agents but struggle to generalize with variable agents. Based on limited, domain-specific data, existing methods employ autoregressive models to generate motion recursively, which suffer from inefficiency and error accumulation. We propose Unified Motion Flow (UMF), which consists of Pyramid Motion Flow (P-Flow) and Semi-Noise Motion Flow (S-Flow). UMF decomposes the number-free motion generation into a single-pass motion prior generation stage and multi-pass reaction generation stages. Specifically, UMF utilizes a unified latent space to bridge the distribution gap between heterogeneous motion datasets, enabling effective unified training. For motion prior generation, P-Flow operates on hierarchical resolutions conditioned on different noise levels, thereby mitigating computational overheads. For reaction generation, S-Flow learns a joint probabilistic path that adaptively performs reaction transformation and context reconstruction, alleviating error accumulation. Extensive results and user studies demonstrate UMF's effectiveness as a generalist model for multi-person motion generation from text. We will release the code.

Key Contributions

Figure 1. (a) Standard methods are restricted to a fixed number of agents. (b) Autoregressive methods decouple generation into a motion prior and subsequent reaction guided by a conditioning network. (c) Our UMF leverages a heterogeneous motion prior as the adaptive start point of the reaction flow path, mitigating error accumulation.

🔗 Unified Motion Flow (UMF)

A generalist framework for number-free text-to-motion generation. UMF's core design unifies heterogeneous single-person (HumanML3D) and multi-person (InterHuman) datasets within a multi-token latent space.

🔺 Pyramid Motion Flow (P-Flow)

For efficient individual motion synthesis, P-Flow operates on hierarchical resolutions conditioned on the noise level, alleviating computational overheads of multi-token representations while maintaining high-fidelity generation.

🌊 Semi-Noise Motion Flow (S-Flow)

For reaction and interaction synthesis, S-Flow learns a joint probabilistic path by balancing reaction transformation and context reconstruction, thereby alleviating error accumulation in autoregressive generation.

🏆 State-of-the-Art Results

UMF achieves SOTA performance on multi-person generation benchmarks (FID 4.772 on InterHuman). A user study validates UMF's zero-shot generalization to unseen crowd scenarios (N > 2).

Method Overview

Figure 2. Overview of the UMF architecture. (A) Unified Motion VAE: Encodes heterogeneous motions (HumanML3D, InterHuman) into a regularized multi-token latent space, bridging domain gaps between datasets. (B) P-Flow: Synthesizes the individual motion prior hierarchically — processing low-resolution latents at early timesteps and full-resolution at later timesteps, reducing computation by ≈1/K. (C) S-Flow: Generates reactions by jointly learning context reconstruction and reaction transformation paths, alleviating error accumulation. Applied autoregressively for N > 2 agents.

UMF substantially outperforms the generalist baseline FreeMotion, improving Top3 R-Precision by 28% and reducing FID by 29%. Against specialist methods, UMF achieves the best FID score.

Method	R Top-3 ↑	FID ↓	MM Dist ↓	Diversity →
Ground Truth	0.701	0.273	3.755	7.948
InterGen	0.624	5.918	5.108	7.387
TIMotion	0.724	5.433	3.775	8.032
InterMask	0.683	5.154	3.790	7.944
FreeMotion	0.544	6.740	3.848	7.828
UMF (Ours)	0.694	4.772	3.784	8.039

UMF improves Top3 R-Precision by over 30% and reduces MM-Distance by 27% compared to ReGenNet, significantly improving reactive motion quality.

Method	R Top-3 ↑	FID ↓	MM Dist ↓	Diversity →
Real	0.722	0.002	3.503	5.390
ReGenNet	0.407	2.265	6.860	5.214
FreeMotion	0.409	3.896	7.632	6.089
UMF (Ours)	0.530	2.577	4.987	7.764

UMF's unified latent space achieves strong single-person generation on HumanML3D, outperforming FreeMotion on both FID and R Top-3.

Method	R Top-3 ↑	FID ↓
Real	0.797	0.002
FreeMotion	0.612	3.539
UMF (Ours)	0.729	0.486

Qualitative Results

In-domain Motion Generation

UMF: Two performers use their right leg to confront each other

Two performers use their right leg to confront each other, and then one lifts the left leg to attack.

These two people practicing martial art, performing consecutive punches towards each other.

One strikes with the sword, and the other counterattacks.

A person walks in a S shape.

UMF: A man raises his left hand and starts rowing

A man raises his left hand, and then brings his hands together and starts rowing by pulling his hands back. He then brings his hands down in a wavy pattern.

A person shuffles to the left, then to the right.

Zero-Shot Multi-Agent Generation

Five people stroll closely while performing alternating arm swings in sync.

Two people stroll together and chatter with each other, the third person walks towards them with hand gestures.

Three performers are playing the latin dance.

Three performers engage in a sparring match, attempting to kick and evade each other's moves.

Two people are sparring. The third person extends arms to stop them. The fourth person engages in the fight.

Five performers are training in taekwondo by exchanging attacks.

BibTeX

@misc{huang2026unifiednumberfreetexttomotiongeneration,
      title={Unified Number-Free Text-to-Motion Generation Via Flow Matching}, 
      author={Guanhe Huang and Oya Celiktutan},
      year={2026},
      eprint={2603.27040},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2603.27040}, 
}