MDD: Masked Deconstructed Diffusion for 3D Human Motion Generation from Text

AIxVR 2025

Jia Chen, Fangze Liu, Yingying Wang.

McMaster University, Canada

pdf

We present MDD (Masked Deconstructed Diffusion), a novel framework for generating high-fidelity 3D human motions from textual descriptions. Our MDD framework employs a multi-stage Kinematic Chain Quantization (KCQ) that effectively encodes motion sequences into a compact yet expressive codebook by capturing both local and global human kinematic structures. This codebook is then leveraged by a Masked Diffusion Transformer (MDT), which iteratively refines the motion sequence through masked token prediction and a deconstructed diffusion process. By aligning the prediction with the denoising process, our method strikes an optimal balance between generation quality and computational efficiency. Extensive evaluations on multiple established benchmarks demonstrate that MDD consistently outperforms state-of-the-art methods in terms of precision and semantic accuracy, while achieving superior inference speed. The generalizability of our generated motions is validated in a virtual reality (VR) environment built in Unity3D, showcasing the effectiveness of our framework in VR applications.