MDD: Masked Deconstructed Diffusion for 3D Human Motion Generation from Text


We present MDD (Masked Deconstructed Diffusion), a novel framework for generating high-fidelity 3D human motions from textual descriptions. Our MDD framework employs a multi-stage Kinematic Chain Quantization (KCQ) that effectively encodes motion sequences into a compact yet expressive codebook by capturing both local and global human kinematic structures. This codebook is then leveraged by a Masked Diffusion Transformer (MDT), which iteratively refines the motion sequence through masked token prediction and a deconstructed diffusion process. By aligning the prediction with the denoising process, our method strikes an optimal balance between generation quality and computational efficiency. Extensive evaluations on multiple established benchmarks demonstrate that MDD consistently outperforms state-of-the-art methods in terms of precision and semantic accuracy, while achieving superior inference speed. The generalizability of our generated motions is validated in a virtual reality (VR) environment built in Unity3D, showcasing the effectiveness of our framework in VR applications.

Approach Overview



Gallery of Generation

Comparisons


We qualitatively compared our method with MoMask, T2M-GPT, MLD, and MDM. Our apporach achieves more precise motion generation. For example, in the first case, both MLD and MoMask fail to capture the detail "breaks into a running jump". The second case evaluates the ability to handle long prompts, where MDM and MLD exhibit missing actions or unnecessary turns. MoMask and T2M-GPT also struggle to maintain the "walk straight" instruction. In the third case, which involves less intense movement, our method also generates more precise motion.