Move as you Say, Interact as you can: Language-Guided Human Motion Generation with Scene Affordance

Zan Wang, Yixin Chen, Baoxiong Jia, Puhao Li, Jinlu Zhang, Jingze Zhang, Tengyu Liu, Yixin Zhu*, Wei Liang*, Siyuan Huang*

*Corresponding author for this work

Research output: Contribution to journalConference articlepeer-review

9 Citations (Scopus)

Abstract

Despite significant advancements in text-to-motion syn-Thesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' in-tensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage frame-work that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting ex-plicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our exten-sive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, in-cluding HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.

Original languageEnglish
Pages (from-to)433-444
Number of pages12
JournalProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
DOIs
Publication statusPublished - 2024
Event2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Seattle, United States
Duration: 16 Jun 202422 Jun 2024

Fingerprint

Dive into the research topics of 'Move as you Say, Interact as you can: Language-Guided Human Motion Generation with Scene Affordance'. Together they form a unique fingerprint.

Cite this