TY - JOUR
T1 - Move as you Say, Interact as you can
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
AU - Wang, Zan
AU - Chen, Yixin
AU - Jia, Baoxiong
AU - Li, Puhao
AU - Zhang, Jinlu
AU - Zhang, Jingze
AU - Liu, Tengyu
AU - Zhu, Yixin
AU - Liang, Wei
AU - Huang, Siyuan
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Despite significant advancements in text-to-motion syn-Thesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' in-tensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage frame-work that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting ex-plicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our exten-sive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, in-cluding HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.
AB - Despite significant advancements in text-to-motion syn-Thesis, generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language, 3D scenes, and human motion, and (ii) the generative models' in-tensive data requirements contrasted with the scarcity of comprehensive, high-quality, language-scene-motion datasets. To tackle these issues, we introduce a novel two-stage frame-work that employs scene affordance as an intermediate representation, effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting ex-plicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals, especially when training with limited data lacking extensive language-scene-motion pairs. Our exten-sive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks, in-cluding HumanML3D and HUMANISE. Additionally, we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes.
UR - http://www.scopus.com/inward/record.url?scp=85218419460&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.00049
DO - 10.1109/CVPR52733.2024.00049
M3 - Conference article
AN - SCOPUS:85218419460
SN - 1063-6919
SP - 433
EP - 444
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Y2 - 16 June 2024 through 22 June 2024
ER -