TY - GEN
T1 - SARGes
T2 - Workshop on Generation and Evaluation of Nonverbal Behaviour for Embodied Agents, GENEA Workshop 2025
AU - Gao, Nan
AU - Bao, Yihua
AU - Weng, Dongdong
AU - Zhao, Jiayi
AU - Li, Jia
AU - Zhou, Yan
AU - Wan, Pengfei
N1 - Publisher Copyright:
© 2025 Owner/Author.
PY - 2025/10/26
Y1 - 2025/10/26
N2 - Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to construct an intent chain for parsing speech content and generating reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures. First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to parse context-aware gesture labels. Subsequently, we constructed a text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results show that SARGes achieves gesture labeling performance comparable to GPT-4 in intent interpretation, with efficient single-pass inference (0.4 seconds), and significantly improves the semantic expressiveness of gesture generation.
AB - Co-speech gesture generation enhances human-computer interaction realism through speech-synchronized gesture synthesis. However, generating semantically meaningful gestures remains a challenging problem. We propose SARGes, a novel framework that leverages large language models (LLMs) to construct an intent chain for parsing speech content and generating reliable semantic gesture labels, which subsequently guide the synthesis of meaningful co-speech gestures. First, we constructed a comprehensive co-speech gesture ethogram and developed an LLM-based intent chain reasoning mechanism that systematically parses and decomposes gesture semantics into structured inference steps following ethogram criteria, effectively guiding LLMs to parse context-aware gesture labels. Subsequently, we constructed a text-to-gesture label dataset and trained a lightweight gesture label generation model, which then guides the generation of credible and semantically coherent co-speech gestures. Experimental results show that SARGes achieves gesture labeling performance comparable to GPT-4 in intent interpretation, with efficient single-pass inference (0.4 seconds), and significantly improves the semantic expressiveness of gesture generation.
KW - co-speech gesture generation
KW - gesture ethogram
KW - intent chain
KW - large language models
UR - https://www.scopus.com/pages/publications/105023651251
U2 - 10.1145/3746268.3759436
DO - 10.1145/3746268.3759436
M3 - Conference contribution
AN - SCOPUS:105023651251
T3 - GENEA 2025 - Proceedings of the International Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents, co-located with MM 2025
SP - 13
EP - 21
BT - GENEA 2025 - Proceedings of the International Workshop on Generation and Evaluation of Non-verbal Behaviour for Embodied Agents, co-located with MM 2025
PB - Association for Computing Machinery, Inc
Y2 - 31 October 2025 through 31 October 2025
ER -