TY - JOUR
T1 - Multi-Modal Emotion Recognition With Graph Reinforcement Representation Network for Human-Robot Interaction
AU - Chen, Dan
AU - Liu, Zhen Tao
AU - She, Jinhua
AU - Hirota, Kaoru
AU - Kawata, Seiichi
N1 - Publisher Copyright:
© 1999-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Emotion recognition in conversation is essential for achieving intelligent human-robot interaction (HRI). Accurate emotional understanding allows robots to engage in more natural and context-aware interactions with users. A novel graph reinforcement representation architecture (RGATN) is proposed for emotion recognition in HRI. The RGATN aims to analyze multi-modal conversational information to infer users' emotional states during interaction. Specifically, the architecture integrates a residual graph network (Res-GN) and a cross-modal graph channel attention network (CM-GCA). The Res-GN is proposed to efficiently represents information by reducing redundancy in fully connected graphs and incorporating all potential connectivity relationships. To address the inconsistencies in the quality of different modalities in HRI scenarios, the CM-GCA mechanism is presented. This mechanism preserves information from each modality while reconstructing and enhancing the overall graph representation by leveraging the adaptive cross-modal channel attention. The proposed method improves emotion recognition accuracy and robustness in HRI. We evaluated the method on three benchmark datasets, including IEMOCAP, MELD and M ^{3} ED, achieving weighted F1-score of 72.07%, 67.99% and 53.16%, respectively. Additionally, preliminary application experiments conducted on a self-built database demonstrated a recognition accuracy of 78.75%. The results highlight the ability to effectively adapt to inconsistent modal quality, further confirming its effectiveness in real-world HRI scenarios.
AB - Emotion recognition in conversation is essential for achieving intelligent human-robot interaction (HRI). Accurate emotional understanding allows robots to engage in more natural and context-aware interactions with users. A novel graph reinforcement representation architecture (RGATN) is proposed for emotion recognition in HRI. The RGATN aims to analyze multi-modal conversational information to infer users' emotional states during interaction. Specifically, the architecture integrates a residual graph network (Res-GN) and a cross-modal graph channel attention network (CM-GCA). The Res-GN is proposed to efficiently represents information by reducing redundancy in fully connected graphs and incorporating all potential connectivity relationships. To address the inconsistencies in the quality of different modalities in HRI scenarios, the CM-GCA mechanism is presented. This mechanism preserves information from each modality while reconstructing and enhancing the overall graph representation by leveraging the adaptive cross-modal channel attention. The proposed method improves emotion recognition accuracy and robustness in HRI. We evaluated the method on three benchmark datasets, including IEMOCAP, MELD and M ^{3} ED, achieving weighted F1-score of 72.07%, 67.99% and 53.16%, respectively. Additionally, preliminary application experiments conducted on a self-built database demonstrated a recognition accuracy of 78.75%. The results highlight the ability to effectively adapt to inconsistent modal quality, further confirming its effectiveness in real-world HRI scenarios.
KW - Multi-modal fusion
KW - cross-modal representation
KW - emotion recognition in conversation
KW - graph-based learning
KW - human-robot interaction
UR - https://www.scopus.com/pages/publications/105031646271
U2 - 10.1109/TMM.2026.3668695
DO - 10.1109/TMM.2026.3668695
M3 - Article
AN - SCOPUS:105031646271
SN - 1520-9210
JO - IEEE Transactions on Multimedia
JF - IEEE Transactions on Multimedia
ER -