A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

Xiaoheng Zhang; Weigang Cui; Bin Hu; Yang Li

doi:10.1109/TAFFC.2024.3354382

A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

Xiaoheng Zhang, Weigang Cui, Bin Hu, Yang Li

Research output: Contribution to journal › Article › peer-review

2 Citations (Scopus)

Abstract

Emotion recognition in conversation (ERC) based on multiple modalities has attracted enormous attention. However, most research simply concatenated multimodal representations, generally neglecting the impact of cross-modal correspondences and uncertain factors, and leading to the cross-modal misalignment problems. Furthermore, recent methods only considered simple contextual features, commonly ignoring semantic clues and resulting in an insufficient capture of the semantic consistency. To address these limitations, we propose a novel multi-level alignment and cross-modal unified semantic graph refinement network (MA-CMU-SGRNet) for ERC task. Specifically, a multi-level alignment (MA) is first designed to bridge the gap between acoustic and lexical modalities, which can effectively contrast both the instance-level and prototype-level relationships, separating the multimodal features in the latent space. Second, a cross-modal uncertainty-aware unification (CMU) is adopted to generate a unified representation in joint space considering the ambiguity of emotion. Finally, a dual-encoding semantic graph refinement network (SGRNet) is investigated, which includes a syntactic encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. Extensive experiments on three multimodal public datasets show the effectiveness of our proposed method compared with the state-of-the-art methods, indicating its potential application in conversational emotion recognition. Implementation codes can be available at <uri>https://github.com/zxiaohen/MA-CMU-SGRNet</uri>.

Original language	English
Pages (from-to)	1-13
Number of pages	13
Journal	IEEE Transactions on Affective Computing
DOIs	https://doi.org/10.1109/TAFFC.2024.3354382
Publication status	Accepted/In press - 2024
Externally published	Yes

Keywords

Context modeling
Emotion recognition
Emotion recognition
Self-supervised learning
Semantics
Syntactics
Task analysis
Uncertainty
cross-modal alignment
multimodal fusion
semantic refinement

Access to Document

10.1109/TAFFC.2024.3354382

Cite this

@article{4423db5e1da04a06a48344f68ab6878a,

title = "A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition",

abstract = "Emotion recognition in conversation (ERC) based on multiple modalities has attracted enormous attention. However, most research simply concatenated multimodal representations, generally neglecting the impact of cross-modal correspondences and uncertain factors, and leading to the cross-modal misalignment problems. Furthermore, recent methods only considered simple contextual features, commonly ignoring semantic clues and resulting in an insufficient capture of the semantic consistency. To address these limitations, we propose a novel multi-level alignment and cross-modal unified semantic graph refinement network (MA-CMU-SGRNet) for ERC task. Specifically, a multi-level alignment (MA) is first designed to bridge the gap between acoustic and lexical modalities, which can effectively contrast both the instance-level and prototype-level relationships, separating the multimodal features in the latent space. Second, a cross-modal uncertainty-aware unification (CMU) is adopted to generate a unified representation in joint space considering the ambiguity of emotion. Finally, a dual-encoding semantic graph refinement network (SGRNet) is investigated, which includes a syntactic encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. Extensive experiments on three multimodal public datasets show the effectiveness of our proposed method compared with the state-of-the-art methods, indicating its potential application in conversational emotion recognition. Implementation codes can be available at https://github.com/zxiaohen/MA-CMU-SGRNet.",

keywords = "Context modeling, Emotion recognition, Emotion recognition, Self-supervised learning, Semantics, Syntactics, Task analysis, Uncertainty, cross-modal alignment, multimodal fusion, semantic refinement",

author = "Xiaoheng Zhang and Weigang Cui and Bin Hu and Yang Li",

note = "Publisher Copyright: IEEE",

year = "2024",

doi = "10.1109/TAFFC.2024.3354382",

language = "English",

pages = "1--13",

journal = "IEEE Transactions on Affective Computing",

issn = "1949-3045",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

AU - Zhang, Xiaoheng

AU - Cui, Weigang

AU - Hu, Bin

AU - Li, Yang

N1 - Publisher Copyright: IEEE

PY - 2024

Y1 - 2024

N2 - Emotion recognition in conversation (ERC) based on multiple modalities has attracted enormous attention. However, most research simply concatenated multimodal representations, generally neglecting the impact of cross-modal correspondences and uncertain factors, and leading to the cross-modal misalignment problems. Furthermore, recent methods only considered simple contextual features, commonly ignoring semantic clues and resulting in an insufficient capture of the semantic consistency. To address these limitations, we propose a novel multi-level alignment and cross-modal unified semantic graph refinement network (MA-CMU-SGRNet) for ERC task. Specifically, a multi-level alignment (MA) is first designed to bridge the gap between acoustic and lexical modalities, which can effectively contrast both the instance-level and prototype-level relationships, separating the multimodal features in the latent space. Second, a cross-modal uncertainty-aware unification (CMU) is adopted to generate a unified representation in joint space considering the ambiguity of emotion. Finally, a dual-encoding semantic graph refinement network (SGRNet) is investigated, which includes a syntactic encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. Extensive experiments on three multimodal public datasets show the effectiveness of our proposed method compared with the state-of-the-art methods, indicating its potential application in conversational emotion recognition. Implementation codes can be available at https://github.com/zxiaohen/MA-CMU-SGRNet.

AB - Emotion recognition in conversation (ERC) based on multiple modalities has attracted enormous attention. However, most research simply concatenated multimodal representations, generally neglecting the impact of cross-modal correspondences and uncertain factors, and leading to the cross-modal misalignment problems. Furthermore, recent methods only considered simple contextual features, commonly ignoring semantic clues and resulting in an insufficient capture of the semantic consistency. To address these limitations, we propose a novel multi-level alignment and cross-modal unified semantic graph refinement network (MA-CMU-SGRNet) for ERC task. Specifically, a multi-level alignment (MA) is first designed to bridge the gap between acoustic and lexical modalities, which can effectively contrast both the instance-level and prototype-level relationships, separating the multimodal features in the latent space. Second, a cross-modal uncertainty-aware unification (CMU) is adopted to generate a unified representation in joint space considering the ambiguity of emotion. Finally, a dual-encoding semantic graph refinement network (SGRNet) is investigated, which includes a syntactic encoder to aggregate information from near neighbors and a semantic encoder to focus on useful semantically close neighbors. Extensive experiments on three multimodal public datasets show the effectiveness of our proposed method compared with the state-of-the-art methods, indicating its potential application in conversational emotion recognition. Implementation codes can be available at https://github.com/zxiaohen/MA-CMU-SGRNet.

KW - Context modeling

KW - Emotion recognition

KW - Self-supervised learning

KW - Semantics

KW - Syntactics

KW - Task analysis

KW - Uncertainty

KW - cross-modal alignment

KW - multimodal fusion

KW - semantic refinement

UR - http://www.scopus.com/inward/record.url?scp=85184307889&partnerID=8YFLogxK

U2 - 10.1109/TAFFC.2024.3354382

DO - 10.1109/TAFFC.2024.3354382

M3 - Article

AN - SCOPUS:85184307889

SN - 1949-3045

SP - 1

EP - 13

JO - IEEE Transactions on Affective Computing

JF - IEEE Transactions on Affective Computing

ER -

A Multi-Level Alignment and Cross-Modal Unified Semantic Graph Refinement Network for Conversational Emotion Recognition

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this