TY - GEN
T1 - An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism
AU - Li, Yutong
AU - Wang, Juan
AU - Liu, Zhenyu
AU - Zhou, Li
AU - Zhang, Haibo
AU - Tang, Cheng
AU - Hu, Xiping
AU - Hu, Bin
N1 - Publisher Copyright:
© 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
PY - 2024
Y1 - 2024
N2 - Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.
AB - Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.
KW - Automatic detection
KW - Depression
KW - Multi-modal fusion
KW - Multimodal depression detection
UR - http://www.scopus.com/inward/record.url?scp=85180752139&partnerID=8YFLogxK
U2 - 10.1007/978-981-99-8469-5_20
DO - 10.1007/978-981-99-8469-5_20
M3 - Conference contribution
AN - SCOPUS:85180752139
SN - 9789819984688
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 252
EP - 264
BT - Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings
A2 - Liu, Qingshan
A2 - Wang, Hanzi
A2 - Ji, Rongrong
A2 - Ma, Zhanyu
A2 - Zheng, Weishi
A2 - Zha, Hongbin
A2 - Chen, Xilin
A2 - Wang, Liang
PB - Springer Science and Business Media Deutschland GmbH
T2 - 6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023
Y2 - 13 October 2023 through 15 October 2023
ER -