An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

Yutong Li; Juan Wang; Zhenyu Liu; Li Zhou; Haibo Zhang; Cheng Tang; Xiping Hu; Bin Hu

doi:10.1007/978-981-99-8469-5_20

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

Yutong Li, Juan Wang, Zhenyu Liu^*, Li Zhou, Haibo Zhang, Cheng Tang, Xiping Hu, Bin Hu

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.

Original language	English
Title of host publication	Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings
Editors	Qingshan Liu, Hanzi Wang, Rongrong Ji, Zhanyu Ma, Weishi Zheng, Hongbin Zha, Xilin Chen, Liang Wang
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	252-264
Number of pages	13
ISBN (Print)	9789819984688
DOIs	https://doi.org/10.1007/978-981-99-8469-5_20
Publication status	Published - 2024
Externally published	Yes
Event	6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023 - Xiamen, China Duration: 13 Oct 2023 → 15 Oct 2023

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	14429 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023
Country/Territory	China
City	Xiamen
Period	13/10/23 → 15/10/23

Keywords

Automatic detection
Depression
Multi-modal fusion
Multimodal depression detection

Access to Document

10.1007/978-981-99-8469-5_20

Cite this

Li, Y., Wang, J., Liu, Z., Zhou, L., Zhang, H., Tang, C., Hu, X., & Hu, B. (2024). An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. In Q. Liu, H. Wang, R. Ji, Z. Ma, W. Zheng, H. Zha, X. Chen, & L. Wang (Eds.), Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings (pp. 252-264). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14429 LNCS). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-99-8469-5_20

Li, Yutong ; Wang, Juan ; Liu, Zhenyu et al. / An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings. editor / Qingshan Liu ; Hanzi Wang ; Rongrong Ji ; Zhanyu Ma ; Weishi Zheng ; Hongbin Zha ; Xilin Chen ; Liang Wang. Springer Science and Business Media Deutschland GmbH, 2024. pp. 252-264 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{e4023df4bcc1490ca7fac466e28b5dc7,

title = "An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism",

abstract = "Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.",

keywords = "Automatic detection, Depression, Multi-modal fusion, Multimodal depression detection",

author = "Yutong Li and Juan Wang and Zhenyu Liu and Li Zhou and Haibo Zhang and Cheng Tang and Xiping Hu and Bin Hu",

note = "Publisher Copyright: {\textcopyright} 2024, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.; 6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023 ; Conference date: 13-10-2023 Through 15-10-2023",

year = "2024",

doi = "10.1007/978-981-99-8469-5_20",

language = "English",

isbn = "9789819984688",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "252--264",

editor = "Qingshan Liu and Hanzi Wang and Rongrong Ji and Zhanyu Ma and Weishi Zheng and Hongbin Zha and Xilin Chen and Liang Wang",

booktitle = "Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings",

address = "Germany",

}

Li, Y, Wang, J, Liu, Z, Zhou, L, Zhang, H, Tang, C, Hu, X & Hu, B 2024, An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. in Q Liu, H Wang, R Ji, Z Ma, W Zheng, H Zha, X Chen & L Wang (eds), Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 14429 LNCS, Springer Science and Business Media Deutschland GmbH, pp. 252-264, 6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023, Xiamen, China, 13/10/23. https://doi.org/10.1007/978-981-99-8469-5_20

An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. / Li, Yutong; Wang, Juan; Liu, Zhenyu et al.
Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings. ed. / Qingshan Liu; Hanzi Wang; Rongrong Ji; Zhanyu Ma; Weishi Zheng; Hongbin Zha; Xilin Chen; Liang Wang. Springer Science and Business Media Deutschland GmbH, 2024. p. 252-264 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 14429 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism

AU - Li, Yutong

AU - Wang, Juan

AU - Liu, Zhenyu

AU - Zhou, Li

AU - Zhang, Haibo

AU - Tang, Cheng

AU - Hu, Xiping

AU - Hu, Bin

PY - 2024

Y1 - 2024

N2 - Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.

AB - Audio-visual based multimodal depression detection has gained significant attention due to its high efficiency and convenience as a computer-aided detection tool, resulting in promising performance. In this paper, we propose a cross-modal fusion network based on multi-head attention and residual structures (CMAFN) for depression recognition. CMAFN consists of three core modules: the Local Temporal Feature Extract Block (LTF), the Cross-Model Fusion Block (CFB), and the Multi-Head Temporal Attention Block (MTB). The LTF module performs feature extraction and encodes temporal information for audio and video modalities separately, while the CFB module facilitates complementary learning between the modalities. The MTB module accounts for the temporal influence of all modalities on each unimodal branch. With the incorporation of the three well-designed modules, CMAFN can refine the inter-modality complementarity and intra-modality temporal dependencies, achieving the interaction between unimodal branches and adaptive balance between modalities. Evaluation results on widely used depression datasets, AVEC2013 and AVEC2014, demonstrate that the proposed CMAFN method outperforms state-of-the-art approaches for depression recognition tasks. The results highlight the potential of CMAFN as an effective tool for the early detection and diagnosis of depression.

KW - Automatic detection

KW - Depression

KW - Multi-modal fusion

KW - Multimodal depression detection

UR - http://www.scopus.com/inward/record.url?scp=85180752139&partnerID=8YFLogxK

U2 - 10.1007/978-981-99-8469-5_20

DO - 10.1007/978-981-99-8469-5_20

M3 - Conference contribution

AN - SCOPUS:85180752139

SN - 9789819984688

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 252

EP - 264

BT - Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings

A2 - Liu, Qingshan

A2 - Wang, Hanzi

A2 - Ji, Rongrong

A2 - Ma, Zhanyu

A2 - Zheng, Weishi

A2 - Zha, Hongbin

A2 - Chen, Xilin

A2 - Wang, Liang

PB - Springer Science and Business Media Deutschland GmbH

T2 - 6th Chinese Conference on Pattern Recognition and Computer Vision, PRCV 2023

Y2 - 13 October 2023 through 15 October 2023

ER -

Li Y, Wang J, Liu Z, Zhou L, Zhang H, Tang C et al. An Automatic Depression Detection Method with Cross-Modal Fusion Network and Multi-head Attention Mechanism. In Liu Q, Wang H, Ji R, Ma Z, Zheng W, Zha H, Chen X, Wang L, editors, Pattern Recognition and Computer Vision - 6th Chinese Conference, PRCV 2023, Proceedings. Springer Science and Business Media Deutschland GmbH. 2024. p. 252-264. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-981-99-8469-5_20