Skip to main navigation Skip to search Skip to main content

Hierarchical Dynamics Aggregation Network for Speech-based Depression Detection

  • Li Zhou
  • , Ling Li
  • , Rushi Lan
  • , Zhenyu Liu
  • , Xiaonan Luo*
  • , Bin Hu
  • *Corresponding author for this work
  • Guilin University of Electronic Technology
  • Lanzhou University

Research output: Contribution to journalArticlepeer-review

Abstract

Speech signals, owing to their non-invasive and low-cost advantages, have emerged as a pivotal modality for the objective assessment of depression. However, existing methods struggle to capture the hierarchical dynamic structures of speech, thereby constraining the discriminability of the representations. To address this issue, this paper proposes a Hierarchical Dynamics Aggregation Network (HDAN). Under the hierarchical dynamic modeling paradigm, the model first constructs context-aware first-order acoustic state representations. On this basis, a Multi-Scale Dynamics Extraction (MDE) module and a Dynamic Relation Network (DRN) are introduced to extract and aggregate speech dynamics across multiple temporal scales, forming a unified global dynamic representation. Then, a Dynamic Synergistic Memory (DSM) module is employed to align and enhance sample-level dynamics with learnable prototypes. Finally, a Mask-based Cross Fusion (MCF) module is used to adaptively fuse global dynamics and content semantics, obtaining a joint representation that accounts for both content and dynamics. Comparative experiments on the Androids Corpus and Clinical Dataset demonstrate that HDAN consistently outperforms multiple baseline models on various metrics, validating the effectiveness of HDAN. Meanwhile, ablation studies show that each submodule contributes positively to performance improvements, further supporting the rationality of its structural design.

Original languageEnglish
JournalIEEE Transactions on Affective Computing
DOIs
Publication statusAccepted/In press - 2026
Externally publishedYes

Keywords

  • Hierarchical dynamic modeling
  • Multi-scale dynamic feature extraction
  • Representation fusion
  • Speech-based depression detection

Fingerprint

Dive into the research topics of 'Hierarchical Dynamics Aggregation Network for Speech-based Depression Detection'. Together they form a unique fingerprint.

Cite this