TY - JOUR
T1 - Hierarchical Dynamics Aggregation Network for Speech-based Depression Detection
AU - Zhou, Li
AU - Li, Ling
AU - Lan, Rushi
AU - Liu, Zhenyu
AU - Luo, Xiaonan
AU - Hu, Bin
N1 - Publisher Copyright:
© 2010-2012 IEEE.
PY - 2026
Y1 - 2026
N2 - Speech signals, owing to their non-invasive and low-cost advantages, have emerged as a pivotal modality for the objective assessment of depression. However, existing methods struggle to capture the hierarchical dynamic structures of speech, thereby constraining the discriminability of the representations. To address this issue, this paper proposes a Hierarchical Dynamics Aggregation Network (HDAN). Under the hierarchical dynamic modeling paradigm, the model first constructs context-aware first-order acoustic state representations. On this basis, a Multi-Scale Dynamics Extraction (MDE) module and a Dynamic Relation Network (DRN) are introduced to extract and aggregate speech dynamics across multiple temporal scales, forming a unified global dynamic representation. Then, a Dynamic Synergistic Memory (DSM) module is employed to align and enhance sample-level dynamics with learnable prototypes. Finally, a Mask-based Cross Fusion (MCF) module is used to adaptively fuse global dynamics and content semantics, obtaining a joint representation that accounts for both content and dynamics. Comparative experiments on the Androids Corpus and Clinical Dataset demonstrate that HDAN consistently outperforms multiple baseline models on various metrics, validating the effectiveness of HDAN. Meanwhile, ablation studies show that each submodule contributes positively to performance improvements, further supporting the rationality of its structural design.
AB - Speech signals, owing to their non-invasive and low-cost advantages, have emerged as a pivotal modality for the objective assessment of depression. However, existing methods struggle to capture the hierarchical dynamic structures of speech, thereby constraining the discriminability of the representations. To address this issue, this paper proposes a Hierarchical Dynamics Aggregation Network (HDAN). Under the hierarchical dynamic modeling paradigm, the model first constructs context-aware first-order acoustic state representations. On this basis, a Multi-Scale Dynamics Extraction (MDE) module and a Dynamic Relation Network (DRN) are introduced to extract and aggregate speech dynamics across multiple temporal scales, forming a unified global dynamic representation. Then, a Dynamic Synergistic Memory (DSM) module is employed to align and enhance sample-level dynamics with learnable prototypes. Finally, a Mask-based Cross Fusion (MCF) module is used to adaptively fuse global dynamics and content semantics, obtaining a joint representation that accounts for both content and dynamics. Comparative experiments on the Androids Corpus and Clinical Dataset demonstrate that HDAN consistently outperforms multiple baseline models on various metrics, validating the effectiveness of HDAN. Meanwhile, ablation studies show that each submodule contributes positively to performance improvements, further supporting the rationality of its structural design.
KW - Hierarchical dynamic modeling
KW - Multi-scale dynamic feature extraction
KW - Representation fusion
KW - Speech-based depression detection
UR - https://www.scopus.com/pages/publications/105036178838
U2 - 10.1109/TAFFC.2026.3684889
DO - 10.1109/TAFFC.2026.3684889
M3 - Article
AN - SCOPUS:105036178838
SN - 1949-3045
JO - IEEE Transactions on Affective Computing
JF - IEEE Transactions on Affective Computing
ER -