TY - JOUR
T1 - Towards Diverse and Efficient Audio Captioning via Diffusion Models
AU - Xu, Manjie
AU - Li, Chenxing
AU - Ren, Yong
AU - Tu, Xinyi
AU - Fu, Ruibo
AU - Liang, Wei
AU - Yu, Dong
N1 - Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.
PY - 2025
Y1 - 2025
N2 - We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable success in various captioning tasks, their insufficient performance in terms of generation speed and diversity impedes progress in audio understanding and multimedia applications. Our diffusion-based framework offers unique advantages stemming from its inherent stochasticity and holistic context modeling in captioning. Through rigorous evaluation, we demonstrate that DAC not only achieves superior performance levels compared to existing benchmarks in the caption quality, but also significantly outperforms them in terms of generation speed and diversity.
AB - We introduce Diffusion-based Audio Captioning (DAC), a non-autoregressive diffusion model tailored for diverse and efficient audio captioning. Although existing captioning models relying on language backbones have achieved remarkable success in various captioning tasks, their insufficient performance in terms of generation speed and diversity impedes progress in audio understanding and multimedia applications. Our diffusion-based framework offers unique advantages stemming from its inherent stochasticity and holistic context modeling in captioning. Through rigorous evaluation, we demonstrate that DAC not only achieves superior performance levels compared to existing benchmarks in the caption quality, but also significantly outperforms them in terms of generation speed and diversity.
KW - audio captioning
KW - diffusion model
UR - https://www.scopus.com/pages/publications/105020046642
U2 - 10.21437/Interspeech.2025-79
DO - 10.21437/Interspeech.2025-79
M3 - Conference article
AN - SCOPUS:105020046642
SN - 2308-457X
SP - 191
EP - 195
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 26th Interspeech Conference 2025
Y2 - 17 August 2025 through 21 August 2025
ER -