TY - JOUR
T1 - IBACodec
T2 - End-to-end speech codec with intra-inter broad attention
AU - Yang, Xiaonan
AU - Zhou, Jinjie
AU - Yang, Deshan
AU - Wan, Yunwei
AU - Pan, Limin
AU - Luo, Senlin
N1 - Publisher Copyright:
© 2024 Elsevier Ltd
PY - 2025/5
Y1 - 2025/5
N2 - Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.
AB - Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.
KW - Intra-inter
KW - Neural networks
KW - Speech coding
KW - Transformers
KW - VQ-VAE
UR - http://www.scopus.com/inward/record.url?scp=85210726020&partnerID=8YFLogxK
U2 - 10.1016/j.ipm.2024.103979
DO - 10.1016/j.ipm.2024.103979
M3 - Article
AN - SCOPUS:85210726020
SN - 0306-4573
VL - 62
JO - Information Processing and Management
JF - Information Processing and Management
IS - 3
M1 - 103979
ER -