IBACodec: End-to-end speech codec with intra-inter broad attention

Xiaonan Yang; Jinjie Zhou; Deshan Yang; Yunwei Wan; Limin Pan; Senlin Luo

doi:10.1016/j.ipm.2024.103979

IBACodec: End-to-end speech codec with intra-inter broad attention

Xiaonan Yang, Jinjie Zhou, Deshan Yang, Yunwei Wan, Limin Pan^*, Senlin Luo

^*Corresponding author for this work

Beijing Institute of Technology

Research output: Contribution to journal › Article › peer-review

Abstract

Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.

Original language	English
Article number	103979
Journal	Information Processing and Management
Volume	62
Issue number	3
DOIs	https://doi.org/10.1016/j.ipm.2024.103979
Publication status	Published - May 2025

Keywords

Intra-inter
Neural networks
Speech coding
Transformers
VQ-VAE

Access to Document

10.1016/j.ipm.2024.103979

Cite this

Yang, X., Zhou, J., Yang, D., Wan, Y., Pan, L., & Luo, S. (2025). IBACodec: End-to-end speech codec with intra-inter broad attention. Information Processing and Management, 62(3), Article 103979. https://doi.org/10.1016/j.ipm.2024.103979

@article{153686ff454948bca04a0ef0385511e1,

title = "IBACodec: End-to-end speech codec with intra-inter broad attention",

abstract = "Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.",

keywords = "Intra-inter, Neural networks, Speech coding, Transformers, VQ-VAE",

author = "Xiaonan Yang and Jinjie Zhou and Deshan Yang and Yunwei Wan and Limin Pan and Senlin Luo",

note = "Publisher Copyright: {\textcopyright} 2024 Elsevier Ltd",

year = "2025",

month = may,

doi = "10.1016/j.ipm.2024.103979",

language = "English",

volume = "62",

journal = "Information Processing and Management",

issn = "0306-4573",

publisher = "Elsevier Ltd.",

number = "3",

}

TY - JOUR

T1 - IBACodec

T2 - End-to-end speech codec with intra-inter broad attention

AU - Yang, Xiaonan

AU - Zhou, Jinjie

AU - Yang, Deshan

AU - Wan, Yunwei

AU - Pan, Limin

AU - Luo, Senlin

PY - 2025/5

Y1 - 2025/5

N2 - Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.

AB - Speech compression attempts to yield a compact bitstream that can represent a speech signal with minimal distortion by eliminating redundant information, which is increasingly challenging as the bitrate decreases. However, existing neural speech codecs do not fully exploit the information from previous speech sequences, and learning encoded features blindly leads to the ineffective removal of redundant information, resulting in suboptimal reconstruction quality. In this work, we propose an end-to-end speech codec with intra-inter broad attention, named IBACodec, that efficiently compresses speech across different types of datasets, including LibriTTS, LJSpeech, and more. By designing an intra-inter broad transformer that integrates multi-head attention networks and LSTM, our model captures broad attention with direct context awareness between the intra- and inter-frames of speech. Furthermore, we present a dual-branch conformer for channel-wise modeling to effectively eliminate redundant information. In subjective evaluations using speech at a 24 kHz sampling rate, IBACodec at 6.3 kbps is comparable to SoundStream at 9 kbps and better than Opus at 9 kbps, with about 30 % fewer bits. Objective experimental results show that IBACodec outperforms state-of-the-art codecs across a wide range of bitrates, with an average ViSQOL, LLR, and CEP improvement of up to 4.97 %, 38.94 %, and 25.39 %, respectively.

KW - Intra-inter

KW - Neural networks

KW - Speech coding

KW - Transformers

KW - VQ-VAE

UR - http://www.scopus.com/inward/record.url?scp=85210726020&partnerID=8YFLogxK

U2 - 10.1016/j.ipm.2024.103979

DO - 10.1016/j.ipm.2024.103979

M3 - Article

AN - SCOPUS:85210726020

SN - 0306-4573

VL - 62

JO - Information Processing and Management

JF - Information Processing and Management

IS - 3

M1 - 103979

ER -

IBACodec: End-to-end speech codec with intra-inter broad attention

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this