Learning audio sequence representations for acoustic event classification

Zixing Zhang; Ding Liu; Jing Han; Kun Qian; Björn W. Schuller

doi:10.1016/j.eswa.2021.115007

Learning audio sequence representations for acoustic event classification

Zixing Zhang, Ding Liu, Jing Han, Kun Qian^*, Björn W. Schuller

^*Corresponding author for this work

Research output: Contribution to journal › Article › peer-review

5 Citations (Scopus)

Abstract

Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a ‘hand-crafted’ manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame-level. In this article, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and a RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC.

Original language	English
Article number	115007
Journal	Expert Systems with Applications
Volume	178
DOIs	https://doi.org/10.1016/j.eswa.2021.115007
Publication status	Published - 15 Sept 2021
Externally published	Yes

Keywords

Acoustic event classification
Audio sequence-to-vector
Computer audition
Deep learning
Machine learning
Recurrent autoencoder

Access to Document

10.1016/j.eswa.2021.115007

Cite this

Zhang, Z., Liu, D., Han, J., Qian, K., & Schuller, B. W. (2021). Learning audio sequence representations for acoustic event classification. Expert Systems with Applications, 178, Article 115007. https://doi.org/10.1016/j.eswa.2021.115007

@article{75564c6f718c43348696e7c7c515bc7b,

title = "Learning audio sequence representations for acoustic event classification",

abstract = "Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a {\textquoteleft}hand-crafted{\textquoteright} manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame-level. In this article, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and a RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC.",

keywords = "Acoustic event classification, Audio sequence-to-vector, Computer audition, Deep learning, Machine learning, Recurrent autoencoder",

author = "Zixing Zhang and Ding Liu and Jing Han and Kun Qian and Schuller, {Bj{\"o}rn W.}",

note = "Publisher Copyright: {\textcopyright} 2021 Elsevier Ltd",

year = "2021",

month = sep,

day = "15",

doi = "10.1016/j.eswa.2021.115007",

language = "English",

volume = "178",

journal = "Expert Systems with Applications",

issn = "0957-4174",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - Learning audio sequence representations for acoustic event classification

AU - Zhang, Zixing

AU - Liu, Ding

AU - Han, Jing

AU - Qian, Kun

AU - Schuller, Björn W.

PY - 2021/9/15

Y1 - 2021/9/15

N2 - Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a ‘hand-crafted’ manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame-level. In this article, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and a RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC.

AB - Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a ‘hand-crafted’ manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame-level. In this article, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and a RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC.

KW - Acoustic event classification

KW - Audio sequence-to-vector

KW - Computer audition

KW - Deep learning

KW - Machine learning

KW - Recurrent autoencoder

UR - http://www.scopus.com/inward/record.url?scp=85104804190&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2021.115007

DO - 10.1016/j.eswa.2021.115007

M3 - Article

AN - SCOPUS:85104804190

SN - 0957-4174

VL - 178

JO - Expert Systems with Applications

JF - Expert Systems with Applications

M1 - 115007

ER -

Learning audio sequence representations for acoustic event classification

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this