Audio self-supervised learning: A survey

Shuo Liu; Adria Mallol-Ragolta; Emilia Parada-Cabaleiro; Kun Qian; Xin Jing; Alexander Kathan; Bin Hu; Björn W. Schuller

doi:10.1016/j.patter.2022.100616

Audio self-supervised learning: A survey

Shuo Liu^*, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, Björn W. Schuller

^*Corresponding author for this work

School of Medical and Technology

Research output: Contribution to journal › Review article › peer-review

49 Citations (Scopus)

Abstract

Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

Original language	English
Article number	100616
Journal	Patterns
Volume	3
Issue number	12
DOIs	https://doi.org/10.1016/j.patter.2022.100616
Publication status	Published - 9 Dec 2022

Keywords

DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
audio and speech processing
multi-modal SSL
representation learning
self-supervised learning
unsupervised learning

Access to Document

10.1016/j.patter.2022.100616

Cite this

@article{d1b06e5c6ea74357833fe8364a87a4cb,

title = "Audio self-supervised learning: A survey",

abstract = "Similar to humans{\textquoteright} cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.",

keywords = "DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem, audio and speech processing, multi-modal SSL, representation learning, self-supervised learning, unsupervised learning",

author = "Shuo Liu and Adria Mallol-Ragolta and Emilia Parada-Cabaleiro and Kun Qian and Xin Jing and Alexander Kathan and Bin Hu and Schuller, {Bj{\"o}rn W.}",

note = "Publisher Copyright: {\textcopyright} 2022 The Author(s)",

year = "2022",

month = dec,

day = "9",

doi = "10.1016/j.patter.2022.100616",

language = "English",

volume = "3",

journal = "Patterns",

issn = "2666-3899",

publisher = "Cell Press",

number = "12",

}

TY - JOUR

T1 - Audio self-supervised learning

T2 - A survey

AU - Liu, Shuo

AU - Mallol-Ragolta, Adria

AU - Parada-Cabaleiro, Emilia

AU - Qian, Kun

AU - Jing, Xin

AU - Kathan, Alexander

AU - Hu, Bin

AU - Schuller, Björn W.

PY - 2022/12/9

Y1 - 2022/12/9

N2 - Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

AB - Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

KW - DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem

KW - audio and speech processing

KW - multi-modal SSL

KW - representation learning

KW - self-supervised learning

KW - unsupervised learning

UR - http://www.scopus.com/inward/record.url?scp=85145774176&partnerID=8YFLogxK

U2 - 10.1016/j.patter.2022.100616

DO - 10.1016/j.patter.2022.100616

M3 - Review article

AN - SCOPUS:85145774176

SN - 2666-3899

VL - 3

JO - Patterns

JF - Patterns

IS - 12

M1 - 100616

ER -

Audio self-supervised learning: A survey

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this