Audio self-supervised learning: A survey

Shuo Liu*, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, Björn W. Schuller

*Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

49 Citations (Scopus)

Abstract

Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

Original languageEnglish
Article number100616
JournalPatterns
Volume3
Issue number12
DOIs
Publication statusPublished - 9 Dec 2022

Keywords

  • DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
  • audio and speech processing
  • multi-modal SSL
  • representation learning
  • self-supervised learning
  • unsupervised learning

Fingerprint

Dive into the research topics of 'Audio self-supervised learning: A survey'. Together they form a unique fingerprint.

Cite this