TY - JOUR
T1 - Audio self-supervised learning
T2 - A survey
AU - Liu, Shuo
AU - Mallol-Ragolta, Adria
AU - Parada-Cabaleiro, Emilia
AU - Qian, Kun
AU - Jing, Xin
AU - Kathan, Alexander
AU - Hu, Bin
AU - Schuller, Björn W.
N1 - Publisher Copyright:
© 2022 The Author(s)
PY - 2022/12/9
Y1 - 2022/12/9
N2 - Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
AB - Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.
KW - DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem
KW - audio and speech processing
KW - multi-modal SSL
KW - representation learning
KW - self-supervised learning
KW - unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85145774176&partnerID=8YFLogxK
U2 - 10.1016/j.patter.2022.100616
DO - 10.1016/j.patter.2022.100616
M3 - Review article
AN - SCOPUS:85145774176
SN - 2666-3899
VL - 3
JO - Patterns
JF - Patterns
IS - 12
M1 - 100616
ER -