Audio self-supervised learning: A survey

Shuo Liu; Adria Mallol-Ragolta; Emilia Parada-Cabaleiro; Kun Qian; Xin Jing; Alexander Kathan; Bin Hu; Björn W. Schuller

doi:10.1016/j.patter.2022.100616

Audio self-supervised learning: A survey

Shuo Liu^*, Adria Mallol-Ragolta, Emilia Parada-Cabaleiro, Kun Qian, Xin Jing, Alexander Kathan, Bin Hu, Björn W. Schuller

^*此作品的通讯作者

医学技术学院

科研成果: 期刊稿件 › 文献综述 › 同行评审

46 引用（Scopus）

摘要

Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

源语言	英语
文章编号	100616
期刊	Patterns
卷	3
期	12
DOI	https://doi.org/10.1016/j.patter.2022.100616
出版状态	已出版 - 9 12月 2022

访问文件

10.1016/j.patter.2022.100616

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{d1b06e5c6ea74357833fe8364a87a4cb,

title = "Audio self-supervised learning: A survey",

abstract = "Similar to humans{\textquoteright} cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.",

keywords = "DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem, audio and speech processing, multi-modal SSL, representation learning, self-supervised learning, unsupervised learning",

author = "Shuo Liu and Adria Mallol-Ragolta and Emilia Parada-Cabaleiro and Kun Qian and Xin Jing and Alexander Kathan and Bin Hu and Schuller, {Bj{\"o}rn W.}",

note = "Publisher Copyright: {\textcopyright} 2022 The Author(s)",

year = "2022",

month = dec,

day = "9",

doi = "10.1016/j.patter.2022.100616",

language = "English",

volume = "3",

journal = "Patterns",

issn = "2666-3899",

publisher = "Cell Press",

number = "12",

}

TY - JOUR

T1 - Audio self-supervised learning

T2 - A survey

AU - Liu, Shuo

AU - Mallol-Ragolta, Adria

AU - Parada-Cabaleiro, Emilia

AU - Qian, Kun

AU - Jing, Xin

AU - Kathan, Alexander

AU - Hu, Bin

AU - Schuller, Björn W.

PY - 2022/12/9

Y1 - 2022/12/9

N2 - Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

AB - Similar to humans’ cognitive ability to generalize knowledge and skills, self-supervised learning (SSL) targets discovering general representations from large-scale data. This, through the use of pre-trained SSL models for downstream tasks, alleviates the need for human annotation, which is an expensive and time-consuming task. Its success in the fields of computer vision and natural language processing have prompted its recent adoption into the field of audio and speech processing. Comprehensive reviews summarizing the knowledge in audio SSL are currently missing. To fill this gap, we provide an overview of the SSL methods used for audio and speech processing applications. Herein, we also summarize the empirical works that exploit audio modality in multi-modal SSL frameworks and the existing suitable benchmarks to evaluate the power of SSL in the computer audition domain. Finally, we discuss some open problems and point out the future directions in the development of audio SSL.

KW - DSML 2: Proof-of-concept: Data science output has been formulated, implemented, and tested for one domain/problem

KW - audio and speech processing

KW - multi-modal SSL

KW - representation learning

KW - self-supervised learning

KW - unsupervised learning

UR - http://www.scopus.com/inward/record.url?scp=85145774176&partnerID=8YFLogxK

U2 - 10.1016/j.patter.2022.100616

DO - 10.1016/j.patter.2022.100616

M3 - Review article

AN - SCOPUS:85145774176

SN - 2666-3899

VL - 3

JO - Patterns

JF - Patterns

IS - 12

M1 - 100616

ER -

Audio self-supervised learning: A survey

摘要

访问文件

其它文件与链接

指纹

引用此