STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition

Yi Chang; Zhao Ren; Zixing Zhang; Xin Jing; Kun Qian; Xi Shao; Bin Hu; Tanja Schultz; Bjorn W. Schuller

doi:10.1109/TAFFC.2024.3475729

STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition

Yi Chang^*, Zhao Ren^*, Zixing Zhang, Xin Jing, Kun Qian^*, Xi Shao, Bin Hu, Tanja Schultz, Bjorn W. Schuller

^*此作品的通讯作者

科研成果: 期刊稿件 › 文章 › 同行评审

1 引用（Scopus）

摘要

Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has become a popular area of research. However, prior works on adversarial attacks in the audio domain primarily rely on iterative gradient-based techniques, which are time-consuming and prone to overfitting the specific threat model. Furthermore, the exploration of sparse perturbations, which have the potential for better stealthiness, remains limited in the audio domain. To address these challenges, we propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models in an end-to-end and efficient manner. We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP), and demonstrate its ability to generate successful sparse adversarial examples in an efficient manner. Moreover, our generated adversarial examples exhibit model-agnostic transferability, enabling effective adversarial attacks on advanced victim models. The source code for this project is available at https://github.com/glam-imperial/STAA-Net-SER.

源语言	英语
期刊	IEEE Transactions on Affective Computing
DOI	https://doi.org/10.1109/TAFFC.2024.3475729
出版状态	已接受/待刊 - 2024

联合国可持续发展目标

此成果有助于实现下列可持续发展目标：

访问文件

10.1109/TAFFC.2024.3475729

其它文件与链接

链接到 Scopus 的出版物

引用此

Chang, Y., Ren, Z., Zhang, Z., Jing, X., Qian, K., Shao, X., Hu, B., Schultz, T., & Schuller, B. W. (已接受/印刷中). STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition. IEEE Transactions on Affective Computing. https://doi.org/10.1109/TAFFC.2024.3475729

@article{b90245a6b68a48be979698ff39e9ea78,

title = "STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition",

abstract = "Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has become a popular area of research. However, prior works on adversarial attacks in the audio domain primarily rely on iterative gradient-based techniques, which are time-consuming and prone to overfitting the specific threat model. Furthermore, the exploration of sparse perturbations, which have the potential for better stealthiness, remains limited in the audio domain. To address these challenges, we propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models in an end-to-end and efficient manner. We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP), and demonstrate its ability to generate successful sparse adversarial examples in an efficient manner. Moreover, our generated adversarial examples exhibit model-agnostic transferability, enabling effective adversarial attacks on advanced victim models. The source code for this project is available at https://github.com/glam-imperial/STAA-Net-SER.",

keywords = "Adversarial attacks, efficiency, end-to-end, sparsity, speech emotion recognition, transferability",

author = "Yi Chang and Zhao Ren and Zixing Zhang and Xin Jing and Kun Qian and Xi Shao and Bin Hu and Tanja Schultz and Schuller, {Bjorn W.}",

note = "Publisher Copyright: {\textcopyright} 2010-2012 IEEE.",

year = "2024",

doi = "10.1109/TAFFC.2024.3475729",

language = "English",

journal = "IEEE Transactions on Affective Computing",

issn = "1949-3045",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - STAA-Net

T2 - A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition

AU - Chang, Yi

AU - Ren, Zhao

AU - Zhang, Zixing

AU - Jing, Xin

AU - Qian, Kun

AU - Shao, Xi

AU - Hu, Bin

AU - Schultz, Tanja

AU - Schuller, Bjorn W.

PY - 2024

Y1 - 2024

N2 - Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has become a popular area of research. However, prior works on adversarial attacks in the audio domain primarily rely on iterative gradient-based techniques, which are time-consuming and prone to overfitting the specific threat model. Furthermore, the exploration of sparse perturbations, which have the potential for better stealthiness, remains limited in the audio domain. To address these challenges, we propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models in an end-to-end and efficient manner. We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP), and demonstrate its ability to generate successful sparse adversarial examples in an efficient manner. Moreover, our generated adversarial examples exhibit model-agnostic transferability, enabling effective adversarial attacks on advanced victim models. The source code for this project is available at https://github.com/glam-imperial/STAA-Net-SER.

AB - Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural networks in the audio domain to adversarial attacks has become a popular area of research. However, prior works on adversarial attacks in the audio domain primarily rely on iterative gradient-based techniques, which are time-consuming and prone to overfitting the specific threat model. Furthermore, the exploration of sparse perturbations, which have the potential for better stealthiness, remains limited in the audio domain. To address these challenges, we propose a generator-based attack method to generate sparse and transferable adversarial examples to deceive SER models in an end-to-end and efficient manner. We evaluate our method on two widely-used SER datasets, Database of Elicited Mood in Speech (DEMoS) and Interactive Emotional dyadic MOtion CAPture (IEMOCAP), and demonstrate its ability to generate successful sparse adversarial examples in an efficient manner. Moreover, our generated adversarial examples exhibit model-agnostic transferability, enabling effective adversarial attacks on advanced victim models. The source code for this project is available at https://github.com/glam-imperial/STAA-Net-SER.

KW - Adversarial attacks

KW - efficiency

KW - end-to-end

KW - sparsity

KW - speech emotion recognition

KW - transferability

UR - http://www.scopus.com/inward/record.url?scp=85207117290&partnerID=8YFLogxK

U2 - 10.1109/TAFFC.2024.3475729

DO - 10.1109/TAFFC.2024.3475729

M3 - Article

AN - SCOPUS:85207117290

SN - 1949-3045

JO - IEEE Transactions on Affective Computing

JF - IEEE Transactions on Affective Computing

ER -

STAA-Net: A Sparse and Transferable Adversarial Attack for Speech Emotion Recognition

摘要

联合国可持续发展目标

访问文件

其它文件与链接

指纹

引用此