TY - JOUR
T1 - Interpretable adversarial example detection via high-level concept activation vector
AU - Li, Jiaxing
AU - Tan, Yu an
AU - Liu, Xinyu
AU - Meng, Weizhi
AU - Li, Yuanzhang
N1 - Publisher Copyright:
© 2024
PY - 2025/3
Y1 - 2025/3
N2 - Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.
AB - Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.
KW - Adversarial defense
KW - Adversarial machine learning
KW - Concept activation vector
KW - Deep learning
KW - Model explainability
UR - http://www.scopus.com/inward/record.url?scp=85210536775&partnerID=8YFLogxK
U2 - 10.1016/j.cose.2024.104218
DO - 10.1016/j.cose.2024.104218
M3 - Article
AN - SCOPUS:85210536775
SN - 0167-4048
VL - 150
JO - Computers and Security
JF - Computers and Security
M1 - 104218
ER -