Interpretable adversarial example detection via high-level concept activation vector

Jiaxing Li, Yu an Tan, Xinyu Liu, Weizhi Meng, Yuanzhang Li*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Deep neural networks have achieved amazing performance in many tasks. However, they are easily fooled by small perturbations added to the input. Such small perturbations to image data are usually imperceptible to humans. The uninterpretable nature of deep learning systems is considered to be one of the reasons why they are vulnerable to adversarial attacks. For enhanced trust and confidence, it is crucial for artificial intelligence systems to ensure transparency, reliability, and human comprehensibility in their decision-making processes as they gain wider acceptance among the general public. In this paper, we propose an approach for defending against adversarial attacks based on conceptually interpretable techniques. Our approach to model interpretation is on high-level concepts rather than low-level pixel features. Our key finding is that adding small perturbations leads to large changes in the model concept vector tests. Based on this, we design a single image concept vector testing method for detecting adversarial examples. Our experiments on the Imagenet dataset show that our method can achieve an average accuracy of over 95%. We provide source code in the supplementary material.

Original languageEnglish
Article number104218
JournalComputers and Security
Volume150
DOIs
Publication statusPublished - Mar 2025

Keywords

  • Adversarial defense
  • Adversarial machine learning
  • Concept activation vector
  • Deep learning
  • Model explainability

Fingerprint

Dive into the research topics of 'Interpretable adversarial example detection via high-level concept activation vector'. Together they form a unique fingerprint.

Cite this