TY - GEN
T1 - Manipulating the Mind’s Eye
T2 - 40th AAAI Conference on Artificial Intelligence, AAAI 2026
AU - Zheng, Boshi
AU - Li, Yan
AU - Liu, Jiabin
N1 - Publisher Copyright:
© 2026, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2026
Y1 - 2026
N2 - The rise of Vision Transformers (ViTs) as cornerstone models in safety-critical applications like autonomous driving and medical diagnosis has shifted the focus from pure accuracy to verifiable trustworthiness. However, the very mechanisms used to explain these models, their internal attention maps, are themselves vulnerable. This creates a critical ”trust gap,” as the model’s apparent reasoning can be maliciously manipulated. To systematically investigate this vulnerability, we introduce A-SAGE (Attention-based Steering Adversarial Generation by Corrupting Explanations), a dual-objective attack framework that forces a model to misclassify an input while simultaneously corrupting its internal attention patterns to generate a misleading explanation. A-SAGE achieves this by optimizing a unified loss that combines a standard classification objective with two explanation-specific terms: an attention entropy loss to diffuse the model’s focus and an attention map distortion loss to steer the corrupted explanation towards a desired target. Our primary finding is A-SAGE’s exceptional black-box transferability. Using a CaiT-S as a white-box surrogate, adversarial examples generated with imperceptible perturbations achieve attack success rates of 79.4% on ViT-B, 49.7% on ResNet-50, and over 81.5% on other transformers (DeiT-B,TNT-S). Crucially, these successful attacks do not merely destroy the explanation; they generate a coherent but false attention map that deceptively ”justifies” the wrong prediction. These results reveal a systemic vulnerability in the core reasoning of modern foundation models, establishing A-SAGE as a critical benchmark for auditing the robustness of AI explainability.
AB - The rise of Vision Transformers (ViTs) as cornerstone models in safety-critical applications like autonomous driving and medical diagnosis has shifted the focus from pure accuracy to verifiable trustworthiness. However, the very mechanisms used to explain these models, their internal attention maps, are themselves vulnerable. This creates a critical ”trust gap,” as the model’s apparent reasoning can be maliciously manipulated. To systematically investigate this vulnerability, we introduce A-SAGE (Attention-based Steering Adversarial Generation by Corrupting Explanations), a dual-objective attack framework that forces a model to misclassify an input while simultaneously corrupting its internal attention patterns to generate a misleading explanation. A-SAGE achieves this by optimizing a unified loss that combines a standard classification objective with two explanation-specific terms: an attention entropy loss to diffuse the model’s focus and an attention map distortion loss to steer the corrupted explanation towards a desired target. Our primary finding is A-SAGE’s exceptional black-box transferability. Using a CaiT-S as a white-box surrogate, adversarial examples generated with imperceptible perturbations achieve attack success rates of 79.4% on ViT-B, 49.7% on ResNet-50, and over 81.5% on other transformers (DeiT-B,TNT-S). Crucially, these successful attacks do not merely destroy the explanation; they generate a coherent but false attention map that deceptively ”justifies” the wrong prediction. These results reveal a systemic vulnerability in the core reasoning of modern foundation models, establishing A-SAGE as a critical benchmark for auditing the robustness of AI explainability.
UR - https://www.scopus.com/pages/publications/105035030748
U2 - 10.1609/aaai.v40i16.38340
DO - 10.1609/aaai.v40i16.38340
M3 - Conference contribution
AN - SCOPUS:105035030748
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
SN - 9781577359067
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 13369
EP - 13377
BT - Proceedings of the AAAI Conference on Artificial Intelligence
A2 - Koenig, Sven
A2 - Jenkins, Chad
A2 - Taylor, Matthew E.
PB - Association for the Advancement of Artificial Intelligence
Y2 - 20 January 2026 through 27 January 2026
ER -