TY - JOUR
T1 - CLIP-MSM
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
AU - Yang, Guoyuan
AU - Xue, Mufan
AU - Mao, Ziming
AU - Zheng, Haofang
AU - Xu, Jia
AU - Sheng, Dabin
AU - Sun, Ruotian
AU - Yang, Ruoqi
AU - Li, Xuesong
N1 - Publisher Copyright:
Copyright © 2025, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - Prior work employing deep neural networks (DNNs) with explainable techniques has identified human visual cortical selective representation to specific categories. However, constructing high-performing encoding models that accurately capture brain responses to coexisting multi-semantics remains elusive. Here, we used CLIP models combined with CLIP Dissection to establish a multi-semantic mapping framework (CLIP-MSM) for hypothesis-free analysis in human high-level visual cortex. First, we utilize CLIP models to construct voxel-wise encoding models for predicting visual cortical responses to natural scene images. Then, we apply CLIP Dissection and normalize the semantic mapping score to achieve the mapping of single brain voxels to multiple semantics. Our findings indicate that CLIP Dissection applied to DNNs modeling the human high-level visual cortex demonstrates better interpretability accuracy compared to Network Dissection. In addition, to demonstrate how our method enables fine-grained discovery in hypothesis-free analysis, we quantify the accuracy between CLIP-MSM’s reconstructed brain activation in response to categories of faces, bodies, places, words and food, and the ground truth of brain activation. We demonstrate that CLIP-MSM provides more accurate predictions of visual responses compared to CLIP Dissection. Our results have been validated using two large natural image datasets: the Natural Scenes Dataset (NSD) and the Natural Object Dataset (NOD).
AB - Prior work employing deep neural networks (DNNs) with explainable techniques has identified human visual cortical selective representation to specific categories. However, constructing high-performing encoding models that accurately capture brain responses to coexisting multi-semantics remains elusive. Here, we used CLIP models combined with CLIP Dissection to establish a multi-semantic mapping framework (CLIP-MSM) for hypothesis-free analysis in human high-level visual cortex. First, we utilize CLIP models to construct voxel-wise encoding models for predicting visual cortical responses to natural scene images. Then, we apply CLIP Dissection and normalize the semantic mapping score to achieve the mapping of single brain voxels to multiple semantics. Our findings indicate that CLIP Dissection applied to DNNs modeling the human high-level visual cortex demonstrates better interpretability accuracy compared to Network Dissection. In addition, to demonstrate how our method enables fine-grained discovery in hypothesis-free analysis, we quantify the accuracy between CLIP-MSM’s reconstructed brain activation in response to categories of faces, bodies, places, words and food, and the ground truth of brain activation. We demonstrate that CLIP-MSM provides more accurate predictions of visual responses compared to CLIP Dissection. Our results have been validated using two large natural image datasets: the Natural Scenes Dataset (NSD) and the Natural Object Dataset (NOD).
UR - http://www.scopus.com/inward/record.url?scp=105003914482&partnerID=8YFLogxK
U2 - 10.1609/aaai.v39i9.32994
DO - 10.1609/aaai.v39i9.32994
M3 - Conference article
AN - SCOPUS:105003914482
SN - 2159-5399
VL - 39
SP - 9184
EP - 9192
JO - Proceedings of the AAAI Conference on Artificial Intelligence
JF - Proceedings of the AAAI Conference on Artificial Intelligence
IS - 9
Y2 - 25 February 2025 through 4 March 2025
ER -