TY - GEN
T1 - Unifying Vision-Language Models and Knowledge Graphs for Zero-Shot Counter-Intuitive Reasoning in Images
AU - Li, Hongxi
AU - Qi, Yayun
AU - Wu, Xinxiao
AU - Luo, Jiebo
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Counter-intuitive visual reasoning aims to recognize and interpret why weird, unusual, and uncanny image content violates common sense or lacks a basis in reality. It is intuitively appealing to employ pre-trained Vision-Language Models (VLMs) due to their impressive zero-shot capability for understanding and reasoning. However, the inherent gap between counter-intuitive images and pre-training images hinders VLM from perceiving commonsense-violating entity relationships that are not seen during pre-training. To address this challenge, we propose a framework that unifies VLM and knowledge graphs together to explicitly incorporate commonsense knowledge into counter-intuitive reasoning. Starting with extracting fine-grained visual entities and their relationships using a VLM, we then reason about commonsense knowledge of the entities and relationships through knowledge graph completion, and finally interpret the deviations between the visual content and the commonsense knowledge using a pre-trained Large Language Model (LLM). Experiments on the visual dataset WHOOPS! demonstrate the effectiveness of our method.
AB - Counter-intuitive visual reasoning aims to recognize and interpret why weird, unusual, and uncanny image content violates common sense or lacks a basis in reality. It is intuitively appealing to employ pre-trained Vision-Language Models (VLMs) due to their impressive zero-shot capability for understanding and reasoning. However, the inherent gap between counter-intuitive images and pre-training images hinders VLM from perceiving commonsense-violating entity relationships that are not seen during pre-training. To address this challenge, we propose a framework that unifies VLM and knowledge graphs together to explicitly incorporate commonsense knowledge into counter-intuitive reasoning. Starting with extracting fine-grained visual entities and their relationships using a VLM, we then reason about commonsense knowledge of the entities and relationships through knowledge graph completion, and finally interpret the deviations between the visual content and the commonsense knowledge using a pre-trained Large Language Model (LLM). Experiments on the visual dataset WHOOPS! demonstrate the effectiveness of our method.
KW - Counter-Intuitive Reasoning
KW - Knowledge Graph
KW - Vision-Language Model
UR - https://www.scopus.com/pages/publications/105013061781
U2 - 10.1109/CVIDL65390.2025.11085609
DO - 10.1109/CVIDL65390.2025.11085609
M3 - Conference contribution
AN - SCOPUS:105013061781
T3 - 2025 6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025
SP - 861
EP - 866
BT - 2025 6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025
Y2 - 23 May 2025 through 25 May 2025
ER -