跳到主要导航 跳到搜索 跳到主要内容

Unifying Vision-Language Models and Knowledge Graphs for Zero-Shot Counter-Intuitive Reasoning in Images

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Counter-intuitive visual reasoning aims to recognize and interpret why weird, unusual, and uncanny image content violates common sense or lacks a basis in reality. It is intuitively appealing to employ pre-trained Vision-Language Models (VLMs) due to their impressive zero-shot capability for understanding and reasoning. However, the inherent gap between counter-intuitive images and pre-training images hinders VLM from perceiving commonsense-violating entity relationships that are not seen during pre-training. To address this challenge, we propose a framework that unifies VLM and knowledge graphs together to explicitly incorporate commonsense knowledge into counter-intuitive reasoning. Starting with extracting fine-grained visual entities and their relationships using a VLM, we then reason about commonsense knowledge of the entities and relationships through knowledge graph completion, and finally interpret the deviations between the visual content and the commonsense knowledge using a pre-trained Large Language Model (LLM). Experiments on the visual dataset WHOOPS! demonstrate the effectiveness of our method.

源语言英语
主期刊名2025 6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025
出版商Institute of Electrical and Electronics Engineers Inc.
861-866
页数6
ISBN(电子版)9798331523244
DOI
出版状态已出版 - 2025
活动6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025 - Ningbo, 中国
期限: 23 5月 202525 5月 2025

出版系列

姓名2025 6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025

会议

会议6th International Conference on Computer Vision, Image and Deep Learning, CVIDL 2025
国家/地区中国
Ningbo
时期23/05/2525/05/25

指纹

探究 'Unifying Vision-Language Models and Knowledge Graphs for Zero-Shot Counter-Intuitive Reasoning in Images' 的科研主题。它们共同构成独一无二的指纹。

引用此