Exploring Grounding Abilities in Vision-Language Models through Contextual Perception

Wei Xu, Tianfei Zhou, Taoyuan Zhang, Jie Li, Peiyin Chen, Jia Pan, Xiaofeng Liu*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.

Original languageEnglish
JournalIEEE Transactions on Cognitive and Developmental Systems
DOIs
Publication statusAccepted/In press - 2025

Keywords

  • human-robot interaction
  • Large language model
  • prompt engineering
  • visual grounding

Fingerprint

Dive into the research topics of 'Exploring Grounding Abilities in Vision-Language Models through Contextual Perception'. Together they form a unique fingerprint.

Cite this