TY - JOUR
T1 - Exploring Grounding Abilities in Vision-Language Models through Contextual Perception
AU - Xu, Wei
AU - Zhou, Tianfei
AU - Zhang, Taoyuan
AU - Li, Jie
AU - Chen, Peiyin
AU - Pan, Jia
AU - Liu, Xiaofeng
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2025
Y1 - 2025
N2 - Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.
AB - Vision language models (VLMs) have demonstrated strong general capabilities and achieved great success in areas such as image understanding and reasoning. Visual prompts enhance the focus of VLMs on designated areas, but their fine-grained grounding has not been fully developed. Recent research has used Set-of-Mark (SoM) approach to unleash the grounding capabilities of Generative Pre-trained Transformer-4 with Vision (GPT-4V), achieving significant benchmark performance. However, SoM still has problems with label offset and hallucination of vision language models, and the grounding ability of VLMs remains limited, making it challenging to handle complex scenarios in human-robot interaction. To address these limitations and provide more accurate and less hallucinatory results, we propose Contextual Set-of-Mark (ConSoM), a new SoM-based prompting mechanism that leverages dual-image inputs and contextual semantic information of images. Experiments demonstrate that ConSoM has distinct advantages in visual grounding, improving by 11% compared to the baseline on the dataset Refcocog. Furthermore, we evaluated ConSoM’s grounding abilities in five indoor scenarios, where it exhibited strong robustness in complex environments and under occlusion conditions. We also introduced a scalable annotation method for pixel-level question-answering dataset. The accuracy, scalability, and depth of world knowledge make ConSoM a highly effective approach for future human-robot interactions.
KW - human-robot interaction
KW - Large language model
KW - prompt engineering
KW - visual grounding
UR - http://www.scopus.com/inward/record.url?scp=105004695236&partnerID=8YFLogxK
U2 - 10.1109/TCDS.2025.3566649
DO - 10.1109/TCDS.2025.3566649
M3 - Article
AN - SCOPUS:105004695236
SN - 2379-8920
JO - IEEE Transactions on Cognitive and Developmental Systems
JF - IEEE Transactions on Cognitive and Developmental Systems
ER -