A Progressive Approach to Learn Global and Local Multi-view Features for 3D Visual Grounding

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The 3D visual grounding task aims to localize objects in point clouds based on natural language descriptions, playing a significant role in various domains such as autonomous driving and augmented reality. In this task, the view inconsistency between textual observation perspectives and point cloud viewpoints causes view confusion problems that hinder the model’s ability to accurately localize target objects. To address this issue, this paper proposes a progressive multi-view feature approach to supplement point cloud information from different perspectives, which includes sequential-form global multi-view features and vector-form local multi-view features. This method progressively learns multi-view point cloud features within the model while designing explicit interaction between object relative positions and textual descriptions to enhance the model’s comprehension of multimodal information. Furthermore, we introduce a selective state space model as the learning module for sequential global multi-view features, which improves model accuracy while reducing memory consumption and training time. Experimental results demonstrate that the proposed method achieves superior performance over existing state-of-the-art approaches on public datasets.

Original languageEnglish
Title of host publicationImage and Graphics - 13th International Conference, ICIG 2025, Proceedings
EditorsZhouchen Lin, Liang Wang, Yugang Jiang, Xuesong Wang, Shengcai Liao, Shiguang Shan, Risheng Liu, Jing Dong, Xin Yu
PublisherSpringer Science and Business Media Deutschland GmbH
Pages353-364
Number of pages12
ISBN (Print)9789819533923
DOIs
Publication statusPublished - 2026
Event13th International Conference on Image and Graphics, ICIG 2025 - Xuzhou, China
Duration: 31 Oct 20252 Nov 2025

Publication series

NameLecture Notes in Computer Science
Volume16162 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference13th International Conference on Image and Graphics, ICIG 2025
Country/TerritoryChina
CityXuzhou
Period31/10/252/11/25

Keywords

  • 3D visual grounding
  • Multi-modal learning

Fingerprint

Dive into the research topics of 'A Progressive Approach to Learn Global and Local Multi-view Features for 3D Visual Grounding'. Together they form a unique fingerprint.

Cite this