TY - GEN
T1 - A Progressive Approach to Learn Global and Local Multi-view Features for 3D Visual Grounding
AU - Yang, Ken
AU - Zhao, Sanyuan
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2026.
PY - 2026
Y1 - 2026
N2 - The 3D visual grounding task aims to localize objects in point clouds based on natural language descriptions, playing a significant role in various domains such as autonomous driving and augmented reality. In this task, the view inconsistency between textual observation perspectives and point cloud viewpoints causes view confusion problems that hinder the model’s ability to accurately localize target objects. To address this issue, this paper proposes a progressive multi-view feature approach to supplement point cloud information from different perspectives, which includes sequential-form global multi-view features and vector-form local multi-view features. This method progressively learns multi-view point cloud features within the model while designing explicit interaction between object relative positions and textual descriptions to enhance the model’s comprehension of multimodal information. Furthermore, we introduce a selective state space model as the learning module for sequential global multi-view features, which improves model accuracy while reducing memory consumption and training time. Experimental results demonstrate that the proposed method achieves superior performance over existing state-of-the-art approaches on public datasets.
AB - The 3D visual grounding task aims to localize objects in point clouds based on natural language descriptions, playing a significant role in various domains such as autonomous driving and augmented reality. In this task, the view inconsistency between textual observation perspectives and point cloud viewpoints causes view confusion problems that hinder the model’s ability to accurately localize target objects. To address this issue, this paper proposes a progressive multi-view feature approach to supplement point cloud information from different perspectives, which includes sequential-form global multi-view features and vector-form local multi-view features. This method progressively learns multi-view point cloud features within the model while designing explicit interaction between object relative positions and textual descriptions to enhance the model’s comprehension of multimodal information. Furthermore, we introduce a selective state space model as the learning module for sequential global multi-view features, which improves model accuracy while reducing memory consumption and training time. Experimental results demonstrate that the proposed method achieves superior performance over existing state-of-the-art approaches on public datasets.
KW - 3D visual grounding
KW - Multi-modal learning
UR - https://www.scopus.com/pages/publications/105022184012
U2 - 10.1007/978-981-95-3393-0_29
DO - 10.1007/978-981-95-3393-0_29
M3 - Conference contribution
AN - SCOPUS:105022184012
SN - 9789819533923
T3 - Lecture Notes in Computer Science
SP - 353
EP - 364
BT - Image and Graphics - 13th International Conference, ICIG 2025, Proceedings
A2 - Lin, Zhouchen
A2 - Wang, Liang
A2 - Jiang, Yugang
A2 - Wang, Xuesong
A2 - Liao, Shengcai
A2 - Shan, Shiguang
A2 - Liu, Risheng
A2 - Dong, Jing
A2 - Yu, Xin
PB - Springer Science and Business Media Deutschland GmbH
T2 - 13th International Conference on Image and Graphics, ICIG 2025
Y2 - 31 October 2025 through 2 November 2025
ER -