Cross-Modal Match for Language Conditioned 3D Object Grounding

Yachao Zhang, Runze Hu, Ronghui Li, Yanyun Qu, Yuan Xie, Xiu Li*

*此作品的通讯作者

科研成果: 书/报告/会议事项章节会议稿件同行评审

摘要

Language conditioned 3D object grounding aims to find the object within the 3D scene mentioned by natural language descriptions, which mainly depends on the matching between visual and natural language. Considerable improvement in grounding performance is achieved by improving the multi-modal fusion mechanism or bridging the gap between detection and matching. However, several mismatches are ignored, i.e., mismatch in local visual representation and global sentence representation, and mismatch in visual space and corresponding label word space. In this paper, we propose cross-modal match for 3D grounding from mitigating these mismatches perspective. Specifically, to match local visual features with the global description sentence, we propose BEV (Bird's-eye-view) based global information embedding module. It projects multiple object proposal features into the BEV and the relations of different objects are accessed by the visual transformer which can model both positions and features with long-range dependencies. To circumvent the mismatch in feature spaces of different modalities, we propose cross-modal consistency learning. It performs cross-modal consistency constraints to convert the visual feature space into the label word feature space resulting in easier matching. Besides, we introduce label distillation loss and global distillation loss to drive these matches learning in a distillation way. We evaluate our method in mainstream evaluation settings on three datasets, and the results demonstrate the effectiveness of the proposed method.

源语言英语
主期刊名Technical Tracks 14
编辑Michael Wooldridge, Jennifer Dy, Sriraam Natarajan
出版商Association for the Advancement of Artificial Intelligence
7359-7367
页数9
版本7
ISBN(电子版)1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 1577358872, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879, 9781577358879
DOI
出版状态已出版 - 25 3月 2024
活动38th AAAI Conference on Artificial Intelligence, AAAI 2024 - Vancouver, 加拿大
期限: 20 2月 202427 2月 2024

出版系列

姓名Proceedings of the AAAI Conference on Artificial Intelligence
编号7
38
ISSN(印刷版)2159-5399
ISSN(电子版)2374-3468

会议

会议38th AAAI Conference on Artificial Intelligence, AAAI 2024
国家/地区加拿大
Vancouver
时期20/02/2427/02/24

指纹

探究 'Cross-Modal Match for Language Conditioned 3D Object Grounding' 的科研主题。它们共同构成独一无二的指纹。

引用此