OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields With Fine-Grained Understanding

Yinan Deng, Jiahui Wang, Jingyu Zhao, Jianyu Dou, Yi Yang, Yufeng Yue*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Citation (Scopus)

Abstract

In recent years, there has been a surge of interest in open-vocabulary 3D scene reconstruction facilitated by visual language models (VLMs), which showcase remarkable capabilities in open-set retrieval tasks. Although the semantic ambiguity of existing point-wise feature maps is alleviated by open-vocabulary mask segmenters for object-level understanding, effectively retaining fine-grained features within objects simultaneously remains challenging. To address these challenges, we introduce OpenObj, an innovative approach to build open-vocabulary object-level Neural Radiance Fields (NeRF) with fine-grained understanding. In essence, OpenObj establishes a robust framework for efficient and watertight scene modeling and comprehension at the object level. Specifically, we obtain cross-frame consistent instance-level masks for supervision through our two-stage mask clustering module. Moreover, by incorporating part-level features into the object NeRF models, OpenObj not only captures object-level instances but also preserves an understanding of their internal granularity. The results on multiple datasets demonstrate that OpenObj achieves superior performance in zero-shot segmentation and retrieval tasks. Additionally, OpenObj supports real-world robotics tasks at several levels, including global movement and local manipulation.

Original languageEnglish
Pages (from-to)652-659
Number of pages8
JournalIEEE Robotics and Automation Letters
Volume10
Issue number1
DOIs
Publication statusPublished - 2025

Keywords

  • Implicit mapping
  • object-level NeRF
  • open-vocabulary
  • representation

Fingerprint

Dive into the research topics of 'OpenObj: Open-Vocabulary Object-Level Neural Radiance Fields With Fine-Grained Understanding'. Together they form a unique fingerprint.

Cite this