Skip to main navigation Skip to search Skip to main content

UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model

  • Changxin Huang
  • , Lv Tang
  • , Zhaohuan Zhan
  • , Lisha Yu
  • , Runhao Zeng
  • , Zun Liu
  • , Zhengjie Wang
  • , Jianqiang Li*
  • *Corresponding author for this work
  • Shenzhen University
  • Shenzhen MSU-BIT University
  • Sun Yat-Sen University
  • Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instructions—remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives. To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions as inputs to jointly predict subsequent visual states, enabling cross-modal reasoning. Via a Hierarchical Prediction-Feedback (HPN) mechanism, MWM collaborates with navigation policies: the first layer generates actions using current vision-and-language features; MWM then infers post-action visual states to guide the second layer’s fine-grained decisions. This forms a dynamic bidirectional promotion mechanism where MWM reasoning optimizes navigation policies, while policy decisions feedback to improve MWM’s reasoning accuracy. Experiments on R2R and REVERIE datasets show UNeMo outperforms state-of-the-art methods by 2.1% and 0.7% in navigation accuracy for unseen scenes, validating its effectiveness.

Original languageEnglish
Title of host publicationProceedings of the AAAI Conference on Artificial Intelligence
EditorsSven Koenig, Chad Jenkins, Matthew E. Taylor
PublisherAssociation for the Advancement of Artificial Intelligence
Pages18315
Number of pages1
Edition22
ISBN (Print)9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067, 9781577359067
DOIs
Publication statusPublished - 2026
Externally publishedYes
Event40th AAAI Conference on Artificial Intelligence, AAAI 2026 - Singapore, Singapore
Duration: 20 Jan 202627 Jan 2026

Publication series

NameProceedings of the AAAI Conference on Artificial Intelligence
Number22
Volume40
ISSN (Print)2159-5399
ISSN (Electronic)2374-3468

Conference

Conference40th AAAI Conference on Artificial Intelligence, AAAI 2026
Country/TerritorySingapore
CitySingapore
Period20/01/2627/01/26

Fingerprint

Dive into the research topics of 'UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model'. Together they form a unique fingerprint.

Cite this