TNCOA: Efficient Exploration via Observation-Action Constraint on Trajectory-Based Intrinsic Reward

  • Jingxiang Ma
  • , Hongbin Ma*
  • , Youzhi Zhang*
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Efficient exploration is critical in handling sparse rewards and partial observability in deep reinforcement learning. However, most existing intrinsic reward methods based on novelty rely on single-step observations or Euclidean distances. These approaches struggle to capture trajectory-level novelty and often perform poorly in partially observable settings. Moreover, they typically ignore the role of actions in driving observation changes, as not all actions lead to meaningful state transitions. To overcome these limitations, we propose a trajectory-level novelty measure that estimates the novelty of a state by comparing current observations with past ones along the trajectory. To focus on meaningful exploration, we incorporate the mutual information between actions and trajectory novelty to filter out random fluctuations and retain only novelty caused by the agent's actions. Additionally, we introduce a first-visit constraint on observation–action pairs, rewarding only interactions that result in state transitions to enhance exploration efficiency. We conducted experiments in the MiniGrid-ObstructedMaze environment characterised by complex object interactions and sparse rewards. Results demonstrate that our method achieves state-of-the-art performance in convergence speed and average returns. Furthermore, it shows strong generalisation on high-dimensional Atari benchmarks and demonstrates robust performance in more challenging MiniGrid variants. Implementation code is available at: https://github.com/MurrayMa0816/TNCOA.

Original languageEnglish
JournalCAAI Transactions on Intelligence Technology
DOIs
Publication statusAccepted/In press - 2026

Keywords

  • artificial intelligence
  • decision making
  • intelligent systems
  • machine learning

Fingerprint

Dive into the research topics of 'TNCOA: Efficient Exploration via Observation-Action Constraint on Trajectory-Based Intrinsic Reward'. Together they form a unique fingerprint.

Cite this