A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation

Huaqing Zhang, Hongbin Ma*, Bemnet Wondimagegnehu Mersha, Ying Jin

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

On-policy deep reinforcement learning (DRL) has the inherent advantage of using multi-step interaction data for policy learning. However, on-policy DRL still faces challenges in improving the sample efficiency of policy evaluations. Therefore, we propose a multi-step on-policy DRL method assisted by off-policy policy evaluation (abbreviated as MSOAO), whichs integrates on-policy and off-policy policy evaluations and belongs to a new type of DRL method. We propose a low-pass filtering algorithm for state-values to perform off-policy policy evaluation and make it efficiently assist on-policy policy evaluation. The filtered state-values and the multi-step interaction data are used as the input of the V-trace algorithm. Then, the state-value function is learned by simultaneously approximating the target state-values obtained from the V-trace output and the action-values of the current policy. The action-value function is learned by using the one-step bootstrapping algorithm to approximate the target action-values obtained from the V-trace output. Extensive evaluation results indicate that MSOAO outperformed the performance of state-of-the-art on-policy DRL algorithms, and the simultaneous learning of the state-value function and the action-value function in MSOAO can promote each other, thus improving the learning capability of the algorithm.

Original languageEnglish
JournalApplied Intelligence
DOIs
Publication statusAccepted/In press - 2024

Keywords

  • Deep reinforcement learning
  • Low-pass filter
  • On-policy and off-policy
  • Policy evaluation
  • Policy gradient

Fingerprint

Dive into the research topics of 'A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation'. Together they form a unique fingerprint.

Cite this