An efficient and lightweight off-policy actor–critic reinforcement learning framework

Huaqing Zhang, Hongbin Ma*, Xiaofei Zhang, Bemnet Wondimagegnehu Mersha, Li Wang, Ying Jin

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

In the framework of current off-policy actor–critic methods, the state–action pairs in an experience replay buffer (called historical behaviors) cannot be used to improve the policy, and the target network and the clipped double Q-learning techniques need to be used to evaluate the policy. The framework limits the policy learning capability in complex environments, and needs to maintain four critic networks. As a result, we propose an efficient and lightweight off-policy actor–critic (EL-AC) framework. In the policy improvement, an efficient off-policy likelihood ratio policy gradient algorithm with historical behaviors reusing (PG-HBR) is proposed, which promotes the agent to learn an approximately optimal policy by using the historical behaviors. Moreover, a theoretically interpretable universal critic network is designed. It can approximate the action-value and the state-value functions simultaneously, so as to obtain the advantage function in PG-HBR. In the policy evaluation, we develop the algorithms of low-pass filtering for target state-values and adaptive controlling algorithm for overestimation bias, which can evaluate the policy efficiently and accurately using only one universal critic network. Extensive evaluation results indicate that EL-AC outperforms the state-of-the-art off-policy actor–critic methods in terms of approximately optimal policy learning and neural network storage space occupation, and it is more suitable for policy learning in complex environments.

Original languageEnglish
Article number111814
JournalApplied Soft Computing
Volume163
DOIs
Publication statusPublished - Sept 2024

Keywords

  • Actor–critic reinforcement learning
  • Adaptive overestimation bias control
  • Low-pass filter
  • Neural networks
  • Policy gradient

Fingerprint

Dive into the research topics of 'An efficient and lightweight off-policy actor–critic reinforcement learning framework'. Together they form a unique fingerprint.

Cite this