An efficient and lightweight off-policy actor–critic reinforcement learning framework

Huaqing Zhang, Hongbin Ma*, Xiaofei Zhang, Bemnet Wondimagegnehu Mersha, Li Wang, Ying Jin

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

摘要

In the framework of current off-policy actor–critic methods, the state–action pairs in an experience replay buffer (called historical behaviors) cannot be used to improve the policy, and the target network and the clipped double Q-learning techniques need to be used to evaluate the policy. The framework limits the policy learning capability in complex environments, and needs to maintain four critic networks. As a result, we propose an efficient and lightweight off-policy actor–critic (EL-AC) framework. In the policy improvement, an efficient off-policy likelihood ratio policy gradient algorithm with historical behaviors reusing (PG-HBR) is proposed, which promotes the agent to learn an approximately optimal policy by using the historical behaviors. Moreover, a theoretically interpretable universal critic network is designed. It can approximate the action-value and the state-value functions simultaneously, so as to obtain the advantage function in PG-HBR. In the policy evaluation, we develop the algorithms of low-pass filtering for target state-values and adaptive controlling algorithm for overestimation bias, which can evaluate the policy efficiently and accurately using only one universal critic network. Extensive evaluation results indicate that EL-AC outperforms the state-of-the-art off-policy actor–critic methods in terms of approximately optimal policy learning and neural network storage space occupation, and it is more suitable for policy learning in complex environments.

源语言英语
文章编号111814
期刊Applied Soft Computing
163
DOI
出版状态已出版 - 9月 2024

指纹

探究 'An efficient and lightweight off-policy actor–critic reinforcement learning framework' 的科研主题。它们共同构成独一无二的指纹。

引用此