TY - JOUR
T1 - An efficient and lightweight off-policy actor–critic reinforcement learning framework
AU - Zhang, Huaqing
AU - Ma, Hongbin
AU - Zhang, Xiaofei
AU - Mersha, Bemnet Wondimagegnehu
AU - Wang, Li
AU - Jin, Ying
N1 - Publisher Copyright:
© 2024
PY - 2024/9
Y1 - 2024/9
N2 - In the framework of current off-policy actor–critic methods, the state–action pairs in an experience replay buffer (called historical behaviors) cannot be used to improve the policy, and the target network and the clipped double Q-learning techniques need to be used to evaluate the policy. The framework limits the policy learning capability in complex environments, and needs to maintain four critic networks. As a result, we propose an efficient and lightweight off-policy actor–critic (EL-AC) framework. In the policy improvement, an efficient off-policy likelihood ratio policy gradient algorithm with historical behaviors reusing (PG-HBR) is proposed, which promotes the agent to learn an approximately optimal policy by using the historical behaviors. Moreover, a theoretically interpretable universal critic network is designed. It can approximate the action-value and the state-value functions simultaneously, so as to obtain the advantage function in PG-HBR. In the policy evaluation, we develop the algorithms of low-pass filtering for target state-values and adaptive controlling algorithm for overestimation bias, which can evaluate the policy efficiently and accurately using only one universal critic network. Extensive evaluation results indicate that EL-AC outperforms the state-of-the-art off-policy actor–critic methods in terms of approximately optimal policy learning and neural network storage space occupation, and it is more suitable for policy learning in complex environments.
AB - In the framework of current off-policy actor–critic methods, the state–action pairs in an experience replay buffer (called historical behaviors) cannot be used to improve the policy, and the target network and the clipped double Q-learning techniques need to be used to evaluate the policy. The framework limits the policy learning capability in complex environments, and needs to maintain four critic networks. As a result, we propose an efficient and lightweight off-policy actor–critic (EL-AC) framework. In the policy improvement, an efficient off-policy likelihood ratio policy gradient algorithm with historical behaviors reusing (PG-HBR) is proposed, which promotes the agent to learn an approximately optimal policy by using the historical behaviors. Moreover, a theoretically interpretable universal critic network is designed. It can approximate the action-value and the state-value functions simultaneously, so as to obtain the advantage function in PG-HBR. In the policy evaluation, we develop the algorithms of low-pass filtering for target state-values and adaptive controlling algorithm for overestimation bias, which can evaluate the policy efficiently and accurately using only one universal critic network. Extensive evaluation results indicate that EL-AC outperforms the state-of-the-art off-policy actor–critic methods in terms of approximately optimal policy learning and neural network storage space occupation, and it is more suitable for policy learning in complex environments.
KW - Actor–critic reinforcement learning
KW - Adaptive overestimation bias control
KW - Low-pass filter
KW - Neural networks
KW - Policy gradient
UR - http://www.scopus.com/inward/record.url?scp=85196318214&partnerID=8YFLogxK
U2 - 10.1016/j.asoc.2024.111814
DO - 10.1016/j.asoc.2024.111814
M3 - Article
AN - SCOPUS:85196318214
SN - 1568-4946
VL - 163
JO - Applied Soft Computing
JF - Applied Soft Computing
M1 - 111814
ER -