TY - GEN
T1 - An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control
AU - Zhang, Huaqing
AU - Ma, Hongbin
AU - Jin, Ying
N1 - Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
PY - 2022
Y1 - 2022
N2 - When the robot uses reinforcement learning (RL) to learn behavior policy, the requirement of RL algorithm is that it can use limited interactive data to learn the relatively optimal policy model. In this paper, we present an off-policy actor-critic deep RL algorithm based on maximum entropy RL framework. In policy improvement step, an off-policy likelihood ratio policy gradient method is derived, where the actions are sampled simultaneously from the current policy model and the experience replay buffer according to the sampled states. This method makes full use of the past experience. Moreover, we design an unified critic network, which can simultaneously approximate the state-value and action-value functions. On a range of continuous control benchmarks, the results show that our method outperforms the state-of-the-art soft actor-critic (SAC) algorithm in stability and asymptotic performance.
AB - When the robot uses reinforcement learning (RL) to learn behavior policy, the requirement of RL algorithm is that it can use limited interactive data to learn the relatively optimal policy model. In this paper, we present an off-policy actor-critic deep RL algorithm based on maximum entropy RL framework. In policy improvement step, an off-policy likelihood ratio policy gradient method is derived, where the actions are sampled simultaneously from the current policy model and the experience replay buffer according to the sampled states. This method makes full use of the past experience. Moreover, we design an unified critic network, which can simultaneously approximate the state-value and action-value functions. On a range of continuous control benchmarks, the results show that our method outperforms the state-of-the-art soft actor-critic (SAC) algorithm in stability and asymptotic performance.
KW - A unified critic network
KW - Deep reinforcement learning
KW - Robotic control
UR - http://www.scopus.com/inward/record.url?scp=85136967210&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-13841-6_41
DO - 10.1007/978-3-031-13841-6_41
M3 - Conference contribution
AN - SCOPUS:85136967210
SN - 9783031138409
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 449
EP - 458
BT - Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings
A2 - Liu, Honghai
A2 - Ren, Weihong
A2 - Yin, Zhouping
A2 - Liu, Lianqing
A2 - Jiang, Li
A2 - Gu, Guoying
A2 - Wu, Xinyu
PB - Springer Science and Business Media Deutschland GmbH
T2 - 15th International Conference on Intelligent Robotics and Applications, ICIRA 2022
Y2 - 1 August 2022 through 3 August 2022
ER -