An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control

Huaqing Zhang; Hongbin Ma; Ying Jin

doi:10.1007/978-3-031-13841-6_41

An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control

Huaqing Zhang, Hongbin Ma^*, Ying Jin

^*Corresponding author for this work

School of Automation

Beijing Institute of Technology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

2 Citations (Scopus)

Abstract

When the robot uses reinforcement learning (RL) to learn behavior policy, the requirement of RL algorithm is that it can use limited interactive data to learn the relatively optimal policy model. In this paper, we present an off-policy actor-critic deep RL algorithm based on maximum entropy RL framework. In policy improvement step, an off-policy likelihood ratio policy gradient method is derived, where the actions are sampled simultaneously from the current policy model and the experience replay buffer according to the sampled states. This method makes full use of the past experience. Moreover, we design an unified critic network, which can simultaneously approximate the state-value and action-value functions. On a range of continuous control benchmarks, the results show that our method outperforms the state-of-the-art soft actor-critic (SAC) algorithm in stability and asymptotic performance.

Original language	English
Title of host publication	Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings
Editors	Honghai Liu, Weihong Ren, Zhouping Yin, Lianqing Liu, Li Jiang, Guoying Gu, Xinyu Wu
Publisher	Springer Science and Business Media Deutschland GmbH
Pages	449-458
Number of pages	10
ISBN (Print)	9783031138409
DOIs	https://doi.org/10.1007/978-3-031-13841-6_41
Publication status	Published - 2022
Event	15th International Conference on Intelligent Robotics and Applications, ICIRA 2022 - Harbin, China Duration: 1 Aug 2022 → 3 Aug 2022

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	13458 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	15th International Conference on Intelligent Robotics and Applications, ICIRA 2022
Country/Territory	China
City	Harbin
Period	1/08/22 → 3/08/22

Keywords

A unified critic network
Deep reinforcement learning
Robotic control

Access to Document

10.1007/978-3-031-13841-6_41

Cite this

Zhang, H., Ma, H., & Jin, Y. (2022). An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control. In H. Liu, W. Ren, Z. Yin, L. Liu, L. Jiang, G. Gu, & X. Wu (Eds.), Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings (pp. 449-458). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13458 LNAI). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-13841-6_41

Zhang, Huaqing ; Ma, Hongbin ; Jin, Ying. / An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control. Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings. editor / Honghai Liu ; Weihong Ren ; Zhouping Yin ; Lianqing Liu ; Li Jiang ; Guoying Gu ; Xinyu Wu. Springer Science and Business Media Deutschland GmbH, 2022. pp. 449-458 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{06eb3dc8d1b44fada67d838a420a800b,

title = "An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control",

abstract = "When the robot uses reinforcement learning (RL) to learn behavior policy, the requirement of RL algorithm is that it can use limited interactive data to learn the relatively optimal policy model. In this paper, we present an off-policy actor-critic deep RL algorithm based on maximum entropy RL framework. In policy improvement step, an off-policy likelihood ratio policy gradient method is derived, where the actions are sampled simultaneously from the current policy model and the experience replay buffer according to the sampled states. This method makes full use of the past experience. Moreover, we design an unified critic network, which can simultaneously approximate the state-value and action-value functions. On a range of continuous control benchmarks, the results show that our method outperforms the state-of-the-art soft actor-critic (SAC) algorithm in stability and asymptotic performance.",

keywords = "A unified critic network, Deep reinforcement learning, Robotic control",

author = "Huaqing Zhang and Hongbin Ma and Ying Jin",

note = "Publisher Copyright: {\textcopyright} 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.; 15th International Conference on Intelligent Robotics and Applications, ICIRA 2022 ; Conference date: 01-08-2022 Through 03-08-2022",

year = "2022",

doi = "10.1007/978-3-031-13841-6_41",

language = "English",

isbn = "9783031138409",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Science and Business Media Deutschland GmbH",

pages = "449--458",

editor = "Honghai Liu and Weihong Ren and Zhouping Yin and Lianqing Liu and Li Jiang and Guoying Gu and Xinyu Wu",

booktitle = "Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings",

address = "Germany",

}

Zhang, H, Ma, H & Jin, Y 2022, An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control. in H Liu, W Ren, Z Yin, L Liu, L Jiang, G Gu & X Wu (eds), Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 13458 LNAI, Springer Science and Business Media Deutschland GmbH, pp. 449-458, 15th International Conference on Intelligent Robotics and Applications, ICIRA 2022, Harbin, China, 1/08/22. https://doi.org/10.1007/978-3-031-13841-6_41

An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control. / Zhang, Huaqing; Ma, Hongbin ; Jin, Ying.
Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings. ed. / Honghai Liu; Weihong Ren; Zhouping Yin; Lianqing Liu; Li Jiang; Guoying Gu; Xinyu Wu. Springer Science and Business Media Deutschland GmbH, 2022. p. 449-458 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 13458 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control

AU - Zhang, Huaqing

AU - Ma, Hongbin

AU - Jin, Ying

PY - 2022

Y1 - 2022

N2 - When the robot uses reinforcement learning (RL) to learn behavior policy, the requirement of RL algorithm is that it can use limited interactive data to learn the relatively optimal policy model. In this paper, we present an off-policy actor-critic deep RL algorithm based on maximum entropy RL framework. In policy improvement step, an off-policy likelihood ratio policy gradient method is derived, where the actions are sampled simultaneously from the current policy model and the experience replay buffer according to the sampled states. This method makes full use of the past experience. Moreover, we design an unified critic network, which can simultaneously approximate the state-value and action-value functions. On a range of continuous control benchmarks, the results show that our method outperforms the state-of-the-art soft actor-critic (SAC) algorithm in stability and asymptotic performance.

AB - When the robot uses reinforcement learning (RL) to learn behavior policy, the requirement of RL algorithm is that it can use limited interactive data to learn the relatively optimal policy model. In this paper, we present an off-policy actor-critic deep RL algorithm based on maximum entropy RL framework. In policy improvement step, an off-policy likelihood ratio policy gradient method is derived, where the actions are sampled simultaneously from the current policy model and the experience replay buffer according to the sampled states. This method makes full use of the past experience. Moreover, we design an unified critic network, which can simultaneously approximate the state-value and action-value functions. On a range of continuous control benchmarks, the results show that our method outperforms the state-of-the-art soft actor-critic (SAC) algorithm in stability and asymptotic performance.

KW - A unified critic network

KW - Deep reinforcement learning

KW - Robotic control

UR - http://www.scopus.com/inward/record.url?scp=85136967210&partnerID=8YFLogxK

U2 - 10.1007/978-3-031-13841-6_41

DO - 10.1007/978-3-031-13841-6_41

M3 - Conference contribution

AN - SCOPUS:85136967210

SN - 9783031138409

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 449

EP - 458

BT - Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings

A2 - Liu, Honghai

A2 - Ren, Weihong

A2 - Yin, Zhouping

A2 - Liu, Lianqing

A2 - Jiang, Li

A2 - Gu, Guoying

A2 - Wu, Xinyu

PB - Springer Science and Business Media Deutschland GmbH

T2 - 15th International Conference on Intelligent Robotics and Applications, ICIRA 2022

Y2 - 1 August 2022 through 3 August 2022

ER -

Zhang H, Ma H , Jin Y. An Improved Off-Policy Actor-Critic Algorithm with Historical Behaviors Reusing for Robotic Control. In Liu H, Ren W, Yin Z, Liu L, Jiang L, Gu G, Wu X, editors, Intelligent Robotics and Applications - 15th International Conference, ICIRA 2022, Proceedings. Springer Science and Business Media Deutschland GmbH. 2022. p. 449-458. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-031-13841-6_41