Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

Xinxing Li; Zhihong Peng; Li Liang; Wenzhong Zha

doi:10.1007/s11432-018-9602-1

Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

Xinxing Li, Zhihong Peng^*, Li Liang, Wenzhong Zha

^*Corresponding author for this work

School of Automation

Research output: Contribution to journal › Article › peer-review

18 Citations (Scopus)

Abstract

In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

Original language	English
Article number	52204
Journal	Science China Information Sciences
Volume	62
Issue number	5
DOIs	https://doi.org/10.1007/s11432-018-9602-1
Publication status	Published - 1 May 2019

Keywords

ADP
PI
Q-learning
RL
adaptive dynamic programming
linear nonzero-sum quadratic differential games
off-policy
policy iteration
reinforcement learning

Access to Document

10.1007/s11432-018-9602-1

Cite this

@article{aded293d2ec640a680cf4847b150308d,

title = "Policy iteration based Q-learning for linear nonzero-sum quadratic differential games",

abstract = "In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.",

keywords = "ADP, PI, Q-learning, RL, adaptive dynamic programming, linear nonzero-sum quadratic differential games, off-policy, policy iteration, reinforcement learning",

author = "Xinxing Li and Zhihong Peng and Li Liang and Wenzhong Zha",

note = "Publisher Copyright: {\textcopyright} 2019, Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2019",

month = may,

day = "1",

doi = "10.1007/s11432-018-9602-1",

language = "English",

volume = "62",

journal = "Science China Information Sciences",

issn = "1674-733X",

publisher = "Science China Press",

number = "5",

}

TY - JOUR

T1 - Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

AU - Li, Xinxing

AU - Peng, Zhihong

AU - Liang, Li

AU - Zha, Wenzhong

PY - 2019/5/1

Y1 - 2019/5/1

N2 - In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

AB - In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

KW - ADP

KW - PI

KW - Q-learning

KW - RL

KW - adaptive dynamic programming

KW - linear nonzero-sum quadratic differential games

KW - off-policy

KW - policy iteration

KW - reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85064041558&partnerID=8YFLogxK

U2 - 10.1007/s11432-018-9602-1

DO - 10.1007/s11432-018-9602-1

M3 - Article

AN - SCOPUS:85064041558

SN - 1674-733X

VL - 62

JO - Science China Information Sciences

JF - Science China Information Sciences

IS - 5

M1 - 52204

ER -

Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this