Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

Xinxing Li; Zhihong Peng; Li Liang; Wenzhong Zha

doi:10.1007/s11432-018-9602-1

Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

Xinxing Li, Zhihong Peng^*, Li Liang, Wenzhong Zha

^*此作品的通讯作者

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

18 引用（Scopus）

摘要

In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

源语言	英语
文章编号	52204
期刊	Science China Information Sciences
卷	62
期	5
DOI	https://doi.org/10.1007/s11432-018-9602-1
出版状态	已出版 - 1 5月 2019

访问文件

10.1007/s11432-018-9602-1

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{aded293d2ec640a680cf4847b150308d,

title = "Policy iteration based Q-learning for linear nonzero-sum quadratic differential games",

abstract = "In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.",

keywords = "ADP, PI, Q-learning, RL, adaptive dynamic programming, linear nonzero-sum quadratic differential games, off-policy, policy iteration, reinforcement learning",

author = "Xinxing Li and Zhihong Peng and Li Liang and Wenzhong Zha",

note = "Publisher Copyright: {\textcopyright} 2019, Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2019",

month = may,

day = "1",

doi = "10.1007/s11432-018-9602-1",

language = "English",

volume = "62",

journal = "Science China Information Sciences",

issn = "1674-733X",

publisher = "Science China Press",

number = "5",

}

TY - JOUR

T1 - Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

AU - Li, Xinxing

AU - Peng, Zhihong

AU - Liang, Li

AU - Zha, Wenzhong

PY - 2019/5/1

Y1 - 2019/5/1

N2 - In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

AB - In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.

KW - ADP

KW - PI

KW - Q-learning

KW - RL

KW - adaptive dynamic programming

KW - linear nonzero-sum quadratic differential games

KW - off-policy

KW - policy iteration

KW - reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85064041558&partnerID=8YFLogxK

U2 - 10.1007/s11432-018-9602-1

DO - 10.1007/s11432-018-9602-1

M3 - Article

AN - SCOPUS:85064041558

SN - 1674-733X

VL - 62

JO - Science China Information Sciences

JF - Science China Information Sciences

IS - 5

M1 - 52204

ER -

Policy iteration based Q-learning for linear nonzero-sum quadratic differential games

摘要

访问文件

其它文件与链接

指纹

引用此