Abstract
In this paper, a policy iteration-based Q-learning algorithm is proposed to solve infinite horizon linear nonzero-sum quadratic differential games with completely unknown dynamics. The Q-learning algorithm, which employs off-policy reinforcement learning (RL), can learn the Nash equilibrium and the corresponding value functions online, using the data sets generated by behavior policies. First, we prove equivalence between the proposed off-policy Q-learning algorithm and an offline PI algorithm by selecting specific initially admissible polices that can be learned online. Then, the convergence of the off-policy Q-learning algorithm is proved under a mild rank condition that can be easily met by injecting appropriate probing noises into behavior policies. The generated data sets can be repeatedly used during the learning process, which is computationally effective. The simulation results demonstrate the effectiveness of the proposed Q-learning algorithm.
Original language | English |
---|---|
Article number | 52204 |
Journal | Science China Information Sciences |
Volume | 62 |
Issue number | 5 |
DOIs | |
Publication status | Published - 1 May 2019 |
Keywords
- ADP
- PI
- Q-learning
- RL
- adaptive dynamic programming
- linear nonzero-sum quadratic differential games
- off-policy
- policy iteration
- reinforcement learning