TY - JOUR
T1 - Finite-Time Analysis of Decentralized Temporal-Difference Learning with Linear Function Approximation
AU - Sun, Jun
AU - Wang, Gang
AU - Giannakis, Georgios B.
AU - Yang, Qinmin
AU - Yang, Zaiyue
N1 - Publisher Copyright:
Copyright © 2020 by the author(s)
PY - 2020
Y1 - 2020
N2 - Motivated by the emerging use of multi-agent reinforcement learning (MARL) in various engineering applications, we investigate the policy evaluation problem in a fully decentralized setting, using temporal-difference (TD) learning with linear function approximation to handle large state spaces in practice. The goal of a group of agents is to collaboratively learn the value function of a given policy from locally private rewards observed in a shared environment, through exchanging local estimates with neighbors. Despite their simplicity and widespread use, our theoretical understanding of such decentralized TD learning algorithms remains limited. Existing results were obtained based on i.i.d. data samples, or by imposing an 'additional' projection step to control the 'gradient' bias incurred by the Markovian observations. In this paper, we provide a finite-sample analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a neighborhood of the optimum. The resultant error bounds are the first of its type-in the sense that they hold under the most practical assumptions - which is made possible by means of a novel multi-step Lyapunov analysis.
AB - Motivated by the emerging use of multi-agent reinforcement learning (MARL) in various engineering applications, we investigate the policy evaluation problem in a fully decentralized setting, using temporal-difference (TD) learning with linear function approximation to handle large state spaces in practice. The goal of a group of agents is to collaboratively learn the value function of a given policy from locally private rewards observed in a shared environment, through exchanging local estimates with neighbors. Despite their simplicity and widespread use, our theoretical understanding of such decentralized TD learning algorithms remains limited. Existing results were obtained based on i.i.d. data samples, or by imposing an 'additional' projection step to control the 'gradient' bias incurred by the Markovian observations. In this paper, we provide a finite-sample analysis of the fully decentralized TD(0) learning under both i.i.d. as well as Markovian samples, and prove that all local estimates converge linearly to a neighborhood of the optimum. The resultant error bounds are the first of its type-in the sense that they hold under the most practical assumptions - which is made possible by means of a novel multi-step Lyapunov analysis.
UR - http://www.scopus.com/inward/record.url?scp=85161864433&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85161864433
SN - 2640-3498
VL - 108
SP - 4485
EP - 4495
JO - Proceedings of Machine Learning Research
JF - Proceedings of Machine Learning Research
T2 - 23rd International Conference on Artificial Intelligence and Statistics, AISTATS 2020
Y2 - 26 August 2020 through 28 August 2020
ER -