PROVABLE MEMORY EFFICIENT SELF-PLAY ALGORITHM FOR MODEL-FREE REINFORCEMENT LEARNING

Na Li, Yuchen Jiao, Hangguan Shan, Shefeng Yan

Research output: Contribution to conferencePaperpeer-review

Abstract

The thriving field of multi-agent reinforcement learning (MARL) studies how a group of interacting agents make decisions autonomously in a shared dynamic environment. Existing theoretical studies in this area suffer from at least two of the following obstacles: memory inefficiency, the heavy dependence of sample complexity on the long horizon and the large state space, the high computational complexity, non-Markov policy, non-Nash policy, and high burn-in cost. In this work, we take a step towards settling this problem by designing a model-free self-play algorithm Memory-Efficient Nash Q-Learning (ME-Nash-QL) for two-player zero-sum Markov games, which is a specific setting of MARL. ME-Nash-QL is proven to enjoy the following merits. First, it can output an ε-approximate Nash policy with space complexity O(SABH) and sample complexity Õ(H4SAB/ε2), where S is the number of states, {A, B} is the number of actions for two players, and H is the horizon length. It outperforms existing algorithms in terms of space complexity for tabular cases, and in terms of sample complexity for long horizons, i.e., when min{A, B} ≪ H2. Second, ME-Nash-QL achieves the lowest computational complexity O(Tpoly(AB)) while preserving Markov policies, where T is the number of samples. Third, ME-Nash-QL also achieves the best burn-in cost O(SAB poly(H)), whereas previous algorithms have a burn-in cost of at least O(S3AB poly(H)) to attain the same level of sample complexity with ours.

Original languageEnglish
Publication statusPublished - 2024
Externally publishedYes
Event12th International Conference on Learning Representations, ICLR 2024 - Hybrid, Vienna, Austria
Duration: 7 May 202411 May 2024

Conference

Conference12th International Conference on Learning Representations, ICLR 2024
Country/TerritoryAustria
CityHybrid, Vienna
Period7/05/2411/05/24

Cite this