TY - GEN
T1 - Multi-time scale hierarchical trust domain leads to the improvement of MAPPO algorithm
AU - Guo, Zhentao
AU - Sun, Licheng
AU - Zhao, Guiyu
AU - Wang, Tianhao
AU - Ding, Ao
AU - Ma, Hongbin
N1 - Publisher Copyright:
© 2024 Technical Committee on Control Theory, Chinese Association of Automation.
PY - 2024
Y1 - 2024
N2 - Multi-agent Proximal Policy Optimization is a ubiquitous on-policy reinforcement learning algorithm, but its usage is significantly lower than that of off-policy learning algorithms in multi-agent environments. The existing MAPPO algorithm has the problem of insufficient generalization ability, adaptability and training stability when dealing with complex tasks. In this paper, we propose an improved trust domain guided MAPPO algorithm with multi-time scale hierarchical structure, which aims to cope with the dynamic changes of hierarchical structure and multi-time scale of tasks. A multi-time scale hierarchical structure is introduced by the algorithm, along with trust domain constraints and L2 norm regularization to prevent the instability of policy performance caused by too large updates. Finally, through the experimental verification of Decentralized Collective Assault (DCA), our algorithm has achieved significant improvements in various performance indicators, indicating that it has better effect and robustness in dealing with complex tasks.
AB - Multi-agent Proximal Policy Optimization is a ubiquitous on-policy reinforcement learning algorithm, but its usage is significantly lower than that of off-policy learning algorithms in multi-agent environments. The existing MAPPO algorithm has the problem of insufficient generalization ability, adaptability and training stability when dealing with complex tasks. In this paper, we propose an improved trust domain guided MAPPO algorithm with multi-time scale hierarchical structure, which aims to cope with the dynamic changes of hierarchical structure and multi-time scale of tasks. A multi-time scale hierarchical structure is introduced by the algorithm, along with trust domain constraints and L2 norm regularization to prevent the instability of policy performance caused by too large updates. Finally, through the experimental verification of Decentralized Collective Assault (DCA), our algorithm has achieved significant improvements in various performance indicators, indicating that it has better effect and robustness in dealing with complex tasks.
KW - L2 norm regularization
KW - MAPPO
KW - Multi-time scale hierarchical structure
KW - Trust domain
UR - http://www.scopus.com/inward/record.url?scp=85205488262&partnerID=8YFLogxK
U2 - 10.23919/CCC63176.2024.10662712
DO - 10.23919/CCC63176.2024.10662712
M3 - Conference contribution
AN - SCOPUS:85205488262
T3 - Chinese Control Conference, CCC
SP - 6109
EP - 6114
BT - Proceedings of the 43rd Chinese Control Conference, CCC 2024
A2 - Na, Jing
A2 - Sun, Jian
PB - IEEE Computer Society
T2 - 43rd Chinese Control Conference, CCC 2024
Y2 - 28 July 2024 through 31 July 2024
ER -