An Efficient Multi-Agent Policy Self-Play Learning Method Aiming at Seize-Control Scenarios

Huaqing Zhang, Hongbin Ma*, Xiaofei Zhang, Li Wang, Minglei Han, Hui Chen, Ao Ding

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Aiming at the problem of multi-Agent cooperative confrontation in seize-control scenarios, we design an efficient multi-Agent policy self-play (EMAP-SP) learning method. First, a multi-Agent centralized policy model is constructed to command the agents to perform tasks cooperatively. Considering that the policy being trained and its historical policies usually have poor exploration capability under incomplete information in self-play trainings, the intrinsic reward mechanism based on random network distillation (RND) is introduced in the self-play learning method. In addition, we propose a multi-step on-policy deep reinforcement learning (DRL) algorithm assisted by off-policy policy evaluation (MSOAO) to learn the best response policy in the self-play. Compared with DRL algorithms commonly used in complex decision problems, MSOAO has more efficient policy evaluation capability, and efficient policy evaluation further improves the policy learning capability. The effectiveness of EMAP-SP is fully verified in MiaoSuan wargame simulation system, and the evaluation results show that EMAP-SP can learn the cooperative policy of effectively defeating the Blue side's knowledge-based policy under incomplete information. Moreover, the evaluations results in DRL benchmark environments also show that the best response policy learning algorithm MSOAO can promote the agent to learn approximately optimal policies.

Original languageEnglish
Pages (from-to)987-1004
Number of pages18
JournalUnmanned Systems
Volume13
Issue number4
DOIs
Publication statusPublished - 1 Jul 2025
Externally publishedYes

Keywords

  • cooperative confrontation
  • deep reinforcement learning
  • policy evaluation
  • Self-play
  • wargame

Cite this