Asynchronous hierarchical deep reinforcement learning with learnable reward shaping for distributed multi-UCAV air combat decision

  • Yifan Zheng
  • , Bin Xin*
  • , Jie Chen
  • , Keming Jiao
  • , Zhixin Zhao
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

The complexity of the battlefield environment, including its high dynamics, along with the high-dimensional spaces of state and decision-making, has brought severe challenges to unmanned combat aerial vehicles (UCAVs) in the cooperative autonomous air combat decision-making. This paper focuses on the many-to-many air combat maneuvering decision (MMACMD) in an environment with extremely limited communication. An asynchronous hierarchical deep reinforcement learning method with learnable reward shaping (AHDRL_LRS) is proposed. First, by introducing an asynchronous hierarchical reinforcement learning framework, the large-scale MMACMD is decomposed into smaller-scale subtasks to reduce the dimensions of the decision spaces. Second, to achieve the coordinated global task allocation in the environment with extremely limited communication, the learnable reward with embedded target intention (LRETI) is proposed. Through the LRETI, the target selecting intentions generated by the high-level policy are implicitly represented as learnable parameters in the situation reward function, which is used to train the low-level flight maneuver policy. Third, to dynamically characterize the topological correlations of each unit in the UCAV swarm and enhance the transferability and scalability of the decision-making model, the flexible target intention network (FTIN) structure based on the multi-head self-attention (MHSA) model is designed for the representation of the high-level policy, which can accept input features with variable-length sequences. Moreover, a graph learning-based critic network is adopted in the low-level policy model to address the dynamic credit assignment. Finally, by comparing with the baseline methods under scenarios with various initialization from 6-vs-6 to 20-to-20 scales, the effectiveness and superiority of the proposed AHDRL_LRS are validated through the results of the simulation experiment.

Original languageEnglish
Article number1100303
JournalScience China Technological Sciences
Volume69
Issue number1
DOIs
Publication statusPublished - Jan 2026
Externally publishedYes

Keywords

  • distributed decision-making
  • hierarchical reinforcement learning
  • learnable reward shaping
  • many-to-many air combat

Fingerprint

Dive into the research topics of 'Asynchronous hierarchical deep reinforcement learning with learnable reward shaping for distributed multi-UCAV air combat decision'. Together they form a unique fingerprint.

Cite this