TY - JOUR
T1 - Asynchronous hierarchical deep reinforcement learning with learnable reward shaping for distributed multi-UCAV air combat decision
AU - Zheng, Yifan
AU - Xin, Bin
AU - Chen, Jie
AU - Jiao, Keming
AU - Zhao, Zhixin
N1 - Publisher Copyright:
© Science China Press 2026.
PY - 2026/1
Y1 - 2026/1
N2 - The complexity of the battlefield environment, including its high dynamics, along with the high-dimensional spaces of state and decision-making, has brought severe challenges to unmanned combat aerial vehicles (UCAVs) in the cooperative autonomous air combat decision-making. This paper focuses on the many-to-many air combat maneuvering decision (MMACMD) in an environment with extremely limited communication. An asynchronous hierarchical deep reinforcement learning method with learnable reward shaping (AHDRL_LRS) is proposed. First, by introducing an asynchronous hierarchical reinforcement learning framework, the large-scale MMACMD is decomposed into smaller-scale subtasks to reduce the dimensions of the decision spaces. Second, to achieve the coordinated global task allocation in the environment with extremely limited communication, the learnable reward with embedded target intention (LRETI) is proposed. Through the LRETI, the target selecting intentions generated by the high-level policy are implicitly represented as learnable parameters in the situation reward function, which is used to train the low-level flight maneuver policy. Third, to dynamically characterize the topological correlations of each unit in the UCAV swarm and enhance the transferability and scalability of the decision-making model, the flexible target intention network (FTIN) structure based on the multi-head self-attention (MHSA) model is designed for the representation of the high-level policy, which can accept input features with variable-length sequences. Moreover, a graph learning-based critic network is adopted in the low-level policy model to address the dynamic credit assignment. Finally, by comparing with the baseline methods under scenarios with various initialization from 6-vs-6 to 20-to-20 scales, the effectiveness and superiority of the proposed AHDRL_LRS are validated through the results of the simulation experiment.
AB - The complexity of the battlefield environment, including its high dynamics, along with the high-dimensional spaces of state and decision-making, has brought severe challenges to unmanned combat aerial vehicles (UCAVs) in the cooperative autonomous air combat decision-making. This paper focuses on the many-to-many air combat maneuvering decision (MMACMD) in an environment with extremely limited communication. An asynchronous hierarchical deep reinforcement learning method with learnable reward shaping (AHDRL_LRS) is proposed. First, by introducing an asynchronous hierarchical reinforcement learning framework, the large-scale MMACMD is decomposed into smaller-scale subtasks to reduce the dimensions of the decision spaces. Second, to achieve the coordinated global task allocation in the environment with extremely limited communication, the learnable reward with embedded target intention (LRETI) is proposed. Through the LRETI, the target selecting intentions generated by the high-level policy are implicitly represented as learnable parameters in the situation reward function, which is used to train the low-level flight maneuver policy. Third, to dynamically characterize the topological correlations of each unit in the UCAV swarm and enhance the transferability and scalability of the decision-making model, the flexible target intention network (FTIN) structure based on the multi-head self-attention (MHSA) model is designed for the representation of the high-level policy, which can accept input features with variable-length sequences. Moreover, a graph learning-based critic network is adopted in the low-level policy model to address the dynamic credit assignment. Finally, by comparing with the baseline methods under scenarios with various initialization from 6-vs-6 to 20-to-20 scales, the effectiveness and superiority of the proposed AHDRL_LRS are validated through the results of the simulation experiment.
KW - distributed decision-making
KW - hierarchical reinforcement learning
KW - learnable reward shaping
KW - many-to-many air combat
UR - https://www.scopus.com/pages/publications/105027405680
U2 - 10.1007/s11431-025-3130-x
DO - 10.1007/s11431-025-3130-x
M3 - Article
AN - SCOPUS:105027405680
SN - 1674-7321
VL - 69
JO - Science China Technological Sciences
JF - Science China Technological Sciences
IS - 1
M1 - 1100303
ER -