TY - GEN
T1 - An Efficient Policy Gradient Algorithm with Historical Behaviors Reusing in Multi Agent System
AU - Ding, Ao
AU - Zhang, Huaqing
AU - Ma, Hongbin
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - In multi-agent reinforcement learning, the algorithm's sampling efficiency of historical experience trajectory is regarded as the key to improving the working effect of the agent. In order to make full use of interactive data to improve the sampling ability of agents, an efficient multi-agent reinforcement learning algorithm is proposed in this paper. For multi-agent systems, the policy gradient algorithm with historical behavior reusing (MAPG-HBR) proposed in this paper can take into account the influence of historical behavior on the policy in the policy promotion stage, so that the multi-agents can learn the approximately optimal joint policy. To obtain the advantage functions used in MAPG-HBR with only one critic network, a theoretically interpretable twin universal critic network is proposed in this paper, which is capable of simultaneously estimating the action-value function as well as the state-value function and the corresponding objective value function for Clipped Double Q Learning. We compare the effectiveness of this algorithm with several baselines in Waterworld and Multi-Agent Mujoco, which are currently very popular multi-agent test environments. The results show that MAPG-HBR algorithm has better performance than other algorithms in the environments.
AB - In multi-agent reinforcement learning, the algorithm's sampling efficiency of historical experience trajectory is regarded as the key to improving the working effect of the agent. In order to make full use of interactive data to improve the sampling ability of agents, an efficient multi-agent reinforcement learning algorithm is proposed in this paper. For multi-agent systems, the policy gradient algorithm with historical behavior reusing (MAPG-HBR) proposed in this paper can take into account the influence of historical behavior on the policy in the policy promotion stage, so that the multi-agents can learn the approximately optimal joint policy. To obtain the advantage functions used in MAPG-HBR with only one critic network, a theoretically interpretable twin universal critic network is proposed in this paper, which is capable of simultaneously estimating the action-value function as well as the state-value function and the corresponding objective value function for Clipped Double Q Learning. We compare the effectiveness of this algorithm with several baselines in Waterworld and Multi-Agent Mujoco, which are currently very popular multi-agent test environments. The results show that MAPG-HBR algorithm has better performance than other algorithms in the environments.
KW - Historical behaviors reusing
KW - Multi-agent system
KW - Policy gradient
KW - Sampling efficiency
UR - http://www.scopus.com/inward/record.url?scp=105001669121&partnerID=8YFLogxK
U2 - 10.1109/CSIS-IAC63491.2024.10919340
DO - 10.1109/CSIS-IAC63491.2024.10919340
M3 - Conference contribution
AN - SCOPUS:105001669121
T3 - 2024 International Annual Conference on Complex Systems and Intelligent Science, CSIS-IAC 2024
SP - 585
EP - 592
BT - 2024 International Annual Conference on Complex Systems and Intelligent Science, CSIS-IAC 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 International Annual Conference on Complex Systems and Intelligent Science, CSIS-IAC 2024
Y2 - 20 September 2024 through 22 September 2024
ER -