TY - JOUR
T1 - PRIME
T2 - Policy Representation Integration With Metavalue-Modulated Evolution in Multiagent Reinforcement Learning
AU - Sun, Licheng
AU - Ma, Hongbin
N1 - Publisher Copyright:
© 2026 IEEE.
PY - 2026
Y1 - 2026
N2 - Multiagent reinforcement learning (MARL) remains fundamentally challenged by partial observability, unstable value learning, and inefficient exploration - difficulties that intensify in high-dimensional robotic control and large-scale coordination scenarios. Meanwhile, the existing algorithms lack a mechanism to guide the improvement of long-term strategies. We propose policy representation integration for metavalue evolution (PRIME) in MARL that addresses these limitations through representation-asymmetric policy parameterization, metavalue-augmented optimization, and metavalue-modulated evolutionary search. PRIME constructs a shared nonlinear encoder with lightweight team-specific linear heads, providing a coherent latent policy manifold that supports both fine-grained robotic manipulation and large-population coordination. A learned metavalue function estimates the long-horizon utility of policy updates, whose gradients shape both actor learning and representation formation. In parallel, evolutionary operators - direction-aware crossover and metagradient-scaled low-rank mutation - enable globally diverse yet strategically targeted exploration in policy space. Evaluations on multiagent MuJoCo (MA-MuJoCo), DexHands dexterous manipulation, and the large-scale decentralized collective assault (DCA) benchmark demonstrate that PRIME achieves consistently superior performance, faster convergence, and stronger robustness than state-of-the-art baselines.
AB - Multiagent reinforcement learning (MARL) remains fundamentally challenged by partial observability, unstable value learning, and inefficient exploration - difficulties that intensify in high-dimensional robotic control and large-scale coordination scenarios. Meanwhile, the existing algorithms lack a mechanism to guide the improvement of long-term strategies. We propose policy representation integration for metavalue evolution (PRIME) in MARL that addresses these limitations through representation-asymmetric policy parameterization, metavalue-augmented optimization, and metavalue-modulated evolutionary search. PRIME constructs a shared nonlinear encoder with lightweight team-specific linear heads, providing a coherent latent policy manifold that supports both fine-grained robotic manipulation and large-population coordination. A learned metavalue function estimates the long-horizon utility of policy updates, whose gradients shape both actor learning and representation formation. In parallel, evolutionary operators - direction-aware crossover and metagradient-scaled low-rank mutation - enable globally diverse yet strategically targeted exploration in policy space. Evaluations on multiagent MuJoCo (MA-MuJoCo), DexHands dexterous manipulation, and the large-scale decentralized collective assault (DCA) benchmark demonstrate that PRIME achieves consistently superior performance, faster convergence, and stronger robustness than state-of-the-art baselines.
KW - Deep learning
KW - large-scale multiagent systems (LMASs)
KW - metavalue
KW - multiagent reinforcement learning (MARL)
KW - multirobot control
UR - https://www.scopus.com/pages/publications/105033255466
U2 - 10.1109/JIOT.2026.3674392
DO - 10.1109/JIOT.2026.3674392
M3 - Article
AN - SCOPUS:105033255466
SN - 2327-4662
VL - 13
SP - 24893
EP - 24911
JO - IEEE Internet of Things Journal
JF - IEEE Internet of Things Journal
IS - 11
ER -