TY - GEN
T1 - MDPo
T2 - 2024 International Conference on Generative Artificial Intelligence and Information Security, GAIIS 2024
AU - Liu, Chen
AU - Wang, Yizhuo
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/5/10
Y1 - 2024/5/10
N2 - Offline reinforcement learning aims to empower agents to derive effective strategies from a pre-existing dataset for decision-making tasks. This learning paradigm has a broad application prospect in areas with high safety constraints, such as healthcare and robotic control. However, existing offline reinforcement learning algorithms often overlook the impact of the dataset's inherent multimodal distribution on policy optimization during the training phase, leading to compromised model performance. To tackle this challenge, we introduce a novel offline reinforcement learning algorithm based on mixture density policy network, MDPo ( Mixture Density Policy). The MDPo algorithm initially employs an expected regression loss function to train the value function. Subsequently, it constructs the policy using a mixture density network and trains the policy through a distributional constraint, ultimately learning a high-quality policy model under the combined influence of the reward signal and policy constraints. Predominantly leveraging mixture density networks, MDPo models the policy as a multimodal distribution, enhancing the policy's representational capacity to better fit the multimodal distribution of actions in the dataset, thereby increasing training process stability and improving model performance. Experiments conducted on the Antmaze task of the D4RL benchmark demonstrate that MDPo significantly outperforms existing state-of-the-art methods, also demonstrating the enhancement of training stability.
AB - Offline reinforcement learning aims to empower agents to derive effective strategies from a pre-existing dataset for decision-making tasks. This learning paradigm has a broad application prospect in areas with high safety constraints, such as healthcare and robotic control. However, existing offline reinforcement learning algorithms often overlook the impact of the dataset's inherent multimodal distribution on policy optimization during the training phase, leading to compromised model performance. To tackle this challenge, we introduce a novel offline reinforcement learning algorithm based on mixture density policy network, MDPo ( Mixture Density Policy). The MDPo algorithm initially employs an expected regression loss function to train the value function. Subsequently, it constructs the policy using a mixture density network and trains the policy through a distributional constraint, ultimately learning a high-quality policy model under the combined influence of the reward signal and policy constraints. Predominantly leveraging mixture density networks, MDPo models the policy as a multimodal distribution, enhancing the policy's representational capacity to better fit the multimodal distribution of actions in the dataset, thereby increasing training process stability and improving model performance. Experiments conducted on the Antmaze task of the D4RL benchmark demonstrate that MDPo significantly outperforms existing state-of-the-art methods, also demonstrating the enhancement of training stability.
KW - Multimodal distribution
KW - Offline reinforcement learning
KW - Reinforcement learning
UR - http://www.scopus.com/inward/record.url?scp=85198954941&partnerID=8YFLogxK
U2 - 10.1145/3665348.3665372
DO - 10.1145/3665348.3665372
M3 - Conference contribution
AN - SCOPUS:85198954941
T3 - ACM International Conference Proceeding Series
SP - 132
EP - 137
BT - Proceedings of 2024 International Conference on Generative Artificial Intelligence and Information Security, GAIIS 2024
PB - Association for Computing Machinery
Y2 - 10 May 2024 through 12 May 2024
ER -