TY - JOUR
T1 - An offline actor-critic policy improvement algorithm with historical state-action pairs
AU - Zhang, Huaqing
AU - Zhang, Xiaofei
AU - Jiang, Jixiang
AU - Hao, Mingrui
AU - Ma, Hongbin
AU - Zhang, Ning
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2026.
PY - 2026/1
Y1 - 2026/1
N2 - Low state-action coverage (SACo) and abundant random behaviors in datasets pose significant challenges for offline reinforcement learning (RL). To mitigate the impact of the data quality in datasets on offline RL, we propose an offline actor-critic policy improvement algorithm with historical state-action pairs (PIH). By applying Box-Cox transformation to logarithmic probabilities of dataset samples to obtain the offline policy gradient, PIH overcomes the drawback that existing offline RL methods are prone to generating extrapolation errors when using low-quality datasets for policy learning. This approach enables efficient and stable policy evaluation and improvement simultaneously using the same state-action pairs in the dataset, even when the datasets contain abundant random behaviors or the SACo is low. To calculate the advantage functions used in the offline policy gradient, a unified critic network is designed to jointly approximate state-value and action-value functions, enhancing policy learning. Extensive experiments across six benchmark environments’ datasets demonstrate that only PIH can learn policies efficiently and stably compared to state-of-the-art algorithms (CQL, TD3+BC, IQL, AWAC, etc.) when learning from datasets with low SACo and high randomness. Moreover, in the evaluations of random-zero datasets, PIH achieves a 15.5% improvement in average return compared to the mean performance of other algorithms.
AB - Low state-action coverage (SACo) and abundant random behaviors in datasets pose significant challenges for offline reinforcement learning (RL). To mitigate the impact of the data quality in datasets on offline RL, we propose an offline actor-critic policy improvement algorithm with historical state-action pairs (PIH). By applying Box-Cox transformation to logarithmic probabilities of dataset samples to obtain the offline policy gradient, PIH overcomes the drawback that existing offline RL methods are prone to generating extrapolation errors when using low-quality datasets for policy learning. This approach enables efficient and stable policy evaluation and improvement simultaneously using the same state-action pairs in the dataset, even when the datasets contain abundant random behaviors or the SACo is low. To calculate the advantage functions used in the offline policy gradient, a unified critic network is designed to jointly approximate state-value and action-value functions, enhancing policy learning. Extensive experiments across six benchmark environments’ datasets demonstrate that only PIH can learn policies efficiently and stably compared to state-of-the-art algorithms (CQL, TD3+BC, IQL, AWAC, etc.) when learning from datasets with low SACo and high randomness. Moreover, in the evaluations of random-zero datasets, PIH achieves a 15.5% improvement in average return compared to the mean performance of other algorithms.
KW - Extrapolation errors
KW - Intelligent control
KW - Intelligent decision
KW - Offline reinforcement learning
KW - Policy gradient
UR - https://www.scopus.com/pages/publications/105027727562
U2 - 10.1007/s13042-025-02963-9
DO - 10.1007/s13042-025-02963-9
M3 - Article
AN - SCOPUS:105027727562
SN - 1868-8071
VL - 17
JO - International Journal of Machine Learning and Cybernetics
JF - International Journal of Machine Learning and Cybernetics
IS - 1
M1 - 8
ER -