An offline actor-critic policy improvement algorithm with historical state-action pairs

  • Huaqing Zhang*
  • , Xiaofei Zhang
  • , Jixiang Jiang
  • , Mingrui Hao
  • , Hongbin Ma
  • , Ning Zhang
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Low state-action coverage (SACo) and abundant random behaviors in datasets pose significant challenges for offline reinforcement learning (RL). To mitigate the impact of the data quality in datasets on offline RL, we propose an offline actor-critic policy improvement algorithm with historical state-action pairs (PIH). By applying Box-Cox transformation to logarithmic probabilities of dataset samples to obtain the offline policy gradient, PIH overcomes the drawback that existing offline RL methods are prone to generating extrapolation errors when using low-quality datasets for policy learning. This approach enables efficient and stable policy evaluation and improvement simultaneously using the same state-action pairs in the dataset, even when the datasets contain abundant random behaviors or the SACo is low. To calculate the advantage functions used in the offline policy gradient, a unified critic network is designed to jointly approximate state-value and action-value functions, enhancing policy learning. Extensive experiments across six benchmark environments’ datasets demonstrate that only PIH can learn policies efficiently and stably compared to state-of-the-art algorithms (CQL, TD3+BC, IQL, AWAC, etc.) when learning from datasets with low SACo and high randomness. Moreover, in the evaluations of random-zero datasets, PIH achieves a 15.5% improvement in average return compared to the mean performance of other algorithms.

Original languageEnglish
Article number8
JournalInternational Journal of Machine Learning and Cybernetics
Volume17
Issue number1
DOIs
Publication statusPublished - Jan 2026
Externally publishedYes

Keywords

  • Extrapolation errors
  • Intelligent control
  • Intelligent decision
  • Offline reinforcement learning
  • Policy gradient

Fingerprint

Dive into the research topics of 'An offline actor-critic policy improvement algorithm with historical state-action pairs'. Together they form a unique fingerprint.

Cite this