TY - JOUR
T1 - A zeroth-order stochastic implicit method for bilevel-structured actor-critic schemes
AU - Tao, Haochen
AU - Cui, Shisheng
AU - Li, Zhuo
AU - Sun, Jian
N1 - Publisher Copyright:
© Science China Press 2025.
PY - 2025/5
Y1 - 2025/5
N2 - Reinforcement learning algorithms are central to the cognition and decision-making of embodied intelligent agents. A bilevel optimization (BO) modeling approach, along with a host of efficient BO algorithms, has been proven to be an effective means of addressing actor-critic (AC) policy optimization problems. In this work, based on a bilevel-structured AC problem model, an implicit zeroth-order stochastic algorithm is developed. A locally randomized spherical smoothing technique, which can be applied to nonsmooth nonconvex implicit AC formulations and avoid the closed-form lower-level mapping, is introduced. In the proposed zeroth-order scheme, the gradient of the implicit function can be approximated through inexact lower-level value estimations that are practically available. Under suitable assumptions, the algorithmic framework designed for the bilevel AC method is characterized by convergence guarantees under a fixed stepsize and smoothing parameter. Moreover, the proposed algorithm is equipped with the overall iteration complexity of O(n2L02L~02ϵ−1). The convergence performance of the proposed algorithm is verified through numerical simulations.
AB - Reinforcement learning algorithms are central to the cognition and decision-making of embodied intelligent agents. A bilevel optimization (BO) modeling approach, along with a host of efficient BO algorithms, has been proven to be an effective means of addressing actor-critic (AC) policy optimization problems. In this work, based on a bilevel-structured AC problem model, an implicit zeroth-order stochastic algorithm is developed. A locally randomized spherical smoothing technique, which can be applied to nonsmooth nonconvex implicit AC formulations and avoid the closed-form lower-level mapping, is introduced. In the proposed zeroth-order scheme, the gradient of the implicit function can be approximated through inexact lower-level value estimations that are practically available. Under suitable assumptions, the algorithmic framework designed for the bilevel AC method is characterized by convergence guarantees under a fixed stepsize and smoothing parameter. Moreover, the proposed algorithm is equipped with the overall iteration complexity of O(n2L02L~02ϵ−1). The convergence performance of the proposed algorithm is verified through numerical simulations.
KW - actor-critic
KW - bilevel optimization
KW - implicit programming
KW - stochastic approximation
KW - zeroth-order algorithm
UR - http://www.scopus.com/inward/record.url?scp=105003858922&partnerID=8YFLogxK
U2 - 10.1007/s11432-024-4397-7
DO - 10.1007/s11432-024-4397-7
M3 - Article
AN - SCOPUS:105003858922
SN - 1674-733X
VL - 68
JO - Science China Information Sciences
JF - Science China Information Sciences
IS - 5
M1 - 150204
ER -