TY - JOUR
T1 - AdaDerivative optimizer
T2 - Adapting step-sizes by the derivative term in past gradient information
AU - Zou, Weidong
AU - Xia, Yuanqing
AU - Cao, Weipeng
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2023/3
Y1 - 2023/3
N2 - AdaBelief fully utilizes “belief” to iteratively update the parameters of deep neural networks. However, the reliability of the “belief” is determined by the gradient's prediction accuracy, and the key to this prediction accuracy is the selection of the smoothing parameter β1. AdaBelief also suffers from the overshoot problem, which occurs when the value of parameters exceeds the value of the target and cannot be changed along the gradient direction. In this paper, we propose AdaDerivative to eliminate the overshoot problem of AdaBelief. The key to AdaDerivative is that the “belief” of AdaBelief is replaced by the derivative term's exponential moving average (EMA), which can be constructed as (1−β2)∑i=1tβ2t−i(gi−gi−1)2 based on the past and current gradients. We validate the performance of AdaDerivative on a variety of tasks, including image classification, language modeling, node classification, image generation, and object detection tasks. Extensive experimental results demonstrate that AdaDerivative can achieve state-of-the-art performance.
AB - AdaBelief fully utilizes “belief” to iteratively update the parameters of deep neural networks. However, the reliability of the “belief” is determined by the gradient's prediction accuracy, and the key to this prediction accuracy is the selection of the smoothing parameter β1. AdaBelief also suffers from the overshoot problem, which occurs when the value of parameters exceeds the value of the target and cannot be changed along the gradient direction. In this paper, we propose AdaDerivative to eliminate the overshoot problem of AdaBelief. The key to AdaDerivative is that the “belief” of AdaBelief is replaced by the derivative term's exponential moving average (EMA), which can be constructed as (1−β2)∑i=1tβ2t−i(gi−gi−1)2 based on the past and current gradients. We validate the performance of AdaDerivative on a variety of tasks, including image classification, language modeling, node classification, image generation, and object detection tasks. Extensive experimental results demonstrate that AdaDerivative can achieve state-of-the-art performance.
KW - Adam
KW - Deep neural networks
KW - Optimization algorithms
KW - Stochastic gradient descent
UR - http://www.scopus.com/inward/record.url?scp=85144604961&partnerID=8YFLogxK
U2 - 10.1016/j.engappai.2022.105755
DO - 10.1016/j.engappai.2022.105755
M3 - Article
AN - SCOPUS:85144604961
SN - 0952-1976
VL - 119
JO - Engineering Applications of Artificial Intelligence
JF - Engineering Applications of Artificial Intelligence
M1 - 105755
ER -