AdaDerivative optimizer: Adapting step-sizes by the derivative term in past gradient information

Weidong Zou; Yuanqing Xia; Weipeng Cao

doi:10.1016/j.engappai.2022.105755

AdaDerivative optimizer: Adapting step-sizes by the derivative term in past gradient information

Weidong Zou, Yuanqing Xia, Weipeng Cao^*

^*Corresponding author for this work

School of Automation

Research output: Contribution to journal › Article › peer-review

4 Citations (Scopus)

Abstract

AdaBelief fully utilizes “belief” to iteratively update the parameters of deep neural networks. However, the reliability of the “belief” is determined by the gradient's prediction accuracy, and the key to this prediction accuracy is the selection of the smoothing parameter β₁. AdaBelief also suffers from the overshoot problem, which occurs when the value of parameters exceeds the value of the target and cannot be changed along the gradient direction. In this paper, we propose AdaDerivative to eliminate the overshoot problem of AdaBelief. The key to AdaDerivative is that the “belief” of AdaBelief is replaced by the derivative term's exponential moving average (EMA), which can be constructed as (1−β₂)∑_i=1^tβ₂^t−i(g_i−g_i−1)² based on the past and current gradients. We validate the performance of AdaDerivative on a variety of tasks, including image classification, language modeling, node classification, image generation, and object detection tasks. Extensive experimental results demonstrate that AdaDerivative can achieve state-of-the-art performance.

Original language	English
Article number	105755
Journal	Engineering Applications of Artificial Intelligence
Volume	119
DOIs	https://doi.org/10.1016/j.engappai.2022.105755
Publication status	Published - Mar 2023

Keywords

Adam
Deep neural networks
Optimization algorithms
Stochastic gradient descent

Access to Document

10.1016/j.engappai.2022.105755

Cite this

Zou, W., Xia, Y., & Cao, W. (2023). AdaDerivative optimizer: Adapting step-sizes by the derivative term in past gradient information. Engineering Applications of Artificial Intelligence, 119, Article 105755. https://doi.org/10.1016/j.engappai.2022.105755

@article{182ed378d8744cfda7699ddca39e4a8d,

title = "AdaDerivative optimizer: Adapting step-sizes by the derivative term in past gradient information",

abstract = "AdaBelief fully utilizes “belief” to iteratively update the parameters of deep neural networks. However, the reliability of the “belief” is determined by the gradient's prediction accuracy, and the key to this prediction accuracy is the selection of the smoothing parameter β1. AdaBelief also suffers from the overshoot problem, which occurs when the value of parameters exceeds the value of the target and cannot be changed along the gradient direction. In this paper, we propose AdaDerivative to eliminate the overshoot problem of AdaBelief. The key to AdaDerivative is that the “belief” of AdaBelief is replaced by the derivative term's exponential moving average (EMA), which can be constructed as (1−β2)∑i=1tβ2t−i(gi−gi−1)2 based on the past and current gradients. We validate the performance of AdaDerivative on a variety of tasks, including image classification, language modeling, node classification, image generation, and object detection tasks. Extensive experimental results demonstrate that AdaDerivative can achieve state-of-the-art performance.",

keywords = "Adam, Deep neural networks, Optimization algorithms, Stochastic gradient descent",

author = "Weidong Zou and Yuanqing Xia and Weipeng Cao",

note = "Publisher Copyright: {\textcopyright} 2022 Elsevier Ltd",

year = "2023",

month = mar,

doi = "10.1016/j.engappai.2022.105755",

language = "English",

volume = "119",

journal = "Engineering Applications of Artificial Intelligence",

issn = "0952-1976",

publisher = "Elsevier Ltd.",

}

TY - JOUR

T1 - AdaDerivative optimizer

T2 - Adapting step-sizes by the derivative term in past gradient information

AU - Zou, Weidong

AU - Xia, Yuanqing

AU - Cao, Weipeng

PY - 2023/3

Y1 - 2023/3

N2 - AdaBelief fully utilizes “belief” to iteratively update the parameters of deep neural networks. However, the reliability of the “belief” is determined by the gradient's prediction accuracy, and the key to this prediction accuracy is the selection of the smoothing parameter β1. AdaBelief also suffers from the overshoot problem, which occurs when the value of parameters exceeds the value of the target and cannot be changed along the gradient direction. In this paper, we propose AdaDerivative to eliminate the overshoot problem of AdaBelief. The key to AdaDerivative is that the “belief” of AdaBelief is replaced by the derivative term's exponential moving average (EMA), which can be constructed as (1−β2)∑i=1tβ2t−i(gi−gi−1)2 based on the past and current gradients. We validate the performance of AdaDerivative on a variety of tasks, including image classification, language modeling, node classification, image generation, and object detection tasks. Extensive experimental results demonstrate that AdaDerivative can achieve state-of-the-art performance.

AB - AdaBelief fully utilizes “belief” to iteratively update the parameters of deep neural networks. However, the reliability of the “belief” is determined by the gradient's prediction accuracy, and the key to this prediction accuracy is the selection of the smoothing parameter β1. AdaBelief also suffers from the overshoot problem, which occurs when the value of parameters exceeds the value of the target and cannot be changed along the gradient direction. In this paper, we propose AdaDerivative to eliminate the overshoot problem of AdaBelief. The key to AdaDerivative is that the “belief” of AdaBelief is replaced by the derivative term's exponential moving average (EMA), which can be constructed as (1−β2)∑i=1tβ2t−i(gi−gi−1)2 based on the past and current gradients. We validate the performance of AdaDerivative on a variety of tasks, including image classification, language modeling, node classification, image generation, and object detection tasks. Extensive experimental results demonstrate that AdaDerivative can achieve state-of-the-art performance.

KW - Adam

KW - Deep neural networks

KW - Optimization algorithms

KW - Stochastic gradient descent

UR - http://www.scopus.com/inward/record.url?scp=85144604961&partnerID=8YFLogxK

U2 - 10.1016/j.engappai.2022.105755

DO - 10.1016/j.engappai.2022.105755

M3 - Article

AN - SCOPUS:85144604961

SN - 0952-1976

VL - 119

JO - Engineering Applications of Artificial Intelligence

JF - Engineering Applications of Artificial Intelligence

M1 - 105755

ER -

AdaDerivative optimizer: Adapting step-sizes by the derivative term in past gradient information

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this