TY - JOUR
T1 - AdaSwitch
T2 - Adapting switch from Adam to SGDM by exponential function
AU - Zou, Weidong
AU - Xia, Yuanqing
AU - Zhong, Bineng
AU - Cao, Weipeng
N1 - Publisher Copyright:
© 2025 Elsevier B.V.
PY - 2026/3
Y1 - 2026/3
N2 - Optimizers play a critical role in the training of deep neural networks (DNNs). Adam is known for its fast convergence, while Stochastic Gradient Descent with Momentum (SGDM) is valued for its strong generalization capability. However, both exhibit limitations: SGDM often suffers from slow convergence initially, whereas Adam tends to generalize poorly in later stages. To address these issues, we propose AdaSwitch, an optimizer that combines the strengths of both methods, achieving rapid convergence early in training and robust generalization later on. AdaSwitch employs a linear combination based on an exponential function to smoothly transition from Adam to SGDM by adjusting the DNN parameters. We also provide a theoretical convergence guarantee for non-convex settings. The core idea is to express the network parameters θt as θt=β3tθtAdam+(1−β3t)θtSGDM, where (β3∈(0,1)) is the base of the adaptive exponential function. Extensive experiments on various architectures and tasks demonstrate that AdaSwitch outperforms existing methods in image classification, image generation, node classification, and few-shot visual classification, delivering both fast convergence and strong generalization.
AB - Optimizers play a critical role in the training of deep neural networks (DNNs). Adam is known for its fast convergence, while Stochastic Gradient Descent with Momentum (SGDM) is valued for its strong generalization capability. However, both exhibit limitations: SGDM often suffers from slow convergence initially, whereas Adam tends to generalize poorly in later stages. To address these issues, we propose AdaSwitch, an optimizer that combines the strengths of both methods, achieving rapid convergence early in training and robust generalization later on. AdaSwitch employs a linear combination based on an exponential function to smoothly transition from Adam to SGDM by adjusting the DNN parameters. We also provide a theoretical convergence guarantee for non-convex settings. The core idea is to express the network parameters θt as θt=β3tθtAdam+(1−β3t)θtSGDM, where (β3∈(0,1)) is the base of the adaptive exponential function. Extensive experiments on various architectures and tasks demonstrate that AdaSwitch outperforms existing methods in image classification, image generation, node classification, and few-shot visual classification, delivering both fast convergence and strong generalization.
KW - Deep learning
KW - Fast convergence
KW - Optimizers
KW - Robust generalization
UR - https://www.scopus.com/pages/publications/105027423065
U2 - 10.1016/j.asoc.2025.114459
DO - 10.1016/j.asoc.2025.114459
M3 - Article
AN - SCOPUS:105027423065
SN - 1568-4946
VL - 189
JO - Applied Soft Computing
JF - Applied Soft Computing
M1 - 114459
ER -