TY - JOUR
T1 - Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks
AU - Guo, Zehua
AU - Wang, Jiayu
AU - Liu, Sen
AU - Ren, Jineng
AU - Xu, Yang
AU - Yao, Chao
N1 - Publisher Copyright:
© 2013 IEEE.
PY - 2023/7/1
Y1 - 2023/7/1
N2 - Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.
AB - Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.
KW - Distributed machine learning
KW - data center networks
KW - gradient scheduling
UR - http://www.scopus.com/inward/record.url?scp=85136062006&partnerID=8YFLogxK
U2 - 10.1109/TCC.2022.3197350
DO - 10.1109/TCC.2022.3197350
M3 - Article
AN - SCOPUS:85136062006
SN - 2168-7161
VL - 11
SP - 2296
EP - 2311
JO - IEEE Transactions on Cloud Computing
JF - IEEE Transactions on Cloud Computing
IS - 3
ER -