Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Zehua Guo*, Jiayu Wang, Sen Liu, Jineng Ren, Yang Xu, Chao Yao

*此作品的通讯作者

科研成果: 期刊稿件文章同行评审

3 引用 (Scopus)

摘要

Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

源语言英语
页(从-至)2296-2311
页数16
期刊IEEE Transactions on Cloud Computing
11
3
DOI
出版状态已出版 - 1 7月 2023

指纹

探究 'Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks' 的科研主题。它们共同构成独一无二的指纹。

引用此