Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Zehua Guo; Jiayu Wang; Sen Liu; Jineng Ren; Yang Xu; Chao Yao

doi:10.1109/TCC.2022.3197350

Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Zehua Guo^*, Jiayu Wang, Sen Liu, Jineng Ren, Yang Xu, Chao Yao

^*此作品的通讯作者

自动化学院

科研成果: 期刊稿件 › 文章 › 同行评审

3 引用（Scopus）

摘要

Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

源语言	英语
页（从-至）	2296-2311
页数	16
期刊	IEEE Transactions on Cloud Computing
卷	11
期	3
DOI	https://doi.org/10.1109/TCC.2022.3197350
出版状态	已出版 - 1 7月 2023

访问文件

10.1109/TCC.2022.3197350

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{644dae3fb21a4bb785332212313d95b1,

title = "Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks",

abstract = "Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.",

keywords = "Distributed machine learning, data center networks, gradient scheduling",

author = "Zehua Guo and Jiayu Wang and Sen Liu and Jineng Ren and Yang Xu and Chao Yao",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2023",

month = jul,

day = "1",

doi = "10.1109/TCC.2022.3197350",

language = "English",

volume = "11",

pages = "2296--2311",

journal = "IEEE Transactions on Cloud Computing",

issn = "2168-7161",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

number = "3",

}

TY - JOUR

T1 - Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

AU - Guo, Zehua

AU - Wang, Jiayu

AU - Liu, Sen

AU - Ren, Jineng

AU - Xu, Yang

AU - Yao, Chao

PY - 2023/7/1

Y1 - 2023/7/1

N2 - Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

AB - Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

KW - Distributed machine learning

KW - data center networks

KW - gradient scheduling

UR - http://www.scopus.com/inward/record.url?scp=85136062006&partnerID=8YFLogxK

U2 - 10.1109/TCC.2022.3197350

DO - 10.1109/TCC.2022.3197350

M3 - Article

AN - SCOPUS:85136062006

SN - 2168-7161

VL - 11

SP - 2296

EP - 2311

JO - IEEE Transactions on Cloud Computing

JF - IEEE Transactions on Cloud Computing

IS - 3

ER -

Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

摘要

访问文件

其它文件与链接

指纹

引用此