Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks

Zehua Guo*, Jiayu Wang, Sen Liu, Jineng Ren, Yang Xu, Chao Yao

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

3 Citations (Scopus)

Abstract

Distributed Machine Learning (DML) is proposed not only to accelerate the training of machine learning, but also to solve the inadequate ability for handling a large amount of training data. It adopts multiple computing nodes in data center to collaboratively work in parallel at the cost of high communication overhead. Gradient Compression (GC) is introduced to reduce the communication overhead by reducing the number of synchronized gradients among computing nodes. However, existing GC solutions suffer from varying network congestion. To be specific, when some computing nodes experience high network congestion, their gradient transmission process could be significantly delayed, slowing down the entire training process. To solve the problem, we propose FLASH, a congestion-aware GC solution for DML. FLASH accelerates the training process by jointly considering the iterative approximation of machine learning and dynamic network congestion scenarios. It can maintain good training performance by adaptively adjust and schedule the number of synchronized gradients among computing nodes. We evaluate the effectiveness of FLASH using AlexNet and Resnet18 under different network congestion scenarios. Simulation results show that under the same number of training epochs, FLASH reduces training time 22-71%, maintains good accuracy, and low loss, compared with the existing memory top-K GC solution.

Original languageEnglish
Pages (from-to)2296-2311
Number of pages16
JournalIEEE Transactions on Cloud Computing
Volume11
Issue number3
DOIs
Publication statusPublished - 1 Jul 2023

Keywords

  • Distributed machine learning
  • data center networks
  • gradient scheduling

Fingerprint

Dive into the research topics of 'Congestion-Aware Critical Gradient Scheduling for Distributed Machine Learning in Data Center Networks'. Together they form a unique fingerprint.

Cite this