Toward Determined Service for Distributed Machine Learning

Haowen Zhu, Minghao Ye, Zehua Guo*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Parameter Server (PS) is a typical Distributed Machine Learning (DML) enabler and widely used in industry and academia. Existing works propose to apply the emerging In-Network Aggregation (INA) technique to improve model training efficiency. However, existing INA systems may suffer from undetermined model training efficiency and service quality, given that many gradient aggregation processes are still performed by the server under irrational gradient aggregation strategies. In this paper, we propose a Deterministic In-Network Aggregation (DINA) scheme to improve model training efficiency by enhancing the efficiency of INA utilization in DML. Our key observation is to further increase worker sending rates by reducing gradient packets' RTT. Based on this observation, DINA can rationally select the optimal global gradient aggregation switch depending on the switches' available memory, worker sending rate, and server processing capacity. Simulation results show that DINA can provide determined training by improving worker sending rates by 16%-87% and network load by 28%-46.8% compared with existing solutions.

Original languageEnglish
Title of host publication2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350350128
DOIs
Publication statusPublished - 2024
Event32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024 - Guangzhou, China
Duration: 19 Jun 202421 Jun 2024

Publication series

NameIEEE International Workshop on Quality of Service, IWQoS
ISSN (Print)1548-615X

Conference

Conference32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024
Country/TerritoryChina
CityGuangzhou
Period19/06/2421/06/24

Fingerprint

Dive into the research topics of 'Toward Determined Service for Distributed Machine Learning'. Together they form a unique fingerprint.

Cite this