TY - GEN
T1 - Toward Determined Service for Distributed Machine Learning
AU - Zhu, Haowen
AU - Ye, Minghao
AU - Guo, Zehua
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Parameter Server (PS) is a typical Distributed Machine Learning (DML) enabler and widely used in industry and academia. Existing works propose to apply the emerging In-Network Aggregation (INA) technique to improve model training efficiency. However, existing INA systems may suffer from undetermined model training efficiency and service quality, given that many gradient aggregation processes are still performed by the server under irrational gradient aggregation strategies. In this paper, we propose a Deterministic In-Network Aggregation (DINA) scheme to improve model training efficiency by enhancing the efficiency of INA utilization in DML. Our key observation is to further increase worker sending rates by reducing gradient packets' RTT. Based on this observation, DINA can rationally select the optimal global gradient aggregation switch depending on the switches' available memory, worker sending rate, and server processing capacity. Simulation results show that DINA can provide determined training by improving worker sending rates by 16%-87% and network load by 28%-46.8% compared with existing solutions.
AB - Parameter Server (PS) is a typical Distributed Machine Learning (DML) enabler and widely used in industry and academia. Existing works propose to apply the emerging In-Network Aggregation (INA) technique to improve model training efficiency. However, existing INA systems may suffer from undetermined model training efficiency and service quality, given that many gradient aggregation processes are still performed by the server under irrational gradient aggregation strategies. In this paper, we propose a Deterministic In-Network Aggregation (DINA) scheme to improve model training efficiency by enhancing the efficiency of INA utilization in DML. Our key observation is to further increase worker sending rates by reducing gradient packets' RTT. Based on this observation, DINA can rationally select the optimal global gradient aggregation switch depending on the switches' available memory, worker sending rate, and server processing capacity. Simulation results show that DINA can provide determined training by improving worker sending rates by 16%-87% and network load by 28%-46.8% compared with existing solutions.
UR - http://www.scopus.com/inward/record.url?scp=85206348836&partnerID=8YFLogxK
U2 - 10.1109/IWQoS61813.2024.10682933
DO - 10.1109/IWQoS61813.2024.10682933
M3 - Conference contribution
AN - SCOPUS:85206348836
T3 - IEEE International Workshop on Quality of Service, IWQoS
BT - 2024 IEEE/ACM 32nd International Symposium on Quality of Service, IWQoS 2024
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 32nd IEEE/ACM International Symposium on Quality of Service, IWQoS 2024
Y2 - 19 June 2024 through 21 June 2024
ER -