P4Com: Implementing P4-Based In-Network Data Reduction for Geo-Distributed Machine Learning

  • Boyuan Xiang
  • , Cheng Chi*
  • , Shuai Gao
  • , Haonan Li
  • , Xindi Hou
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Geo-Distributed Machine Learning (GDML) aims to enable datacenters to collaborate in training large-scale models. These datacenters are located in different geographic regions. However, the limited Wide Area Network (WAN) bandwidth resources restrict the performance of GDML systems. Existing solutions compromise model accuracy, while in-network computing-based approaches suffer from issues such as complex processing logic, high system implementation overhead, and a high probability of gradient aggregation conflicts. Therefore, this paper proposes a lightweight in-network data reduction mechanism, P4Com, based on the P4 programmable data plane. First, we design the network-layer identifier to represent gradient aggregation tasks. Then, a register conflict avoidance mechanism is proposed to improve register utilization efficiency. Building upon this, we design a lightweight data plane using the protocol-independent P4 language to support line-rate in-network gradient aggregation. Finally, A prototype system is built for validation, and experimental results show that P4Com significantly reduces the runtime latency of GDML systems.

Original languageEnglish
Title of host publication10th International Conference on Computer and Communication Systems, ICCCS 2025
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages8-13
Number of pages6
ISBN (Electronic)9798331523145
DOIs
Publication statusPublished - 2025
Externally publishedYes
Event10th International Conference on Computer and Communication Systems, ICCCS 2025 - Chengdu, China
Duration: 18 Apr 202521 Apr 2025

Publication series

Name10th International Conference on Computer and Communication Systems, ICCCS 2025

Conference

Conference10th International Conference on Computer and Communication Systems, ICCCS 2025
Country/TerritoryChina
CityChengdu
Period18/04/2521/04/25

Keywords

  • GDML
  • In-network computing
  • P4
  • Programmable data plane

Fingerprint

Dive into the research topics of 'P4Com: Implementing P4-Based In-Network Data Reduction for Geo-Distributed Machine Learning'. Together they form a unique fingerprint.

Cite this