Skip to main navigation Skip to search Skip to main content

LinkSonar: A General and Fine-grained Approach for Failure Identification in Data Center Networks

  • Chao Wu
  • , Tian Song*
  • , Xuliang Zhang
  • , Dashan Yin
  • , Qi Meng
  • *Corresponding author for this work
  • Beijing Institute of Technology
  • Xiaohongshu
  • Meituan

Research output: Contribution to journalArticlepeer-review

Abstract

Commercial data center networks enable low-latency services, but switch and link failures threaten stability. Existing precise port-level failure identification methods are limited to specific topologies, while general ones localize only at the switch level. We present LinkSonar, a system based on semi-controllable probe paths that pinpoints failures at the port level. By leveraging tunneling and ERSPAN technologies, the entire network is decomposed into individual links for probing, thereby enabling balanced full-link coverage while minimizing dependence on topology structures. By modeling the probing process and introducing a failure inference algorithm, LinkSonar estimates link transmission success rates without requiring routing information, while mitigating cascading effects caused by concurrent faults and network noise. Comparative experiments with two state-of-the-art methods demonstrate that LinkSonar achieves superior localization accuracy under diverse load-balancing settings, while reducing manual troubleshooting efforts by more than 50%. Simulations across two representative topologies show that LinkSonar maintains F1-scores above 0.95 even under large-scale failure scenarios. Deployed for six months in a production data center with hundreds of multi-vendor switches, LinkSonar successfully detected over 200 failures, among which 13 correspond to silent packet drops or misconfigurations that are difficult to detect using Pingmesh-like approaches.

Original languageEnglish
JournalIEEE Transactions on Networking
DOIs
Publication statusAccepted/In press - 2026

Keywords

  • Fault detection
  • General DCN troubleshooting
  • Network monitoring
  • Network operation

Fingerprint

Dive into the research topics of 'LinkSonar: A General and Fine-grained Approach for Failure Identification in Data Center Networks'. Together they form a unique fingerprint.

Cite this