Abstract
Commercial data center networks enable low-latency services, but switch and link failures threaten stability. Existing precise port-level failure identification methods are limited to specific topologies, while general ones localize only at the switch level. We present LinkSonar, a system based on semi-controllable probe paths that pinpoints failures at the port level. By leveraging tunneling and ERSPAN technologies, the entire network is decomposed into individual links for probing, thereby enabling balanced full-link coverage while minimizing dependence on topology structures. By modeling the probing process and introducing a failure inference algorithm, LinkSonar estimates link transmission success rates without requiring routing information, while mitigating cascading effects caused by concurrent faults and network noise. Comparative experiments with two state-of-the-art methods demonstrate that LinkSonar achieves superior localization accuracy under diverse load-balancing settings, while reducing manual troubleshooting efforts by more than 50%. Simulations across two representative topologies show that LinkSonar maintains F1-scores above 0.95 even under large-scale failure scenarios. Deployed for six months in a production data center with hundreds of multi-vendor switches, LinkSonar successfully detected over 200 failures, among which 13 correspond to silent packet drops or misconfigurations that are difficult to detect using Pingmesh-like approaches.
| Original language | English |
|---|---|
| Journal | IEEE Transactions on Networking |
| DOIs | |
| Publication status | Accepted/In press - 2026 |
Keywords
- Fault detection
- General DCN troubleshooting
- Network monitoring
- Network operation
Fingerprint
Dive into the research topics of 'LinkSonar: A General and Fine-grained Approach for Failure Identification in Data Center Networks'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver