TY - JOUR
T1 - LinkSonar
T2 - A General and Fine-grained Approach for Failure Identification in Data Center Networks
AU - Wu, Chao
AU - Song, Tian
AU - Zhang, Xuliang
AU - Yin, Dashan
AU - Meng, Qi
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2026
Y1 - 2026
N2 - Commercial data center networks enable low-latency services, but switch and link failures threaten stability. Existing precise port-level failure identification methods are limited to specific topologies, while general ones localize only at the switch level. We present LinkSonar, a system based on semi-controllable probe paths that pinpoints failures at the port level. By leveraging tunneling and ERSPAN technologies, the entire network is decomposed into individual links for probing, thereby enabling balanced full-link coverage while minimizing dependence on topology structures. By modeling the probing process and introducing a failure inference algorithm, LinkSonar estimates link transmission success rates without requiring routing information, while mitigating cascading effects caused by concurrent faults and network noise. Comparative experiments with two state-of-the-art methods demonstrate that LinkSonar achieves superior localization accuracy under diverse load-balancing settings, while reducing manual troubleshooting efforts by more than 50%. Simulations across two representative topologies show that LinkSonar maintains F1-scores above 0.95 even under large-scale failure scenarios. Deployed for six months in a production data center with hundreds of multi-vendor switches, LinkSonar successfully detected over 200 failures, among which 13 correspond to silent packet drops or misconfigurations that are difficult to detect using Pingmesh-like approaches.
AB - Commercial data center networks enable low-latency services, but switch and link failures threaten stability. Existing precise port-level failure identification methods are limited to specific topologies, while general ones localize only at the switch level. We present LinkSonar, a system based on semi-controllable probe paths that pinpoints failures at the port level. By leveraging tunneling and ERSPAN technologies, the entire network is decomposed into individual links for probing, thereby enabling balanced full-link coverage while minimizing dependence on topology structures. By modeling the probing process and introducing a failure inference algorithm, LinkSonar estimates link transmission success rates without requiring routing information, while mitigating cascading effects caused by concurrent faults and network noise. Comparative experiments with two state-of-the-art methods demonstrate that LinkSonar achieves superior localization accuracy under diverse load-balancing settings, while reducing manual troubleshooting efforts by more than 50%. Simulations across two representative topologies show that LinkSonar maintains F1-scores above 0.95 even under large-scale failure scenarios. Deployed for six months in a production data center with hundreds of multi-vendor switches, LinkSonar successfully detected over 200 failures, among which 13 correspond to silent packet drops or misconfigurations that are difficult to detect using Pingmesh-like approaches.
KW - Fault detection
KW - General DCN troubleshooting
KW - Network monitoring
KW - Network operation
UR - https://www.scopus.com/pages/publications/105039955243
U2 - 10.1109/TON.2026.3696241
DO - 10.1109/TON.2026.3696241
M3 - Article
AN - SCOPUS:105039955243
SN - 2998-4157
JO - IEEE Transactions on Networking
JF - IEEE Transactions on Networking
ER -