跳到主要导航 跳到搜索 跳到主要内容

LinkSonar: A General and Fine-grained Approach for Failure Identification in Data Center Networks

  • Chao Wu
  • , Tian Song*
  • , Xuliang Zhang
  • , Dashan Yin
  • , Qi Meng
  • *此作品的通讯作者
  • Beijing Institute of Technology
  • Xiaohongshu
  • Meituan

科研成果: 期刊稿件文章同行评审

摘要

Commercial data center networks enable low-latency services, but switch and link failures threaten stability. Existing precise port-level failure identification methods are limited to specific topologies, while general ones localize only at the switch level. We present LinkSonar, a system based on semi-controllable probe paths that pinpoints failures at the port level. By leveraging tunneling and ERSPAN technologies, the entire network is decomposed into individual links for probing, thereby enabling balanced full-link coverage while minimizing dependence on topology structures. By modeling the probing process and introducing a failure inference algorithm, LinkSonar estimates link transmission success rates without requiring routing information, while mitigating cascading effects caused by concurrent faults and network noise. Comparative experiments with two state-of-the-art methods demonstrate that LinkSonar achieves superior localization accuracy under diverse load-balancing settings, while reducing manual troubleshooting efforts by more than 50%. Simulations across two representative topologies show that LinkSonar maintains F1-scores above 0.95 even under large-scale failure scenarios. Deployed for six months in a production data center with hundreds of multi-vendor switches, LinkSonar successfully detected over 200 failures, among which 13 correspond to silent packet drops or misconfigurations that are difficult to detect using Pingmesh-like approaches.

源语言英语
期刊IEEE Transactions on Networking
DOI
出版状态已接受/待刊 - 2026

指纹

探究 'LinkSonar: A General and Fine-grained Approach for Failure Identification in Data Center Networks' 的科研主题。它们共同构成独一无二的指纹。

引用此