TY - GEN
T1 - Fast outage analysis of large-scale production clouds with service correlation mining
AU - Wang, Yaohui
AU - Li, Guozheng
AU - Wang, Zijian
AU - Kang, Yu
AU - Zhou, Yangfan
AU - Zhang, Hongyu
AU - Gao, Feng
AU - Sun, Jeffrey
AU - Yang, Li
AU - Lee, Pochian
AU - Xu, Zhangwei
AU - Zhao, Pu
AU - Qiao, Bo
AU - Li, Liqun
AU - Zhang, Xu
AU - Lin, Qingwei
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/5
Y1 - 2021/5
N2 - Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the root-cause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first outage triage approach that considers the global view of service correlations. COT mines the correlations among services from outage diagnosis data. After learning from historical outages, COT can infer the root cause of emerging ones accurately. We implement COT and evaluate it on a real-world dataset containing one year of data collected from Microsoft Azure, one of the representative cloud computing platforms in the world. Our experimental results show that COT can reach a triage accuracy of 82.1%-83.5%, which outperforms the state-of-the-art triage approach by 28.0%-29.7%.
AB - Cloud-based services are surging into popularity in recent years. However, outages, i.e., severe incidents that always impact multiple services, can dramatically affect user experience and incur severe economic losses. Locating the root-cause service, i.e., the service that contains the root cause of the outage, is a crucial step to mitigate the impact of the outage. In current industrial practice, this is generally performed in a bootstrap manner and largely depends on human efforts: the service that directly causes the outage is identified first, and the suspected root cause is traced back manually from service to service during diagnosis until the actual root cause is found. Unfortunately, production cloud systems typically contain a large number of interdependent services. Such a manual root cause analysis is often time-consuming and labor-intensive. In this work, we propose COT, the first outage triage approach that considers the global view of service correlations. COT mines the correlations among services from outage diagnosis data. After learning from historical outages, COT can infer the root cause of emerging ones accurately. We implement COT and evaluate it on a real-world dataset containing one year of data collected from Microsoft Azure, one of the representative cloud computing platforms in the world. Our experimental results show that COT can reach a triage accuracy of 82.1%-83.5%, which outperforms the state-of-the-art triage approach by 28.0%-29.7%.
KW - Cloud computing
KW - Machine learning
KW - Outage triage
KW - Root cause analysis
UR - http://www.scopus.com/inward/record.url?scp=85113440844&partnerID=8YFLogxK
U2 - 10.1109/ICSE43902.2021.00085
DO - 10.1109/ICSE43902.2021.00085
M3 - Conference contribution
AN - SCOPUS:85113440844
T3 - Proceedings - International Conference on Software Engineering
SP - 885
EP - 896
BT - Proceedings - 2021 IEEE/ACM 43rd International Conference on Software Engineering, ICSE 2021
PB - IEEE Computer Society
T2 - 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021
Y2 - 22 May 2021 through 30 May 2021
ER -