TY - JOUR
T1 - Algorithms for online fault tolerance server consolidation
AU - Li, Boyu
AU - Wu, Bin
AU - Shen, Meng
AU - Peng, Hao
AU - Li, Weisheng
AU - Zhang, Hong
AU - Gan, Jie
AU - Tian, Zhihong
AU - Xu, Guangquan
N1 - Publisher Copyright:
© 2024 Chongqing University of Posts and Telecommunications
PY - 2025/4
Y1 - 2025/4
N2 - We study a novel replication mechanism to ensure service continuity against multiple simultaneous server failures. In this mechanism, each item represents a computing task and is replicated into ξ+1 servers for some integer ξ≥1, with workloads specified by the amount of required resources. If one or more servers fail, the affected workloads can be redirected to other servers that host replicas associated with the same item, such that the service is not interrupted by the failure of up to ξ servers. This requires that any feasible assignment algorithm must reserve some capacity in each server to accommodate the workload redirected from potential failed servers without overloading, and determining the optimal method for reserving capacity becomes a key issue. Unlike existing algorithms that assume that no two servers share replicas of more than one item, we first formulate capacity reservation for a general arbitrary scenario. Due to the combinatorial nature of this problem, finding the optimal solution is difficult. To this end, we propose a Generalized and Simple Calculating Reserved Capacity (GSCRC) algorithm, with a time complexity only related to the number of items packed in the server. In conjunction with GSCRC, we propose a robust replica packing algorithm with capacity optimization (RobustPack), which aims to minimize the number of servers hosting replicas and tolerate multiple server failures. Through theoretical analysis and experimental evaluations, we show that the RobustPack algorithm can achieve better performance.
AB - We study a novel replication mechanism to ensure service continuity against multiple simultaneous server failures. In this mechanism, each item represents a computing task and is replicated into ξ+1 servers for some integer ξ≥1, with workloads specified by the amount of required resources. If one or more servers fail, the affected workloads can be redirected to other servers that host replicas associated with the same item, such that the service is not interrupted by the failure of up to ξ servers. This requires that any feasible assignment algorithm must reserve some capacity in each server to accommodate the workload redirected from potential failed servers without overloading, and determining the optimal method for reserving capacity becomes a key issue. Unlike existing algorithms that assume that no two servers share replicas of more than one item, we first formulate capacity reservation for a general arbitrary scenario. Due to the combinatorial nature of this problem, finding the optimal solution is difficult. To this end, we propose a Generalized and Simple Calculating Reserved Capacity (GSCRC) algorithm, with a time complexity only related to the number of items packed in the server. In conjunction with GSCRC, we propose a robust replica packing algorithm with capacity optimization (RobustPack), which aims to minimize the number of servers hosting replicas and tolerate multiple server failures. Through theoretical analysis and experimental evaluations, we show that the RobustPack algorithm can achieve better performance.
KW - Cloud computing
KW - Fault tolerance
KW - Replica
KW - Server consolidation
UR - http://www.scopus.com/inward/record.url?scp=105003270418&partnerID=8YFLogxK
U2 - 10.1016/j.dcan.2024.06.007
DO - 10.1016/j.dcan.2024.06.007
M3 - Article
AN - SCOPUS:105003270418
SN - 2468-5925
VL - 11
SP - 514
EP - 523
JO - Digital Communications and Networks
JF - Digital Communications and Networks
IS - 2
ER -