Algorithms for online fault tolerance server consolidation

Boyu Li, Bin Wu, Meng Shen, Hao Peng, Weisheng Li, Hong Zhang, Jie Gan, Zhihong Tian*, Guangquan Xu

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

We study a novel replication mechanism to ensure service continuity against multiple simultaneous server failures. In this mechanism, each item represents a computing task and is replicated into ξ+1 servers for some integer ξ≥1, with workloads specified by the amount of required resources. If one or more servers fail, the affected workloads can be redirected to other servers that host replicas associated with the same item, such that the service is not interrupted by the failure of up to ξ servers. This requires that any feasible assignment algorithm must reserve some capacity in each server to accommodate the workload redirected from potential failed servers without overloading, and determining the optimal method for reserving capacity becomes a key issue. Unlike existing algorithms that assume that no two servers share replicas of more than one item, we first formulate capacity reservation for a general arbitrary scenario. Due to the combinatorial nature of this problem, finding the optimal solution is difficult. To this end, we propose a Generalized and Simple Calculating Reserved Capacity (GSCRC) algorithm, with a time complexity only related to the number of items packed in the server. In conjunction with GSCRC, we propose a robust replica packing algorithm with capacity optimization (RobustPack), which aims to minimize the number of servers hosting replicas and tolerate multiple server failures. Through theoretical analysis and experimental evaluations, we show that the RobustPack algorithm can achieve better performance.

Original languageEnglish
Pages (from-to)514-523
Number of pages10
JournalDigital Communications and Networks
Volume11
Issue number2
DOIs
Publication statusPublished - Apr 2025
Externally publishedYes

Keywords

  • Cloud computing
  • Fault tolerance
  • Replica
  • Server consolidation

Cite this