TY - GEN
T1 - A fault tolerant self-scheduling scheme for parallel loops on shared memory systems
AU - Wang, Yizhuo
AU - Nicolau, Alexandru
AU - Cammarota, Rosario
AU - Veidenbaum, Alexander V.
PY - 2012
Y1 - 2012
N2 - As the number of cores per chip increases, significant speedup for many applications could be achieved by exploiting loop level parallelism (LLP). Meanwhile, ever scaling device size makes multicore/multiprocessor systems suffer from increased reliability problems. Scheduling scheme plays a key role to exploit LLP. In existing dynamic loop scheduling schemes, self-scheduling is the most commonly used scheme1. This paper presents FTSS, a fault tolerant self-scheduling scheme which aims to execute parallel loops efficiently in the presence of hardware faults on shared memory systems. Our technique transforms a loop to ensure the correctness of the re-execution of loop iterations by buffering variables with anti-dependences, which make it possible to design a fault tolerant loop scheduling scheme without checkpointing. FTSS combines work-stealing with self-scheduling, and uses a bidirectional execution model when work is stolen from a faulty core. Experimental results show that FTSS achieve better load balancing than existing self-scheduling schemes. Compared with checkpoint/restart implementations that save a checkpoint before executing each chunk of iterations and restart the whole chunk running on a faulty core, FTSS exhibits better runtime performance. In addition, FTSS greatly outperforms existing self-scheduling schemes in terms of performance and stability in heavy loaded runtime environment.
AB - As the number of cores per chip increases, significant speedup for many applications could be achieved by exploiting loop level parallelism (LLP). Meanwhile, ever scaling device size makes multicore/multiprocessor systems suffer from increased reliability problems. Scheduling scheme plays a key role to exploit LLP. In existing dynamic loop scheduling schemes, self-scheduling is the most commonly used scheme1. This paper presents FTSS, a fault tolerant self-scheduling scheme which aims to execute parallel loops efficiently in the presence of hardware faults on shared memory systems. Our technique transforms a loop to ensure the correctness of the re-execution of loop iterations by buffering variables with anti-dependences, which make it possible to design a fault tolerant loop scheduling scheme without checkpointing. FTSS combines work-stealing with self-scheduling, and uses a bidirectional execution model when work is stolen from a faulty core. Experimental results show that FTSS achieve better load balancing than existing self-scheduling schemes. Compared with checkpoint/restart implementations that save a checkpoint before executing each chunk of iterations and restart the whole chunk running on a faulty core, FTSS exhibits better runtime performance. In addition, FTSS greatly outperforms existing self-scheduling schemes in terms of performance and stability in heavy loaded runtime environment.
KW - fault tolerance
KW - loop scheduling
KW - multicore processors
KW - self-scheduling
UR - http://www.scopus.com/inward/record.url?scp=84880303303&partnerID=8YFLogxK
U2 - 10.1109/HiPC.2012.6507476
DO - 10.1109/HiPC.2012.6507476
M3 - Conference contribution
AN - SCOPUS:84880303303
SN - 9781467323703
T3 - 2012 19th International Conference on High Performance Computing, HiPC 2012
BT - 2012 19th International Conference on High Performance Computing, HiPC 2012
T2 - 2012 19th International Conference on High Performance Computing, HiPC 2012
Y2 - 18 December 2012 through 21 December 2012
ER -