A fault tolerant self-scheduling scheme for parallel loops on shared memory systems

Yizhuo Wang, Alexandru Nicolau, Rosario Cammarota, Alexander V. Veidenbaum

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

6 Citations (Scopus)

Abstract

As the number of cores per chip increases, significant speedup for many applications could be achieved by exploiting loop level parallelism (LLP). Meanwhile, ever scaling device size makes multicore/multiprocessor systems suffer from increased reliability problems. Scheduling scheme plays a key role to exploit LLP. In existing dynamic loop scheduling schemes, self-scheduling is the most commonly used scheme1. This paper presents FTSS, a fault tolerant self-scheduling scheme which aims to execute parallel loops efficiently in the presence of hardware faults on shared memory systems. Our technique transforms a loop to ensure the correctness of the re-execution of loop iterations by buffering variables with anti-dependences, which make it possible to design a fault tolerant loop scheduling scheme without checkpointing. FTSS combines work-stealing with self-scheduling, and uses a bidirectional execution model when work is stolen from a faulty core. Experimental results show that FTSS achieve better load balancing than existing self-scheduling schemes. Compared with checkpoint/restart implementations that save a checkpoint before executing each chunk of iterations and restart the whole chunk running on a faulty core, FTSS exhibits better runtime performance. In addition, FTSS greatly outperforms existing self-scheduling schemes in terms of performance and stability in heavy loaded runtime environment.

Original languageEnglish
Title of host publication2012 19th International Conference on High Performance Computing, HiPC 2012
DOIs
Publication statusPublished - 2012
Event2012 19th International Conference on High Performance Computing, HiPC 2012 - Pune, India
Duration: 18 Dec 201221 Dec 2012

Publication series

Name2012 19th International Conference on High Performance Computing, HiPC 2012

Conference

Conference2012 19th International Conference on High Performance Computing, HiPC 2012
Country/TerritoryIndia
CityPune
Period18/12/1221/12/12

Keywords

  • fault tolerance
  • loop scheduling
  • multicore processors
  • self-scheduling

Fingerprint

Dive into the research topics of 'A fault tolerant self-scheduling scheme for parallel loops on shared memory systems'. Together they form a unique fingerprint.

Cite this