Fault tolerant scheduling for parallel loops on shared memory systems

Yizhuo Wang; Rosario Cammarota; Alexandru Nicolau

Fault tolerant scheduling for parallel loops on shared memory systems

Yizhuo Wang, Rosario Cammarota, Alexandru Nicolau

School of Computer Science and Technology

Research output: Contribution to journal › Article › peer-review

Abstract

While multicore/multiprocessor systems achieve significant speedup for many applications by exploiting loop level parallelism, they also suffer from increased reliability problems as a result of ever scaling device size. This paper addresses the reliability of loop dominated applications, aiming to execute parallel loops efficiently in the presence of various types of hardware faults. In this paper, we present a fault tolerant work-stealing scheme which makes parallel loop execution resilient to hardware faults. A lightweight buffer-commit mechanism is applied in the proposed scheme to ensure the correctness of the re-execution of loop iterations. In addition, we split large failing chunks of loop iterations at runtime to improve load balancing, and a worker thread is discarded when faults occur frequently on it. We evaluated our techniques on a multi-socket multicore system, using a set of loop dominated benchmarks. The proposed scheme achieves the minimum overhead of supporting fault tolerance and optimal load balancing.

Original language	English
Pages (from-to)	1937-1959
Number of pages	23
Journal	Journal of Information Science and Engineering
Volume	31
Issue number	6
Publication status	Published - Nov 2015

Keywords

Fault tolerance
Loop scheduling
Multicore and multiprocessor
Self-scheduling
Work-stealing

Cite this

Wang, Y., Cammarota, R., & Nicolau, A. (2015). Fault tolerant scheduling for parallel loops on shared memory systems. Journal of Information Science and Engineering, 31(6), 1937-1959.

@article{850d28d8ce694c5d992a2126fdaa9607,

title = "Fault tolerant scheduling for parallel loops on shared memory systems",

abstract = "While multicore/multiprocessor systems achieve significant speedup for many applications by exploiting loop level parallelism, they also suffer from increased reliability problems as a result of ever scaling device size. This paper addresses the reliability of loop dominated applications, aiming to execute parallel loops efficiently in the presence of various types of hardware faults. In this paper, we present a fault tolerant work-stealing scheme which makes parallel loop execution resilient to hardware faults. A lightweight buffer-commit mechanism is applied in the proposed scheme to ensure the correctness of the re-execution of loop iterations. In addition, we split large failing chunks of loop iterations at runtime to improve load balancing, and a worker thread is discarded when faults occur frequently on it. We evaluated our techniques on a multi-socket multicore system, using a set of loop dominated benchmarks. The proposed scheme achieves the minimum overhead of supporting fault tolerance and optimal load balancing.",

keywords = "Fault tolerance, Loop scheduling, Multicore and multiprocessor, Self-scheduling, Work-stealing",

author = "Yizhuo Wang and Rosario Cammarota and Alexandru Nicolau",

year = "2015",

month = nov,

language = "English",

volume = "31",

pages = "1937--1959",

journal = "Journal of Information Science and Engineering",

issn = "1016-2364",

publisher = "Institute of Information Science",

number = "6",

}

TY - JOUR

T1 - Fault tolerant scheduling for parallel loops on shared memory systems

AU - Wang, Yizhuo

AU - Cammarota, Rosario

AU - Nicolau, Alexandru

PY - 2015/11

Y1 - 2015/11

N2 - While multicore/multiprocessor systems achieve significant speedup for many applications by exploiting loop level parallelism, they also suffer from increased reliability problems as a result of ever scaling device size. This paper addresses the reliability of loop dominated applications, aiming to execute parallel loops efficiently in the presence of various types of hardware faults. In this paper, we present a fault tolerant work-stealing scheme which makes parallel loop execution resilient to hardware faults. A lightweight buffer-commit mechanism is applied in the proposed scheme to ensure the correctness of the re-execution of loop iterations. In addition, we split large failing chunks of loop iterations at runtime to improve load balancing, and a worker thread is discarded when faults occur frequently on it. We evaluated our techniques on a multi-socket multicore system, using a set of loop dominated benchmarks. The proposed scheme achieves the minimum overhead of supporting fault tolerance and optimal load balancing.

AB - While multicore/multiprocessor systems achieve significant speedup for many applications by exploiting loop level parallelism, they also suffer from increased reliability problems as a result of ever scaling device size. This paper addresses the reliability of loop dominated applications, aiming to execute parallel loops efficiently in the presence of various types of hardware faults. In this paper, we present a fault tolerant work-stealing scheme which makes parallel loop execution resilient to hardware faults. A lightweight buffer-commit mechanism is applied in the proposed scheme to ensure the correctness of the re-execution of loop iterations. In addition, we split large failing chunks of loop iterations at runtime to improve load balancing, and a worker thread is discarded when faults occur frequently on it. We evaluated our techniques on a multi-socket multicore system, using a set of loop dominated benchmarks. The proposed scheme achieves the minimum overhead of supporting fault tolerance and optimal load balancing.

KW - Fault tolerance

KW - Loop scheduling

KW - Multicore and multiprocessor

KW - Self-scheduling

KW - Work-stealing

UR - http://www.scopus.com/inward/record.url?scp=84947427443&partnerID=8YFLogxK

M3 - Article

AN - SCOPUS:84947427443

SN - 1016-2364

VL - 31

SP - 1937

EP - 1959

JO - Journal of Information Science and Engineering

JF - Journal of Information Science and Engineering

IS - 6

ER -

Fault tolerant scheduling for parallel loops on shared memory systems

Abstract

Keywords

Other files and links

Fingerprint

Cite this