A work-stealing scheduling framework supporting fault tolerance

Yizhuo Wang, Weixing Ji, Feng Shi, Qi Zuo

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Citations (Scopus)

Abstract

Fault tolerance and load balancing are critical points for executing long-running parallel applications on multicore clusters. This paper addresses both fault tolerance and load balancing on multicore clusters by presenting a novel work-stealing task scheduling framework which supports hardware fault tolerance. In this framework, both transient and permanent faults are detected and recovered at task granularity. We incorporate task-based fault detection and recovery mechanisms into a hierarchical work-stealing scheme to establish the framework. This framework provides low-overhead fault-tolerance and optimal load balancing by fully exploiting task parallelism.

Original languageEnglish
Title of host publicationProceedings - Design, Automation and Test in Europe, DATE 2013
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages695-700
Number of pages6
ISBN (Print)9783981537000
DOIs
Publication statusPublished - 2013
Event16th Design, Automation and Test in Europe Conference and Exhibition, DATE 2013 - Grenoble, France
Duration: 18 Mar 201322 Mar 2013

Publication series

NameProceedings -Design, Automation and Test in Europe, DATE
ISSN (Print)1530-1591

Conference

Conference16th Design, Automation and Test in Europe Conference and Exhibition, DATE 2013
Country/TerritoryFrance
CityGrenoble
Period18/03/1322/03/13

Keywords

  • Cluster
  • Fault tolerance
  • Multicore
  • Work-stealing

Fingerprint

Dive into the research topics of 'A work-stealing scheduling framework supporting fault tolerance'. Together they form a unique fingerprint.

Cite this