Abstract
Task-Based parallel programming model has become the mainstream parallel programming model to improve the performance of parallel computer systems by exploiting task parallelism. This paper presents a novel task-based parallel programming model which supports hardware fault tolerance. This model incorporates fault tolerance mechanisms into the task-based parallel programming model and aim to improve system performance and reliability. It uses task as the basic unit of scheduling, execution, fault detection and recovery, and supports fault tolerance in the application level. A buffer-commit computation model is used for transient fault tolerance and application-level diskless checkpointing technique is employed for permanent fault tolerance. A work-stealing scheduling scheme supporting fault tolerance is adopted to achieve dynamic load balancing. Experimental results show that the proposed model provides hardware fault tolerance with low performance overhead.
Original language | English |
---|---|
Pages (from-to) | 1789-1804 |
Number of pages | 16 |
Journal | Ruan Jian Xue Bao/Journal of Software |
Volume | 27 |
Issue number | 7 |
DOIs | |
Publication status | Published - 1 Jul 2016 |
Keywords
- Fault tolerance
- Load balancing
- Parallel programming
- Task parallelism
- Work-stealing scheduling