elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Contact | Deutsch
Fontsize: [-] Text [+]

CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance

Shahzad, Faisal and Thies, Jonas and Kreutzer, Moritz and Zeiser, Thomas and Hager, Georg and Wellein, Gerhard (2018) CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance. IEEE Transactions on Parallel and Distributed Systems. DOI: 10.1109/TPDS.2018.2866794 ISBN 10459219 ISSN 1045-9219 (In Press)

[img] PDF
4MB

Abstract

In order to efficiently use the future generations of supercomputers, fault tolerance and power consumption are two of the prime challenges anticipated by the High Performance Computing (HPC) community. Checkpoint/Restart (CR) has been and still is the most widely used technique to deal with hard failures. Application-level CR is the most effective CR technique in terms of overhead efficiency but it takes a lot of implementation effort. This work presents the implementation of our C++ based library CRAFT (Checkpoint-Restart and Automatic Fault Tolerance), which serves two purposes. First, it provides an extendable library that significantly eases the implementation of application-level checkpointing. The most basic and frequently used checkpoint data types are already part of CRAFT and can be directly used out of the box. The library can be easily extended to add more data types. As means of overhead reduction, the library offers a build-in asynchronous checkpointing mechanism and also supports the Scalable Checkpoint/Restart (SCR) library for node level checkpointing. Second, CRAFT provides an easier interface for User-Level Failure Mitigation (ULFM) based dynamic process recovery, which significantly reduces the complexity and effort of failure detection and communication recovery mechanism. By utilizing both functionalities tog ther, applications can write application-level checkpoints and recover dynamically from process failures with very limited programming effort. This work presents the design and use of our library in detail. The associated overheads are thoroughly analyzed using several benchmarks.benchmarks.

Item URL in elib:https://elib.dlr.de/114762/
Document Type:Article
Title:CRAFT: A library for easier application-level Checkpoint/Restart and Automatic Fault Tolerance
Authors:
AuthorsInstitution or Email of AuthorsAuthors ORCID iD
Shahzad, FaisalFaisal.Shahzad (at) rrze.fau.deUNSPECIFIED
Thies, JonasJonas.Thies (at) dlr.deUNSPECIFIED
Kreutzer, MoritzErlangen Regional Computing CenterUNSPECIFIED
Zeiser, ThomasErlangen Regional Computing CenterUNSPECIFIED
Hager, GeorgErlangen Regional Computing CenterUNSPECIFIED
Wellein, GerhardErlangen Regional Computing CenterUNSPECIFIED
Date:2018
Journal or Publication Title:IEEE Transactions on Parallel and Distributed Systems
Refereed publication:Yes
Open Access:Yes
Gold Open Access:No
In SCOPUS:Yes
In ISI Web of Science:Yes
DOI :10.1109/TPDS.2018.2866794
ISSN:1045-9219
ISBN:10459219
Status:In Press
Keywords:Application-level checkpoint/restart, automatic fault tolerance, User-Level Failure Mitigation (ULFM)
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Space
HGF - Program Themes:Space Technology
DLR - Research area:Raumfahrt
DLR - Program:R SY - Technik für Raumfahrtsysteme
DLR - Research theme (Project):R - Vorhaben SISTEC
Location: Köln-Porz
Institutes and Institutions:Institut of Simulation and Software Technology
Institut of Simulation and Software Technology > High Performance Computing
Deposited By: Thies, Jonas
Deposited On:12 Jan 2018 11:19
Last Modified:31 Jul 2019 20:12

Repository Staff Only: item control page

Browse
Search
Help & Contact
Information
electronic library is running on EPrints 3.3.12
Copyright © 2008-2017 German Aerospace Center (DLR). All rights reserved.