arXiv Open Access 2021

Checkpointing and Localized Recovery for Nested Fork-Join Programs

Claudia Fohry
Lihat Sumber

Abstrak

While checkpointing is typically combined with a restart of the whole application, localized recovery permits all but the affected processes to continue. In task-based cluster programming, for instance, the application can then be finished on the intact nodes, and the lost tasks be reassigned. This extended abstract suggests to adapt a checkpointing and localized recovery technique that has originally been developed for independent tasks to nested fork-join programs. We consider a Cilk-like work stealing scheme with work-first policy in a distributed memory setting, and describe the required algorithmic changes. The original technique has checkpointing overheads below 1% and neglectable costs for recovery, we expect the new algorithm to achieve a similar performance.

Topik & Kata Kunci

Penulis (1)

C

Claudia Fohry

Format Sitasi

Fohry, C. (2021). Checkpointing and Localized Recovery for Nested Fork-Join Programs. https://arxiv.org/abs/2102.12941

Akses Cepat

Lihat di Sumber
Informasi Jurnal
Tahun Terbit
2021
Bahasa
en
Sumber Database
arXiv
Akses
Open Access ✓