We are a group of people from academia, industry, and government labs to work on standardizing the Checkpoint/Restart (C/R) APIs for supercomputing. Our goal is to make C/R tools usable on fast changing HPC platforms with production workloads by working with HPC hardware/software vendors, C/R tools developers, application and library developers, HPC practitioners, and end users. We are working to develop a C/R interface standard, and facilitate its adoption in the HPC community to help harness the full benefits of C/R that go far beyond fault-tolerance.
Kapil Arya, Azure Systems Research
Gene Cooperman, Northeastern University
Donglai Dai, X-ScaleSolutions Inc
Doug Fuller, Cornelis Networks
Rebecca Hartman-Baker, Berkeley Lab
Lena M. Lopatina, Los Alamos National Lab
Bogdan Nicolae, Argonne National Lab
Sarp Oral, Oak Ridge National Lab
Adrian Reber, RedHat Inc
David Yat Sin, AMD, Inc
Andrey Vagin, Google Inc
Patrick Widener, Oak Ridge National Lab
Zhengji Zhao, Berkeley Lab
Tony Skjellum, University of Tennessee, Chattanooga
John Shalf, Lawrence Berkeley National Laboratory
Eric Roman, Lawrence Berkeley National Laboratory
Yves Robert, ENS Lyon
The C/R standard Forum was formed in January 2022. Since then the steering committee has been meeting bi-weekly to gather the requirements for the C/R interface standard, and have collected inputs from 27 code teams so far via the bi-weekly meetings and the C/R Requirements Gathering Workshop held in July, 2022. Currently the committee is working on drafting the requirements documents from which the C/R Interface specification will be extracted. The steering committee will release the first version of the C/R interface standard specification in SC22 (November, 2022). A BOF session has been planned for this.
The C/R Standard Forum is open to anyone interested. If you are interested in participating in the C/R standardization effort, please contact at ZZhao@lbl.gov (Zhengji Zhao), or any member of the steering committee.