Check-pointing

Check-pointing means saving the state of your work periodically, so that it can be resumed later.

Need for Check-pointing

Imagine your code is doing some heavy computations and has been running for hours without saving and suddenly, the computer crashes or shuts down. You are likely to lose your hours-long computational work and would need to start all over again. But if you keep saving your work every ten minutes, if something like above example happens, you will only lose maximum of ten minutes of your computaion.

Check-pointing your code is always a good practice, regardless you are running it on an HPC or not. On HPC systems, check-pointing becomes even more important. There could be several reasons a job could get aborted:

  • Job exceeds the time limit
  • Job exceeds the allocated memory
  • Preemption by other jobs
  • Node failure (rare, but possible)

In addition to the above job failure reasons, check-pointing is also good for debugging and monitoring your job progress.

How to check-point

You can make your code checkpoint-able by saving its state periodically, and looking for a state file when it starts. Here are some general steps you need to follow:

  1. Look for a state file (which contains a previously saved state).
  2. If such a state file is found, restore the state, otherwise start from the beginning.
  3. Periodically save the state (this could include the intermediate results or the variable values, etc.).

The way you would implement these would vary based on your code.