Checkpointing for Kubernetes Pods

Abstract

Workflow systems in Kubernetes (such as Argo, Kubeflow, etc.) provide checkpointing by saving input/output artifacts at each step, but this mainly focuses on data transfer between pods.
The “intermediate state” of long-running tasks inside a pod is not preserved without separate checkpointing, leading to inefficiency as the entire task must be repeated from the beginning if the pod fails or restarts.
A checkpointing strategy for the intermediate state inside pods using external storage like Amazon S3 is necessary.
This article explains how to resume work from the last saved point in case of pod failure using S3-based checkpointing, and demonstrates the benefits in terms of time, resource savings, and increased success rate in iterative experiments such as Genetic Algorithms (GA).

The Need for Checkpoints in Kubernetes Jobs

When performing long-running tasks in a Kubernetes environment (e.g., large-scale data processing, machine learning training, high-performance geometric simulations, etc.),

it is common for pods to fail or restart due to competition for spot instances, memory shortages, and other reasons.

Kubernetes-based workflow systems (e.g., Argo Workflows, Kubeflow Pipelines, etc.) already provide robust support for saving the results of each step and passing them to the next step via input/output artifacts. However, this artifact-based storage mainly focuses on data transfer between workflow steps (i.e., pod units).

In other words, the “intermediate state” of long-running tasks inside a pod (e.g., in-memory iterative calculations, experiment progress, etc.) is not preserved without additional measures. If a pod fails or restarts in the middle, the ongoing work inside that pod must start over from the beginning.

Without any measures, the task will restart from scratch, wasting a lot of resources and time. Especially in retry situations, checkpointing is very useful. If a task fails or is unexpectedly interrupted, without a checkpoint, the entire process must be repeated from the beginning, but with a checkpoint, you can quickly recover from the last saved point.

Therefore, fine-grained checkpointing inside the pod is separately required. This article covers how to save and restore checkpoints inside a pod, and the effects thereof.

Saving Checkpoints Using S3

There are several ways to save checkpoint data, but in cloud environments, using object storage such as Amazon S3 is the simplest and most scalable.

Save intermediate results to S3 at certain stages (e.g., epoch, generation, etc.).
When the pod restarts, load the most recent checkpoint from S3 and resume work from there.

This approach has the following advantages:

Regardless of which node the pod runs on, you can always access the same checkpoint via S3
Ensures consistent results even after multiple restarts or retry situations
Improves reliability and efficiency of the job

Using Checkpoints in GA (Genetic Algorithm)

Especially for tasks like Genetic Algorithms (GA), which involve repeated calculations over many generations, saving a checkpoint at the end of each generation is very effective.

Even in situations where retry is needed (e.g., pod failure, network issues, etc.), if you have a checkpoint, you can immediately resume the experiment from the last saved generation. This greatly reduces wasted time and resources.

Information That Must Be Included in a GA Checkpoint

Current generation number (generation)
population (list or array of parameter sets to be passed to the next generation)
best solution (the best parameter set found so far)
best fitness (fitness value of the best solution)
(Optional) random seed, environment information, etc. for experiment reproducibility

All this information must be saved so that, even if the pod dies in the middle, the experiment can be resumed from exactly the same state.

Example Code (Python)

Below is an example code for saving/loading checkpoints in a GA using S3. All essential information is included.

import boto3
import pickle

s3 = boto3.client('s3')
bucket = 'your-s3-bucket'
checkpoint_key = 'checkpoints/ga_generation.pkl'

def save_checkpoint(data, key=checkpoint_key):
    s3.put_object(Bucket=bucket, Key=key, Body=pickle.dumps(data))

def load_checkpoint(key=checkpoint_key):
    try:
        obj = s3.get_object(Bucket=bucket, Key=key)
        return pickle.loads(obj['Body'].read())
    except s3.exceptions.NoSuchKey:
        return None

# Usage example
generation_to_run = 0 # Generation number to run this time
population = None
best_solution = None
best_fitness = None
random_seed = 42  # Example: save seed for reproducibility

# Load checkpoint
checkpoint = load_checkpoint()
if checkpoint:
    # Start from the next generation after the last successful one
    generation_to_run = checkpoint['generation'] + 1
    population = checkpoint['population']
    best_solution = checkpoint['best_solution']
    best_fitness = checkpoint['best_fitness']
    random_seed = checkpoint.get('random_seed', 42)

while generation_to_run < MAX_GENERATION:
    # ... GA operations for the current generation_to_run ...
    # Update population, best_solution, best_fitness

    # Save checkpoint after successfully completing the current generation
    save_checkpoint({
        'generation': generation_to_run, # Successfully completed generation number
        'population': population,  # Parameter set for the next generation
        'best_solution': best_solution,
        'best_fitness': best_fitness,
        'random_seed': random_seed
    })
    generation_to_run += 1

Note: For complete reproducibility, it is recommended to also save the random seed and environment information. (If running within the same docker image, this is usually not necessary.) Add to the checkpoint as needed.

Time and Success Rate Comparison Before and After Applying Checkpoints

As a real example, suppose a GA task takes about 10 seconds per generation and is performed for 50 generations. Comparing situations where the pod fails and the task is interrupted:

Time Comparison

case 1 - no checkpoint

1 ~ 30 (runs normally for 30 generations, then fails)
1 ~ 25 (runs normally for 25 generations, then fails)
1 ~ 50 (runs normally for 50 generations, finally succeeds)

Total generations run: 30 + 25 + 50 = 105 times

Total time: 105 × 10 seconds = 1,050 seconds (about 17.5 minutes)

case 2 - checkpoint

1 ~ 30 (runs normally for 30 generations, then fails)
30 ~ 50 (runs normally for 21 generations, finally succeeds)

Total generations run: 30 + 21 = 51 times

Total time: 51 × 10 seconds = 510 seconds (about 8.5 minutes)

Comparison and Effect

Restarting without checkpoint: 1,050 seconds
Restarting with checkpoint: 510 seconds
You can save more than half the time and resources.

Improved Final Success Probability

With checkpointing, even if the pod fails multiple times, you can always resume from the last saved point. This not only saves time but also greatly increases the probability of completing the experiment successfully. Even after repeated failures, the probability of completing the entire experiment increases.

For example, suppose each generation succeeds with a probability of 98%, and you need to complete all 50 generations within a maximum of 3 pod retries. Let’s compare the situations:

case 1 - no checkpoint

# Failure probability for each try = 1 - 0.98 ** 50
>>> 1 - 0.98 ** 50
0.6358303199128832

# Probability of failing all three times
>>> 0.6358303199128832 ** 3
0.2543740234375

# Probability of succeeding at least once in three tries
>>> 1 - 0.2543740234375
0.7456259765625

So, you have about a 75% chance of final success.

case 2 - checkpoint

# Probability that a single generation fails all three times
>>> 0.02 ** 3
8e-06

# This means "the probability that a single generation fails in pod 1, 2, and 3."
# In reality, retry is not done per generation, but
# in a pod-level retry + checkpoint resume structure, some generations may be attempted up to 3 times.

# Probability that all 50 generations succeed within 3 tries each
>>> (1 - 8e-06) ** 50
0.9996000399986667

So, you have about a 99.96% chance of final success.

Summary: Checkpointing is an important strategy that not only saves time and resources in retry situations but also increases the success rate of experiments.

Conclusion

Checkpointing using S3 in Kubernetes environments, especially for repetitive and long-running tasks (e.g., GA), greatly reduces resource waste due to pod restarts or retry situations. By saving a checkpoint at each generation, you can efficiently resume work from the interrupted point at any time. In the case of GA, saving all essential information such as population, best solution, best fitness, and random seed is necessary for complete recovery.