How to Resolve OOMKilled Kubernetes Error (Exit Code 137)

Table of Contents

The “OOMKilled” error is a serious issue marked by the exit code 137. When a container uses more memory than allowed, Kubernetes uses the Linux Out-of-Memory (OOM) Killer to stop the process. This prevents memory exhaustion and keeps the cluster healthy.

In this article, we’ll explain the OOMKilled error, its importance in Kubernetes, and ways to reduce its impact.

Understanding OOMKilled

1. Memory Limit and Memory Request

In Kubernetes, managing container memory involves two main settings:

Memory Limit:

This sets the maximum memory a container can use. If the container exceeds this limit, Kubernetes steps in to prevent memory exhaustion, causing the container to get an OOMKilled status.
Memory Request:

This is the minimum memory a container needs to run properly. It helps Kubernetes allocate resources correctly. When scheduling pods, Kubernetes ensures nodes have enough memory to meet this request.

2. Linux OOM Killer Mechanism

The OOMKilled status is not native to Kubernetes but relies on the Linux Kernel’s OOM Killer. Here’s how it works:

When a container consumes more memory than its limit, the Linux Kernel detects the memory pressure.
The OOM Killer identifies the process with the highest “oom_score” (a value indicating the process’s priority for termination).
The process with the highest oom_score gets terminated, freeing up memory for other processes.
Kubernetes captures this event and marks the pod as OOMKilled.

3. Preventing Monopolization of Node Memory

It’s crucial to prevent a single container from monopolizing node memory. Here are some strategies:

Resource Limits: Set appropriate memory limits for containers. Avoid overcommitting memory, as it can lead to frequent OOMKilled events.
Monitoring and Alerts: Regularly monitor memory usage within pods. Implement alerts to detect sudden spikes or prolonged high memory consumption.
Tune Application Behavior: Optimize your application to use memory efficiently. Identify memory leaks or inefficient code paths.
Horizontal Scaling: Distribute workloads across multiple pods or nodes to avoid resource contention.

Detecting OOMKilled Events

To identify OOMKilled pods using kubectl, follow these steps:

Step 1: Run the following command to list all pods in the current namespace.

kubectl get pods

Step 2: Look for the “STATUS” column in the output. If a pod has been OOMKilled, it will display “OOMKilled” as its status.

In the example provided, the initial pod (“my-app-6f8d4c7b5f-7z9q4”) was OOMKilled and restarted one time.

Regularly checking pod statuses can help you detect OOMKilled events promptly and take the necessary steps to avoid them.

Resolving OOMKilled Issues

1. Evaluate and Adjust Memory Requests and Limits

To avoid OOMKilled events, configure memory requests and limits for your containers with care:

Memory Requests: Establish a realistic memory request that represents the minimum memory your application needs. This ensures Kubernetes allocates enough resources to the pod.
Memory Limits: Set a suitable memory limit for each container. Avoid placing limits too close to the requests, as this can result in frequent OOMKilled incidents.

Keep in mind that setting overly restrictive memory limits can lead to unnecessary pod restarts, affecting application availability.

Step 1: Check Current Memory Requests and Limits

kubectl get pod <pod-name> -o jsonpath='{.spec.containers[*].resources}'

Expected Output:

{
  "requests": {
    "memory": "128Mi"
  },
  "limits": {
    "memory": "256Mi"
  }
}

Step 2: Adjust Memory Requests and Limits

kubectl edit pod <pod-name>

This will open the pod’s configuration in your default text editor. Look for the resources section and adjust the requests and limits values.

Step 3: Apply The Changes

kubectl apply -f pod.yaml

Step 4: Verify The Changes

kubectl get pod my-pod -o jsonpath='{.spec.containers[*].resources}'

2. Debug Memory Spikes or Leaks

When faced with OOMKilled errors, investigate memory spikes or leaks within your application:

Monitoring Tools: Use monitoring tools like Prometheus, Grafana, or Kubernetes’ built-in metrics to track memory usage over time. Identify any sudden spikes or gradual increases in usage.
Heap Dumps and Profiling: Capture heap dumps and analyze memory profiles to identify memory-intensive components. Tools like pprof can assist in this analysis.
Code Review: Examine your application code for inefficient memory usage patterns. Check for unclosed resources, unnecessary caching, or overly large data structures.

3. Complex Cases and Further Investigation

In some cases, OOMKilled problems can be complicated and need a closer look:

Kernel Tuning: Explore kernel-level tuning options to adjust the OOM Killer behavior. Modify the oom_score_adj values or consider using cgroups memory controllers.

Step 1: Find The Process ID

kubectl get pod <pod-name> -o jsonpath='{.status.containerStatuses[0].containerID}' | cut -d'/' -f3

Step 2: Adjust oom_score_adj

For example, to set the oom_score_adj value to -100 for a process with PID 3274:

echo -100 | sudo tee /proc/3274/oom_score_adj

Node-Level Metrics: Monitor node-level metrics (such as system memory utilization) to identify resource bottlenecks.

Step 1: Install metrics-server (if not already installed)

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

Step 2: Get Node-Level Memory Usage

kubectl top nodes

Application Profiling: Use profiling tools to understand memory allocation patterns during specific operations or requests.

Linux Kernel’s OOM Killer Mechanism

1. OOMKilled and the Linux Kernel

The “OOMKilled” status isn’t a built-in feature of Kubernetes. Instead, it depends on the Linux Kernel’s Out-of-Memory (OOM) Killer. Here’s how it works:

Memory Pressure: When a node is running low on memory, the Linux Kernel detects it.
Process Termination: The OOM Killer finds the process with the highest “oom_score,” which indicates its priority for termination.
oom_score_adj Values: Each process has an “oom_score_adj” value that shows its importance. Lower values mean higher priority for termination.
OOMKilled Event: The process with the highest oom_score is terminated to free up memory. Kubernetes then marks the related pod as OOMKilled.

2. oom_score and oom_score_adj

oom_score: The oom_score is a number given to each process, showing how likely it is to be picked by the OOM Killer. Higher scores mean the process is less likely to be terminated.
oom_score_adj: The oom_score_adj is a value you can adjust to influence the oom_score. Changing this value helps control how vulnerable a process is to OOM termination.

3. Quality of Service (QoS) Classes

Kubernetes groups pods into three Quality of Service (QoS) classes based on their resource requests and limits:

Guaranteed:
- Pods where memory and CPU requests equal their limits.
- oom_score_adj value: -998.
Burstable:
- Pods where memory or CPU requests are less than their limits.
- oom_score_adj value: 100.
BestEffort:
- Pods with no memory or CPU requests.
- oom_score_adj value: 1000.