Troubleshooting Kubernetes CrashLoopBackOff: A Comprehensive Guide to Diagnosis, Fixes, and Alerting
Fix Kubernetes CrashLoopBackOff errors fast. Learn root causes, debug with kubectl, and configure Prometheus Alertmanager for pod restart loops and IaC locks.
- CrashLoopBackOff is not a fatal error itself, but a Kubernetes state indicating a pod's container is repeatedly failing and restarting.
- Common root causes include application crashes (Exit Code 1), memory limits exceeded (OOMKilled - Exit Code 137), and misconfigured liveness probes.
- Use 'kubectl logs <pod> --previous' to retrieve the logs of the container instance that crashed before the current restart loop.
- Proactive monitoring with Prometheus and Alertmanager (via Slack, OpsGenie, or PagerDuty) is essential to detect CrashLoopBackOff states before users notice.
- Infrastructure-as-Code lockups (e.g., needing 'terragrunt force unlock') can block deployment fixes, so clear state locks before pushing your manifest updates.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| kubectl describe pod | Initial triage, checking events, exit codes, and probe failures. | 1-2 mins | None |
| kubectl logs --previous | When the pod is restarting too fast to catch live logs. | 2-3 mins | None |
| Prometheus & Alertmanager | Proactive monitoring across entire clusters for automated Slack/OpsGenie alerts. | Setup: Hours | Low (Alert Fatigue) |
| Ephemeral Debug Containers | When 'crashloopbackoff no logs' occurs and you need shell access. | 5-10 mins | Medium (Requires K8s v1.25+) |
Understanding the CrashLoopBackOff Error
When working with Kubernetes, seeing a pod stuck in CrashLoopBackOff is a rite of passage for DevOps engineers. It is one of the most common, yet frustrating, errors you will encounter. But what does it actually mean?
CrashLoopBackOff is not the cause of the crash. It is a state. It means that the kubelet has tried to start your container, the container has failed and exited, and Kubernetes is now waiting for a specified "backoff" period before trying to restart it again. The backoff period increases exponentially with each failure (10s, 20s, 40s, up to 5 minutes) to prevent a failing pod from consuming all node CPU and resources in an infinite, rapid restart loop.
The exact error message you will see in your Kubernetes events looks like this:
Warning BackOff kubelet Back-off restarting failed container
Step 1: Diagnose the Pod
The most critical phase of troubleshooting CrashLoopBackOff is gathering data. Because the container exits rapidly, standard log trailing often shows nothing.
1. Check the Pod Status and Events
Run kubectl get pods status crashloopbackoff (or simply kubectl get pods -n <namespace>) to identify the failing pod. Then, describe the pod to look at the events and exit codes:
kubectl describe pod <pod-name> -n <namespace>
Scroll to the bottom of the output to the Events section. You are looking for the State and Last State of the container. Pay close attention to the Exit Code.
- Exit Code 1: General application error. The application panicked or threw a fatal error. Look at the application logs.
- Exit Code 137: OOMKilled. The container exceeded its memory limits and was killed by the Linux kernel.
- Exit Code 255: Usually an infrastructure issue, such as the node rebooting or severe underlying host errors.
2. Fetching Previous Logs
A common complaint is crashloopbackoff no logs. If you run kubectl logs <pod-name>, it might return empty because the current container instance just started and hasn't written anything before crashing. You must use the --previous (or -p) flag to get the logs of the container that actually crashed:
kubectl logs <pod-name> --previous -n <namespace>
Step 2: Proactive Alerting with Alertmanager
Relying on kubectl get pods manually is not sustainable. You need an automated way to detect when a pod enters pod crash loop back off. This is where Prometheus and Alertmanager come in. You can configure Prometheus to trigger an alert if a pod restarts too many times in a short window, and use Alertmanager to route that alert to your team.
Creating the Prometheus Alert
You need a PromQL query that detects restarting pods. The kube_pod_container_status_restarts_total metric (provided by kube-state-metrics) is perfect for this. If you are dealing with kube state metrics crashloopbackoff issues, ensure that deployment is stable first.
groups:
- name: kubernetes-apps
rules:
- alert: PodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod is stuck in CrashLoopBackOff. Check kubectl crashloopbackoff logs."
Configuring Alertmanager
Once Prometheus fires the alert, Alertmanager routes it. Whether you are using the alertmanager operator, a standalone install alertmanager deployment, or integrated solutions like mimir alertmanager or victoriametrics alertmanager, the configuration is similar.
Here is an alertmanager slack config example combined with an alertmanager opsgenie config example:
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s # group_wait alertmanager delays the alert slightly to batch them
group_interval: 5m
repeat_interval: 4h
receiver: 'slack-default'
routes:
- match:
severity: critical
receiver: 'opsgenie-critical'
receivers:
- name: 'slack-default'
slack_configs:
- channel: '#alerts-k8s'
send_resolved: true
title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'
- name: 'opsgenie-critical'
opsgenie_configs:
- api_key: 'YOUR_OPSGENIE_API_KEY'
priority: 'P1'
Alertmanager is highly extensible. You can easily configure an alertmanager discord webhook, send emails via alertmanager sendgrid, use alertmanager sns for AWS ecosystems, or integrate with alerta alertmanager and alertmanager victorops.
Step 3: Fixing Specific Component CrashLoopBackOffs
Sometimes, it is not your application, but the Kubernetes infrastructure itself that is failing. Let's look at common system pod failures:
1. aws-node / calico-node CrashLoopBackOff
If you are on EKS or using Calico, you might see aws node crashloopbackoff or calico node crashloopbackoff.
- Cause: Often caused by CNI misconfigurations, IPAM exhaustion (no available IP addresses in the subnet), or missing IAM roles for service accounts (IRSA).
- Fix: Check
kubectl logs -n kube-system daemonset/aws-node. Verify your VPC subnets have free IPs. For Calico, checkcalico kube controllers crashloopbackofflogs for RBAC permission errors.
2. coredns / nodelocaldns CrashLoopBackOff
coredns crashloopbackoff typically happens when the DNS pod cannot reach the Kubernetes API server, or there is a loop in your upstream DNS configuration (e.g., /etc/resolv.conf pointing to itself).
- Fix: Check
kubectl logs -n kube-system deployment/coredns. Look for "plugin/loop: Loop (127.0.0.1:53) detected". You may need to patch the CoreDNS ConfigMap to use a specific upstream forwarder like8.8.8.8instead of inheriting the host's loop.
3. Ingress Controllers
ingress nginx controller crashloopbackoff or aws load balancer controller crashloopbackoff often occur due to invalid ingress syntax, missing TLS secrets, or missing IAM permissions (for AWS ALB).
- Fix: Validate your Ingress resources. Use
kubectl describeon the controller pod to check if it's failing health checks due to webhook timeout.
Step 4: Deployment Blockers - Terragrunt Locks
You've diagnosed the code issue, you've built a new container image, and you are ready to deploy the fix via your IaC pipeline. But the pipeline fails with a state lock error!
If you use Terragrunt/Terraform, a previous pipeline run might have crashed, leaving the DynamoDB/GCS state locked. You will see an error like Error acquiring the state lock.
To push your Kubernetes fix, you must first clear the lock:
- Identify the Lock ID from the pipeline error output.
- Run the
terragrunt force unlockcommand (orterragrunt release lock):
terragrunt force-unlock <LOCK_ID>
Warning: Only use terragrunt remove lock or terragrunt unlock if you are 100% sure no other process is actively running terraform apply on that state file, otherwise you risk state corruption.
Summary of the Diagnostic Flow
When a pod in crashloopbackoff occurs:
- Check the alert in Slack/OpsGenie (
alertmanager trigger alert). - Run
kubectl get pods crashloopbackoffto confirm. - Run
kubectl describe podto find the Exit Code and exact error. - Run
kubectl logs --previousto see application output. - Adjust memory limits, fix environment variables, or correct IAM roles.
- Clear any IaC locks (
terragrunt force unlock) if deploying via Terraform. - Deploy the fix and monitor the pod status.
Frequently Asked Questions
#!/bin/bash
# Diagnostic script: Find all pods in CrashLoopBackOff across all namespaces
# and fetch the last 50 lines of their previous crashed instance.
echo "Searching for pods in CrashLoopBackOff..."
CRASHING_PODS=$(kubectl get pods --all-namespaces --field-selector=status.phase!=Succeeded,status.phase!=Running -o json | jq -r '.items[] | select(.status.containerStatuses[].state.waiting.reason=="CrashLoopBackOff") | "\(.metadata.namespace) \(.metadata.name)"')
if [ -z "$CRASHING_PODS" ]; then
echo "No CrashLoopBackOff pods found."
exit 0
fi
echo "$CRASHING_PODS" | while read -r namespace pod; do
echo "--------------------------------------------------"
echo "Analyzing Pod: $pod in Namespace: $namespace"
echo "--------------------------------------------------"
# Get the exit code and reason
kubectl describe pod "$pod" -n "$namespace" | grep -A 5 "State: Waiting"
echo "\n[Logs from previous container termination]"
kubectl logs "$pod" -n "$namespace" --previous --tail=50 || echo "No previous logs available."
echo "\n"
done
Error Medic Editorial
Error Medic Editorial is composed of senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex Kubernetes, AWS, and CI/CD infrastructure challenges.