Why is 'kubectl logs' empty for my pod in CrashLoopBackOff?

When a pod is in CrashLoopBackOff, the currently running container may have just started and crashed immediately before writing to stdout. You must append the '--previous' or '-p' flag (e.g., 'kubectl logs --previous') to view the logs of the container instance that previously terminated.

What does Exit Code 137 mean in a CrashLoopBackOff event?

Exit Code 137 almost always means the container was OOMKilled (Out of Memory). The Linux kernel terminated the process because it attempted to use more memory than was allocated in the pod's resource limits. To fix this, optimize your application's memory usage or increase the 'resources.limits.memory' in your deployment specification.

How do I test my Alertmanager Slack configuration?

You can test Alertmanager routing by using the amtool CLI or by sending a raw HTTP POST request directly to the Alertmanager API. For example: 'curl -H "Content-Type: application/json" -d \'[{"labels":{"alertname":"TestAlert"}}]\' http://localhost:9093/api/v1/alerts'. Ensure your 'alertmanager slack config' has the correct webhook URL.

Why is my EKS aws-node daemonset stuck in CrashLoopBackOff?

The 'aws node crashloopbackoff' issue is commonly caused by IP exhaustion in your VPC subnets (no IPs available for new pods), incorrect IAM roles (IRSA) preventing the CNI plugin from communicating with the AWS API, or security groups blocking communication between worker nodes and the control plane.

I'm trying to deploy a fix but Terragrunt says the state is locked. What do I do?

If a previous deployment pipeline failed or was cancelled abruptly, the state lock may not have been released. Use the command 'terragrunt force-unlock ' (where LOCK-ID is provided in the error message) to manually remove the lock. Ensure no other pipeline is running concurrently before doing this.

Troubleshooting Kubernetes CrashLoopBackOff: A Comprehensive Guide to Diagnosis, Fixes, and Alerting

Fix Kubernetes CrashLoopBackOff errors fast. Learn root causes, debug with kubectl, and configure Prometheus Alertmanager for pod restart loops and IaC locks.

Last updated: February 23, 2026

Last verified: February 23, 2026

1,718 words

Key Takeaways

CrashLoopBackOff is not a fatal error itself, but a Kubernetes state indicating a pod's container is repeatedly failing and restarting.
Common root causes include application crashes (Exit Code 1), memory limits exceeded (OOMKilled - Exit Code 137), and misconfigured liveness probes.
Use 'kubectl logs <pod> --previous' to retrieve the logs of the container instance that crashed before the current restart loop.
Proactive monitoring with Prometheus and Alertmanager (via Slack, OpsGenie, or PagerDuty) is essential to detect CrashLoopBackOff states before users notice.
Infrastructure-as-Code lockups (e.g., needing 'terragrunt force unlock') can block deployment fixes, so clear state locks before pushing your manifest updates.

Diagnostic Approaches for CrashLoopBackOff
Method	When to Use	Time	Risk
kubectl describe pod	Initial triage, checking events, exit codes, and probe failures.	1-2 mins	None
kubectl logs --previous	When the pod is restarting too fast to catch live logs.	2-3 mins	None
Prometheus & Alertmanager	Proactive monitoring across entire clusters for automated Slack/OpsGenie alerts.	Setup: Hours	Low (Alert Fatigue)
Ephemeral Debug Containers	When 'crashloopbackoff no logs' occurs and you need shell access.	5-10 mins	Medium (Requires K8s v1.25+)

Understanding the CrashLoopBackOff Error

When working with Kubernetes, seeing a pod stuck in CrashLoopBackOff is a rite of passage for DevOps engineers. It is one of the most common, yet frustrating, errors you will encounter. But what does it actually mean?

CrashLoopBackOff is not the cause of the crash. It is a state. It means that the kubelet has tried to start your container, the container has failed and exited, and Kubernetes is now waiting for a specified "backoff" period before trying to restart it again. The backoff period increases exponentially with each failure (10s, 20s, 40s, up to 5 minutes) to prevent a failing pod from consuming all node CPU and resources in an infinite, rapid restart loop.

The exact error message you will see in your Kubernetes events looks like this: Warning BackOff kubelet Back-off restarting failed container

Step 1: Diagnose the Pod

The most critical phase of troubleshooting CrashLoopBackOff is gathering data. Because the container exits rapidly, standard log trailing often shows nothing.

1. Check the Pod Status and Events Run kubectl get pods status crashloopbackoff (or simply kubectl get pods -n <namespace>) to identify the failing pod. Then, describe the pod to look at the events and exit codes:

kubectl describe pod <pod-name> -n <namespace>

Scroll to the bottom of the output to the Events section. You are looking for the State and Last State of the container. Pay close attention to the Exit Code.

Exit Code 1: General application error. The application panicked or threw a fatal error. Look at the application logs.
Exit Code 137: OOMKilled. The container exceeded its memory limits and was killed by the Linux kernel.
Exit Code 255: Usually an infrastructure issue, such as the node rebooting or severe underlying host errors.

2. Fetching Previous Logs A common complaint is crashloopbackoff no logs. If you run kubectl logs <pod-name>, it might return empty because the current container instance just started and hasn't written anything before crashing. You must use the --previous (or -p) flag to get the logs of the container that actually crashed:

kubectl logs <pod-name> --previous -n <namespace>

Step 2: Proactive Alerting with Alertmanager

Relying on kubectl get pods manually is not sustainable. You need an automated way to detect when a pod enters pod crash loop back off. This is where Prometheus and Alertmanager come in. You can configure Prometheus to trigger an alert if a pod restarts too many times in a short window, and use Alertmanager to route that alert to your team.

Creating the Prometheus Alert

You need a PromQL query that detects restarting pods. The kube_pod_container_status_restarts_total metric (provided by kube-state-metrics) is perfect for this. If you are dealing with kube state metrics crashloopbackoff issues, ensure that deployment is stable first.

groups:
- name: kubernetes-apps
  rules:
  - alert: PodCrashLooping
    expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
      description: "Pod is stuck in CrashLoopBackOff. Check kubectl crashloopbackoff logs."

Configuring Alertmanager

Once Prometheus fires the alert, Alertmanager routes it. Whether you are using the alertmanager operator, a standalone install alertmanager deployment, or integrated solutions like mimir alertmanager or victoriametrics alertmanager, the configuration is similar.

Here is an alertmanager slack config example combined with an alertmanager opsgenie config example:

global:
  resolve_timeout: 5m
  slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'

route:
  group_by: ['alertname', 'cluster', 'service']
  group_wait: 30s # group_wait alertmanager delays the alert slightly to batch them
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'slack-default'
  routes:
    - match:
        severity: critical
      receiver: 'opsgenie-critical'

receivers:
- name: 'slack-default'
  slack_configs:
  - channel: '#alerts-k8s'
    send_resolved: true
    title: '[{{ .Status | toUpper }}] {{ .GroupLabels.alertname }}'
    text: '{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}'

- name: 'opsgenie-critical'
  opsgenie_configs:
  - api_key: 'YOUR_OPSGENIE_API_KEY'
    priority: 'P1'

Alertmanager is highly extensible. You can easily configure an alertmanager discord webhook, send emails via alertmanager sendgrid, use alertmanager sns for AWS ecosystems, or integrate with alerta alertmanager and alertmanager victorops.

Step 3: Fixing Specific Component CrashLoopBackOffs

Sometimes, it is not your application, but the Kubernetes infrastructure itself that is failing. Let's look at common system pod failures:

1. aws-node / calico-node CrashLoopBackOff If you are on EKS or using Calico, you might see aws node crashloopbackoff or calico node crashloopbackoff.

Cause: Often caused by CNI misconfigurations, IPAM exhaustion (no available IP addresses in the subnet), or missing IAM roles for service accounts (IRSA).
Fix: Check kubectl logs -n kube-system daemonset/aws-node. Verify your VPC subnets have free IPs. For Calico, check calico kube controllers crashloopbackoff logs for RBAC permission errors.

2. coredns / nodelocaldns CrashLoopBackOff coredns crashloopbackoff typically happens when the DNS pod cannot reach the Kubernetes API server, or there is a loop in your upstream DNS configuration (e.g., /etc/resolv.conf pointing to itself).

Fix: Check kubectl logs -n kube-system deployment/coredns. Look for "plugin/loop: Loop (127.0.0.1:53) detected". You may need to patch the CoreDNS ConfigMap to use a specific upstream forwarder like 8.8.8.8 instead of inheriting the host's loop.

3. Ingress Controllers ingress nginx controller crashloopbackoff or aws load balancer controller crashloopbackoff often occur due to invalid ingress syntax, missing TLS secrets, or missing IAM permissions (for AWS ALB).

Fix: Validate your Ingress resources. Use kubectl describe on the controller pod to check if it's failing health checks due to webhook timeout.

Step 4: Deployment Blockers - Terragrunt Locks

You've diagnosed the code issue, you've built a new container image, and you are ready to deploy the fix via your IaC pipeline. But the pipeline fails with a state lock error!

If you use Terragrunt/Terraform, a previous pipeline run might have crashed, leaving the DynamoDB/GCS state locked. You will see an error like Error acquiring the state lock.

To push your Kubernetes fix, you must first clear the lock:

Identify the Lock ID from the pipeline error output.
Run the terragrunt force unlock command (or terragrunt release lock):

terragrunt force-unlock <LOCK_ID>

Warning: Only use terragrunt remove lock or terragrunt unlock if you are 100% sure no other process is actively running terraform apply on that state file, otherwise you risk state corruption.

Summary of the Diagnostic Flow

When a pod in crashloopbackoff occurs:

Check the alert in Slack/OpsGenie (alertmanager trigger alert).
Run kubectl get pods crashloopbackoff to confirm.
Run kubectl describe pod to find the Exit Code and exact error.
Run kubectl logs --previous to see application output.
Adjust memory limits, fix environment variables, or correct IAM roles.
Clear any IaC locks (terragrunt force unlock) if deploying via Terraform.
Deploy the fix and monitor the pod status.

Frequently Asked Questions

bash

#!/bin/bash
# Diagnostic script: Find all pods in CrashLoopBackOff across all namespaces
# and fetch the last 50 lines of their previous crashed instance.

echo "Searching for pods in CrashLoopBackOff..."

CRASHING_PODS=$(kubectl get pods --all-namespaces --field-selector=status.phase!=Succeeded,status.phase!=Running -o json | jq -r '.items[] | select(.status.containerStatuses[].state.waiting.reason=="CrashLoopBackOff") | "\(.metadata.namespace) \(.metadata.name)"')

if [ -z "$CRASHING_PODS" ]; then
  echo "No CrashLoopBackOff pods found."
  exit 0
fi

echo "$CRASHING_PODS" | while read -r namespace pod; do
  echo "--------------------------------------------------"
  echo "Analyzing Pod: $pod in Namespace: $namespace"
  echo "--------------------------------------------------"
  
  # Get the exit code and reason
  kubectl describe pod "$pod" -n "$namespace" | grep -A 5 "State:          Waiting"
  
  echo "\n[Logs from previous container termination]"
  kubectl logs "$pod" -n "$namespace" --previous --tail=50 || echo "No previous logs available."
  echo "\n"
done

Error Medic Editorial

Error Medic Editorial is composed of senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex Kubernetes, AWS, and CI/CD infrastructure challenges.

Sources

Resolve Kubernetes ImagePullBackOff and ErrImagePull errors fast. Learn how to fix private registry authentication, ACR/ECR permissions, and typos in K8s.

Ansible Failed: Fix Connection Refused, Permission Denied & Timeout Errors

Fix Ansible failures including connection refused, permission denied, and timeout errors. Step-by-step diagnosis with real commands and verified solutions.

ArgoCD 'connection refused' Error: Complete Troubleshooting Guide (2024)

Fix ArgoCD 'connection refused', CrashLoopBackOff, ImagePullBackOff, and timeout errors with step-by-step diagnostic commands and proven solutions.

ArgoCD Connection Refused: Fix CrashLoopBackOff, ImagePullBackOff, Permission Denied & Timeout Errors

Fix ArgoCD connection refused errors: diagnose CrashLoopBackOff, ImagePullBackOff, permission denied, and timeout with step-by-step kubectl commands and config