What is the difference between ErrImagePull and ImagePullBackOff?

ErrImagePull is the immediate error generated when the kubelet fails to download a container image. If the problem persists across multiple automatic retries, the status changes to ImagePullBackOff, meaning Kubernetes is extending the delay (backing off) between subsequent pull attempts to reduce load on the system and registry.

How do I fix ImagePullBackOff with a private Azure Container Registry (ACR)?

You need to ensure your Kubernetes cluster has permission to pull from the ACR. You can either attach an AcrPull role assignment to your AKS cluster's managed identity, or manually create a Kubernetes Secret of type docker-registry containing your ACR credentials and reference it in your pod's imagePullSecrets.

Why are my Kubernetes pods getting evicted due to ephemeral storage?

Ephemeral storage eviction happens when your pod writes too much data to its local filesystem (like /tmp) or an emptyDir volume, causing the underlying worker node's disk to fill up. To fix this, define ephemeral-storage limits in your deployment, or move persistent data to a Persistent Volume (PV).

How can I automatically delete evicted pods in Kubernetes?

While Kubernetes has a pod garbage collector that eventually cleans up terminated pods when the threshold (terminated-pod-gc-threshold) is reached, you can clean them up manually via a cronjob or CLI command using `kubectl delete pods --field-selector status.phase=Failed`.

What does the 'safe-to-evict' annotation do?

The 'cluster-autoscaler.kubernetes.io/safe-to-evict' annotation tells the Kubernetes Cluster Autoscaler whether it is permitted to evict a specific pod when attempting to scale down a node. Setting it to 'true' overrides default blocks (like the presence of local storage) and allows the node to be terminated.

How to Fix ImagePullBackOff and Evicted Pods in Kubernetes

Common Fix Approaches Compared
Method	When to Use	Time	Risk
Verify Image Tag/Name	When 'kubectl describe' shows 'NotFound' or 'manifest unknown'	Low	Low
Create ImagePullSecret	When pulling from a private registry results in 'Unauthorized' or 'Access Denied'	Medium	Low
Increase Node Resources/Requests	When pods are Evicted due to Memory or Ephemeral Storage pressure	High	Medium
Add safe-to-evict Annotation	When Cluster Autoscaler refuses to scale down a node due to local storage pods	Low	Low

Understanding ImagePullBackOff and Evicted Pods in Kubernetes

When managing a Kubernetes cluster, whether it's on Azure Kubernetes Service (AKS), Amazon EKS, Google GKE, or Docker Desktop, encountering pod lifecycle errors is inevitable. Two of the most common and disruptive statuses you will encounter are ImagePullBackOff (often preceded by ErrImagePull) and Evicted.

While they manifest differently, both indicate that Kubernetes cannot run your workload as requested. ImagePullBackOff is a failure at the container startup phase, whereas Evicted means a running pod was forcefully terminated by the kubelet to save the node from complete resource starvation.

Diagnosing ImagePullBackOff and ErrImagePull

The ImagePullBackOff status means that Kubernetes tried to pull the container image specified in your pod manifest, failed, and is now backing off (delaying) further attempts. The initial failure state is ErrImagePull.

Root Causes of ImagePullBackOff

Typo in the Image Name or Tag: The most common cause. If you specify nginx:latestt instead of nginx:latest, the container runtime cannot find the manifest.
Private Registry Authentication: If you are using a private registry (like Azure Container Registry or AWS ECR) and haven't provided the correct credentials via an ImagePullSecret, the registry will reject the pull request with an Unauthorized error.
Network Constraints: The worker node might not have outbound internet access or DNS resolution to reach the container registry.
Rate Limiting: Docker Hub and other public registries impose rate limits. If your cluster shares a single NAT gateway IP, you might be hitting the toomanyrequests error.

Step 1: Diagnose the Pull Failure

To find out exactly why the image pull is failing, describe the pod:

kubectl describe pod <pod-name> -n <namespace>

Scroll to the Events section at the bottom. You will likely see something like:

Failed to pull image "myregistry.azurecr.io/my-app:v1": rpc error: code = Unknown desc = Error response from daemon: Get "https://myregistry.azurecr.io/v2/my-app/manifests/v1": unauthorized: authentication required

Step 2: Fix ImagePullBackOff

For Typos: Correct the deployment manifest and run kubectl apply -f deployment.yaml.
For Private Registries: Create a secret containing your Docker credentials:

kubectl create secret docker-registry my-registry-secret \
  --docker-server=myregistry.azurecr.io \
  --docker-username=<your-username> \
  --docker-password=<your-password> \
  --docker-email=<your-email>

Then, add imagePullSecrets to your Pod spec:

spec:
  containers:
  - name: my-app
    image: myregistry.azurecr.io/my-app:v1
  imagePullSecrets:
  - name: my-registry-secret

Understanding Pod Eviction in Kubernetes

An Evicted pod status means the kubelet on a worker node terminated the pod. This is not a crash; it's a deliberate action taken by the node to preserve its own stability.

The Kubernetes Eviction Policy

Kubernetes monitors node resources heavily. If a node starts running out of critical, incompressible resources (like memory or disk space), it triggers the eviction policy.

Common eviction triggers include:

MemoryPressure: The node is running out of RAM.
DiskPressure / Ephemeral Storage: The node's root filesystem or the container runtime's image filesystem is full. Pods writing large amounts of data to emptyDir volumes or their local container filesystem without requesting ephemeral-storage limits are prime culprits.
PIDPressure: Too many processes are running on the node.

When you see a pod stuck in the Evicted state, it leaves behind a tombstone record. The pod itself is dead, but the API object remains so you can inspect the eviction logs and status.

Step 1: Diagnose Pod Eviction

Describe the evicted pod to see the exact reason:

kubectl describe pod <evicted-pod-name>

Look at the Status and Message fields. You'll often see something like: Message: The node was low on resource: ephemeral-storage. Container my-app was using 50Gi, which exceeds its request of 0.

Step 2: Prevent Eviction

To prevent pods from getting evicted:

Set Resource Requests and Limits: Always define requests and limits for CPU, memory, and importantly, ephemeral-storage.
Optimize Logging: If an application logs excessively to stdout/stderr, those logs consume local disk space until log rotation occurs. Use log forwarding to offload them.
Use Persistent Volumes: Don't use emptyDir for large datasets. Attach a PersistentVolumeClaim (PVC).

The Role of cluster-autoscaler.kubernetes.io/safe-to-evict

Sometimes you want pods to be evicted, specifically when the Cluster Autoscaler is trying to scale down an underutilized node. By default, the autoscaler will not evict certain pods, such as those using local storage (emptyDir). This prevents the node from scaling down.

If your pod uses emptyDir strictly for temporary, non-critical cache and you want the autoscaler to feel free to terminate it to save cloud costs, add the following annotation to your pod spec:

metadata:
  annotations:
    "cluster-autoscaler.kubernetes.io/safe-to-evict": "true"

Conversely, if you have a critical pod that should never be randomly evicted during scale-down, you can set this to "false".

Cleaning Up Evicted Pods

Kubernetes does not automatically delete evicted pods immediately because it assumes you want to read their failure messages. Over time, these can clutter your dashboard and CLI output.

You can manually clean up evicted pods using a field selector. See the code block below for the exact command.