Error Medic

Resolving 'AWS EKS Node Not Ready' Status: A Comprehensive Troubleshooting Guide

Fix AWS EKS Node Not Ready errors. Learn to diagnose kubelet failures, IAM aws-auth issues, VPC CNI IP exhaustion, and security group misconfigurations.

Last updated:
Last verified:
1,504 words
Key Takeaways
  • Check 'kubectl describe node' first to identify NetworkUnavailable, MemoryPressure, or PIDPressure conditions.
  • AWS VPC CNI (aws-node) failures due to subnet IP exhaustion are the most common cause of NotReady states in EKS.
  • Ensure the EC2 instance IAM role is correctly mapped in the kube-system 'aws-auth' ConfigMap.
  • Verify Security Groups allow bi-directional TCP 443 between the EKS Control Plane and worker nodes.
  • Review kubelet logs via journalctl to identify container runtime (containerd) or kubelet bootstrapping failures.
Diagnostic Approaches Compared
MethodWhen to UseTimeRisk
kubectl describe nodeInitial triage to check node conditions and recent Kubelet events1 minNone
Check Kubelet Logs (journalctl)Node is reachable but Kubelet fails to register with the API server5 minsLow
Verify aws-auth ConfigMapNode EC2 instance is running but never joins the cluster at all3 minsLow
Restart VPC CNI / aws-nodeCNI config uninitialized or Pods stuck in ContainerCreating2 minsMedium
Replace EKS Node GroupUnrecoverable state, corrupted AMI, or configuration drift15 minsHigh

Understanding the Error

When operating Amazon Elastic Kubernetes Service (EKS), one of the most stressful alerts an SRE or DevOps engineer can receive is an aws eks node not ready state. In Kubernetes, the control plane continuously monitors the health of worker nodes. If the kubelet daemon running on a node stops reporting its status, or reports a degraded state, the API server marks the node's Ready condition as False or Unknown.

When a node transitions to NotReady, the Kubernetes scheduler stops placing new Pods on it. After a default eviction timeout (usually 5 minutes), the control plane will begin evicting existing Pods to reschedule them elsewhere. This can lead to cascading failures if cluster capacity is suddenly reduced.

Step 1: Diagnose the Node Conditions

Before logging into the AWS console, start with the Kubernetes API. Run the following command to get a detailed view of the failing node:

kubectl describe node <node-name>

Scroll down to the Conditions section. You are looking for several key indicators:

  • Ready: Will be False or Unknown.
  • NetworkUnavailable: If True, the node's network routes are not configured correctly (often a CNI issue).
  • MemoryPressure / DiskPressure / PIDPressure: If any of these are True, the node has exhausted its physical resources, causing the kubelet to defensively fail or the OS to invoke the OOMKiller.

Look at the Events at the bottom of the output. Common error messages include:

  • PLEG is not healthy: pleg was last seen active 3m0s ago
  • network plugin is not ready: cni config uninitialized
  • NodeStatusUnknown: Kubelet stopped posting node status

Step 2: AWS VPC CNI and IP Exhaustion

In EKS, networking is handled by the Amazon VPC CNI plugin. A highly common reason for a node being NotReady is that the CNI plugin failed to initialize because the underlying AWS subnet has run out of available IP addresses.

The VPC CNI assigns a secondary IP address from the VPC to every Pod. If your subnet is exhausted, the aws-node daemonset pod on the worker node will crash or hang.

How to check:

  1. Check the aws-node pods in the kube-system namespace: kubectl get pods -n kube-system -l k8s-app=aws-node wide
  2. Look for pods in a CrashLoopBackOff or Error state on the affected node.
  3. Check your AWS VPC Subnet available IPv4 addresses via the AWS Console or CLI: aws ec2 describe-subnets --subnet-ids <your-subnet-id> --query 'Subnets[*].AvailableIpAddressCount'

The Fix: Expand your subnet CIDR, move nodes to a different subnet, or enable prefix delegation (VPC CNI feature) to drastically increase the number of available IPs per EC2 Elastic Network Interface (ENI).

Step 3: Kubelet and Container Runtime Failures

If the network is fine, the kubelet or the container runtime (containerd or dockerd on older AMIs) might be failing. To investigate this, you must connect to the EC2 instance via AWS Systems Manager (SSM) Session Manager or SSH.

Once connected, check the kubelet logs:

journalctl -u kubelet -f

Common Kubelet Errors:

  1. Unauthorized / Forbidden: error: failed to run Kubelet: cannot create certificate signing request: Unauthorized This indicates an IAM issue. The EC2 instance's IAM role must be present in the aws-auth ConfigMap. Check the ConfigMap: kubectl get configmap aws-auth -n kube-system -o yaml Ensure the rolearn precisely matches the IAM role attached to the EC2 instance (do not include the instance profile path).

  2. Cgroup Driver Mismatch: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs" If you are using custom AMIs, ensure your container runtime and kubelet are both configured to use systemd as the cgroup driver. EKS optimized AMIs default to systemd.

  3. PLEG Issues: If you see Pod Lifecycle Event Generator (PLEG) errors, the container runtime is likely deadlocked. Restart the runtime and kubelet: sudo systemctl restart containerd sudo systemctl restart kubelet

Step 4: Security Groups and Network ACLs

For an EKS node to register as Ready, it must be able to communicate with the EKS Control Plane. EKS creates cross-account elastic network interfaces in your VPC to facilitate this.

Ensure your security groups allow:

  • Control Plane to Nodes: TCP port 443 (for webhook validations and executing commands like kubectl exec) and TCP port 10250 (for kubelet API).
  • Nodes to Control Plane: TCP port 443 to the EKS cluster endpoint.

If a misconfigured Terraform or CloudFormation deployment accidentally removed the ingress rules allowing the worker node security group to reach the cluster security group, the kubelet will silently fail to register, throwing connection timeouts in the journalctl logs.

Step 5: User Data and Bootstrap Scripts

If the node never joins the cluster after creation, check the EC2 User Data logs. When an EKS node boots, it runs a script (/etc/eks/bootstrap.sh) to configure the kubelet with the cluster's CA certificate and API endpoint.

Check the cloud-init logs on the instance:

cat /var/log/cloud-init-output.log

Look for errors related to downloading the EKS CA cert, connecting to the EKS endpoint, or syntax errors in any custom user data scripts you provided. If the bootstrap script fails, the kubelet is never started.

Conclusion

Resolving an EKS node NotReady issue requires a systematic approach. Always start by reading the node conditions via kubectl. Isolate whether it is a resource exhaustion issue, a CNI/networking failure, or an IAM/authentication blockage. By systematically verifying subnet IPs, security groups, the aws-auth ConfigMap, and kubelet logs, you can quickly identify the root cause and restore cluster capacity.

Frequently Asked Questions

bash
# EKS Node Diagnostic Script
# Run this script to gather essential triage information for a NotReady node.

NODE_NAME="ip-10-0-1-123.ec2.internal"
NAMESPACE="kube-system"

echo "=== 1. Checking Node Conditions ==="
kubectl describe node $NODE_NAME | grep -A 5 "Conditions:"

echo "\n=== 2. Checking AWS VPC CNI Status ==="
kubectl get pods -n $NAMESPACE -l k8s-app=aws-node --field-selector spec.nodeName=$NODE_NAME -o wide

echo "\n=== 3. Checking for Recent Kubelet Events ==="
kubectl get events --field-selector involvedObject.name=$NODE_NAME --sort-by='.metadata.creationTimestamp'

echo "\n=== 4. Validating aws-auth ConfigMap ==="
kubectl get configmap aws-auth -n $NAMESPACE -o yaml | grep -A 10 "mapRoles"

# Note: To fetch kubelet logs from the node via AWS SSM:
# aws ssm start-session --target <instance-id> --document-name AWS-StartInteractiveCommand --parameters command="journalctl -u kubelet -n 100 --no-pager"
E

Error Medic Editorial

Error Medic Editorial is composed of Senior DevOps Engineers and SREs dedicated to providing actionable, real-world solutions for modern cloud infrastructure challenges.

Sources

Related Guides