Why is my EKS node NotReady after upgrading the cluster version?

Upgrading the EKS control plane without upgrading the worker nodes' AMI (or vice versa) can cause incompatibilities. Ensure the VPC CNI, CoreDNS, and kube-proxy add-ons are updated to versions compatible with your new Kubernetes version, and rotate the worker nodes to use the matching EKS optimized AMI.

How do I check if my VPC subnets are out of IP addresses?

You can use the AWS CLI to check available IPs: `aws ec2 describe-subnets --subnet-ids --query 'Subnets[*].AvailableIpAddressCount'`. The EKS VPC CNI relies on secondary IPs; if this count is zero, the aws-node daemonset cannot assign IPs to Pods, leading to a NotReady state.

The kubelet logs say 'Unauthorized' when connecting to the API server. What does this mean?

This means the IAM Role attached to the EC2 worker node is not recognized by the EKS control plane. You must add the EC2 instance's IAM Role ARN into the `aws-auth` ConfigMap located in the `kube-system` namespace under the `mapRoles` section.

Can security groups cause a node to be NotReady?

Yes. If the worker node's security group does not allow outbound TCP 443 traffic to the EKS cluster's security group, the kubelet cannot reach the API server to register itself, resulting in a timeout and the node never reaching the Ready state.

My node is NotReady and 'kubectl describe' shows 'PLEG is not healthy'. What is PLEG?

PLEG stands for Pod Lifecycle Event Generator. It is a module in the kubelet that checks container states. If PLEG is unhealthy, it usually means the underlying container runtime (containerd or Docker) is unresponsive due to heavy CPU/IO load, a runtime deadlock, or kernel-level issues. Restarting containerd/kubelet often provides temporary relief.

Resolving 'AWS EKS Node Not Ready' Status: A Comprehensive Troubleshooting Guide

Diagnostic Approaches Compared
Method	When to Use	Time	Risk
kubectl describe node	Initial triage to check node conditions and recent Kubelet events	1 min	None
Check Kubelet Logs (journalctl)	Node is reachable but Kubelet fails to register with the API server	5 mins	Low
Verify aws-auth ConfigMap	Node EC2 instance is running but never joins the cluster at all	3 mins	Low
Restart VPC CNI / aws-node	CNI config uninitialized or Pods stuck in ContainerCreating	2 mins	Medium
Replace EKS Node Group	Unrecoverable state, corrupted AMI, or configuration drift	15 mins	High

Understanding the Error

When operating Amazon Elastic Kubernetes Service (EKS), one of the most stressful alerts an SRE or DevOps engineer can receive is an aws eks node not ready state. In Kubernetes, the control plane continuously monitors the health of worker nodes. If the kubelet daemon running on a node stops reporting its status, or reports a degraded state, the API server marks the node's Ready condition as False or Unknown.

When a node transitions to NotReady, the Kubernetes scheduler stops placing new Pods on it. After a default eviction timeout (usually 5 minutes), the control plane will begin evicting existing Pods to reschedule them elsewhere. This can lead to cascading failures if cluster capacity is suddenly reduced.

Step 1: Diagnose the Node Conditions

Before logging into the AWS console, start with the Kubernetes API. Run the following command to get a detailed view of the failing node:

kubectl describe node <node-name>

Scroll down to the Conditions section. You are looking for several key indicators:

Ready: Will be False or Unknown.
NetworkUnavailable: If True, the node's network routes are not configured correctly (often a CNI issue).
MemoryPressure / DiskPressure / PIDPressure: If any of these are True, the node has exhausted its physical resources, causing the kubelet to defensively fail or the OS to invoke the OOMKiller.

Look at the Events at the bottom of the output. Common error messages include:

PLEG is not healthy: pleg was last seen active 3m0s ago
network plugin is not ready: cni config uninitialized
NodeStatusUnknown: Kubelet stopped posting node status

Step 2: AWS VPC CNI and IP Exhaustion

In EKS, networking is handled by the Amazon VPC CNI plugin. A highly common reason for a node being NotReady is that the CNI plugin failed to initialize because the underlying AWS subnet has run out of available IP addresses.

The VPC CNI assigns a secondary IP address from the VPC to every Pod. If your subnet is exhausted, the aws-node daemonset pod on the worker node will crash or hang.

How to check:

Check the aws-node pods in the kube-system namespace: kubectl get pods -n kube-system -l k8s-app=aws-node wide
Look for pods in a CrashLoopBackOff or Error state on the affected node.
Check your AWS VPC Subnet available IPv4 addresses via the AWS Console or CLI: aws ec2 describe-subnets --subnet-ids <your-subnet-id> --query 'Subnets[*].AvailableIpAddressCount'

The Fix: Expand your subnet CIDR, move nodes to a different subnet, or enable prefix delegation (VPC CNI feature) to drastically increase the number of available IPs per EC2 Elastic Network Interface (ENI).

Step 3: Kubelet and Container Runtime Failures

If the network is fine, the kubelet or the container runtime (containerd or dockerd on older AMIs) might be failing. To investigate this, you must connect to the EC2 instance via AWS Systems Manager (SSM) Session Manager or SSH.

Once connected, check the kubelet logs:

journalctl -u kubelet -f

Common Kubelet Errors:

Unauthorized / Forbidden: error: failed to run Kubelet: cannot create certificate signing request: Unauthorized This indicates an IAM issue. The EC2 instance's IAM role must be present in the aws-auth ConfigMap. Check the ConfigMap: kubectl get configmap aws-auth -n kube-system -o yaml Ensure the rolearn precisely matches the IAM role attached to the EC2 instance (do not include the instance profile path).
Cgroup Driver Mismatch: misconfiguration: kubelet cgroup driver: "systemd" is different from docker cgroup driver: "cgroupfs" If you are using custom AMIs, ensure your container runtime and kubelet are both configured to use systemd as the cgroup driver. EKS optimized AMIs default to systemd.
PLEG Issues: If you see Pod Lifecycle Event Generator (PLEG) errors, the container runtime is likely deadlocked. Restart the runtime and kubelet: sudo systemctl restart containerd sudo systemctl restart kubelet

Step 4: Security Groups and Network ACLs

For an EKS node to register as Ready, it must be able to communicate with the EKS Control Plane. EKS creates cross-account elastic network interfaces in your VPC to facilitate this.

Ensure your security groups allow:

Control Plane to Nodes: TCP port 443 (for webhook validations and executing commands like kubectl exec) and TCP port 10250 (for kubelet API).
Nodes to Control Plane: TCP port 443 to the EKS cluster endpoint.

If a misconfigured Terraform or CloudFormation deployment accidentally removed the ingress rules allowing the worker node security group to reach the cluster security group, the kubelet will silently fail to register, throwing connection timeouts in the journalctl logs.

Step 5: User Data and Bootstrap Scripts

If the node never joins the cluster after creation, check the EC2 User Data logs. When an EKS node boots, it runs a script (/etc/eks/bootstrap.sh) to configure the kubelet with the cluster's CA certificate and API endpoint.

Check the cloud-init logs on the instance:

cat /var/log/cloud-init-output.log

Look for errors related to downloading the EKS CA cert, connecting to the EKS endpoint, or syntax errors in any custom user data scripts you provided. If the bootstrap script fails, the kubelet is never started.

Conclusion

Resolving an EKS node NotReady issue requires a systematic approach. Always start by reading the node conditions via kubectl. Isolate whether it is a resource exhaustion issue, a CNI/networking failure, or an IAM/authentication blockage. By systematically verifying subnet IPs, security groups, the aws-auth ConfigMap, and kubelet logs, you can quickly identify the root cause and restore cluster capacity.