Troubleshooting AWS ECS Timeout Errors: ResourceInitializationError & 504 Gateways
Comprehensive guide to fixing AWS ECS timeout errors, including ResourceInitializationError, ALB 504 Gateway Timeouts, and failing health checks.
- Tasks stuck in PENDING usually indicate a missing NAT Gateway or VPC Endpoint preventing ECR image pulls.
- ALB health check timeouts often happen when application startup exceeds the ECS service's healthCheckGracePeriodSeconds.
- HTTP 504 Gateway Timeouts indicate the container took longer to respond than the ALB's configured idle timeout.
- Missing IAM permissions (Task Execution Role) for Secrets Manager or ECR will cause container initialization timeouts.
| Error Signature | Primary Root Cause | Diagnostic Tool | Recommended Fix |
|---|---|---|---|
| ResourceInitializationError | Missing VPC route to ECR/Secrets | VPC Reachability Analyzer | Add NAT Gateway or VPC Endpoints |
| Task failed ELB health checks | App boot exceeds grace period | CloudWatch Metrics / ECS Events | Increase healthCheckGracePeriodSeconds |
| HTTP 504 Gateway Timeout | Slow backend processing | ALB Access Logs | Increase ALB idle timeout or optimize DB queries |
| CannotPullContainerError | IAM or Network block | CloudTrail | Fix Task Execution Role permissions |
Understanding AWS ECS Timeout Errors
When working with Amazon Elastic Container Service (ECS), particularly on AWS Fargate, "timeout" is a symptom rather than a singular root cause. Because ECS is a deeply integrated service, a timeout can stem from networking constraints (VPC, Subnets, Security Groups), IAM permission boundaries, Load Balancer configurations, or the application layer itself.
As a DevOps engineer or SRE, your first step is categorizing the timeout. Did the task fail to start? Did it start but fail health checks? Or is it running fine, but clients are experiencing timeout errors?
Below, we break down the most common ECS timeout scenarios, the exact error messages you will encounter, and step-by-step resolution paths.
Scenario 1: The Provisioning Timeout (ResourceInitializationError)
The Symptom:
Your ECS task transitions to the PENDING state and stays there for several minutes before finally transitioning to STOPPED.
Exact Error Messages:
ResourceInitializationError: unable to pull secrets or registry auth: execution resource missingCannotPullContainerError: inspect image has been retried 1 time(s): failed to resolve referenceCannotPullContainerError: pull image manifest has been retried 1 time(s): error during connect: Get https://api.ecr... timeout
The Root Cause: The ECS agent (running on the underlying EC2 instance or Fargate fleet) must communicate with several AWS APIs to start your container: ECR (to pull the image), Secrets Manager/SSM (to pull environment variables), and CloudWatch (to configure logging). If it cannot reach these endpoints over the network, it will eventually time out and kill the task.
How to Fix It:
1. Check Subnet Routing (The #1 Culprit) If your task is deployed in a Private Subnet, it has no direct route to the internet. To pull an image from ECR, it must either route through a NAT Gateway or use AWS PrivateLink (VPC Endpoints).
- Fix A (NAT Gateway): Ensure your private subnet's route table has a route
0.0.0.0/0pointing to a NAT Gateway located in a Public Subnet. - Fix B (VPC Endpoints): If you are running in a disconnected VPC (no internet access), you MUST provision the following interface VPC Endpoints:
com.amazonaws.<region>.ecr.apicom.amazonaws.<region>.ecr.dkrcom.amazonaws.<region>.logs(for CloudWatch)com.amazonaws.<region>.secretsmanager(if using Secrets)- Crucial: You must also create a Gateway VPC Endpoint for S3 (
com.amazonaws.<region>.s3), because ECR actually stores the image layers in S3.
2. Check Security Groups
Ensure the Security Group attached to your ECS task allows outbound traffic (Egress) to 0.0.0.0/0 on port 443 (HTTPS). The ECS agent requires HTTPS to communicate with ECR and CloudWatch.
3. Check Public IP Assignment
If your task is in a Public Subnet (using an Internet Gateway), Fargate tasks must have the Assign public IP setting set to ENABLED. Without a public IP, the task cannot route out through the Internet Gateway, resulting in a timeout.
Scenario 2: The Health Check Timeout (ALB Target Group Failure)
The Symptom:
The ECS task successfully pulls the image and enters the RUNNING state. However, 30 to 60 seconds later, it is terminated, and ECS attempts to start a new one. This loop continues indefinitely.
Exact Error Message:
Task failed ELB health checks in (target-group arn)
The Root Cause:
Your application framework (e.g., Spring Boot, Django, heavy Node.js apps) takes a significant amount of time to initialize, connect to the database, run migrations, and bind to the port. Meanwhile, the Application Load Balancer (ALB) begins sending health check pings immediately. If the container doesn't respond with a 200 OK within the configured time, the ALB marks the target as UNHEALTHY and instructs ECS to kill and replace the task.
How to Fix It:
1. Increase the Health Check Grace Period
In your ECS Service definition, locate the healthCheckGracePeriodSeconds parameter. This tells ECS to ignore failing load balancer health checks for a specified duration after the task enters the RUNNING state.
- Action: Increase this to
120or300seconds, depending on your application's bootstrap time.
2. Tune the ALB Health Check Interval and Threshold Go to the EC2 Console -> Target Groups -> Health Checks.
- Ensure the
Timeoutis reasonable (e.g., 5 seconds). - Ensure the
Intervalgives the app time to breathe (e.g., 30 seconds). - Check the
Healthy threshold(e.g., 2) andUnhealthy threshold(e.g., 3).
3. Verify Port Binding
A common "silent timeout" occurs when your application binds to 127.0.0.1 (localhost) instead of 0.0.0.0. The container will start, but the ALB health check packets arriving at the container's ENI will drop, leading to a timeout. Ensure your web server configuration binds to 0.0.0.0.
Scenario 3: The Client-Facing Timeout (HTTP 504 Gateway Timeout)
The Symptom: The ECS tasks are stable and running. Health checks are passing. However, when users or API clients send certain requests to the Load Balancer, they receive an HTTP 504 error after exactly 60 seconds.
Exact Error Message:
HTTP 504 Gateway Timeout
The Root Cause: The ALB successfully forwarded the request to your ECS container, but the container failed to return an HTTP response before the ALB's configured Idle Timeout was reached. The default ALB idle timeout is 60 seconds.
How to Fix It:
1. Increase the ALB Idle Timeout If your application legitimately requires more than 60 seconds to process a request (e.g., generating a massive PDF report, complex data processing), you must increase the ALB's idle timeout.
- Action: Go to EC2 Console -> Load Balancers -> Select your ALB -> Attributes -> Edit -> Set
Idle timeoutto your desired value (up to 4000 seconds).
2. Investigate Application Bottlenecks If the request shouldn't take 60 seconds, an infrastructure change won't fix the root problem. You need to look at application metrics:
- Are database connection pools exhausted, causing requests to queue?
- Are downstream third-party APIs timing out, cascading the delay to your container?
- Implement distributed tracing (e.g., AWS X-Ray or OpenTelemetry) to pinpoint exactly where the time is being spent inside the ECS task.
Scenario 4: The Container Stop Timeout
The Symptom:
When deploying a new version of your service, the old tasks hang in the DEPROVISIONING state for a long time before finally terminating.
Exact Error Message:
- ECS Event Log:
Stopped reason: Stop timeout
The Root Cause:
When ECS decides to stop a task (due to a deployment, scaling in, or failing health checks), it sends a SIGTERM signal to the container. The application is supposed to catch this signal, finish in-flight requests gracefully, and exit. If the application ignores SIGTERM, ECS waits for a specific duration (default 30 seconds) before sending a hard SIGKILL.
How to Fix It:
Ensure your application framework handles SIGTERM gracefully. If your application legitimately needs more time to drain long-running WebSocket connections or background jobs, you can configure the stopTimeout parameter in your ECS Task Definition (container definitions section) to extend the wait time up to 120 seconds.
Frequently Asked Questions
# A bash script to quickly diagnose ECS task timeout reasons
# Requirements: AWS CLI configured with appropriate permissions
CLUSTER_NAME="my-production-cluster"
# 1. Find the most recently stopped task
TASK_ARN=$(aws ecs list-tasks \
--cluster $CLUSTER_NAME \
--desired-status STOPPED \
--max-results 1 \
--query 'taskArns[0]' \
--output text)
if [ "$TASK_ARN" == "None" ]; then
echo "No stopped tasks found in cluster $CLUSTER_NAME."
exit 0
fi
echo "Analyzing stopped task: $TASK_ARN"
# 2. Extract the exact stop reason and container exit codes
aws ecs describe-tasks \
--cluster $CLUSTER_NAME \
--tasks $TASK_ARN \
--query 'tasks[0].{StopReason: stoppedReason, ContainerReason: containers[0].reason, ExitCode: containers[0].exitCode}' \
--output table
# 3. Check for specific timeout keywords
STOP_REASON=$(aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $TASK_ARN --query 'tasks[0].stoppedReason' --output text)
if [[ "$STOP_REASON" == *"ELB health checks"* ]]; then
echo "[DIAGNOSIS]: Task failed ALB health checks. Consider increasing 'healthCheckGracePeriodSeconds' in your service."
elif [[ "$STOP_REASON" == *"ResourceInitializationError"* ]]; then
echo "[DIAGNOSIS]: Networking/IAM failure. Check NAT Gateway, VPC Endpoints, or Task Execution Role permissions."
fiError Medic Editorial
Error Medic Editorial is a collective of senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex cloud infrastructure issues and maintaining high-availability architectures.
Sources
- https://aws.amazon.com/premiumsupport/knowledge-center/ecs-fargate-tasks-pending-state/
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/troubleshoot-task-health-checks.html
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/application-load-balancers.html#connection-idle-timeout