Error Medic

Troubleshooting AWS ECS 502 Bad Gateway and Timeout Errors

Fix AWS ECS 502 Bad Gateway and timeout errors by diagnosing ALB to ECS connection issues, health check failures, and security group misconfigurations.

Last updated:
Last verified:
1,263 words
Key Takeaways
  • 502 Bad Gateway typically means the Application Load Balancer (ALB) cannot communicate with your ECS tasks.
  • Common root cause 1: Security Groups are blocking traffic between the ALB and the ECS container instances or Fargate ENIs.
  • Common root cause 2: The container application is crashing, not listening on 0.0.0.0, or listening on the wrong port.
  • Common root cause 3: ALB idle timeout is shorter than the application's processing time, leading to premature connection drops.
  • Quick fix summary: Verify Target Group health status, ensure the container listens on the mapped port across all interfaces, and check Security Group ingress rules from the ALB.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Update Security GroupsTarget Group shows targets as 'Unhealthy' with connection timeouts.5 minsLow
Fix Container Port/HostTarget Group shows 'Unhealthy' but SGs are correct; app works locally.15 minsMedium
Adjust ALB Idle TimeoutIntermittent 502s during long-running requests or large uploads.5 minsLow
Increase Task ResourcesTasks are being OOMKilled or thrashing CPU, causing unresponsiveness.10 minsMedium

Understanding the AWS ECS 502 Bad Gateway Error

When deploying applications on Amazon Elastic Container Service (ECS), whether backed by EC2 or AWS Fargate, the architecture typically involves an Application Load Balancer (ALB) routing traffic to your containers. A 502 Bad Gateway error occurs when the ALB attempts to proxy a request to your ECS task but receives an invalid response, or no response at all, from the target container.

Unlike a 503 Service Unavailable which often points to no healthy targets being available, or a 504 Gateway Timeout which strictly indicates the target took too long, a 502 in the AWS ecosystem usually points to a fundamental communication breakdown between the load balancer and the container.

Common Symptoms and Log Indicators

When this error occurs, you will likely see the following:

  • Users receive an HTTP 502 status code in their browser.
  • The ALB access logs show 502 in the elb_status_code field, but often - in the target_status_code field, meaning the request never reached the application.
  • Target Groups in the EC2 console show targets transitioning rapidly between Initial, Unhealthy, and Draining states.
  • ECS Service events show tasks repeatedly starting and stopping (task flapping).

Step 1: Diagnose the Target Group Health

The first step in any ECS 502 investigation is checking the ALB Target Group. The Target Group acts as the bridge between the load balancer and the ECS tasks.

  1. Navigate to the EC2 Console -> Target Groups.
  2. Select the Target Group associated with your ECS service.
  3. Click the Targets tab.

Look at the Status and Status details columns.

  • Health checks failed with these codes: [Connection refused]: The ALB reached the container, but nothing is listening on the expected port.
  • Health checks failed with these codes: [Request timed out]: The ALB cannot reach the container at all. This is almost always a Security Group or VPC routing issue.

Step 2: Fix Security Group and VPC Misconfigurations

If the health checks are timing out, the ALB is physically blocked from talking to the ECS task.

For Fargate Tasks: Fargate tasks each get their own Elastic Network Interface (ENI). The Security Group attached to the ECS Service must allow inbound traffic on the container port from the ALB's Security Group.

  • Source: ALB Security Group ID (e.g., sg-0abcd1234)
  • Port Range: The specific port your container listens on (e.g., 8080)
  • Protocol: TCP

For EC2-backed ECS Tasks: If using bridge networking with dynamic port mapping, the ECS instances are assigned an ephemeral port (typically 32768 - 65535). The EC2 instance Security Group must allow inbound traffic from the ALB on the entire ephemeral port range.

Step 3: Container Application Issues

If the Security Groups are correct, the issue usually lies within the container itself.

The '0.0.0.0' vs '127.0.0.1' Trap A classic mistake is configuring the application (like a Node.js Express app, Python Flask/Django, or Go server) to listen on localhost or 127.0.0.1. Inside a Docker container, 127.0.0.1 refers only to the container's internal loopback interface. The ALB cannot reach it.

Fix: Ensure your application binds to 0.0.0.0.

Premature Task Exits (Crashing) If the application crashes immediately upon startup, the ALB will try to route traffic to it just as it dies, resulting in a 502. Check CloudWatch Logs for your ECS task to look for stack traces, missing environment variables, or database connection failures preventing startup.

Step 4: Addressing AWS ECS Timeout Errors

Sometimes, a 502 or 504 is intermittent, specifically occurring during heavy load or long-running requests.

ALB Idle Timeout: By default, the ALB has an idle timeout of 60 seconds. If your container takes 65 seconds to process a report generation request, the ALB will close the connection to the client and return a 504 (or sometimes a 502 if the target closes the connection abruptly after the timeout). You must increase the ALB's idle timeout attribute to match your application's maximum expected response time.

Keep-Alive Headers: Ensure your application's HTTP keep-alive timeout is greater than the ALB's idle timeout. If the application closes the TCP connection while the ALB is still trying to send data, a 502 Bad Gateway will occur. For example, in Node.js, you might need to set server.keepAliveTimeout = 65000; and server.headersTimeout = 66000;.

Frequently Asked Questions

bash
#!/bin/bash
# Diagnostic script to check ALB Target Group health and fetch recent ECS task logs

TARGET_GROUP_ARN="arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-targets/6d0ecf831eec9f09"
CLUSTER_NAME="my-ecs-cluster"
SERVICE_NAME="my-ecs-service"

echo "=== Checking Target Health ==="
aws elbv2 describe-target-health --target-group-arn $TARGET_GROUP_ARN | jq '.TargetHealthDescriptions[] | {Id: .Target.Id, State: .TargetHealth.State, Reason: .TargetHealth.Reason, Description: .TargetHealth.Description}'

echo -e "\n=== Finding Latest Stopped Task ==="
LATEST_TASK=$(aws ecs list-tasks --cluster $CLUSTER_NAME --service-name $SERVICE_NAME --desired-status STOPPED --max-items 1 | jq -r '.taskArns[0]')

if [ "$LATEST_TASK" != "null" ]; then
  echo "Found stopped task: $LATEST_TASK"
  echo "Fetching container exit codes and stop reasons..."
  aws ecs describe-tasks --cluster $CLUSTER_NAME --tasks $LATEST_TASK | jq '.tasks[0] | {StopReason: .stoppedReason, Containers: [.containers[] | {Name: .name, ExitCode: .exitCode, Reason: .reason}]}'
else
  echo "No recently stopped tasks found."
fi
E

Error Medic Editorial

Error Medic Editorial is a collective of senior Site Reliability Engineers and Cloud Architects dedicated to demystifying complex infrastructure issues and providing actionable, real-world solutions.

Sources

Related Guides