Error Medic

AWS ECS 502 Bad Gateway: Complete Troubleshooting Guide

Fix AWS ECS 502 Bad Gateway errors fast. Covers health check misconfig, security group blocks, port mismatches, and timeout issues with exact CLI commands.

Last updated:
Last verified:
2,581 words
Key Takeaways
  • Root cause #1: Target group health checks failing because the health check path returns non-2xx, the container port is wrong, or startup grace period is too short — causing ALB to mark all targets unhealthy and return 502 to clients.
  • Root cause #2: Security group misconfiguration blocking traffic between the ALB and ECS task ENIs — the task security group must explicitly allow inbound TCP on the container port from the ALB security group.
  • Root cause #3: ALB idle timeout (default 60 s) is shorter than your application's response time, or your app closes keepalive connections before ALB does, causing ALB to receive an invalid/empty response and emit 502.
  • Root cause #4: Container crashes mid-request due to OOM kills (exit code 137) or unhandled exceptions, returning no valid HTTP response to ALB.
  • Quick fix checklist: (1) run `aws elbv2 describe-target-health` to see unhealthy reason, (2) verify container port matches target group port, (3) confirm ALB SG is allowed inbound on task SG, (4) tail CloudWatch Logs for app errors, (5) raise ALB idle_timeout if seeing timeouts.
Fix Approaches Compared
MethodWhen to UseEstimated TimeRisk
Fix health check path/port in target groupTargets show 'unhealthy' with HealthCheckFailed reason5 minLow — no downtime, takes effect on next check
Update ECS task SG inbound rule to allow ALB SGTargets show 'unhealthy' with Request timed out reason and no app logs5 minLow — additive change
Correct containerPort in task definition and redeployPort mismatch between app, task def, and target group10–15 minMedium — triggers rolling deployment
Increase ALB idle_timeout.timeout_seconds502s appear after long-running requests or during file uploads2 minLow — live attribute change
Increase ECS task memory and redeployTasks stopped with exit code 137 (OOM kill)15–20 minMedium — rolling deployment required
Enable ALB access logs and set deregistration delayIntermittent 502s only during deployments5 minLow — observability + drain tuning
Switch to blue/green deployment via CodeDeployFrequent deployments causing brief 502 spikes30–60 minLow after setup — eliminates deployment 502s

Understanding AWS ECS 502 Bad Gateway

A 502 Bad Gateway response is generated by your Application Load Balancer (ALB) — not by your container. It means the ALB successfully received the client request, selected a registered target (your ECS task), forwarded the request, but got back an invalid or empty response. The ALB is the gateway; your ECS task is the bad backend.

This is distinct from:

  • 504 Gateway Timeout: ALB forwarded the request but the backend took too long to respond.
  • 503 Service Unavailable: No healthy targets exist in the target group at all.
  • 500 Internal Server Error: Your application returned a valid HTTP response with status 500.

In AWS ECS specifically, 502s almost always originate from one of five layers: health check configuration, network/security group rules, port mapping configuration, application behavior, or timeout mismatches.


Step 1: Establish Scope and Frequency

Before diving into fixes, determine whether the 502 is constant, intermittent, or deployment-triggered. This alone eliminates half the candidate causes.

# Enable ALB access logging first if not already enabled
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --attributes Key=access_logs.s3.enabled,Value=true \
               Key=access_logs.s3.bucket,Value=your-logs-bucket \
               Key=access_logs.s3.prefix,Value=alb

If you already have logs, query for 502 patterns:

# Count 502s by target IP in last hour (using CloudWatch Log Insights)
aws logs start-query \
  --log-group-name /aws/applicationloadbalancer/your-alb \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, target_ip, request_url | filter elb_status_code=502 | stats count() by target_ip | sort count desc'

Also check ECS service events immediately — they often tell the whole story:

aws ecs describe-services \
  --cluster $CLUSTER_NAME \
  --services $SERVICE_NAME \
  --query 'services[0].events[:15]' \
  --output table

Constant 502s → likely health check, security group, or port misconfiguration. Intermittent 502s → likely timeout mismatch, OOM kills, or app-level errors. Deployment-only 502s → draining connections from deregistered tasks.


Step 2: Check Target Group Health

This is the single most productive first check. An unhealthy target group with zero healthy targets will cause 100% of requests to return 502 or 503.

# Get target group ARN for your service
aws elbv2 describe-target-groups \
  --load-balancer-arn $ALB_ARN \
  --query 'TargetGroups[*].{Name:TargetGroupName,ARN:TargetGroupArn,Port:Port,Protocol:Protocol}' \
  --output table

# Check actual health status with reasons
aws elbv2 describe-target-health \
  --target-group-arn $TG_ARN \
  --query 'TargetHealthDescriptions[*].{Target:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
  --output table

Common TargetHealth.Reason values and what they mean:

Reason Meaning Fix
Target.FailedHealthChecks Health check path returning non-2xx Fix /health endpoint or update matcher
Target.Timeout Health check request timed out Check SG rules, increase timeout
Target.NotRegistered Task not yet registered Wait for registration or check task status
Target.DeregistrationInProgress Task draining Normal during deployment
Elb.InitialHealthChecking Just registered, still checking Wait

Step 3: Verify Port Mapping Configuration

A port mismatch between your application, ECS task definition, and target group is a silent killer — everything looks configured but traffic never reaches the app correctly.

# Check container port mappings in task definition
aws ecs describe-task-definition \
  --task-definition $TASK_DEFINITION_NAME \
  --query 'taskDefinition.containerDefinitions[*].{Name:name,Ports:portMappings}'

# Check what port the target group is sending traffic to
aws elbv2 describe-target-groups \
  --target-group-arns $TG_ARN \
  --query 'TargetGroups[0].{Port:Port,HealthCheckPort:HealthCheckPort,HealthCheckPath:HealthCheckPath}'

For awsvpc network mode (Fargate or ECS with task-level networking), the container port must match the target group port exactly. For bridge mode, the host port in the port mapping must match.

If using dynamic host port mapping (hostPort: 0), the target group must use instance type and ECS registers the ephemeral port automatically.


Step 4: Audit Security Groups

This is the #1 cause of Target.Timeout health check failures. The ECS task's security group must allow inbound TCP traffic from the ALB's security group on the container port.

# Get the ECS service network config
aws ecs describe-services \
  --cluster $CLUSTER_NAME \
  --services $SERVICE_NAME \
  --query 'services[0].networkConfiguration.awsvpcConfiguration'

# Check inbound rules on the ECS task security group
aws ec2 describe-security-groups \
  --group-ids $ECS_TASK_SG_ID \
  --query 'SecurityGroups[0].IpPermissions'

# Get the ALB's security group
aws elbv2 describe-load-balancers \
  --load-balancer-arns $ALB_ARN \
  --query 'LoadBalancers[0].SecurityGroups'

Add the required inbound rule if missing:

aws ec2 authorize-security-group-ingress \
  --group-id $ECS_TASK_SG_ID \
  --protocol tcp \
  --port $CONTAINER_PORT \
  --source-group $ALB_SG_ID

Step 5: Review Application Logs for Errors

If health checks pass but 502s persist, the app is starting and passing health checks but failing on real requests.

# Get the log stream for the running task
TASK_ID=$(aws ecs list-tasks \
  --cluster $CLUSTER_NAME \
  --service-name $SERVICE_NAME \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

# Tail logs (adjust log group and container name)
aws logs get-log-events \
  --log-group-name /ecs/$SERVICE_NAME \
  --log-stream-name ecs/$CONTAINER_NAME/$TASK_ID \
  --limit 200 \
  --query 'events[*].message' \
  --output text

Look for: unhandled exceptions, segfaults, connection refused errors, or crash loops.


Step 6: Check for OOM Kills and Task Crashes

A container killed by the OOM reaper (exit code 137) or that crashes with exit code 1 while serving a request will cause the ALB to receive a TCP RST — resulting in a 502.

# List recently stopped tasks
aws ecs list-tasks \
  --cluster $CLUSTER_NAME \
  --service-name $SERVICE_NAME \
  --desired-status STOPPED \
  --query 'taskArns[*]' \
  --output text

# Inspect stop reason and exit code
aws ecs describe-tasks \
  --cluster $CLUSTER_NAME \
  --tasks $STOPPED_TASK_ARN \
  --query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode,Reason:reason}}'

If exitCode is 137, the container was OOM killed. Increase the memory (hard limit) or memoryReservation in your task definition and redeploy.


Step 7: Fix Timeout Mismatches

AWS ECS timeout 502 errors often occur when:

  1. ALB idle timeout (default 60 seconds) expires before the backend responds.
  2. Your application server closes keepalive connections before ALB does — ALB reuses the connection, gets a RST, and emits 502.
# Increase ALB idle timeout (live change, no redeploy needed)
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

# Verify current value
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'

In your application: set the keepalive timeout to ALB idle timeout + 5 seconds. For nginx: keepalive_timeout 125s;. For Node.js HTTP server: server.keepAliveTimeout = 125000.


Step 8: Fix Health Check Configuration

If your app's health endpoint needs time to warm up, or the path/matcher is wrong:

aws elbv2 modify-target-group \
  --target-group-arn $TG_ARN \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --matcher HttpCode=200-299

# Also increase ECS service health check grace period for slow-starting apps
aws ecs update-service \
  --cluster $CLUSTER_NAME \
  --service $SERVICE_NAME \
  --health-check-grace-period-seconds 60

Step 9: Reduce Deployment-Induced 502s

During rolling updates, ECS deregisters old tasks while registering new ones. Requests to draining tasks return 502. Fix this by:

  1. Reducing deregistration delay so old tasks are drained quickly.
  2. Ensuring new tasks are healthy before old ones are deregistered.
# Reduce deregistration delay (default 300s is too long)
aws elbv2 modify-target-group-attributes \
  --target-group-arn $TG_ARN \
  --attributes Key=deregistration_delay.timeout_seconds,Value=30

# Ensure rolling update keeps healthy capacity
aws ecs update-service \
  --cluster $CLUSTER_NAME \
  --service $SERVICE_NAME \
  --deployment-configuration minimumHealthyPercent=100,maximumPercent=200

For zero-downtime deployments, migrate to CodeDeploy blue/green which shifts traffic only after new tasks are fully healthy.

Frequently Asked Questions

bash
#!/usr/bin/env bash
# AWS ECS 502 Bad Gateway - Full Diagnostic Script
# Usage: ECS_502_CLUSTER=my-cluster ECS_502_SERVICE=my-service bash diagnose-ecs-502.sh

set -euo pipefail

CLUSTER="${ECS_502_CLUSTER:-REPLACE_ME}"
SERVICE="${ECS_502_SERVICE:-REPLACE_ME}"
REGION="${AWS_REGION:-us-east-1}"

echo "=== [1] ECS Service Events (last 10) ==="
aws ecs describe-services \
  --cluster "$CLUSTER" \
  --services "$SERVICE" \
  --region "$REGION" \
  --query 'services[0].events[:10].[message]' \
  --output text

echo ""
echo "=== [2] Target Group Health ==="
TG_ARN=$(aws ecs describe-services \
  --cluster "$CLUSTER" \
  --services "$SERVICE" \
  --region "$REGION" \
  --query 'services[0].loadBalancers[0].targetGroupArn' \
  --output text)

if [ "$TG_ARN" = "None" ] || [ -z "$TG_ARN" ]; then
  echo "ERROR: No target group found for service $SERVICE"
else
  echo "Target Group: $TG_ARN"
  aws elbv2 describe-target-health \
    --target-group-arn "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
    --output table
fi

echo ""
echo "=== [3] Running Task ARNs ==="
TASK_ARNS=$(aws ecs list-tasks \
  --cluster "$CLUSTER" \
  --service-name "$SERVICE" \
  --desired-status RUNNING \
  --region "$REGION" \
  --query 'taskArns[*]' \
  --output text)
echo "$TASK_ARNS"

echo ""
echo "=== [4] Task Port Mappings ==="
TASK_DEF=$(aws ecs describe-services \
  --cluster "$CLUSTER" \
  --services "$SERVICE" \
  --region "$REGION" \
  --query 'services[0].taskDefinition' \
  --output text)
aws ecs describe-task-definition \
  --task-definition "$TASK_DEF" \
  --region "$REGION" \
  --query 'taskDefinition.containerDefinitions[*].{Container:name,Ports:portMappings}'

echo ""
echo "=== [5] Recent Stopped Tasks (exit codes) ==="
STOPPED=$(aws ecs list-tasks \
  --cluster "$CLUSTER" \
  --service-name "$SERVICE" \
  --desired-status STOPPED \
  --region "$REGION" \
  --query 'taskArns[:3]' \
  --output text)

if [ -n "$STOPPED" ]; then
  aws ecs describe-tasks \
    --cluster "$CLUSTER" \
    --tasks $STOPPED \
    --region "$REGION" \
    --query 'tasks[*].{StopReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode}}' \
    --output json
else
  echo "No recently stopped tasks found."
fi

echo ""
echo "=== [6] ALB Idle Timeout ==="
if [ -n "$TG_ARN" ] && [ "$TG_ARN" != "None" ]; then
  ALB_ARN=$(aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetGroups[0].LoadBalancerArns[0]' \
    --output text)
  aws elbv2 describe-load-balancer-attributes \
    --load-balancer-arn "$ALB_ARN" \
    --region "$REGION" \
    --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]' \
    --output table
fi

echo ""
echo "=== [7] Health Check Config ==="
if [ -n "$TG_ARN" ] && [ "$TG_ARN" != "None" ]; then
  aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount,Matcher:Matcher.HttpCode}' \
    --output table
fi

echo ""
echo "=== DIAGNOSTIC COMPLETE ==="
echo "Next steps:"
echo "  - If targets UNHEALTHY with Timeout: check ECS task security group allows ALB SG on container port"
echo "  - If targets UNHEALTHY with HealthCheckFailed: fix health check path/port in target group"
echo "  - If exit code 137: increase task memory limit and redeploy"
echo "  - If all targets healthy: check app logs for errors and ALB idle timeout vs app response time"
E

Error Medic Editorial

The Error Medic Editorial team consists of senior DevOps engineers, SREs, and cloud architects with hands-on experience managing production workloads on AWS, GCP, and Azure. Our guides are written from real incident postmortems and production debugging sessions — not documentation rewrites. We specialize in AWS ECS, Kubernetes, observability tooling, and infrastructure-as-code best practices.

Sources

Related Guides