AWS ECS 502 Bad Gateway: Complete Troubleshooting Guide
Fix AWS ECS 502 Bad Gateway errors fast. Covers health check misconfig, security group blocks, port mismatches, and timeout issues with exact CLI commands.
- Root cause #1: Target group health checks failing because the health check path returns non-2xx, the container port is wrong, or startup grace period is too short — causing ALB to mark all targets unhealthy and return 502 to clients.
- Root cause #2: Security group misconfiguration blocking traffic between the ALB and ECS task ENIs — the task security group must explicitly allow inbound TCP on the container port from the ALB security group.
- Root cause #3: ALB idle timeout (default 60 s) is shorter than your application's response time, or your app closes keepalive connections before ALB does, causing ALB to receive an invalid/empty response and emit 502.
- Root cause #4: Container crashes mid-request due to OOM kills (exit code 137) or unhandled exceptions, returning no valid HTTP response to ALB.
- Quick fix checklist: (1) run `aws elbv2 describe-target-health` to see unhealthy reason, (2) verify container port matches target group port, (3) confirm ALB SG is allowed inbound on task SG, (4) tail CloudWatch Logs for app errors, (5) raise ALB idle_timeout if seeing timeouts.
| Method | When to Use | Estimated Time | Risk |
|---|---|---|---|
| Fix health check path/port in target group | Targets show 'unhealthy' with HealthCheckFailed reason | 5 min | Low — no downtime, takes effect on next check |
| Update ECS task SG inbound rule to allow ALB SG | Targets show 'unhealthy' with Request timed out reason and no app logs | 5 min | Low — additive change |
| Correct containerPort in task definition and redeploy | Port mismatch between app, task def, and target group | 10–15 min | Medium — triggers rolling deployment |
| Increase ALB idle_timeout.timeout_seconds | 502s appear after long-running requests or during file uploads | 2 min | Low — live attribute change |
| Increase ECS task memory and redeploy | Tasks stopped with exit code 137 (OOM kill) | 15–20 min | Medium — rolling deployment required |
| Enable ALB access logs and set deregistration delay | Intermittent 502s only during deployments | 5 min | Low — observability + drain tuning |
| Switch to blue/green deployment via CodeDeploy | Frequent deployments causing brief 502 spikes | 30–60 min | Low after setup — eliminates deployment 502s |
Understanding AWS ECS 502 Bad Gateway
A 502 Bad Gateway response is generated by your Application Load Balancer (ALB) — not by your container. It means the ALB successfully received the client request, selected a registered target (your ECS task), forwarded the request, but got back an invalid or empty response. The ALB is the gateway; your ECS task is the bad backend.
This is distinct from:
- 504 Gateway Timeout: ALB forwarded the request but the backend took too long to respond.
- 503 Service Unavailable: No healthy targets exist in the target group at all.
- 500 Internal Server Error: Your application returned a valid HTTP response with status 500.
In AWS ECS specifically, 502s almost always originate from one of five layers: health check configuration, network/security group rules, port mapping configuration, application behavior, or timeout mismatches.
Step 1: Establish Scope and Frequency
Before diving into fixes, determine whether the 502 is constant, intermittent, or deployment-triggered. This alone eliminates half the candidate causes.
# Enable ALB access logging first if not already enabled
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn $ALB_ARN \
--attributes Key=access_logs.s3.enabled,Value=true \
Key=access_logs.s3.bucket,Value=your-logs-bucket \
Key=access_logs.s3.prefix,Value=alb
If you already have logs, query for 502 patterns:
# Count 502s by target IP in last hour (using CloudWatch Log Insights)
aws logs start-query \
--log-group-name /aws/applicationloadbalancer/your-alb \
--start-time $(date -d '1 hour ago' +%s) \
--end-time $(date +%s) \
--query-string 'fields @timestamp, target_ip, request_url | filter elb_status_code=502 | stats count() by target_ip | sort count desc'
Also check ECS service events immediately — they often tell the whole story:
aws ecs describe-services \
--cluster $CLUSTER_NAME \
--services $SERVICE_NAME \
--query 'services[0].events[:15]' \
--output table
Constant 502s → likely health check, security group, or port misconfiguration. Intermittent 502s → likely timeout mismatch, OOM kills, or app-level errors. Deployment-only 502s → draining connections from deregistered tasks.
Step 2: Check Target Group Health
This is the single most productive first check. An unhealthy target group with zero healthy targets will cause 100% of requests to return 502 or 503.
# Get target group ARN for your service
aws elbv2 describe-target-groups \
--load-balancer-arn $ALB_ARN \
--query 'TargetGroups[*].{Name:TargetGroupName,ARN:TargetGroupArn,Port:Port,Protocol:Protocol}' \
--output table
# Check actual health status with reasons
aws elbv2 describe-target-health \
--target-group-arn $TG_ARN \
--query 'TargetHealthDescriptions[*].{Target:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
--output table
Common TargetHealth.Reason values and what they mean:
| Reason | Meaning | Fix |
|---|---|---|
Target.FailedHealthChecks |
Health check path returning non-2xx | Fix /health endpoint or update matcher |
Target.Timeout |
Health check request timed out | Check SG rules, increase timeout |
Target.NotRegistered |
Task not yet registered | Wait for registration or check task status |
Target.DeregistrationInProgress |
Task draining | Normal during deployment |
Elb.InitialHealthChecking |
Just registered, still checking | Wait |
Step 3: Verify Port Mapping Configuration
A port mismatch between your application, ECS task definition, and target group is a silent killer — everything looks configured but traffic never reaches the app correctly.
# Check container port mappings in task definition
aws ecs describe-task-definition \
--task-definition $TASK_DEFINITION_NAME \
--query 'taskDefinition.containerDefinitions[*].{Name:name,Ports:portMappings}'
# Check what port the target group is sending traffic to
aws elbv2 describe-target-groups \
--target-group-arns $TG_ARN \
--query 'TargetGroups[0].{Port:Port,HealthCheckPort:HealthCheckPort,HealthCheckPath:HealthCheckPath}'
For awsvpc network mode (Fargate or ECS with task-level networking), the container port must match the target group port exactly. For bridge mode, the host port in the port mapping must match.
If using dynamic host port mapping (hostPort: 0), the target group must use instance type and ECS registers the ephemeral port automatically.
Step 4: Audit Security Groups
This is the #1 cause of Target.Timeout health check failures. The ECS task's security group must allow inbound TCP traffic from the ALB's security group on the container port.
# Get the ECS service network config
aws ecs describe-services \
--cluster $CLUSTER_NAME \
--services $SERVICE_NAME \
--query 'services[0].networkConfiguration.awsvpcConfiguration'
# Check inbound rules on the ECS task security group
aws ec2 describe-security-groups \
--group-ids $ECS_TASK_SG_ID \
--query 'SecurityGroups[0].IpPermissions'
# Get the ALB's security group
aws elbv2 describe-load-balancers \
--load-balancer-arns $ALB_ARN \
--query 'LoadBalancers[0].SecurityGroups'
Add the required inbound rule if missing:
aws ec2 authorize-security-group-ingress \
--group-id $ECS_TASK_SG_ID \
--protocol tcp \
--port $CONTAINER_PORT \
--source-group $ALB_SG_ID
Step 5: Review Application Logs for Errors
If health checks pass but 502s persist, the app is starting and passing health checks but failing on real requests.
# Get the log stream for the running task
TASK_ID=$(aws ecs list-tasks \
--cluster $CLUSTER_NAME \
--service-name $SERVICE_NAME \
--query 'taskArns[0]' \
--output text | awk -F/ '{print $NF}')
# Tail logs (adjust log group and container name)
aws logs get-log-events \
--log-group-name /ecs/$SERVICE_NAME \
--log-stream-name ecs/$CONTAINER_NAME/$TASK_ID \
--limit 200 \
--query 'events[*].message' \
--output text
Look for: unhandled exceptions, segfaults, connection refused errors, or crash loops.
Step 6: Check for OOM Kills and Task Crashes
A container killed by the OOM reaper (exit code 137) or that crashes with exit code 1 while serving a request will cause the ALB to receive a TCP RST — resulting in a 502.
# List recently stopped tasks
aws ecs list-tasks \
--cluster $CLUSTER_NAME \
--service-name $SERVICE_NAME \
--desired-status STOPPED \
--query 'taskArns[*]' \
--output text
# Inspect stop reason and exit code
aws ecs describe-tasks \
--cluster $CLUSTER_NAME \
--tasks $STOPPED_TASK_ARN \
--query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode,Reason:reason}}'
If exitCode is 137, the container was OOM killed. Increase the memory (hard limit) or memoryReservation in your task definition and redeploy.
Step 7: Fix Timeout Mismatches
AWS ECS timeout 502 errors often occur when:
- ALB idle timeout (default 60 seconds) expires before the backend responds.
- Your application server closes keepalive connections before ALB does — ALB reuses the connection, gets a RST, and emits 502.
# Increase ALB idle timeout (live change, no redeploy needed)
aws elbv2 modify-load-balancer-attributes \
--load-balancer-arn $ALB_ARN \
--attributes Key=idle_timeout.timeout_seconds,Value=120
# Verify current value
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn $ALB_ARN \
--query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'
In your application: set the keepalive timeout to ALB idle timeout + 5 seconds. For nginx: keepalive_timeout 125s;. For Node.js HTTP server: server.keepAliveTimeout = 125000.
Step 8: Fix Health Check Configuration
If your app's health endpoint needs time to warm up, or the path/matcher is wrong:
aws elbv2 modify-target-group \
--target-group-arn $TG_ARN \
--health-check-path /health \
--health-check-interval-seconds 30 \
--health-check-timeout-seconds 10 \
--healthy-threshold-count 2 \
--unhealthy-threshold-count 3 \
--matcher HttpCode=200-299
# Also increase ECS service health check grace period for slow-starting apps
aws ecs update-service \
--cluster $CLUSTER_NAME \
--service $SERVICE_NAME \
--health-check-grace-period-seconds 60
Step 9: Reduce Deployment-Induced 502s
During rolling updates, ECS deregisters old tasks while registering new ones. Requests to draining tasks return 502. Fix this by:
- Reducing deregistration delay so old tasks are drained quickly.
- Ensuring new tasks are healthy before old ones are deregistered.
# Reduce deregistration delay (default 300s is too long)
aws elbv2 modify-target-group-attributes \
--target-group-arn $TG_ARN \
--attributes Key=deregistration_delay.timeout_seconds,Value=30
# Ensure rolling update keeps healthy capacity
aws ecs update-service \
--cluster $CLUSTER_NAME \
--service $SERVICE_NAME \
--deployment-configuration minimumHealthyPercent=100,maximumPercent=200
For zero-downtime deployments, migrate to CodeDeploy blue/green which shifts traffic only after new tasks are fully healthy.
Frequently Asked Questions
#!/usr/bin/env bash
# AWS ECS 502 Bad Gateway - Full Diagnostic Script
# Usage: ECS_502_CLUSTER=my-cluster ECS_502_SERVICE=my-service bash diagnose-ecs-502.sh
set -euo pipefail
CLUSTER="${ECS_502_CLUSTER:-REPLACE_ME}"
SERVICE="${ECS_502_SERVICE:-REPLACE_ME}"
REGION="${AWS_REGION:-us-east-1}"
echo "=== [1] ECS Service Events (last 10) ==="
aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$SERVICE" \
--region "$REGION" \
--query 'services[0].events[:10].[message]' \
--output text
echo ""
echo "=== [2] Target Group Health ==="
TG_ARN=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$SERVICE" \
--region "$REGION" \
--query 'services[0].loadBalancers[0].targetGroupArn' \
--output text)
if [ "$TG_ARN" = "None" ] || [ -z "$TG_ARN" ]; then
echo "ERROR: No target group found for service $SERVICE"
else
echo "Target Group: $TG_ARN"
aws elbv2 describe-target-health \
--target-group-arn "$TG_ARN" \
--region "$REGION" \
--query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
--output table
fi
echo ""
echo "=== [3] Running Task ARNs ==="
TASK_ARNS=$(aws ecs list-tasks \
--cluster "$CLUSTER" \
--service-name "$SERVICE" \
--desired-status RUNNING \
--region "$REGION" \
--query 'taskArns[*]' \
--output text)
echo "$TASK_ARNS"
echo ""
echo "=== [4] Task Port Mappings ==="
TASK_DEF=$(aws ecs describe-services \
--cluster "$CLUSTER" \
--services "$SERVICE" \
--region "$REGION" \
--query 'services[0].taskDefinition' \
--output text)
aws ecs describe-task-definition \
--task-definition "$TASK_DEF" \
--region "$REGION" \
--query 'taskDefinition.containerDefinitions[*].{Container:name,Ports:portMappings}'
echo ""
echo "=== [5] Recent Stopped Tasks (exit codes) ==="
STOPPED=$(aws ecs list-tasks \
--cluster "$CLUSTER" \
--service-name "$SERVICE" \
--desired-status STOPPED \
--region "$REGION" \
--query 'taskArns[:3]' \
--output text)
if [ -n "$STOPPED" ]; then
aws ecs describe-tasks \
--cluster "$CLUSTER" \
--tasks $STOPPED \
--region "$REGION" \
--query 'tasks[*].{StopReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode}}' \
--output json
else
echo "No recently stopped tasks found."
fi
echo ""
echo "=== [6] ALB Idle Timeout ==="
if [ -n "$TG_ARN" ] && [ "$TG_ARN" != "None" ]; then
ALB_ARN=$(aws elbv2 describe-target-groups \
--target-group-arns "$TG_ARN" \
--region "$REGION" \
--query 'TargetGroups[0].LoadBalancerArns[0]' \
--output text)
aws elbv2 describe-load-balancer-attributes \
--load-balancer-arn "$ALB_ARN" \
--region "$REGION" \
--query 'Attributes[?Key==`idle_timeout.timeout_seconds`]' \
--output table
fi
echo ""
echo "=== [7] Health Check Config ==="
if [ -n "$TG_ARN" ] && [ "$TG_ARN" != "None" ]; then
aws elbv2 describe-target-groups \
--target-group-arns "$TG_ARN" \
--region "$REGION" \
--query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount,Matcher:Matcher.HttpCode}' \
--output table
fi
echo ""
echo "=== DIAGNOSTIC COMPLETE ==="
echo "Next steps:"
echo " - If targets UNHEALTHY with Timeout: check ECS task security group allows ALB SG on container port"
echo " - If targets UNHEALTHY with HealthCheckFailed: fix health check path/port in target group"
echo " - If exit code 137: increase task memory limit and redeploy"
echo " - If all targets healthy: check app logs for errors and ALB idle timeout vs app response time"Error Medic Editorial
The Error Medic Editorial team consists of senior DevOps engineers, SREs, and cloud architects with hands-on experience managing production workloads on AWS, GCP, and Azure. Our guides are written from real incident postmortems and production debugging sessions — not documentation rewrites. We specialize in AWS ECS, Kubernetes, observability tooling, and infrastructure-as-code best practices.
Sources
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/troubleshooting.html
- https://repost.aws/knowledge-center/ecs-load-balancer-connection-error
- https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-load-balancing.html
- https://docs.aws.amazon.com/elasticloadbalancing/latest/application/target-group-health-checks.html