Why do I get 502 Bad Gateway only during ECS deployments, not normally?

This is caused by ALB sending requests to ECS tasks that are in the 'draining' state during a rolling update, or to newly registered tasks that haven't yet passed health checks. Two fixes: (1) Reduce the target group deregistration delay from the default 300s to 15–30s so draining tasks stop receiving traffic faster, and (2) increase the ECS service `healthCheckGracePeriodSeconds` to give new tasks time to warm up before receiving production traffic. For complete elimination of deployment 502s, use CodeDeploy blue/green deployment which only cuts over traffic once the new target group is fully healthy.

My ECS target group shows all targets as healthy, but I still get intermittent 502 errors — why?

Healthy targets returning 502 usually means the application itself is at fault, not the ALB configuration. The most common causes: (1) Your app is returning an invalid HTTP response (e.g., crashing mid-response, writing malformed headers). (2) The app is hitting resource limits (CPU throttling, database connection pool exhaustion) and timing out internally. (3) A keepalive timeout mismatch — your app closes idle connections while ALB reuses them, getting a TCP RST which ALB reports as 502. Check CloudWatch Logs for application errors and set your server's keepalive timeout to ALB idle timeout + 5 seconds.

How is AWS ECS 502 Bad Gateway different from 504 Gateway Timeout?

A **502** means the ALB received an invalid or empty response from your ECS task — the backend responded but with something unparseable, or it closed the connection prematurely. A **504** means the ALB forwarded the request but the ECS task did not respond within the ALB's idle timeout period (default 60 seconds). In practice: if you see 502, check for app crashes, OOM kills, security group blocks, or keepalive mismatches. If you see 504, the app is running but too slow — check for database query timeouts, external API calls hanging, or CPU starvation causing slow responses.

Why does ECS show my tasks as RUNNING but the ALB still returns 502?

RUNNING status means the container process started — it does not mean the process is healthy or listening on the expected port. Common scenarios: (1) The container started but the web server process crashed immediately after startup. (2) The server is listening on a different port than what's configured in the target group (e.g., port 3000 vs 8080). (3) The security group is blocking ALB traffic to the task ENI. (4) The container is in its startup phase and not yet ready to serve traffic. Run `aws elbv2 describe-target-health` to get the specific health check failure reason, then check `aws ecs describe-tasks` and CloudWatch Logs.

Can AWS ECS Fargate tasks cause 502 errors differently than EC2 launch type?

Yes. Fargate always uses `awsvpc` network mode, meaning each task gets its own ENI with its own security group. This makes security group misconfiguration more common since there's no host-level networking — you must explicitly allow ALB SG → task SG on the container port. Also, Fargate task startup is slower than EC2 (cold start pulling container image), so you need a longer `healthCheckGracePeriodSeconds` (60–120s recommended). Additionally, Fargate tasks cannot use dynamic host port mapping or bridge mode, so port configuration errors are caught at deployment time rather than at runtime.

AWS ECS 502 Bad Gateway: Complete Troubleshooting Guide

Fix AWS ECS 502 Bad Gateway errors fast. Covers health check misconfig, security group blocks, port mismatches, and timeout issues with exact CLI commands.

Last updated: February 23, 2026

Last verified: February 23, 2026

2,581 words

Key Takeaways

Root cause #1: Target group health checks failing because the health check path returns non-2xx, the container port is wrong, or startup grace period is too short — causing ALB to mark all targets unhealthy and return 502 to clients.
Root cause #2: Security group misconfiguration blocking traffic between the ALB and ECS task ENIs — the task security group must explicitly allow inbound TCP on the container port from the ALB security group.
Root cause #3: ALB idle timeout (default 60 s) is shorter than your application's response time, or your app closes keepalive connections before ALB does, causing ALB to receive an invalid/empty response and emit 502.
Root cause #4: Container crashes mid-request due to OOM kills (exit code 137) or unhandled exceptions, returning no valid HTTP response to ALB.
Quick fix checklist: (1) run `aws elbv2 describe-target-health` to see unhealthy reason, (2) verify container port matches target group port, (3) confirm ALB SG is allowed inbound on task SG, (4) tail CloudWatch Logs for app errors, (5) raise ALB idle_timeout if seeing timeouts.

Fix Approaches Compared
Method	When to Use	Estimated Time	Risk
Fix health check path/port in target group	Targets show 'unhealthy' with HealthCheckFailed reason	5 min	Low — no downtime, takes effect on next check
Update ECS task SG inbound rule to allow ALB SG	Targets show 'unhealthy' with Request timed out reason and no app logs	5 min	Low — additive change
Correct containerPort in task definition and redeploy	Port mismatch between app, task def, and target group	10–15 min	Medium — triggers rolling deployment
Increase ALB idle_timeout.timeout_seconds	502s appear after long-running requests or during file uploads	2 min	Low — live attribute change
Increase ECS task memory and redeploy	Tasks stopped with exit code 137 (OOM kill)	15–20 min	Medium — rolling deployment required
Enable ALB access logs and set deregistration delay	Intermittent 502s only during deployments	5 min	Low — observability + drain tuning
Switch to blue/green deployment via CodeDeploy	Frequent deployments causing brief 502 spikes	30–60 min	Low after setup — eliminates deployment 502s

Understanding AWS ECS 502 Bad Gateway

A 502 Bad Gateway response is generated by your Application Load Balancer (ALB) — not by your container. It means the ALB successfully received the client request, selected a registered target (your ECS task), forwarded the request, but got back an invalid or empty response. The ALB is the gateway; your ECS task is the bad backend.

This is distinct from:

504 Gateway Timeout: ALB forwarded the request but the backend took too long to respond.
503 Service Unavailable: No healthy targets exist in the target group at all.
500 Internal Server Error: Your application returned a valid HTTP response with status 500.

In AWS ECS specifically, 502s almost always originate from one of five layers: health check configuration, network/security group rules, port mapping configuration, application behavior, or timeout mismatches.

Step 1: Establish Scope and Frequency

Before diving into fixes, determine whether the 502 is constant, intermittent, or deployment-triggered. This alone eliminates half the candidate causes.

# Enable ALB access logging first if not already enabled
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --attributes Key=access_logs.s3.enabled,Value=true \
               Key=access_logs.s3.bucket,Value=your-logs-bucket \
               Key=access_logs.s3.prefix,Value=alb

If you already have logs, query for 502 patterns:

# Count 502s by target IP in last hour (using CloudWatch Log Insights)
aws logs start-query \
  --log-group-name /aws/applicationloadbalancer/your-alb \
  --start-time $(date -d '1 hour ago' +%s) \
  --end-time $(date +%s) \
  --query-string 'fields @timestamp, target_ip, request_url | filter elb_status_code=502 | stats count() by target_ip | sort count desc'

Also check ECS service events immediately — they often tell the whole story:

aws ecs describe-services \
  --cluster $CLUSTER_NAME \
  --services $SERVICE_NAME \
  --query 'services[0].events[:15]' \
  --output table

Constant 502s → likely health check, security group, or port misconfiguration. Intermittent 502s → likely timeout mismatch, OOM kills, or app-level errors. Deployment-only 502s → draining connections from deregistered tasks.

Step 2: Check Target Group Health

This is the single most productive first check. An unhealthy target group with zero healthy targets will cause 100% of requests to return 502 or 503.

# Get target group ARN for your service
aws elbv2 describe-target-groups \
  --load-balancer-arn $ALB_ARN \
  --query 'TargetGroups[*].{Name:TargetGroupName,ARN:TargetGroupArn,Port:Port,Protocol:Protocol}' \
  --output table

# Check actual health status with reasons
aws elbv2 describe-target-health \
  --target-group-arn $TG_ARN \
  --query 'TargetHealthDescriptions[*].{Target:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Description:TargetHealth.Description}' \
  --output table

Common TargetHealth.Reason values and what they mean:

Reason	Meaning	Fix
`Target.FailedHealthChecks`	Health check path returning non-2xx	Fix `/health` endpoint or update matcher
`Target.Timeout`	Health check request timed out	Check SG rules, increase timeout
`Target.NotRegistered`	Task not yet registered	Wait for registration or check task status
`Target.DeregistrationInProgress`	Task draining	Normal during deployment
`Elb.InitialHealthChecking`	Just registered, still checking	Wait

Step 3: Verify Port Mapping Configuration

A port mismatch between your application, ECS task definition, and target group is a silent killer — everything looks configured but traffic never reaches the app correctly.

# Check container port mappings in task definition
aws ecs describe-task-definition \
  --task-definition $TASK_DEFINITION_NAME \
  --query 'taskDefinition.containerDefinitions[*].{Name:name,Ports:portMappings}'

# Check what port the target group is sending traffic to
aws elbv2 describe-target-groups \
  --target-group-arns $TG_ARN \
  --query 'TargetGroups[0].{Port:Port,HealthCheckPort:HealthCheckPort,HealthCheckPath:HealthCheckPath}'

For awsvpc network mode (Fargate or ECS with task-level networking), the container port must match the target group port exactly. For bridge mode, the host port in the port mapping must match.

If using dynamic host port mapping (hostPort: 0), the target group must use instance type and ECS registers the ephemeral port automatically.

Step 4: Audit Security Groups

This is the #1 cause of Target.Timeout health check failures. The ECS task's security group must allow inbound TCP traffic from the ALB's security group on the container port.

# Get the ECS service network config
aws ecs describe-services \
  --cluster $CLUSTER_NAME \
  --services $SERVICE_NAME \
  --query 'services[0].networkConfiguration.awsvpcConfiguration'

# Check inbound rules on the ECS task security group
aws ec2 describe-security-groups \
  --group-ids $ECS_TASK_SG_ID \
  --query 'SecurityGroups[0].IpPermissions'

# Get the ALB's security group
aws elbv2 describe-load-balancers \
  --load-balancer-arns $ALB_ARN \
  --query 'LoadBalancers[0].SecurityGroups'

Add the required inbound rule if missing:

aws ec2 authorize-security-group-ingress \
  --group-id $ECS_TASK_SG_ID \
  --protocol tcp \
  --port $CONTAINER_PORT \
  --source-group $ALB_SG_ID

Step 5: Review Application Logs for Errors

If health checks pass but 502s persist, the app is starting and passing health checks but failing on real requests.

# Get the log stream for the running task
TASK_ID=$(aws ecs list-tasks \
  --cluster $CLUSTER_NAME \
  --service-name $SERVICE_NAME \
  --query 'taskArns[0]' \
  --output text | awk -F/ '{print $NF}')

# Tail logs (adjust log group and container name)
aws logs get-log-events \
  --log-group-name /ecs/$SERVICE_NAME \
  --log-stream-name ecs/$CONTAINER_NAME/$TASK_ID \
  --limit 200 \
  --query 'events[*].message' \
  --output text

Look for: unhandled exceptions, segfaults, connection refused errors, or crash loops.

Step 6: Check for OOM Kills and Task Crashes

A container killed by the OOM reaper (exit code 137) or that crashes with exit code 1 while serving a request will cause the ALB to receive a TCP RST — resulting in a 502.

# List recently stopped tasks
aws ecs list-tasks \
  --cluster $CLUSTER_NAME \
  --service-name $SERVICE_NAME \
  --desired-status STOPPED \
  --query 'taskArns[*]' \
  --output text

# Inspect stop reason and exit code
aws ecs describe-tasks \
  --cluster $CLUSTER_NAME \
  --tasks $STOPPED_TASK_ARN \
  --query 'tasks[0].{StoppedReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode,Reason:reason}}'

If exitCode is 137, the container was OOM killed. Increase the memory (hard limit) or memoryReservation in your task definition and redeploy.

Step 7: Fix Timeout Mismatches

AWS ECS timeout 502 errors often occur when:

ALB idle timeout (default 60 seconds) expires before the backend responds.
Your application server closes keepalive connections before ALB does — ALB reuses the connection, gets a RST, and emits 502.

# Increase ALB idle timeout (live change, no redeploy needed)
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

# Verify current value
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn $ALB_ARN \
  --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'

In your application: set the keepalive timeout to ALB idle timeout + 5 seconds. For nginx: keepalive_timeout 125s;. For Node.js HTTP server: server.keepAliveTimeout = 125000.

Step 8: Fix Health Check Configuration

If your app's health endpoint needs time to warm up, or the path/matcher is wrong:

aws elbv2 modify-target-group \
  --target-group-arn $TG_ARN \
  --health-check-path /health \
  --health-check-interval-seconds 30 \
  --health-check-timeout-seconds 10 \
  --healthy-threshold-count 2 \
  --unhealthy-threshold-count 3 \
  --matcher HttpCode=200-299

# Also increase ECS service health check grace period for slow-starting apps
aws ecs update-service \
  --cluster $CLUSTER_NAME \
  --service $SERVICE_NAME \
  --health-check-grace-period-seconds 60

Step 9: Reduce Deployment-Induced 502s

During rolling updates, ECS deregisters old tasks while registering new ones. Requests to draining tasks return 502. Fix this by:

Reducing deregistration delay so old tasks are drained quickly.
Ensuring new tasks are healthy before old ones are deregistered.

# Reduce deregistration delay (default 300s is too long)
aws elbv2 modify-target-group-attributes \
  --target-group-arn $TG_ARN \
  --attributes Key=deregistration_delay.timeout_seconds,Value=30

# Ensure rolling update keeps healthy capacity
aws ecs update-service \
  --cluster $CLUSTER_NAME \
  --service $SERVICE_NAME \
  --deployment-configuration minimumHealthyPercent=100,maximumPercent=200

For zero-downtime deployments, migrate to CodeDeploy blue/green which shifts traffic only after new tasks are fully healthy.

Frequently Asked Questions

bash

#!/usr/bin/env bash
# AWS ECS 502 Bad Gateway - Full Diagnostic Script
# Usage: ECS_502_CLUSTER=my-cluster ECS_502_SERVICE=my-service bash diagnose-ecs-502.sh

set -euo pipefail

CLUSTER="${ECS_502_CLUSTER:-REPLACE_ME}"
SERVICE="${ECS_502_SERVICE:-REPLACE_ME}"
REGION="${AWS_REGION:-us-east-1}"

echo "=== [1] ECS Service Events (last 10) ==="
aws ecs describe-services \
  --cluster "$CLUSTER" \
  --services "$SERVICE" \
  --region "$REGION" \
  --query 'services[0].events[:10].[message]' \
  --output text

echo ""
echo "=== [2] Target Group Health ==="
TG_ARN=$(aws ecs describe-services \
  --cluster "$CLUSTER" \
  --services "$SERVICE" \
  --region "$REGION" \
  --query 'services[0].loadBalancers[0].targetGroupArn' \
  --output text)

if [ "$TG_ARN" = "None" ] || [ -z "$TG_ARN" ]; then
  echo "ERROR: No target group found for service $SERVICE"
else
  echo "Target Group: $TG_ARN"
  aws elbv2 describe-target-health \
    --target-group-arn "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
    --output table
fi

echo ""
echo "=== [3] Running Task ARNs ==="
TASK_ARNS=$(aws ecs list-tasks \
  --cluster "$CLUSTER" \
  --service-name "$SERVICE" \
  --desired-status RUNNING \
  --region "$REGION" \
  --query 'taskArns[*]' \
  --output text)
echo "$TASK_ARNS"

echo ""
echo "=== [4] Task Port Mappings ==="
TASK_DEF=$(aws ecs describe-services \
  --cluster "$CLUSTER" \
  --services "$SERVICE" \
  --region "$REGION" \
  --query 'services[0].taskDefinition' \
  --output text)
aws ecs describe-task-definition \
  --task-definition "$TASK_DEF" \
  --region "$REGION" \
  --query 'taskDefinition.containerDefinitions[*].{Container:name,Ports:portMappings}'

echo ""
echo "=== [5] Recent Stopped Tasks (exit codes) ==="
STOPPED=$(aws ecs list-tasks \
  --cluster "$CLUSTER" \
  --service-name "$SERVICE" \
  --desired-status STOPPED \
  --region "$REGION" \
  --query 'taskArns[:3]' \
  --output text)

if [ -n "$STOPPED" ]; then
  aws ecs describe-tasks \
    --cluster "$CLUSTER" \
    --tasks $STOPPED \
    --region "$REGION" \
    --query 'tasks[*].{StopReason:stoppedReason,Containers:containers[*].{Name:name,ExitCode:exitCode}}' \
    --output json
else
  echo "No recently stopped tasks found."
fi

echo ""
echo "=== [6] ALB Idle Timeout ==="
if [ -n "$TG_ARN" ] && [ "$TG_ARN" != "None" ]; then
  ALB_ARN=$(aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetGroups[0].LoadBalancerArns[0]' \
    --output text)
  aws elbv2 describe-load-balancer-attributes \
    --load-balancer-arn "$ALB_ARN" \
    --region "$REGION" \
    --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]' \
    --output table
fi

echo ""
echo "=== [7] Health Check Config ==="
if [ -n "$TG_ARN" ] && [ "$TG_ARN" != "None" ]; then
  aws elbv2 describe-target-groups \
    --target-group-arns "$TG_ARN" \
    --region "$REGION" \
    --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Interval:HealthCheckIntervalSeconds,Timeout:HealthCheckTimeoutSeconds,HealthyThreshold:HealthyThresholdCount,UnhealthyThreshold:UnhealthyThresholdCount,Matcher:Matcher.HttpCode}' \
    --output table
fi

echo ""
echo "=== DIAGNOSTIC COMPLETE ==="
echo "Next steps:"
echo "  - If targets UNHEALTHY with Timeout: check ECS task security group allows ALB SG on container port"
echo "  - If targets UNHEALTHY with HealthCheckFailed: fix health check path/port in target group"
echo "  - If exit code 137: increase task memory limit and redeploy"
echo "  - If all targets healthy: check app logs for errors and ALB idle timeout vs app response time"

Error Medic Editorial

The Error Medic Editorial team consists of senior DevOps engineers, SREs, and cloud architects with hands-on experience managing production workloads on AWS, GCP, and Azure. Our guides are written from real incident postmortems and production debugging sessions — not documentation rewrites. We specialize in AWS ECS, Kubernetes, observability tooling, and infrastructure-as-code best practices.

Sources

Fix AWS ECS 502 Bad Gateway and timeout errors by diagnosing ALB to ECS connection issues, health check failures, and security group misconfigurations.

AWS CloudFront 403 Forbidden: Complete Troubleshooting Guide (Rate Limits, Timeouts & Fixes)

Fix AWS CloudFront 403 Forbidden errors fast. Step-by-step diagnosis covering S3 OAC misconfig, WAF blocks, geo-restrictions, signed URL expiry, and rate limits

AWS Lambda Timeout, 403 Forbidden, 502 Bad Gateway & Throttling: The Complete Troubleshooting Guide

Fix AWS Lambda timeout, 403 Forbidden, 502 Bad Gateway, throttling, and access denied errors with step-by-step diagnosis commands and proven configuration fixes

AWS RDS Access Denied: Fix Connection Refused and Timeout Errors (2024 Guide)

Fix AWS RDS access denied, connection refused, and timeout errors in minutes. Step-by-step guide covering IAM, security groups, VPC, and DB credentials.