Error Medic

AWS ALB 502 Bad Gateway & 504 Gateway Timeout: Complete Troubleshooting Guide

Fix AWS ALB 502 Bad Gateway and 504 timeout errors fast. Covers target health, keep-alive mismatches, idle timeout tuning, and exact CLI diagnostic commands.

Last updated:
Last verified:
2,182 words
Key Takeaways
  • HTTP 502 means ALB received an invalid or malformed response from your backend target — the most common triggers are all-unhealthy target groups, keep-alive connection race conditions, and malformed HTTP responses from the application
  • HTTP 504 Gateway Timeout means the ALB idle timeout (default 60 s) expired before the target sent a complete response — increase the timeout or implement async patterns for long-running work
  • Enable ALB access logs immediately to get exact error_reason codes such as Target.ResponseCodeMismatch, Target.Timeout, and Target.ConnectionError before making any configuration changes; guessing without logs wastes hours
Fix Approaches for ALB 502 and 504 Compared
MethodWhen to UseTime to ApplyRisk
Enable ALB access logsAlways — first step before any fix2 minNone — read-only diagnostic
Increase idle timeout504 errors; backend needs >60 s to respond1 min CLILow — affects all connections on the ALB
Fix Nginx keep-alive settingsIntermittent 502 under load; error_reason Target.InvalidResponse5–15 min + deployLow — Nginx reload is zero-downtime
Fix Node.js keepAliveTimeoutNode/Express backends returning 502 under concurrent traffic5 min + deployLow
Tighten health check path/matcherUnhealthy targets; TargetHealth.Reason FailedHealthChecks2 minMedium — wrong matcher marks all hosts unhealthy
Add async job pattern504 errors on operations that cannot be optimized below timeoutDaysLow risk to ALB; requires app redesign
Fix Lambda response format502 on Lambda target groups; missing statusCode field10 min + deployLow

Understanding AWS ALB 502 and 504 Errors

An Application Load Balancer sits between clients and your backend targets (EC2 instances, ECS containers, Lambda functions, or bare IP addresses). When clients receive HTTP 502 or 504, the problem is always between the ALB and your targets — not in the ALB infrastructure itself.

HTTP 502 Bad Gateway is returned when the ALB successfully established a TCP connection to a target but received a response it could not proxy: a malformed HTTP response, an abruptly closed connection, or no response at all from a target that immediately disconnected.

HTTP 504 Gateway Timeout is returned when the ALB connected to the target but the target failed to send a complete HTTP response within the ALB idle timeout window. The default idle timeout is 60 seconds and measures silence on the wire — not total wall-clock request time.

Both errors look identical to end users but have distinct root causes and fixes. The fastest way to distinguish them precisely is ALB access logs.

Step 1: Enable ALB Access Logs and Read error_reason

Access logs are disabled by default. Enable them before doing anything else — every 502 and 504 line includes an error_reason field that names the exact failure mode.

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --attributes Key=access_logs.s3.enabled,Value=true \
              Key=access_logs.s3.bucket,Value=my-alb-logs-bucket \
              Key=access_logs.s3.prefix,Value=alb

Once logs arrive in S3, filter for 502 and 504 entries. Access logs are space-delimited; the error_reason is the last field.

Common error_reason codes for 502 Bad Gateway:

error_reason Root Cause
Target.ResponseCodeMismatch Health check returned a code outside the configured matcher range
Target.FailedHealthChecks All targets unhealthy; ALB has nowhere to route
Target.InvalidResponse Backend returned a malformed HTTP response (bad headers, wrong HTTP version)
Target.ConnectionError ALB could not establish TCP to the target (security group, process not listening)
Target.Timeout Target connected but did not return response headers within timeout

Common codes for 504 Gateway Timeout:

error_reason Root Cause
Target.Timeout Target exceeded the ALB idle timeout
Target.ConnectionRefused Target port closed or firewall blocking ALB health-check or data path

Step 2: Check Target Group Health

The most common cause of sustained 502 errors in production is a fully unhealthy target group. When no healthy targets exist, every request returns 502 immediately.

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/def456 \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
  --output table

Key TargetHealth.Reason values:

  • Target.FailedHealthChecks — Your /health endpoint is not returning the expected HTTP status code, or the response is arriving after the health check timeout.
  • Target.NotRegistered — The target deregistered (ASG scale-in, manual removal). Re-register or verify your ASG lifecycle hooks.
  • Target.NotInUse — Target is in an Availability Zone the ALB is not enabled for. Enable the AZ on the ALB or enable cross-zone load balancing.
  • Elb.InternalError — AWS-side issue. Open a support case.

Verify your health check configuration matches what the application actually serves:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/def456 \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Matcher:Matcher,Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds}'

If the application takes 8 seconds to start responding to health checks but HealthCheckTimeoutSeconds is 5, every target will be marked unhealthy within seconds of launch.

Step 3: Fix Keep-Alive Connection Race Conditions (502)

Intermittent 502 errors under load — especially errors that appear on roughly 1–5% of requests and cannot be reproduced on a single request — are usually caused by a keep-alive mismatch between ALB and the backend.

The race condition: ALB uses HTTP/1.1 persistent connections to targets and aggressively reuses them. If your backend closes the connection immediately after sending a response (sending Connection: close) at the exact moment ALB sends the next request on that connection, ALB receives a reset TCP connection mid-request and emits 502.

Nginx upstream fix:

upstream backend {
    server 127.0.0.1:8080;
    keepalive 32;        # maintain a pool of 32 keep-alive connections
    keepalive_timeout 65s;  # must exceed ALB idle timeout
}

server {
    location / {
        proxy_pass         http://backend;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";  # strip hop-by-hop header to enable keep-alive
    }
}

Node.js / Express fix:

const app = require('express')();
const server = app.listen(3000);

// Both values MUST exceed the ALB idle timeout (default 60 s)
server.keepAliveTimeout = 65000;  // ms — time to keep idle connection open
server.headersTimeout   = 66000;  // ms — must be strictly greater than keepAliveTimeout

This is the most overlooked fix for Node.js services behind ALB and resolves the majority of intermittent 502 patterns.

Step 4: Increase the ALB Idle Timeout (504)

For HTTP 504 errors, start by measuring actual backend processing time from the target_processing_time field in access logs, then compare to your idle timeout setting.

# View current idle timeout
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'

# Increase to 120 seconds
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

The maximum ALB idle timeout is 4000 seconds. However, simply increasing the timeout is a band-aid. For operations that genuinely take minutes, implement an asynchronous pattern: return HTTP 202 Accepted with a job ID immediately, and provide a polling endpoint the client can check. This eliminates timeout risk entirely regardless of processing duration.

For streaming responses: If your application can send partial data (chunked transfer encoding), each sent chunk resets the idle timer. This allows indefinitely long streaming responses without triggering 504.

Step 5: Verify Security Groups and Port Reachability

Security group misconfigurations cause Target.ConnectionError (502) or connection refusals. The ALB must have outbound access to targets and targets must allow inbound from the ALB — using security group references, not static IPs.

# Get the ALB's security group(s)
aws elbv2 describe-load-balancers \
  --load-balancer-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --query 'LoadBalancers[0].SecurityGroups'

# Get the target's security group(s) and check inbound rules
aws ec2 describe-security-groups \
  --group-ids sg-TARGET_SG_ID \
  --query 'SecurityGroups[0].IpPermissions'

The target security group inbound rules must reference the ALB security group ID (e.g., sg-xxxxxxxx) as the source — not the ALB's IP addresses, which change during scale events.

Step 6: Lambda Target Group 502 Errors

When the target type is lambda, HTTP 502 has additional causes specific to Lambda:

  • Invalid response format: Lambda must return a JSON object with statusCode (integer), headers (object), and body (string). A missing or null statusCode causes ALB to return 502.
  • Payload size limit: ALB response payloads from Lambda are capped at 1 MB. Larger responses produce 502.
  • Lambda throttling: When Lambda concurrency is exhausted, ALB cannot invoke the function and returns 502. Check the Throttles metric in CloudWatch for the function.

Verify your Lambda handler returns the correct shape:

def handler(event, context):
    return {
        "statusCode": 200,              # required — must be an integer
        "statusDescription": "200 OK",
        "isBase64Encoded": False,
        "headers": {"Content-Type": "application/json"},
        "body": '{"status": "ok"}'
    }

Step 7: Monitor with CloudWatch Alarms

Once the immediate issue is resolved, set up proactive CloudWatch alarms so you catch regressions before users do:

aws cloudwatch put-metric-alarm \
  --alarm-name "ALB-502-Spike" \
  --metric-name HTTPCode_Target_5XX_Count \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

Key metrics to track: HTTPCode_Target_5XX_Count, UnHealthyHostCount, TargetResponseTime (P99), and TargetConnectionErrorCount. A rising TargetResponseTime P99 is an early warning sign for 504 errors before they begin breaching the timeout threshold.

Frequently Asked Questions

bash
#!/usr/bin/env bash
# AWS ALB 502/504 Diagnostic Script
# Usage: ALB_ARN="arn:aws:..." TG_ARN="arn:aws:..." bash alb-debug.sh

set -euo pipefail
REGION="${AWS_DEFAULT_REGION:-us-east-1}"

echo "=== ALB Attributes ==="
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn "$ALB_ARN" \
  --query 'Attributes[?contains(`["idle_timeout.timeout_seconds","access_logs.s3.enabled"]`, Key)]' \
  --output table

echo ""
echo "=== Target Group Health ==="
aws elbv2 describe-target-health \
  --target-group-arn "$TG_ARN" \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}' \
  --output table

echo ""
echo "=== Health Check Configuration ==="
aws elbv2 describe-target-groups \
  --target-group-arns "$TG_ARN" \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Matcher:Matcher.HttpCode,Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds,Threshold:HealthyThresholdCount}' \
  --output table

echo ""
echo "=== CloudWatch: 5xx counts last 30 min ==="
START=$(date -u -d '30 minutes ago' '+%Y-%m-%dT%H:%M:%SZ')
END=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
LB_SUFFIX=$(echo "$ALB_ARN" | sed 's|.*:loadbalancer/||')

for METRIC in HTTPCode_Target_5XX_Count HTTPCode_ELB_5XX_Count UnHealthyHostCount TargetConnectionErrorCount; do
  echo -n "$METRIC: "
  aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name "$METRIC" \
    --dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
    --start-time "$START" --end-time "$END" \
    --period 1800 --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text
done

echo ""
echo "=== Recent access log 502/504 errors (requires jq + S3 log access) ==="
echo "Run: aws s3 cp s3://YOUR-BUCKET/alb-logs/$(date +%Y/%m/%d)/ . --recursive"
echo "Then: grep ' 502 \| 504 ' *.log | awk '{print $NF}' | sort | uniq -c | sort -rn"
E

Error Medic Editorial

Error Medic Editorial is a team of senior SREs and cloud architects with combined experience operating high-traffic systems on AWS, GCP, and Azure. We write from production post-mortems, not documentation summaries.

Sources

Related Guides