Why does my ALB return 502 only under high traffic, not on individual test requests?

This is almost always the keep-alive race condition described in Step 3. Under low traffic, ALB opens a new TCP connection for each request. Under load, ALB reuses existing connections aggressively. If your backend closes connections immediately after responding — via Connection: close or a too-short keepAliveTimeout in Node.js — the ALB occasionally sends a new request on a connection the backend just closed. The fix is to set keepAliveTimeout in Node.js to 65000 ms (greater than the ALB idle timeout) or configure keepalive pools in Nginx.

My target group health checks pass but I still get 502 errors on real traffic — why?

Health checks and real traffic use the same connection infrastructure but different ports and paths. The most likely culprits when health is green but traffic returns 502: (1) the application fails on specific request paths or payloads not exercised by the health check endpoint; (2) the target is healthy but rejecting certain HTTP methods or headers with a malformed response; (3) a different port is used for health checks vs. application traffic (common with ECS dynamic port mapping). Check access logs for error_reason on the failing requests — Target.InvalidResponse points to malformed application responses, while Target.ConnectionError points to port or process issues.

How do I stop 504 errors on API calls that legitimately take more than 60 seconds?

You have three options in order of preference. First, increase the ALB idle timeout (max 4000 s) using the modify-load-balancer-attributes CLI command — this is the fastest fix but still has a ceiling. Second, implement chunked transfer encoding in your backend so it sends partial bytes periodically, which resets the idle timer on each chunk. Third and best for operations longer than a few minutes, implement an async job pattern: accept the request, return HTTP 202 Accepted with a job ID, and provide a GET /jobs/{id} status endpoint. This eliminates timeout risk completely and makes the API more resilient.

How do I tell the difference between a 502 generated by the ALB itself versus one passed through from a backend service?

Check the ALB access log field `elb_status_code` vs `target_status_code`. If `target_status_code` is a dash (-), the ALB generated the 502 itself — the target never responded. If `target_status_code` shows a value (e.g., 200 or 500), the target did respond but ALB refused to proxy that response. You can also check the CloudWatch metric HTTPCode_ELB_5XX_Count (ALB-generated 5xx) separately from HTTPCode_Target_5XX_Count (target-generated 5xx).

Will increasing the ALB idle timeout affect all listeners and all target groups?

Yes — the idle timeout is a load-balancer-level attribute, not per-listener or per-target-group. Increasing it applies to all connections through that ALB. If you have a mix of fast APIs and slow batch endpoints on the same ALB, consider splitting them onto separate ALBs with different timeout values, or use the async pattern for slow endpoints so the timeout setting becomes irrelevant.

AWS ALB 502 Bad Gateway & 504 Gateway Timeout: Complete Troubleshooting Guide

Fix AWS ALB 502 Bad Gateway and 504 timeout errors fast. Covers target health, keep-alive mismatches, idle timeout tuning, and exact CLI diagnostic commands.

Last updated: February 23, 2026

Last verified: February 23, 2026

2,182 words

Key Takeaways

HTTP 502 means ALB received an invalid or malformed response from your backend target — the most common triggers are all-unhealthy target groups, keep-alive connection race conditions, and malformed HTTP responses from the application
HTTP 504 Gateway Timeout means the ALB idle timeout (default 60 s) expired before the target sent a complete response — increase the timeout or implement async patterns for long-running work
Enable ALB access logs immediately to get exact error_reason codes such as Target.ResponseCodeMismatch, Target.Timeout, and Target.ConnectionError before making any configuration changes; guessing without logs wastes hours

Fix Approaches for ALB 502 and 504 Compared
Method	When to Use	Time to Apply	Risk
Enable ALB access logs	Always — first step before any fix	2 min	None — read-only diagnostic
Increase idle timeout	504 errors; backend needs >60 s to respond	1 min CLI	Low — affects all connections on the ALB
Fix Nginx keep-alive settings	Intermittent 502 under load; error_reason Target.InvalidResponse	5–15 min + deploy	Low — Nginx reload is zero-downtime
Fix Node.js keepAliveTimeout	Node/Express backends returning 502 under concurrent traffic	5 min + deploy	Low
Tighten health check path/matcher	Unhealthy targets; TargetHealth.Reason FailedHealthChecks	2 min	Medium — wrong matcher marks all hosts unhealthy
Add async job pattern	504 errors on operations that cannot be optimized below timeout	Days	Low risk to ALB; requires app redesign
Fix Lambda response format	502 on Lambda target groups; missing statusCode field	10 min + deploy	Low

Understanding AWS ALB 502 and 504 Errors

An Application Load Balancer sits between clients and your backend targets (EC2 instances, ECS containers, Lambda functions, or bare IP addresses). When clients receive HTTP 502 or 504, the problem is always between the ALB and your targets — not in the ALB infrastructure itself.

HTTP 502 Bad Gateway is returned when the ALB successfully established a TCP connection to a target but received a response it could not proxy: a malformed HTTP response, an abruptly closed connection, or no response at all from a target that immediately disconnected.

HTTP 504 Gateway Timeout is returned when the ALB connected to the target but the target failed to send a complete HTTP response within the ALB idle timeout window. The default idle timeout is 60 seconds and measures silence on the wire — not total wall-clock request time.

Both errors look identical to end users but have distinct root causes and fixes. The fastest way to distinguish them precisely is ALB access logs.

Step 1: Enable ALB Access Logs and Read error_reason

Access logs are disabled by default. Enable them before doing anything else — every 502 and 504 line includes an error_reason field that names the exact failure mode.

aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --attributes Key=access_logs.s3.enabled,Value=true \
              Key=access_logs.s3.bucket,Value=my-alb-logs-bucket \
              Key=access_logs.s3.prefix,Value=alb

Once logs arrive in S3, filter for 502 and 504 entries. Access logs are space-delimited; the error_reason is the last field.

Common error_reason codes for 502 Bad Gateway:

error_reason	Root Cause
`Target.ResponseCodeMismatch`	Health check returned a code outside the configured matcher range
`Target.FailedHealthChecks`	All targets unhealthy; ALB has nowhere to route
`Target.InvalidResponse`	Backend returned a malformed HTTP response (bad headers, wrong HTTP version)
`Target.ConnectionError`	ALB could not establish TCP to the target (security group, process not listening)
`Target.Timeout`	Target connected but did not return response headers within timeout

Common codes for 504 Gateway Timeout:

error_reason	Root Cause
`Target.Timeout`	Target exceeded the ALB idle timeout
`Target.ConnectionRefused`	Target port closed or firewall blocking ALB health-check or data path

Step 2: Check Target Group Health

The most common cause of sustained 502 errors in production is a fully unhealthy target group. When no healthy targets exist, every request returns 502 immediately.

aws elbv2 describe-target-health \
  --target-group-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/def456 \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason,Desc:TargetHealth.Description}' \
  --output table

Key TargetHealth.Reason values:

Target.FailedHealthChecks — Your /health endpoint is not returning the expected HTTP status code, or the response is arriving after the health check timeout.
Target.NotRegistered — The target deregistered (ASG scale-in, manual removal). Re-register or verify your ASG lifecycle hooks.
Target.NotInUse — Target is in an Availability Zone the ALB is not enabled for. Enable the AZ on the ALB or enable cross-zone load balancing.
Elb.InternalError — AWS-side issue. Open a support case.

Verify your health check configuration matches what the application actually serves:

aws elbv2 describe-target-groups \
  --target-group-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/def456 \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Matcher:Matcher,Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds}'

If the application takes 8 seconds to start responding to health checks but HealthCheckTimeoutSeconds is 5, every target will be marked unhealthy within seconds of launch.

Step 3: Fix Keep-Alive Connection Race Conditions (502)

Intermittent 502 errors under load — especially errors that appear on roughly 1–5% of requests and cannot be reproduced on a single request — are usually caused by a keep-alive mismatch between ALB and the backend.

The race condition: ALB uses HTTP/1.1 persistent connections to targets and aggressively reuses them. If your backend closes the connection immediately after sending a response (sending Connection: close) at the exact moment ALB sends the next request on that connection, ALB receives a reset TCP connection mid-request and emits 502.

Nginx upstream fix:

upstream backend {
    server 127.0.0.1:8080;
    keepalive 32;        # maintain a pool of 32 keep-alive connections
    keepalive_timeout 65s;  # must exceed ALB idle timeout
}

server {
    location / {
        proxy_pass         http://backend;
        proxy_http_version 1.1;
        proxy_set_header   Connection "";  # strip hop-by-hop header to enable keep-alive
    }
}

Node.js / Express fix:

const app = require('express')();
const server = app.listen(3000);

// Both values MUST exceed the ALB idle timeout (default 60 s)
server.keepAliveTimeout = 65000;  // ms — time to keep idle connection open
server.headersTimeout   = 66000;  // ms — must be strictly greater than keepAliveTimeout

This is the most overlooked fix for Node.js services behind ALB and resolves the majority of intermittent 502 patterns.

Step 4: Increase the ALB Idle Timeout (504)

For HTTP 504 errors, start by measuring actual backend processing time from the target_processing_time field in access logs, then compare to your idle timeout setting.

# View current idle timeout
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --query 'Attributes[?Key==`idle_timeout.timeout_seconds`]'

# Increase to 120 seconds
aws elbv2 modify-load-balancer-attributes \
  --load-balancer-arn arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --attributes Key=idle_timeout.timeout_seconds,Value=120

The maximum ALB idle timeout is 4000 seconds. However, simply increasing the timeout is a band-aid. For operations that genuinely take minutes, implement an asynchronous pattern: return HTTP 202 Accepted with a job ID immediately, and provide a polling endpoint the client can check. This eliminates timeout risk entirely regardless of processing duration.

For streaming responses: If your application can send partial data (chunked transfer encoding), each sent chunk resets the idle timer. This allows indefinitely long streaming responses without triggering 504.

Step 5: Verify Security Groups and Port Reachability

Security group misconfigurations cause Target.ConnectionError (502) or connection refusals. The ALB must have outbound access to targets and targets must allow inbound from the ALB — using security group references, not static IPs.

# Get the ALB's security group(s)
aws elbv2 describe-load-balancers \
  --load-balancer-arns arn:aws:elasticloadbalancing:us-east-1:123456789012:loadbalancer/app/my-alb/abc123 \
  --query 'LoadBalancers[0].SecurityGroups'

# Get the target's security group(s) and check inbound rules
aws ec2 describe-security-groups \
  --group-ids sg-TARGET_SG_ID \
  --query 'SecurityGroups[0].IpPermissions'

The target security group inbound rules must reference the ALB security group ID (e.g., sg-xxxxxxxx) as the source — not the ALB's IP addresses, which change during scale events.

Step 6: Lambda Target Group 502 Errors

When the target type is lambda, HTTP 502 has additional causes specific to Lambda:

Invalid response format: Lambda must return a JSON object with statusCode (integer), headers (object), and body (string). A missing or null statusCode causes ALB to return 502.
Payload size limit: ALB response payloads from Lambda are capped at 1 MB. Larger responses produce 502.
Lambda throttling: When Lambda concurrency is exhausted, ALB cannot invoke the function and returns 502. Check the Throttles metric in CloudWatch for the function.

Verify your Lambda handler returns the correct shape:

def handler(event, context):
    return {
        "statusCode": 200,              # required — must be an integer
        "statusDescription": "200 OK",
        "isBase64Encoded": False,
        "headers": {"Content-Type": "application/json"},
        "body": '{"status": "ok"}'
    }

Step 7: Monitor with CloudWatch Alarms

Once the immediate issue is resolved, set up proactive CloudWatch alarms so you catch regressions before users do:

aws cloudwatch put-metric-alarm \
  --alarm-name "ALB-502-Spike" \
  --metric-name HTTPCode_Target_5XX_Count \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 60 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --dimensions Name=LoadBalancer,Value=app/my-alb/abc123 \
  --evaluation-periods 2 \
  --alarm-actions arn:aws:sns:us-east-1:123456789012:ops-alerts

Key metrics to track: HTTPCode_Target_5XX_Count, UnHealthyHostCount, TargetResponseTime (P99), and TargetConnectionErrorCount. A rising TargetResponseTime P99 is an early warning sign for 504 errors before they begin breaching the timeout threshold.

Frequently Asked Questions

bash

#!/usr/bin/env bash
# AWS ALB 502/504 Diagnostic Script
# Usage: ALB_ARN="arn:aws:..." TG_ARN="arn:aws:..." bash alb-debug.sh

set -euo pipefail
REGION="${AWS_DEFAULT_REGION:-us-east-1}"

echo "=== ALB Attributes ==="
aws elbv2 describe-load-balancer-attributes \
  --load-balancer-arn "$ALB_ARN" \
  --query 'Attributes[?contains(`["idle_timeout.timeout_seconds","access_logs.s3.enabled"]`, Key)]' \
  --output table

echo ""
echo "=== Target Group Health ==="
aws elbv2 describe-target-health \
  --target-group-arn "$TG_ARN" \
  --query 'TargetHealthDescriptions[*].{ID:Target.Id,Port:Target.Port,State:TargetHealth.State,Reason:TargetHealth.Reason}' \
  --output table

echo ""
echo "=== Health Check Configuration ==="
aws elbv2 describe-target-groups \
  --target-group-arns "$TG_ARN" \
  --query 'TargetGroups[0].{Path:HealthCheckPath,Port:HealthCheckPort,Matcher:Matcher.HttpCode,Timeout:HealthCheckTimeoutSeconds,Interval:HealthCheckIntervalSeconds,Threshold:HealthyThresholdCount}' \
  --output table

echo ""
echo "=== CloudWatch: 5xx counts last 30 min ==="
START=$(date -u -d '30 minutes ago' '+%Y-%m-%dT%H:%M:%SZ')
END=$(date -u '+%Y-%m-%dT%H:%M:%SZ')
LB_SUFFIX=$(echo "$ALB_ARN" | sed 's|.*:loadbalancer/||')

for METRIC in HTTPCode_Target_5XX_Count HTTPCode_ELB_5XX_Count UnHealthyHostCount TargetConnectionErrorCount; do
  echo -n "$METRIC: "
  aws cloudwatch get-metric-statistics \
    --namespace AWS/ApplicationELB \
    --metric-name "$METRIC" \
    --dimensions Name=LoadBalancer,Value="$LB_SUFFIX" \
    --start-time "$START" --end-time "$END" \
    --period 1800 --statistics Sum \
    --query 'Datapoints[0].Sum' \
    --output text
done

echo ""
echo "=== Recent access log 502/504 errors (requires jq + S3 log access) ==="
echo "Run: aws s3 cp s3://YOUR-BUCKET/alb-logs/$(date +%Y/%m/%d)/ . --recursive"
echo "Then: grep ' 502 \| 504 ' *.log | awk '{print $NF}' | sort | uniq -c | sort -rn"

Error Medic Editorial

Error Medic Editorial is a team of senior SREs and cloud architects with combined experience operating high-traffic systems on AWS, GCP, and Azure. We write from production post-mortems, not documentation summaries.

Sources

Fix AWS ALB 503 Service Unavailable errors by diagnosing target group health, security groups, and backend capacity limits. A complete SRE troubleshooting guide

Troubleshooting AWS ALB 502 Bad Gateway and 504 Gateway Timeout Errors

Comprehensive guide to fixing AWS ALB 502 Bad Gateway and 504 Gateway Timeout errors. Learn root causes, diagnostic steps, and actionable fixes for your targets

Envoy 503 Service Unavailable: Complete Troubleshooting Guide

Fix Envoy 503 errors fast: diagnose no_healthy_upstream, circuit breaker trips, and health check failures using the Admin API and config tuning.

Envoy 503 Service Unavailable: Root Causes and Troubleshooting Guide

Fix Envoy 503 Service Unavailable errors. Learn how to diagnose upstream connection failures, connection pool exhaustion, and TLS issues with actionable steps.