Error Medic

Troubleshooting AWS ALB 502 Bad Gateway and 504 Gateway Timeout Errors

Comprehensive guide to fixing AWS ALB 502 Bad Gateway and 504 Gateway Timeout errors. Learn root causes, diagnostic steps, and actionable fixes for your targets

Last updated:
Last verified:
1,240 words
Key Takeaways
  • 502 Bad Gateway usually means the target actively closed the connection or returned a malformed response.
  • 504 Gateway Timeout means the target failed to respond before the ALB's idle timeout expired.
  • Keep-Alive timeout mismatches between the ALB and the backend target are the #1 cause of intermittent 502s.
  • Always check ALB access logs and target application logs (e.g., Nginx, Tomcat, Node.js) to pinpoint the exact failure point.
Common Root Causes and Fix Approaches Compared
Error CodeRoot CauseFix ApproachResolution Time
502 Bad GatewayTarget closed connection prematurely (Keep-Alive mismatch)Increase target's keep-alive timeout to be greater than ALB's idle timeoutFast (< 15 mins)
502 Bad GatewayTarget returned malformed HTTP responseFix application code or web server configuration to return valid HTTP/1.1 headersMedium (Requires code/config change)
504 Gateway TimeoutTarget processing took longer than ALB idle timeoutOptimize application performance or increase ALB idle timeoutMedium to Slow
504 Gateway TimeoutNetwork ACL or Security Group blocking traffic from ALB to TargetUpdate SG/NACL rules to allow traffic on target port from ALB subnetsFast (< 10 mins)

Understanding ALB 502 and 504 Errors

When operating applications behind an AWS Application Load Balancer (ALB), encountering 502 Bad Gateway and 504 Gateway Timeout errors is a common rite of passage. While they look similar to an end user, they indicate fundamentally different interactions between the ALB and your backend targets (EC2 instances, ECS tasks, or Lambda functions).

The Difference: 502 vs. 504

  • 502 Bad Gateway: The ALB successfully established a connection with the target, but the target either returned a malformed response, or, more commonly, closed the TCP connection before the ALB could send the request or read the response. The ALB perceives this as the target acting poorly.
  • 504 Gateway Timeout: The ALB established a connection and sent the request, but the target failed to send a complete response before a configured timer expired. This is usually the ALB's idle timeout setting. The ALB essentially gave up waiting.

Deep Dive: Fixing 502 Bad Gateway

The most notorious cause of intermittent 502 errors is a mismatch in the TCP Keep-Alive timeout settings between the ALB and the backend web server (like Nginx, Apache, Node.js, or Tomcat).

The Keep-Alive Race Condition

By default, an ALB has an idle timeout of 60 seconds. It uses persistent connections (keep-alives) to communicate with backend targets to improve performance.

If your backend web server has a keep-alive timeout shorter than the ALB's idle timeout (e.g., Nginx default is often 75s, but Node.js might be 5s), a race condition occurs:

  1. The ALB maintains an open idle connection to the target.
  2. The target's shorter keep-alive timer expires. The target decides to close the connection and sends a FIN packet.
  3. At the exact same millisecond, a new client request arrives at the ALB.
  4. The ALB routes this request down the connection it thinks is still open.
  5. The target receives a request on a connection it is in the process of closing. It responds with an RST (Reset) packet.
  6. The ALB receives the RST and serves a 502 Bad Gateway to the client.

The Fix: Aligning Timeouts

Rule of thumb: The keep-alive timeout of your backend target must be greater than the idle timeout of your ALB.

If your ALB idle timeout is 60 seconds:

  • Nginx: Set keepalive_timeout 65; in nginx.conf.
  • Apache: Set KeepAliveTimeout 65 in httpd.conf.
  • Node.js: server.keepAliveTimeout = 65000; server.headersTimeout = 66000;

Deep Dive: Fixing 504 Gateway Timeout

A 504 error almost always means your backend is too slow, or the network is dropping packets silently.

Scenario 1: Slow Application Processing

If your application involves heavy database queries, calling external slow APIs, or complex computations, it might legitimately take longer than the default 60-second ALB idle timeout to return a response.

Diagnostic Steps:

  1. Check your application performance monitoring (APM) tools (Datadog, New Relic, AWS X-Ray).
  2. Look at the target_processing_time metric in CloudWatch for your ALB. If this value approaches your ALB's idle timeout before 504s occur, the application is the bottleneck.

The Fix:

  • Short-term: Increase the ALB's idle timeout (up to 4000 seconds) in the EC2 Console -> Load Balancers -> Attributes.
  • Long-term: Optimize your application code, add database indexes, or move long-running tasks to background queue workers (like SQS/Celery) instead of blocking the HTTP request.

Scenario 2: Silent Network Drops (Security Groups/NACLs)

If the ALB attempts to route traffic to a target, but a Security Group or Network ACL blocks the return traffic or drops the initial SYN packets without a rejection, the connection simply hangs until the ALB times out, resulting in a 504.

Diagnostic Steps:

  1. Verify the Target Group health checks. If health checks are failing with timeouts, network configuration is the likely culprit.
  2. Ensure the Target Security Group allows inbound traffic on the target port (e.g., 80, 8080) from the ALB's Security Group.
  3. Ensure the ephemeral ports (1024-65535) are allowed on the return path if using strict NACLs.

Leveraging ALB Access Logs

ALB Access Logs are your source of truth. Enable them to stream to an S3 bucket. The logs contain specific fields that pinpoint the failure:

  • elb_status_code: The status the ALB returned to the client (e.g., 502, 504).
  • target_status_code: The status the target returned to the ALB. (If this is -, the target didn't return an HTTP response).
  • request_processing_time, target_processing_time, response_processing_time: If target_processing_time is -1, the ALB couldn't reach the target or the connection closed unexpectedly.

By querying these logs using Amazon Athena, you can isolate which specific targets are failing and exactly when the connection is breaking down.

Frequently Asked Questions

sql
-- Example Amazon Athena query to find 502 and 504 errors in ALB access logs
SELECT 
  time,
  client_ip,
  elb_status_code,
  target_status_code,
  target_processing_time,
  request_url,
  target_port_list
FROM alb_logs
WHERE elb_status_code IN (502, 504)
ORDER BY time DESC
LIMIT 50;
E

Error Medic Editorial

The Error Medic Editorial team consists of senior DevOps engineers and Site Reliability Experts dedicated to demystifying complex cloud infrastructure issues.

Sources

Related Guides