Can an overloaded target cause a 502 Bad Gateway?

Yes. While it often causes 504s, an overloaded server (e.g., out of memory or hitting worker limits) might forcefully drop new connections or crash the worker process handling the request, causing it to return an RST packet or close the TCP socket prematurely, which the ALB interprets as a 502.

Why do I see 502s during deployments or auto-scaling events?

This happens if targets are deregistered too quickly or if the application shuts down before draining existing connections. Ensure connection draining (deregistration delay) is enabled on the Target Group and configured to be longer than your longest expected request. Your application must also gracefully handle SIGTERM signals to finish inflight requests before exiting.

I only see 504s for specific API endpoints. Why?

This strongly indicates the issue is application performance, not infrastructure. Those specific endpoints are likely performing expensive database queries or waiting on slow third-party APIs. You should profile the code for those specific routes.

What if both elb_status_code and target_status_code in the access logs are 502?

If target_status_code is 502, it means the ALB successfully received a valid HTTP response from your backend target, and that response was literally a 502. The ALB is just passing it through. The problem is entirely within your application architecture (e.g., your Nginx proxy returned a 502 because its upstream Tomcat server failed).

Troubleshooting AWS ALB 502 Bad Gateway and 504 Gateway Timeout Errors

Common Root Causes and Fix Approaches Compared
Error Code	Root Cause	Fix Approach	Resolution Time
502 Bad Gateway	Target closed connection prematurely (Keep-Alive mismatch)	Increase target's keep-alive timeout to be greater than ALB's idle timeout	Fast (< 15 mins)
502 Bad Gateway	Target returned malformed HTTP response	Fix application code or web server configuration to return valid HTTP/1.1 headers	Medium (Requires code/config change)
504 Gateway Timeout	Target processing took longer than ALB idle timeout	Optimize application performance or increase ALB idle timeout	Medium to Slow
504 Gateway Timeout	Network ACL or Security Group blocking traffic from ALB to Target	Update SG/NACL rules to allow traffic on target port from ALB subnets	Fast (< 10 mins)

Understanding ALB 502 and 504 Errors

When operating applications behind an AWS Application Load Balancer (ALB), encountering 502 Bad Gateway and 504 Gateway Timeout errors is a common rite of passage. While they look similar to an end user, they indicate fundamentally different interactions between the ALB and your backend targets (EC2 instances, ECS tasks, or Lambda functions).

The Difference: 502 vs. 504

502 Bad Gateway: The ALB successfully established a connection with the target, but the target either returned a malformed response, or, more commonly, closed the TCP connection before the ALB could send the request or read the response. The ALB perceives this as the target acting poorly.
504 Gateway Timeout: The ALB established a connection and sent the request, but the target failed to send a complete response before a configured timer expired. This is usually the ALB's idle timeout setting. The ALB essentially gave up waiting.

Deep Dive: Fixing 502 Bad Gateway

The most notorious cause of intermittent 502 errors is a mismatch in the TCP Keep-Alive timeout settings between the ALB and the backend web server (like Nginx, Apache, Node.js, or Tomcat).

The Keep-Alive Race Condition

By default, an ALB has an idle timeout of 60 seconds. It uses persistent connections (keep-alives) to communicate with backend targets to improve performance.

If your backend web server has a keep-alive timeout shorter than the ALB's idle timeout (e.g., Nginx default is often 75s, but Node.js might be 5s), a race condition occurs:

The ALB maintains an open idle connection to the target.
The target's shorter keep-alive timer expires. The target decides to close the connection and sends a FIN packet.
At the exact same millisecond, a new client request arrives at the ALB.
The ALB routes this request down the connection it thinks is still open.
The target receives a request on a connection it is in the process of closing. It responds with an RST (Reset) packet.
The ALB receives the RST and serves a 502 Bad Gateway to the client.

The Fix: Aligning Timeouts

Rule of thumb: The keep-alive timeout of your backend target must be greater than the idle timeout of your ALB.

If your ALB idle timeout is 60 seconds:

Nginx: Set keepalive_timeout 65; in nginx.conf.
Apache: Set KeepAliveTimeout 65 in httpd.conf.
Node.js: server.keepAliveTimeout = 65000; server.headersTimeout = 66000;

Deep Dive: Fixing 504 Gateway Timeout

A 504 error almost always means your backend is too slow, or the network is dropping packets silently.

Scenario 1: Slow Application Processing

If your application involves heavy database queries, calling external slow APIs, or complex computations, it might legitimately take longer than the default 60-second ALB idle timeout to return a response.

Diagnostic Steps:

Check your application performance monitoring (APM) tools (Datadog, New Relic, AWS X-Ray).
Look at the target_processing_time metric in CloudWatch for your ALB. If this value approaches your ALB's idle timeout before 504s occur, the application is the bottleneck.

The Fix:

Short-term: Increase the ALB's idle timeout (up to 4000 seconds) in the EC2 Console -> Load Balancers -> Attributes.
Long-term: Optimize your application code, add database indexes, or move long-running tasks to background queue workers (like SQS/Celery) instead of blocking the HTTP request.

Scenario 2: Silent Network Drops (Security Groups/NACLs)

If the ALB attempts to route traffic to a target, but a Security Group or Network ACL blocks the return traffic or drops the initial SYN packets without a rejection, the connection simply hangs until the ALB times out, resulting in a 504.

Diagnostic Steps:

Verify the Target Group health checks. If health checks are failing with timeouts, network configuration is the likely culprit.
Ensure the Target Security Group allows inbound traffic on the target port (e.g., 80, 8080) from the ALB's Security Group.
Ensure the ephemeral ports (1024-65535) are allowed on the return path if using strict NACLs.

Leveraging ALB Access Logs

ALB Access Logs are your source of truth. Enable them to stream to an S3 bucket. The logs contain specific fields that pinpoint the failure:

elb_status_code: The status the ALB returned to the client (e.g., 502, 504).
target_status_code: The status the target returned to the ALB. (If this is -, the target didn't return an HTTP response).
request_processing_time, target_processing_time, response_processing_time: If target_processing_time is -1, the ALB couldn't reach the target or the connection closed unexpectedly.

By querying these logs using Amazon Athena, you can isolate which specific targets are failing and exactly when the connection is breaking down.