Error Medic

Troubleshooting Azure API Timeout Errors: 504 Gateway Timeout and 502 Bad Gateway Fixes

Resolve Azure API timeout errors (504/502) quickly. Learn how to diagnose Application Gateway, API Management, and App Service timeouts with actionable fixes.

Last updated:
Last verified:
1,514 words
Key Takeaways
  • Azure API timeouts typically manifest as HTTP 504 Gateway Timeout or HTTP 502 Bad Gateway errors across API Management (APIM), Application Gateway, or App Services.
  • Common root causes include backend processing delays exceeding the default APIM/App Gateway timeout limits, SNAT port exhaustion, or unoptimized database queries.
  • Quick fixes involve increasing the default timeout limits via Azure CLI/Portal, implementing asynchronous patterns for long-running operations, or scaling backend resources.
Azure API Timeout Fix Approaches Compared
MethodWhen to UseTimeRisk
Increase API Management Timeout (`forward-request`)When backend predictably takes longer than 20 seconds for specific endpoints.5 minsLow (if isolated), High (if applied globally, risks connection pooling exhaustion)
Scale Up/Out Backend (App Service/AKS)When CPU/Memory metrics show resource exhaustion causing request queuing.15 minsLow (but incurs higher billing costs)
Implement Asynchronous Request-Reply Pattern (202 Accepted)For long-running operations (>120s) like report generation or batch processing.DaysMedium (requires client and backend architectural changes)
Resolve SNAT Port Exhaustion (NAT Gateway)When outbound connections to databases or external APIs are timing out under load.30 minsLow

Understanding Azure API Timeout Errors

When working with Azure's ecosystem—whether routing through Azure API Management (APIM), Azure Application Gateway, Azure Front Door, or directly hitting an Azure App Service—one of the most frustrating interruptions is the dreaded API timeout. These typically surface to the client as either an HTTP 504 Gateway Timeout or an HTTP 502 Bad Gateway error.

The anatomy of an Azure timeout is directly tied to the infrastructure topology. Every hop in your Azure networking stack has its own idle timeout settings. For instance, Azure Load Balancer has a default idle timeout of 4 minutes. Azure App Service has a default request timeout of 230 seconds. Azure API Management (APIM) has a default forward-request timeout of 20 seconds. If your backend service takes 30 seconds to respond, it might successfully process the request, but APIM will drop the connection at the 20-second mark and return a 504 to the client.

Common Error Messages

  • From APIM: { "statusCode": 504, "message": "Gateway Timeout" }
  • From Application Gateway: 504 Gateway Time-out - The server didn't respond in time.
  • From App Service (Docker): Container [Name] didn't respond to HTTP pings on port: 8080, failing site start. See container logs for debugging.
  • From Application Routing / ARR: The specified CGI application encountered an error and the server terminated the process.

Step 1: Diagnose the Exact Bottleneck

Before modifying infrastructure, you must identify where the timeout is occurring. Is it the client dropping the connection? The API Gateway? Or the backend database locking up?

Using Azure Application Insights

If you have Application Insights enabled, navigate to the Performance or Failures blade. Look for the Dependency execution times. If your API is timing out, it's highly likely a downstream dependency (like an Azure SQL database query or a third-party REST API call) is dragging the response time down.

Query Log Analytics to find the exact request durations to pinpoint the tier causing the issue:

requests
| where resultCode == "504" or resultCode == "502"
| project timestamp, operation_Name, duration, resultCode, client_IP
| order by duration desc

Diagnosing SNAT Port Exhaustion

If your App Service is making numerous outbound calls (e.g., to an external API or database) without utilizing connection pooling, you may be hitting SNAT (Source Network Address Translation) port exhaustion. This prevents new outbound connections from being established, resulting in a timeout. Go to the App Service -> Diagnose and solve problems -> SNAT Port Exhaustion to verify if port allocation has reached its maximum threshold.

Step 2: Implement the Fix

Once you have identified the bottleneck, apply the corresponding fix. Be aware that increasing timeouts is often treating the symptom; optimizing backend performance treats the disease.

Fix A: Increase the APIM Timeout Policy

If your backend requires more than 20 seconds and it is a legitimate architectural requirement (for example, a legacy system that cannot be easily optimized), you can increase the timeout limit in the APIM policy using the <forward-request> element.

Navigate to APIM -> Select your API -> Design -> Inbound processing -> Code View:

<policies>
    <inbound>
        <base />
    </inbound>
    <backend>
        <!-- Increase timeout to 120 seconds -->
        <forward-request timeout="120" />
    </backend>
    <outbound>
        <base />
    </outbound>
</policies>

Warning: Do not set this globally. Apply it only to the specific operations that require it to prevent blocking the APIM thread pool and causing cascading failures across your other APIs.

Fix B: Application Gateway Request Timeout

If you are using Azure Application Gateway, the default request routing timeout is 20 seconds. If your backend VMs or App Services take longer to compute the response, you must update the HTTP settings associated with your routing rules.

Navigate to Application Gateway -> HTTP settings -> Select your setting -> update the Request timeout (seconds) field to a higher value (e.g., 120). You can also do this via the Azure CLI or Terraform for infrastructure-as-code deployments.

Fix C: Addressing App Service 230-Second Limit

Azure App Service has an unchangeable hard limit: the Azure Load Balancer will drop any connection that is idle for 230 seconds. If your request takes longer than 3.8 minutes, you cannot simply "increase the timeout." You must fundamentally change the architecture.

The Asynchronous Request-Reply Pattern: Instead of holding the HTTP connection open while processing a massive file or report, refactor your API to return an HTTP 202 Accepted immediately, along with a Location header pointing to a status endpoint.

  1. Client sends POST /api/reports.
  2. API queues a background job (using Azure Service Bus, RabbitMQ, or Azure Storage Queues) and immediately returns 202 Accepted with Location: /api/reports/status/123.
  3. Worker (e.g., Azure Functions or a WebJob) processes the queue message independently.
  4. Client polls /api/reports/status/123 every few seconds until it returns 200 OK with the final payload or a download link.

Fix D: Optimizing Backend Dependencies and Connection Pooling

If the timeout is caused by a slow database query or resource exhaustion, increasing the API timeout is merely a band-aid. Consider these optimizations:

  • Database Tuning: Check for missing database indexes or locking issues in Azure SQL.
  • Connection Pooling: Ensure your application uses singletons for HTTP Clients (HttpClient in .NET, requests.Session() in Python) to prevent socket starvation.
  • Asynchronous Code: Ensure you are using async/await throughout your entire application stack to prevent thread-blocking under high concurrent load.
  • Caching: Implement a caching layer using Azure Cache for Redis to serve frequently requested data in milliseconds rather than querying the primary database repeatedly.

Frequently Asked Questions

bash
# Check current Application Gateway HTTP Settings timeout
az network application-gateway http-settings show \
  --gateway-name MyGateway \
  --resource-group MyResourceGroup \
  --name MyHttpSettings \
  --query "requestTimeout"

# Increase Application Gateway timeout to 120 seconds
az network application-gateway http-settings update \
  --gateway-name MyGateway \
  --resource-group MyResourceGroup \
  --name MyHttpSettings \
  --request-timeout 120

# Query Log Analytics via CLI to find 504 Timeout occurrences
az monitor log-analytics query \
  --workspace-id <your-workspace-id> \
  --analytics-query "requests | where resultCode == '504' | summarize count() by operation_Name"
E

Error Medic Editorial

A dedicated team of Senior Site Reliability Engineers and DevOps practitioners sharing hard-learned lessons on cloud infrastructure, debugging, and system architecture.

Sources

Related Guides