Error Medic

Azure API Timeout: How to Diagnose and Fix 408/504 Timeout Errors

Fix Azure API timeout errors (408, 504, OperationTimedOut) by adjusting timeout settings, enabling retries, and optimizing long-running calls. Step-by-step guid

Last updated:
Last verified:
2,015 words
Key Takeaways
  • Azure API timeouts surface as HTTP 408, 504, or the exception message 'The operation timed out' / 'OperationTimedOut' and stem from four root causes: client-side timeout too short, Azure API Management (APIM) gateway timeout, backend service cold start, or a long-running operation exceeding the 230-second Azure Load Balancer hard limit.
  • Azure Application Gateway and the public Azure Load Balancer enforce a 4-minute (240 s) idle TCP timeout that cannot be extended; any HTTP request that takes longer than 230 s end-to-end will be silently dropped by the fabric before your backend responds.
  • Quick fix summary: (1) set HttpClient.Timeout / Axios timeout to at least 100 s for synchronous calls; (2) raise the APIM policy timeout to match; (3) convert calls longer than 90 s to the async polling pattern (202 Accepted + Location header); (4) add an exponential-backoff retry policy with jitter for transient 429/503/504 responses.
Fix Approaches Compared
MethodWhen to UseImplementation TimeRisk
Raise client HttpClient timeoutClient times out before server responds; 408 on client side< 15 minLow – isolated to your client code
Raise APIM forward-request timeoutAPIM policy returns 504 before backend finishes15–30 minLow – scoped to one API/operation policy
Switch to async polling (202 + Location)Operations regularly exceed 90 s (reports, exports, ML inference)2–8 hMedium – requires API contract change
Add Polly retry with exponential backoffTransient 429 / 503 / 504 bursts30–60 minLow – retries are idempotent only on safe methods
Enable APIM caching for repeated readsRepeated identical GET calls timing out under load30–60 minLow – stale-data risk on mutable resources
Scale out / warm up backendCold-start latency on Azure Functions consumption plan1–4 hLow-Medium – cost increase, needs load testing
Move to Azure Durable FunctionsWorkflows that fan-out, aggregate, or run > 5 min1–3 daysMedium – architectural refactor

Understanding Azure API Timeout Errors

When an Azure API call exceeds a time boundary, the failure can originate at several distinct layers, each producing a different error signature:

  • Client SDK / HttpClient – throws TaskCanceledException (C#) or ECONNABORTED (Node.js) with message: The request was canceled due to the configured HttpClient.Timeout of 100 seconds elapsing.
  • Azure API Management gateway – returns HTTP 504 Gateway Timeout with body { "statusCode": 504, "message": "Origin server did not respond in time." }
  • Azure Load Balancer idle timeout – silently resets the TCP connection after 4 minutes of inactivity; the client sees a connection reset or SocketException.
  • Azure Resource Manager (ARM) polling – returns HTTP 202 Accepted immediately but the polling loop eventually times out with CloudException: OperationTimedOut.
  • Azure SQL / Cosmos DB – surfaces as SqlException: Timeout expired or RequestRateTooLargeException (429) which, if unretried, manifests as a logical timeout.

Understanding which layer fired is the mandatory first step before applying any fix.


Step 1: Identify the Timeout Layer

1a. Read the full exception chain. In .NET, always call exception.ToString() rather than .Message – the inner TaskCanceledException or SocketException reveals whether the cancellation token came from your code or the HTTP stack.

1b. Check the HTTP status code. 408 = client or server explicitly signaled timeout. 504 = intermediate proxy (APIM, Application Gateway, or Azure Front Door) gave up. A connection-reset with no status code = TCP-layer idle timeout from the Load Balancer.

1c. Pull APIM diagnostic logs. In the Azure portal go to API Management → APIs → [your API] → Test and inspect the trace, or enable Application Insights on APIM:

GET https://management.azure.com/subscriptions/{sub}/resourceGroups/{rg}/providers/Microsoft.ApiManagement/service/{apim}/apis/{api}/diagnostics/applicationinsights?api-version=2022-08-01

Look for backend-duration in the trace. If it is close to your forward-request timeout value, the backend is the bottleneck, not your client.

1d. Check Azure Monitor / Application Insights. Run this KQL query in Log Analytics to find all requests that exceeded 30 seconds:

requests
| where timestamp > ago(1h)
| where duration > 30000
| project timestamp, name, resultCode, duration, cloud_RoleName
| order by duration desc

Step 2: Fix Client-Side Timeouts (C# / .NET)

The default HttpClient timeout is 100 seconds. For APIs that legitimately take longer, create a named client via IHttpClientFactory:

// Program.cs / Startup.cs
builder.Services.AddHttpClient("AzureBackend", client =>
{
    client.BaseAddress = new Uri("https://myapi.azure-api.net");
    client.Timeout = TimeSpan.FromSeconds(180); // explicit, documented
})
.AddPolicyHandler(GetRetryPolicy());

static IAsyncPolicy<HttpResponseMessage> GetRetryPolicy() =>
    HttpPolicyExtensions
        .HandleTransientHttpError()          // 5xx and network errors
        .OrResult(r => r.StatusCode == (HttpStatusCode)429)
        .WaitAndRetryAsync(
            retryCount: 4,
            sleepDurationProvider: attempt =>
                TimeSpan.FromSeconds(Math.Pow(2, attempt))   // 2, 4, 8, 16 s
                + TimeSpan.FromMilliseconds(new Random().Next(0, 500)));

Important: Set the CancellationToken on the request itself when you need per-request control, rather than mutating HttpClient.Timeout at runtime (which is not thread-safe).


Step 3: Fix APIM Gateway Timeouts

In Azure API Management, the default forward-request timeout is 300 seconds (since API version 2021+) but older services default to 60 seconds. Raise it in the inbound or backend policy:

<!-- APIM Policy (API or Operation scope) -->
<policies>
  <inbound>
    <base />
  </inbound>
  <backend>
    <forward-request timeout="180" follow-redirects="true" />
  </backend>
  <outbound>
    <base />
  </outbound>
  <on-error>
    <base />
  </on-error>
</policies>

Note that the timeout attribute is in seconds and cannot exceed 230 seconds due to the underlying Azure Load Balancer constraint. If your operation needs more than 230 seconds, you must use the async pattern described in Step 4.


Step 4: Convert Long-Running Operations to Async Polling

The Azure-standard pattern for operations > 90 seconds is the REST Long-Running Operations (LRO) specification:

  1. Client POSTs the request.
  2. Backend immediately returns 202 Accepted with a Location or Operation-Location header pointing to a status endpoint.
  3. Client polls the status endpoint (with exponential back-off) until it receives 200/201 with the final result or a terminal error.
import time, requests

def start_operation(endpoint, payload, token):
    headers = {"Authorization": f"Bearer {token}", "Content-Type": "application/json"}
    r = requests.post(endpoint, json=payload, headers=headers, timeout=30)
    r.raise_for_status()
    if r.status_code == 202:
        return r.headers["Operation-Location"]
    return None  # synchronous completion

def poll_until_done(operation_url, token, max_wait=600):
    headers = {"Authorization": f"Bearer {token}"}
    elapsed = 0
    interval = 5
    while elapsed < max_wait:
        r = requests.get(operation_url, headers=headers, timeout=30)
        r.raise_for_status()
        body = r.json()
        status = body.get("status", "").lower()
        if status in ("succeeded", "failed", "canceled"):
            return body
        time.sleep(interval)
        elapsed += interval
        interval = min(interval * 1.5, 30)  # back-off up to 30 s
    raise TimeoutError(f"Operation did not complete within {max_wait}s")

Step 5: Fix Azure Function Cold-Start Timeouts

Azure Functions on the Consumption plan can take 5–15 seconds to cold-start. If your API call hits a cold instance, the cumulative latency often triggers client timeouts.

Options:

  • Set "functionTimeout": "00:10:00" in host.json (max 10 min on Consumption, unlimited on Premium/Dedicated).
  • Enable Always On (App Service Plan) or Pre-warmed instances (Premium plan) to eliminate cold starts.
  • Use Azure Front Door health probes to keep instances warm.

Step 6: Verify the Fix in Staging

After applying changes, validate with a load test using Azure Load Testing or k6 before promoting to production:

# k6 smoke test – replace URL and token
k6 run --vus 10 --duration 60s - <<'EOF'
import http from 'k6/http';
import { check, sleep } from 'k6';

const TOKEN = __ENV.AZURE_TOKEN;
const BASE  = __ENV.API_BASE_URL;

export default function () {
  const res = http.post(`${BASE}/api/long-running`, JSON.stringify({input: 'test'}), {
    headers: { 'Authorization': `Bearer ${TOKEN}`, 'Content-Type': 'application/json' },
    timeout: '190s',
  });
  check(res, {
    'status is 200 or 202': (r) => r.status === 200 || r.status === 202,
    'no timeout':           (r) => r.status !== 408 && r.status !== 504,
  });
  sleep(1);
}
EOF

Frequently Asked Questions

bash
#!/usr/bin/env bash
# Azure API Timeout Diagnostic Script
# Prerequisites: az CLI logged in, jq, curl
# Usage: APIM_NAME=mygw RG=mygroup API_ID=myapi bash diagnose-api-timeout.sh

set -euo pipefail

APIM_NAME="${APIM_NAME:?Set APIM_NAME}"
RG="${RG:?Set RG}"
API_ID="${API_ID:?Set API_ID}"
SUB=$(az account show --query id -o tsv)

echo "=== 1. Check APIM SKU and forward-request timeout ==="
az apim show -n "$APIM_NAME" -g "$RG" \
  --query '{sku:sku.name, capacity:sku.capacity, provisioningState:provisioningState}' -o table

echo ""
echo "=== 2. Fetch backend policy for API $API_ID ==="
az rest --method GET \
  --url "https://management.azure.com/subscriptions/$SUB/resourceGroups/$RG/providers/Microsoft.ApiManagement/service/$APIM_NAME/apis/$API_ID/policies/policy?api-version=2022-08-01" \
  --query 'properties.value' -o tsv 2>/dev/null | grep -oP '(?<=forward-request timeout=")\d+' \
  && echo " seconds" || echo "forward-request timeout not explicitly set (check inherited policy)"

echo ""
echo "=== 3. Recent 504/408 errors from APIM in Azure Monitor (last 1h) ==="
az monitor log-analytics query \
  --workspace "$(az monitor log-analytics workspace list -g "$RG" --query '[0].customerId' -o tsv)" \
  --analytics-query "
    ApiManagementGatewayLogs
    | where TimeGenerated > ago(1h)
    | where ResponseCode in (408, 504)
    | project TimeGenerated, OperationId, BackendId, BackendResponseCode, DurationMs
    | order by DurationMs desc
    | limit 20" \
  --output table 2>/dev/null || echo "Log Analytics workspace not found or insufficient permissions"

echo ""
echo "=== 4. Check Function App timeout setting ==="
FUNC_APPS=$(az functionapp list -g "$RG" --query '[].name' -o tsv)
for FUNC in $FUNC_APPS; do
  TIMEOUT=$(az functionapp config appsettings list -n "$FUNC" -g "$RG" \
    --query "[?name=='AzureFunctionsJobHost__functionTimeout'].value" -o tsv 2>/dev/null || echo "default")
  HOST_JSON=$(az storage file download --account-name \
    "$(az functionapp show -n "$FUNC" -g "$RG" --query 'storageAccountRequired' -o tsv)" \
    --share-name "$FUNC" --path host.json --dest /dev/stdout 2>/dev/null | \
    python3 -c "import sys,json; d=json.load(sys.stdin); print(d.get('functionTimeout','not set'))" 2>/dev/null || echo "could not read")
  echo "  Function App: $FUNC | AppSetting timeout: $TIMEOUT | host.json: $HOST_JSON"
done

echo ""
echo "=== 5. Measure raw backend latency bypassing APIM ==="
BACKEND_URL="${BACKEND_URL:-}"
if [[ -n "$BACKEND_URL" ]]; then
  curl -o /dev/null -s -w \
    "DNS: %{time_namelookup}s | Connect: %{time_connect}s | TTFB: %{time_starttransfer}s | Total: %{time_total}s\n" \
    "$BACKEND_URL"
else
  echo "  Set BACKEND_URL env var to measure raw backend latency"
fi

echo ""
echo "=== Diagnostics complete ==="
E

Error Medic Editorial

Error Medic Editorial is a team of senior DevOps and SRE engineers with hands-on experience designing and operating production systems on Azure, AWS, and GCP. Our troubleshooting guides are built from real incident postmortems, not documentation summaries.

Sources

Related Guides