Why does Envoy return 503 immediately after a new deployment?

New pod replicas frequently trigger outlier detection ejection during startup. If pods return any 5xx errors while initializing (before Kubernetes readiness probes pass), Envoy ejects them. Combined with old pods being terminated, this can briefly leave zero healthy hosts. Fix by increasing consecutive_5xx to 10 or higher, setting max_ejection_percent to 50 so at least half the cluster stays serving, and ensuring Kubernetes readiness probes prevent traffic from reaching pods until they are fully warmed up. In Istio, use PodReadinessGate to block Envoy traffic until the sidecar itself reports ready.

How do I tell if the 503 was generated by Envoy or by my application?

Check Envoy's access log for the %RESPONSE_FLAGS% field. Flags UH (no healthy upstream), UF (upstream connection failure), UO (overflow/circuit breaker), UC (upstream connection terminated), or UT (upstream timeout) confirm Envoy generated the 503. If the flag field is a dash (-) and the response includes the x-envoy-upstream-service-time header, the 503 was returned by your upstream application code and Envoy merely proxied it through — in that case you need to debug the application, not Envoy.

What does 'reset reason: overflow' mean in the Envoy 503 response body?

This means Envoy's circuit breaker rejected the request because the upstream connection pool is full — specifically, the number of queued pending requests exceeded the max_pending_requests threshold. The upstream_rq_pending_overflow stat counter in the Admin API will be incrementing. Short-term fix: increase max_pending_requests in circuit_breakers.thresholds. Long-term fix: scale the upstream service horizontally or implement rate limiting at the ingress to reduce the incoming request rate to a level the upstream can sustain.

Envoy shows 503 but curl to my upstream service works fine. Why?

The most likely cause is stale outlier detection ejection: Envoy ejected the host during a past error burst, and even though the service has recovered, Envoy has not yet re-admitted it. Check cluster. .outlier_detection.ejections_active via the Admin API. Hosts are re-admitted automatically after base_ejection_time (default 30s, exponentially increasing on repeat ejections). Another cause is a health check misconfiguration where Envoy is checking a different path or port than what you tested manually — verify the health_checks.http_health_check.path in your cluster config matches the endpoint that actually returns 200 OK.

How do I fix Envoy 503 in an Istio service mesh without restarting pods?

Run 'istioctl proxy-config endpoint . ' to check endpoint health status. If endpoints are UNHEALTHY, verify readiness probes pass on the upstream pods. For outlier detection ejections, patch the DestinationRule without restarting: kubectl patch destinationrule --type merge -p '{"spec":{"trafficPolicy":{"outlierDetection":{"consecutiveGatewayErrors":10,"baseEjectionTime":"10s","maxEjectionPercent":50}}}}'. For circuit breaker overflows, patch connectionPool.http.http1MaxPendingRequests. Control plane config changes propagate to sidecars within seconds without any pod restarts.

Envoy 503 Service Unavailable: Complete Troubleshooting Guide

Fix Envoy 503 errors fast: diagnose no_healthy_upstream, circuit breaker trips, and health check failures using the Admin API and config tuning.

Last updated: February 23, 2026

Last verified: February 23, 2026

2,234 words

Key Takeaways

The most common cause is all upstream hosts being marked unhealthy or ejected by outlier detection, surfacing as 'no healthy upstream' in the response body or the UH response flag in Envoy access logs.
Circuit breaker overflow (UO response flag) triggers 503 under high load when max_connections, max_pending_requests, or max_requests thresholds are breached — increasing these limits or scaling the upstream service resolves the issue.
Quick fix: query the Admin API with 'curl http://localhost:9901/clusters' to identify ejected or unhealthy hosts, check non-zero outlier_detection.ejections_active counters, then tune outlier_detection thresholds or circuit_breakers.thresholds in the cluster config to match real traffic patterns.

Fix Approaches Compared
Method	When to Use	Time	Risk
Admin API inspection (curl :9901/clusters)	First step — confirm which hosts are unhealthy or ejected	< 2 min	None — read-only
Tune outlier_detection thresholds in cluster config	Hosts ejected too aggressively for transient errors	10–30 min + redeploy	Low if thresholds set conservatively
Increase circuit_breakers.thresholds (max_pending_requests)	503s appear only under load; UO response flag in access logs	10–20 min + redeploy	Low-medium — higher limits may mask upstream capacity issues
Fix upstream health check endpoint or config path	health_check.failure counter is non-zero; backend returns non-2xx on health path	Varies	None — fixes the root cause directly
Re-sync xDS / Istio control plane	istioctl proxy-config endpoint shows 0 healthy or UNHEALTHY endpoints	5–15 min	Low
Runtime override to relax outlier detection	Urgent fix when good hosts are over-ejected in production right now	< 5 min	Medium — disables automatic bad-host eviction temporarily

Understanding Envoy 503 Service Unavailable

When Envoy returns a 503 Service Unavailable, the proxy itself — not your application — could not forward the request to any healthy upstream host. This distinction is critical: the 503 originates in the Envoy data plane, meaning your application code may be perfectly healthy while all traffic fails.

The response body typically contains one of these messages:

upstream connect error or disconnect/reset before headers. reset reason: connection_failure

no healthy upstream

upstream connect error or disconnect/reset before headers. reset reason: overflow

Envoy embeds a response flag in its access logs via the %RESPONSE_FLAGS% format token. Identifying this flag immediately narrows the root cause:

Flag	Meaning
`UH`	No healthy upstream host available
`UF`	Upstream connection failure (TCP refused or timed out)
`UO`	Upstream overflow — circuit breaker triggered
`UT`	Upstream request timeout exceeded
`UC`	Upstream connection terminated before response headers
`URX`	Maximum retry limit exceeded

Step 1: Confirm Envoy Is Generating the 503

Before touching any configuration, verify the 503 is Envoy-generated and not a passthrough from the backend application.

# Check access log for Envoy response flags
tail -f /var/log/envoy/access.log | grep ' 503 '
# Look for a line like:
# [2026-02-23T10:14:22.003Z] "GET /api/v1/users HTTP/1.1" 503 UH 0 18 0 - ...
# The 'UH' flag confirms Envoy generated the 503 — no healthy upstream host

If the flag is a literal dash (-) and the x-envoy-upstream-service-time response header is present, the 503 came from your backend application and Envoy merely proxied it through. In that case, debug your application, not Envoy.

Step 2: Query the Envoy Admin API

The Admin API (default port 9901) is the fastest diagnostic tool available. It requires no restarts and exposes real-time cluster and host state.

# List all clusters with health status
curl -s http://localhost:9901/clusters

# Filter for anything non-healthy
curl -s http://localhost:9901/clusters | grep -v 'health_flags::healthy' | grep 'health_flags'

Healthy output shows health_flags::healthy for every host. Problem indicators include:

health_flags::/failed_active_hc — active health check is failing
health_flags::/failed_eds_health — EDS (xDS control plane) marked host unhealthy
health_flags::/ejected_via_outlier_detection — outlier detection ejected this host
health_flags::/pending_dynamic_removal — host is being removed from the cluster

Step 3: Diagnose Health Check Failures

If hosts show /failed_active_hc, Envoy's active health checks are failing. Use the stats endpoint to quantify:

# Check health check statistics
curl -s http://localhost:9901/stats | grep 'health_check'

# Key counters:
# cluster.<name>.health_check.failure         -- cumulative failures
# cluster.<name>.health_check.success         -- cumulative successes
# cluster.<name>.health_check.network_failure -- TCP-level check failures
# cluster.<name>.membership_healthy           -- currently healthy hosts (must be > 0)
# cluster.<name>.membership_total             -- total configured hosts

If membership_healthy is 0 while membership_total is greater than zero, Envoy considers every configured host unhealthy. Verify the upstream is reachable from Envoy's exact network context:

# In Kubernetes: exec into the sidecar proxy container
kubectl exec -it <pod-name> -c istio-proxy -- \
  curl -v http://<upstream-service>:<port>/healthz

# Test raw TCP reachability
kubectl exec -it <pod-name> -c istio-proxy -- \
  nc -zv <upstream-host> <port>

A frequent misconfiguration: the path in Envoy's health check config points to /health but the upstream only serves /healthz or /ping. Fix the cluster config:

# Cluster health_checks config (envoy.yaml or CDS)
health_checks:
  - timeout: 5s
    interval: 10s
    unhealthy_threshold: 3
    healthy_threshold: 2
    http_health_check:
      path: /healthz   # Must match the actual endpoint the backend serves
      expected_statuses:
        - start: 200
          end: 299

Step 4: Diagnose Outlier Detection Ejections

Envoy's outlier detection automatically ejects hosts that return consecutive 5xx responses, gateway failures, or exhibit abnormally high latency. After a deployment spike or transient error burst, hosts may remain ejected even after the backend has fully recovered.

# Check for active ejections
curl -s http://localhost:9901/stats | grep 'outlier_detection'

# Watch for non-zero values:
# cluster.<name>.outlier_detection.ejections_active          -- currently ejected
# cluster.<name>.outlier_detection.ejections_consecutive_5xx
# cluster.<name>.outlier_detection.ejections_total

If ejections_active is greater than zero and you believe the backends have recovered, hosts are automatically re-admitted after base_ejection_time (default: 30 seconds, doubling on repeat ejections). For an immediate fix in production:

# Shorten base_ejection_time via runtime override (takes effect immediately)
curl -X POST http://localhost:9901/runtime_modify \
  --data 'cluster_manager.outlier_detection.base_ejection_time_ms=1000'

The permanent fix is tuning the outlier detection configuration to be less aggressive:

outlier_detection:
  consecutive_5xx: 10              # Default: 5; increase for noisy services
  consecutive_gateway_failure: 10
  interval: 30s                   # Evaluation window
  base_ejection_time: 15s         # Minimum ejection duration
  max_ejection_percent: 50        # Never eject more than half the cluster
  enforcing_consecutive_5xx: 100
  enforcing_consecutive_gateway_failure: 0

Step 5: Diagnose Circuit Breaker Overflow

If the response flag is UO, the circuit breaker is rejecting requests because the upstream connection pool is exhausted:

# Check circuit breaker overflow counters
curl -s http://localhost:9901/stats | grep -E 'upstream_(cx_overflow|rq_pending_overflow)'

# If upstream_rq_pending_overflow is incrementing,
# requests are queuing faster than the upstream can serve them.

Fix by increasing thresholds after verifying the upstream can actually handle the higher concurrency:

circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_connections: 1024
      max_pending_requests: 2048  # Increase for bursty workloads
      max_requests: 2048
      max_retries: 10

Step 6: Diagnose in Istio and xDS-Managed Deployments

In Istio, cluster and endpoint configuration is pushed by the Istiod control plane. Use istioctl to verify the sidecar proxy received correct configuration:

# Check clusters configured for a pod
istioctl proxy-config cluster <pod-name>.<namespace>

# Check endpoint health for a specific service
istioctl proxy-config endpoint <pod-name>.<namespace> \
  --cluster "outbound|80||my-service.default.svc.cluster.local"

# Verify control plane sync status
istioctl proxy-status <pod-name>.<namespace>

If endpoints show UNHEALTHY or the list is empty with zero healthy hosts, the problem is in the Kubernetes Endpoints object or the control plane — not Envoy's runtime state. Verify that the Service selector matches your pod labels and that pod readiness probes are passing.

Step 7: Enable Temporary Debug Logging

For intermittent or hard-to-reproduce 503 errors, temporarily increase Envoy log verbosity at runtime — no restart required:

# Enable debug logging for relevant subsystems
curl -X POST 'http://localhost:9901/logging?upstream=debug'
curl -X POST 'http://localhost:9901/logging?router=debug'
curl -X POST 'http://localhost:9901/logging?health_checker=debug'

# Reproduce the 503, capture logs, then reset to avoid log flood
curl -X POST 'http://localhost:9901/logging?upstream=warning'
curl -X POST 'http://localhost:9901/logging?router=warning'
curl -X POST 'http://localhost:9901/logging?health_checker=warning'

Debug logs for the upstream component show every connection attempt with the exact socket-level error — for example Connection refused on 10.0.0.5:8080 or i/o timeout after 5000ms — that triggers a host to be marked unhealthy and eventually ejected.

Frequently Asked Questions

bash

#!/usr/bin/env bash
# ============================================================
# Envoy 503 Diagnostic Script
# Usage: bash envoy-503-diag.sh [admin_url] [cluster_filter]
# Example: bash envoy-503-diag.sh http://localhost:9901 my_service
# ============================================================

ADMIN_URL="${1:-http://localhost:9901}"
FILTER="${2:-.}"

echo "=== Envoy 503 Diagnostic Report ==="
echo "Target: $ADMIN_URL  |  Filter: $FILTER"
echo ""

# 0. Verify Admin API is reachable
if ! curl -sf "${ADMIN_URL}/ready" >/dev/null 2>&1; then
  echo "[FAIL] Admin API unreachable at ${ADMIN_URL}"
  echo "       Ensure Envoy is running and the admin listener is bound."
  exit 1
fi
echo "[OK] Admin API is reachable"
echo ""

# 1. Unhealthy or ejected hosts (the primary 503 cause)
echo "--- 1. Unhealthy / Ejected Hosts ---"
RESULT=$(curl -s "${ADMIN_URL}/clusters" |
  grep -v "health_flags::healthy" |
  grep "health_flags" |
  grep "$FILTER")
echo "${RESULT:-  (none -- all hosts healthy)}"
echo ""

# 2. Membership counts: healthy vs total
echo "--- 2. Cluster Membership (healthy vs total) ---"
curl -s "${ADMIN_URL}/stats" |
  grep -E "cluster\..*\.(membership_healthy|membership_total)" |
  grep "$FILTER" | sort
echo ""

# 3. Outlier detection ejections (non-zero only)
echo "--- 3. Active Outlier Ejections ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep "outlier_detection.ejections" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 4. Circuit breaker overflows (non-zero only)
echo "--- 4. Circuit Breaker Overflows ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep -E "upstream_(cx_overflow|rq_pending_overflow|rq_retry_overflow)" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 5. Health check failures (non-zero only)
echo "--- 5. Health Check Failures ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep -E "health_check\.(failure|network_failure)" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 6. Upstream TCP connection errors (non-zero only)
echo "--- 6. Upstream TCP Connection Errors ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep "upstream_cx_connect_fail" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 7. Istio-specific checks (if istioctl is available)
if command -v istioctl >/dev/null 2>&1; then
  echo "--- 7. Istio Proxy Status ---"
  istioctl proxy-status 2>/dev/null | head -20
  echo ""
fi

echo "=== Remediation Guide ==="
echo "UH flag (no healthy upstream) -> Check sections 1-3; tune outlier_detection"
echo "UO flag (circuit breaker)     -> Check section 4; increase max_pending_requests"
echo "UF flag (connection failure)  -> Check section 6; verify network path and port"
echo "0 healthy in section 2        -> Active health checks failing; fix path or backend"
echo ""
echo "Enable debug logging (no restart): curl -X POST '${ADMIN_URL}/logging?upstream=debug'"
echo "Disable after use:                 curl -X POST '${ADMIN_URL}/logging?upstream=warning'"
echo "Istio endpoint check:              istioctl proxy-config endpoint <pod>.<ns>"

Error Medic Editorial

Error Medic Editorial is a team of senior DevOps, SRE, and platform engineers with extensive hands-on experience operating Envoy-based service meshes at scale. Our troubleshooting guides are built from real production incidents across Kubernetes, Istio, AWS App Mesh, and bare-metal Envoy deployments, specializing in distributed systems observability, zero-downtime deployments, and traffic management patterns.

Sources

Fix Envoy 503 Service Unavailable errors. Learn how to diagnose upstream connection failures, connection pool exhaustion, and TLS issues with actionable steps.

AWS ALB 502 Bad Gateway & 504 Gateway Timeout: Complete Troubleshooting Guide

Fix AWS ALB 502 Bad Gateway and 504 timeout errors fast. Covers target health, keep-alive mismatches, idle timeout tuning, and exact CLI diagnostic commands.

Fixing 'WireGuard Connection Refused': A Comprehensive Troubleshooting Guide

Resolve WireGuard connection refused, high latency, and packet loss errors. Step-by-step fixes for blocked ports, MTU mismatch, and firewall misconfigurations.

Fixing HAProxy 503 Service Unavailable, 502 Bad Gateway & 504 Timeout Errors

Learn how to troubleshoot and fix HAProxy 503 Service Unavailable and 502/504 gateway errors. Step-by-step diagnostic guide for backend connection issues.