Error Medic

Envoy 503 Service Unavailable: Complete Troubleshooting Guide

Fix Envoy 503 errors fast: diagnose no_healthy_upstream, circuit breaker trips, and health check failures using the Admin API and config tuning.

Last updated:
Last verified:
2,234 words
Key Takeaways
  • The most common cause is all upstream hosts being marked unhealthy or ejected by outlier detection, surfacing as 'no healthy upstream' in the response body or the UH response flag in Envoy access logs.
  • Circuit breaker overflow (UO response flag) triggers 503 under high load when max_connections, max_pending_requests, or max_requests thresholds are breached — increasing these limits or scaling the upstream service resolves the issue.
  • Quick fix: query the Admin API with 'curl http://localhost:9901/clusters' to identify ejected or unhealthy hosts, check non-zero outlier_detection.ejections_active counters, then tune outlier_detection thresholds or circuit_breakers.thresholds in the cluster config to match real traffic patterns.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Admin API inspection (curl :9901/clusters)First step — confirm which hosts are unhealthy or ejected< 2 minNone — read-only
Tune outlier_detection thresholds in cluster configHosts ejected too aggressively for transient errors10–30 min + redeployLow if thresholds set conservatively
Increase circuit_breakers.thresholds (max_pending_requests)503s appear only under load; UO response flag in access logs10–20 min + redeployLow-medium — higher limits may mask upstream capacity issues
Fix upstream health check endpoint or config pathhealth_check.failure counter is non-zero; backend returns non-2xx on health pathVariesNone — fixes the root cause directly
Re-sync xDS / Istio control planeistioctl proxy-config endpoint shows 0 healthy or UNHEALTHY endpoints5–15 minLow
Runtime override to relax outlier detectionUrgent fix when good hosts are over-ejected in production right now< 5 minMedium — disables automatic bad-host eviction temporarily

Understanding Envoy 503 Service Unavailable

When Envoy returns a 503 Service Unavailable, the proxy itself — not your application — could not forward the request to any healthy upstream host. This distinction is critical: the 503 originates in the Envoy data plane, meaning your application code may be perfectly healthy while all traffic fails.

The response body typically contains one of these messages:

upstream connect error or disconnect/reset before headers. reset reason: connection_failure
no healthy upstream
upstream connect error or disconnect/reset before headers. reset reason: overflow

Envoy embeds a response flag in its access logs via the %RESPONSE_FLAGS% format token. Identifying this flag immediately narrows the root cause:

Flag Meaning
UH No healthy upstream host available
UF Upstream connection failure (TCP refused or timed out)
UO Upstream overflow — circuit breaker triggered
UT Upstream request timeout exceeded
UC Upstream connection terminated before response headers
URX Maximum retry limit exceeded

Step 1: Confirm Envoy Is Generating the 503

Before touching any configuration, verify the 503 is Envoy-generated and not a passthrough from the backend application.

# Check access log for Envoy response flags
tail -f /var/log/envoy/access.log | grep ' 503 '
# Look for a line like:
# [2026-02-23T10:14:22.003Z] "GET /api/v1/users HTTP/1.1" 503 UH 0 18 0 - ...
# The 'UH' flag confirms Envoy generated the 503 — no healthy upstream host

If the flag is a literal dash (-) and the x-envoy-upstream-service-time response header is present, the 503 came from your backend application and Envoy merely proxied it through. In that case, debug your application, not Envoy.

Step 2: Query the Envoy Admin API

The Admin API (default port 9901) is the fastest diagnostic tool available. It requires no restarts and exposes real-time cluster and host state.

# List all clusters with health status
curl -s http://localhost:9901/clusters

# Filter for anything non-healthy
curl -s http://localhost:9901/clusters | grep -v 'health_flags::healthy' | grep 'health_flags'

Healthy output shows health_flags::healthy for every host. Problem indicators include:

  • health_flags::/failed_active_hc — active health check is failing
  • health_flags::/failed_eds_health — EDS (xDS control plane) marked host unhealthy
  • health_flags::/ejected_via_outlier_detection — outlier detection ejected this host
  • health_flags::/pending_dynamic_removal — host is being removed from the cluster

Step 3: Diagnose Health Check Failures

If hosts show /failed_active_hc, Envoy's active health checks are failing. Use the stats endpoint to quantify:

# Check health check statistics
curl -s http://localhost:9901/stats | grep 'health_check'

# Key counters:
# cluster.<name>.health_check.failure         -- cumulative failures
# cluster.<name>.health_check.success         -- cumulative successes
# cluster.<name>.health_check.network_failure -- TCP-level check failures
# cluster.<name>.membership_healthy           -- currently healthy hosts (must be > 0)
# cluster.<name>.membership_total             -- total configured hosts

If membership_healthy is 0 while membership_total is greater than zero, Envoy considers every configured host unhealthy. Verify the upstream is reachable from Envoy's exact network context:

# In Kubernetes: exec into the sidecar proxy container
kubectl exec -it <pod-name> -c istio-proxy -- \
  curl -v http://<upstream-service>:<port>/healthz

# Test raw TCP reachability
kubectl exec -it <pod-name> -c istio-proxy -- \
  nc -zv <upstream-host> <port>

A frequent misconfiguration: the path in Envoy's health check config points to /health but the upstream only serves /healthz or /ping. Fix the cluster config:

# Cluster health_checks config (envoy.yaml or CDS)
health_checks:
  - timeout: 5s
    interval: 10s
    unhealthy_threshold: 3
    healthy_threshold: 2
    http_health_check:
      path: /healthz   # Must match the actual endpoint the backend serves
      expected_statuses:
        - start: 200
          end: 299

Step 4: Diagnose Outlier Detection Ejections

Envoy's outlier detection automatically ejects hosts that return consecutive 5xx responses, gateway failures, or exhibit abnormally high latency. After a deployment spike or transient error burst, hosts may remain ejected even after the backend has fully recovered.

# Check for active ejections
curl -s http://localhost:9901/stats | grep 'outlier_detection'

# Watch for non-zero values:
# cluster.<name>.outlier_detection.ejections_active          -- currently ejected
# cluster.<name>.outlier_detection.ejections_consecutive_5xx
# cluster.<name>.outlier_detection.ejections_total

If ejections_active is greater than zero and you believe the backends have recovered, hosts are automatically re-admitted after base_ejection_time (default: 30 seconds, doubling on repeat ejections). For an immediate fix in production:

# Shorten base_ejection_time via runtime override (takes effect immediately)
curl -X POST http://localhost:9901/runtime_modify \
  --data 'cluster_manager.outlier_detection.base_ejection_time_ms=1000'

The permanent fix is tuning the outlier detection configuration to be less aggressive:

outlier_detection:
  consecutive_5xx: 10              # Default: 5; increase for noisy services
  consecutive_gateway_failure: 10
  interval: 30s                   # Evaluation window
  base_ejection_time: 15s         # Minimum ejection duration
  max_ejection_percent: 50        # Never eject more than half the cluster
  enforcing_consecutive_5xx: 100
  enforcing_consecutive_gateway_failure: 0

Step 5: Diagnose Circuit Breaker Overflow

If the response flag is UO, the circuit breaker is rejecting requests because the upstream connection pool is exhausted:

# Check circuit breaker overflow counters
curl -s http://localhost:9901/stats | grep -E 'upstream_(cx_overflow|rq_pending_overflow)'

# If upstream_rq_pending_overflow is incrementing,
# requests are queuing faster than the upstream can serve them.

Fix by increasing thresholds after verifying the upstream can actually handle the higher concurrency:

circuit_breakers:
  thresholds:
    - priority: DEFAULT
      max_connections: 1024
      max_pending_requests: 2048  # Increase for bursty workloads
      max_requests: 2048
      max_retries: 10

Step 6: Diagnose in Istio and xDS-Managed Deployments

In Istio, cluster and endpoint configuration is pushed by the Istiod control plane. Use istioctl to verify the sidecar proxy received correct configuration:

# Check clusters configured for a pod
istioctl proxy-config cluster <pod-name>.<namespace>

# Check endpoint health for a specific service
istioctl proxy-config endpoint <pod-name>.<namespace> \
  --cluster "outbound|80||my-service.default.svc.cluster.local"

# Verify control plane sync status
istioctl proxy-status <pod-name>.<namespace>

If endpoints show UNHEALTHY or the list is empty with zero healthy hosts, the problem is in the Kubernetes Endpoints object or the control plane — not Envoy's runtime state. Verify that the Service selector matches your pod labels and that pod readiness probes are passing.

Step 7: Enable Temporary Debug Logging

For intermittent or hard-to-reproduce 503 errors, temporarily increase Envoy log verbosity at runtime — no restart required:

# Enable debug logging for relevant subsystems
curl -X POST 'http://localhost:9901/logging?upstream=debug'
curl -X POST 'http://localhost:9901/logging?router=debug'
curl -X POST 'http://localhost:9901/logging?health_checker=debug'

# Reproduce the 503, capture logs, then reset to avoid log flood
curl -X POST 'http://localhost:9901/logging?upstream=warning'
curl -X POST 'http://localhost:9901/logging?router=warning'
curl -X POST 'http://localhost:9901/logging?health_checker=warning'

Debug logs for the upstream component show every connection attempt with the exact socket-level error — for example Connection refused on 10.0.0.5:8080 or i/o timeout after 5000ms — that triggers a host to be marked unhealthy and eventually ejected.

Frequently Asked Questions

bash
#!/usr/bin/env bash
# ============================================================
# Envoy 503 Diagnostic Script
# Usage: bash envoy-503-diag.sh [admin_url] [cluster_filter]
# Example: bash envoy-503-diag.sh http://localhost:9901 my_service
# ============================================================

ADMIN_URL="${1:-http://localhost:9901}"
FILTER="${2:-.}"

echo "=== Envoy 503 Diagnostic Report ==="
echo "Target: $ADMIN_URL  |  Filter: $FILTER"
echo ""

# 0. Verify Admin API is reachable
if ! curl -sf "${ADMIN_URL}/ready" >/dev/null 2>&1; then
  echo "[FAIL] Admin API unreachable at ${ADMIN_URL}"
  echo "       Ensure Envoy is running and the admin listener is bound."
  exit 1
fi
echo "[OK] Admin API is reachable"
echo ""

# 1. Unhealthy or ejected hosts (the primary 503 cause)
echo "--- 1. Unhealthy / Ejected Hosts ---"
RESULT=$(curl -s "${ADMIN_URL}/clusters" |
  grep -v "health_flags::healthy" |
  grep "health_flags" |
  grep "$FILTER")
echo "${RESULT:-  (none -- all hosts healthy)}"
echo ""

# 2. Membership counts: healthy vs total
echo "--- 2. Cluster Membership (healthy vs total) ---"
curl -s "${ADMIN_URL}/stats" |
  grep -E "cluster\..*\.(membership_healthy|membership_total)" |
  grep "$FILTER" | sort
echo ""

# 3. Outlier detection ejections (non-zero only)
echo "--- 3. Active Outlier Ejections ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep "outlier_detection.ejections" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 4. Circuit breaker overflows (non-zero only)
echo "--- 4. Circuit Breaker Overflows ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep -E "upstream_(cx_overflow|rq_pending_overflow|rq_retry_overflow)" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 5. Health check failures (non-zero only)
echo "--- 5. Health Check Failures ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep -E "health_check\.(failure|network_failure)" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 6. Upstream TCP connection errors (non-zero only)
echo "--- 6. Upstream TCP Connection Errors ---"
RESULT=$(curl -s "${ADMIN_URL}/stats" |
  grep "upstream_cx_connect_fail" |
  grep "$FILTER" |
  grep -v ": 0$")
echo "${RESULT:-  (none)}"
echo ""

# 7. Istio-specific checks (if istioctl is available)
if command -v istioctl >/dev/null 2>&1; then
  echo "--- 7. Istio Proxy Status ---"
  istioctl proxy-status 2>/dev/null | head -20
  echo ""
fi

echo "=== Remediation Guide ==="
echo "UH flag (no healthy upstream) -> Check sections 1-3; tune outlier_detection"
echo "UO flag (circuit breaker)     -> Check section 4; increase max_pending_requests"
echo "UF flag (connection failure)  -> Check section 6; verify network path and port"
echo "0 healthy in section 2        -> Active health checks failing; fix path or backend"
echo ""
echo "Enable debug logging (no restart): curl -X POST '${ADMIN_URL}/logging?upstream=debug'"
echo "Disable after use:                 curl -X POST '${ADMIN_URL}/logging?upstream=warning'"
echo "Istio endpoint check:              istioctl proxy-config endpoint <pod>.<ns>"
E

Error Medic Editorial

Error Medic Editorial is a team of senior DevOps, SRE, and platform engineers with extensive hands-on experience operating Envoy-based service meshes at scale. Our troubleshooting guides are built from real production incidents across Kubernetes, Istio, AWS App Mesh, and bare-metal Envoy deployments, specializing in distributed systems observability, zero-downtime deployments, and traffic management patterns.

Sources

Related Guides