Error Medic

Prometheus Connection Refused: Complete Troubleshooting Guide (CrashLoopBackOff, OOM, Permission Denied)

Fix Prometheus 'connection refused', CrashLoopBackOff, OOM kills, and permission denied errors with step-by-step commands and config examples.

Last updated:
Last verified:
1,951 words
Key Takeaways
  • Connection refused usually means Prometheus is not running, bound to the wrong address, or blocked by a firewall/NetworkPolicy — check `kubectl get pods` and `netstat -tlnp` first
  • CrashLoopBackOff and OOM kills are almost always caused by insufficient memory limits, a misconfigured `--storage.tsdb.retention` flag, or a cardinality explosion from high-churn label sets
  • Permission denied errors on startup point to a volume mount with wrong UID/GID ownership — the Prometheus binary runs as UID 65534 (nobody) by default and cannot write to root-owned directories
  • Quick fix checklist: verify the process is up → check bind address → inspect resource limits → fix storage permissions → review scrape configs for label cardinality
Fix Approaches Compared
MethodWhen to UseTimeRisk
Restart pod / processTransient crash, OOM after memory limit raised< 2 minLow — no config change
Increase memory limitRepeated OOM kills shown in `kubectl describe pod`5–10 minLow — rolling restart required
Fix --web.listen-address flagPrometheus not binding to 0.0.0.0 or correct port5 minLow
Fix storage volume permissions (chown/securityContext)Permission denied on /prometheus data dir at startup5–10 minLow — requires pod restart
Reduce label cardinality / add metric_relabel_configsCardinality explosion causing OOM or slow queries30–60 minMedium — may drop series
Tune --storage.tsdb.retention.sizeDisk full causing crashes or write errors5 minLow
Add NetworkPolicy / firewall ruleConnection refused from external client or other pod10–20 minMedium — affects network topology
Upgrade Prometheus versionBug in older release causing crashes or timeouts20–40 minMedium — test in staging first

Understanding Prometheus Connection Errors

Prometheus exposes an HTTP API and UI on port 9090 by default. A connection refused error means the TCP handshake never completed — the kernel sent back a RST packet because nothing was listening on that port. This is distinct from a timeout (no response at all) and from a 403/401 (process is up but rejects the request).

Common exact errors you will see:

Get "http://prometheus:9090/api/v1/query": dial tcp 10.96.0.1:9090: connect: connection refused
ts=2024-01-15T10:23:45Z level=error msg="Opening storage failed" err="open /prometheus/queries.active: permission denied"
OOMKilled
Back-off restarting failed container prometheus in pod prometheus-0

Step 1: Determine Whether Prometheus Is Running

Kubernetes environments:

kubectl get pods -n monitoring -l app=prometheus
kubectl describe pod prometheus-0 -n monitoring   # look at Events and Last State
kubectl logs prometheus-0 -n monitoring --previous  # logs from crashed container

Look for these fields in kubectl describe pod:

  • Last State: Terminated Reason: OOMKilled → memory problem
  • Last State: Terminated Reason: Error with exit code 1 or 2 → config or permission error
  • Restart Count: N where N > 3 → CrashLoopBackOff pattern

Bare-metal / VM environments:

systemctl status prometheus
journalctl -u prometheus -n 100 --no-pager
ps aux | grep prometheus

Step 2: Diagnose Connection Refused

Once you confirm the process state, narrow down the cause:

# Is Prometheus actually listening on port 9090?
ss -tlnp | grep 9090
# or on older systems:
netstat -tlnp | grep 9090

# Can you reach it locally?
curl -v http://localhost:9090/-/healthy

# In Kubernetes — check Service selector matches pod labels:
kubectl get svc prometheus -n monitoring -o yaml
kubectl get endpoints prometheus -n monitoring
# If Endpoints shows <none>, the Service selector is wrong

If ss shows nothing on port 9090 but the process is running, check the --web.listen-address flag:

kubectl exec -it prometheus-0 -n monitoring -- /bin/prometheus --help 2>&1 | grep listen
# Then check actual flags the process was started with:
kubectl exec -it prometheus-0 -n monitoring -- cat /proc/1/cmdline | tr '\0' ' '

The flag must be --web.listen-address=0.0.0.0:9090 (or :[port]) for the service to be reachable from other pods. If it is 127.0.0.1:9090, only loopback traffic works.


Step 3: Fix CrashLoopBackOff

CrashLoopBackOff is not a root cause — it is Kubernetes backing off restarts of a container that keeps failing. Identify why it crashes:

# Get the last 200 lines from the crashed container
kubectl logs prometheus-0 -n monitoring --previous --tail=200

Scenario A — Bad configuration file:

ts=2024-01-15T10:23:45Z level=error msg="Error loading config" file=/etc/prometheus/prometheus.yml err="yaml: line 42: mapping values are not allowed in this context"

Validate the config before applying:

promtool check config /etc/prometheus/prometheus.yml

In Kubernetes, the ConfigMap is often the source. Edit it with kubectl edit configmap prometheus-config -n monitoring and look for YAML indentation errors.

Scenario B — Storage corruption:

level=error msg="Failed to open db" err="unexpected end of JSON input"

This requires removing the corrupted WAL block:

# List TSDB blocks
ls -lah /prometheus/
# Remove WAL if corrupted (data loss for in-flight samples only)
rm -rf /prometheus/wal
# Then restart Prometheus

Step 4: Fix OOM Killed

kubectl describe pod prometheus-0 -n monitoring | grep -A5 "Last State"
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

Exit code 137 = 128 + 9 (SIGKILL). The Linux OOM killer terminated the process.

Option 1 — Raise the memory limit (fast):

# In your Deployment or StatefulSet spec:
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "4Gi"

Apply and wait for rollout: kubectl rollout status statefulset/prometheus -n monitoring

Option 2 — Reduce ingestion cardinality (sustainable):

High cardinality (millions of unique time series) is the most common cause of Prometheus OOM. Use the built-in TSDB status endpoint to find offenders:

curl http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool | head -80

Look at headStats.numSeries. If it exceeds 1–2 million on a single Prometheus instance with 4 GiB RAM, you have a cardinality problem.

Drop high-cardinality labels with metric_relabel_configs:

# prometheus.yml scrape config
scrape_configs:
  - job_name: 'my-app'
    metric_relabel_configs:
      - source_labels: [request_id]   # drop unique per-request labels
        regex: '.*'
        action: labeldrop
      - source_labels: [__name__]
        regex: 'go_gc_.*'             # drop noisy Go runtime metrics
        action: drop

Option 3 — Tune retention:

# Limit retention by size instead of time
--storage.tsdb.retention.size=10GB
--storage.tsdb.retention.time=15d

Step 5: Fix Permission Denied

Prometheus runs as UID 65534 (nobody) by default. If the /prometheus data directory was created by root or another user, Prometheus cannot write to it.

ts=2024-01-15T10:23:45Z level=error caller=main.go:174 msg="Opening storage failed" err="open /prometheus/queries.active: permission denied"

Fix on bare metal:

chown -R 65534:65534 /var/lib/prometheus
# or if running as a dedicated user:
chown -R prometheus:prometheus /var/lib/prometheus
systemctl restart prometheus

Fix in Kubernetes (preferred — use securityContext):

spec:
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    fsGroup: 65534      # ensures mounted volumes are chowned to this GID
  containers:
    - name: prometheus
      # ...

The fsGroup field is the key — Kubernetes will chown the volume mount point to that GID on pod startup, so Prometheus can write to it without running as root.


Step 6: Fix Scrape Timeouts

If Prometheus is running but you see timeout errors in the UI or logs:

level=warn msg="Scrape failed" scrape_url="http://my-app:8080/metrics" err="context deadline exceeded"

Increase the per-job scrape timeout in prometheus.yml:

global:
  scrape_timeout: 10s   # default is 10s; increase if targets are slow

scrape_configs:
  - job_name: 'slow-app'
    scrape_timeout: 30s  # job-level override
    scrape_interval: 60s

Note: scrape_timeout must always be less than or equal to scrape_interval.


Step 7: Check NetworkPolicies and Firewalls

In hardened Kubernetes clusters, NetworkPolicies can silently block traffic:

kubectl get networkpolicies -n monitoring
# Test connectivity from another pod:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -n default -- \
  curl -v http://prometheus.monitoring.svc.cluster.local:9090/-/healthy

If the curl pod gets connection refused but Prometheus is running, check if a NetworkPolicy is blocking ingress to port 9090 from the default namespace.

Frequently Asked Questions

bash
#!/usr/bin/env bash
# Prometheus Diagnostic Script
# Usage: Run this on the node or via kubectl exec

set -euo pipefail
NAMESPACE="monitoring"
POD=$(kubectl get pods -n "$NAMESPACE" -l app=prometheus -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")

echo "=== Pod Status ==="
if [ -n "$POD" ]; then
  kubectl get pod "$POD" -n "$NAMESPACE" -o wide
  kubectl describe pod "$POD" -n "$NAMESPACE" | grep -A10 "Conditions:\|Last State:\|Limits:\|Requests:\|Events:"
else
  # Bare-metal fallback
  systemctl status prometheus --no-pager || true
  ps aux | grep '[p]rometheus'
fi

echo ""
echo "=== Port Binding ==="
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- ss -tlnp 2>/dev/null || \
    kubectl exec "$POD" -n "$NAMESPACE" -- netstat -tlnp 2>/dev/null || true
else
  ss -tlnp | grep 9090 || echo "Nothing listening on 9090"
fi

echo ""
echo "=== Health Check ==="
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- wget -qO- http://localhost:9090/-/healthy 2>&1 || \
    echo "Health check FAILED"
else
  curl -sf http://localhost:9090/-/healthy && echo "OK" || echo "FAILED"
fi

echo ""
echo "=== Recent Logs ==="
if [ -n "$POD" ]; then
  kubectl logs "$POD" -n "$NAMESPACE" --previous --tail=50 2>/dev/null || \
    kubectl logs "$POD" -n "$NAMESPACE" --tail=50
else
  journalctl -u prometheus -n 50 --no-pager
fi

echo ""
echo "=== TSDB Status (cardinality) ==="
TSDB_URL="http://localhost:9090/api/v1/status/tsdb"
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- wget -qO- "$TSDB_URL" 2>/dev/null | \
    python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print('Series:', d['headStats']['numSeries'], '| Chunks:', d['headStats']['numChunks'])" 2>/dev/null || true
else
  curl -sf "$TSDB_URL" | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print('Series:', d['headStats']['numSeries'], '| Chunks:', d['headStats']['numChunks'])" 2>/dev/null || true
fi

echo ""
echo "=== Endpoints (Service wiring) ==="
if [ -n "$POD" ]; then
  kubectl get endpoints -n "$NAMESPACE" | grep -i prom || echo "No endpoints found"
fi

echo ""
echo "=== Storage Permissions ==="
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- ls -lah /prometheus/ 2>/dev/null || true
else
  ls -lah /var/lib/prometheus/ 2>/dev/null || ls -lah /prometheus/ 2>/dev/null || true
fi

echo ""
echo "Diagnostic complete."
E

Error Medic Editorial

The Error Medic Editorial team consists of senior SREs and platform engineers with experience running Prometheus at scale across bare-metal, AWS EKS, GKE, and on-prem Kubernetes clusters. We write from production incidents, not documentation.

Sources

Related Guides