Why does Prometheus show 'connection refused' even though the pod is Running?

A Running pod does not mean the container's process is healthy. The pod lifecycle and the application health are separate. The Prometheus process may have failed to bind to the port (due to a bad --web.listen-address flag or port conflict), or it may have started but immediately panicked. Run `kubectl logs prometheus-0 -n monitoring` and `kubectl exec prometheus-0 -n monitoring -- curl localhost:9090/-/healthy` to confirm.

What exit code does an OOM-killed Prometheus container show?

Exit code 137 (128 + SIGKILL signal 9). You will see this in `kubectl describe pod` under 'Last State'. The Reason field will explicitly say 'OOMKilled'. Contrast this with exit code 1 (configuration or startup error) or exit code 2 (usage error).

How do I stop Prometheus from going into CrashLoopBackOff after a config change?

Always validate your configuration before applying it: run `promtool check config /etc/prometheus/prometheus.yml`. In Kubernetes, validate the ConfigMap content before applying with `kubectl apply`. If you already applied a bad config, immediately roll back: `kubectl rollout undo deployment/prometheus -n monitoring`.

Prometheus starts but scrapes time out — how do I debug this?

Check the Prometheus targets page at http://your-prometheus:9090/targets to see which targets are timing out. Verify the target is reachable from the Prometheus pod: `kubectl exec prometheus-0 -n monitoring -- curl http://target-service:port/metrics`. If it hangs, the issue is in the target application or a NetworkPolicy, not Prometheus itself. Increase `scrape_timeout` in prometheus.yml as a temporary workaround.

How much memory should I give Prometheus?

A rule of thumb is roughly 1–2 bytes per active time series per 2-hour chunk in memory. For 500,000 active series, plan for 2–4 GiB. Use the TSDB status endpoint (`/api/v1/status/tsdb`) to see your current series count, then size accordingly. Set Kubernetes `requests` equal to typical usage and `limits` at 1.5–2x requests to allow for query spikes without OOM kills.

Prometheus Connection Refused: Complete Troubleshooting Guide (CrashLoopBackOff, OOM, Permission Denied)

Fix Prometheus 'connection refused', CrashLoopBackOff, OOM kills, and permission denied errors with step-by-step commands and config examples.

Last updated: February 23, 2026

Last verified: February 23, 2026

1,951 words

Key Takeaways

Connection refused usually means Prometheus is not running, bound to the wrong address, or blocked by a firewall/NetworkPolicy — check `kubectl get pods` and `netstat -tlnp` first
CrashLoopBackOff and OOM kills are almost always caused by insufficient memory limits, a misconfigured `--storage.tsdb.retention` flag, or a cardinality explosion from high-churn label sets
Permission denied errors on startup point to a volume mount with wrong UID/GID ownership — the Prometheus binary runs as UID 65534 (nobody) by default and cannot write to root-owned directories
Quick fix checklist: verify the process is up → check bind address → inspect resource limits → fix storage permissions → review scrape configs for label cardinality

Fix Approaches Compared
Method	When to Use	Time	Risk
Restart pod / process	Transient crash, OOM after memory limit raised	< 2 min	Low — no config change
Increase memory limit	Repeated OOM kills shown in `kubectl describe pod`	5–10 min	Low — rolling restart required
Fix --web.listen-address flag	Prometheus not binding to 0.0.0.0 or correct port	5 min	Low
Fix storage volume permissions (chown/securityContext)	Permission denied on /prometheus data dir at startup	5–10 min	Low — requires pod restart
Reduce label cardinality / add metric_relabel_configs	Cardinality explosion causing OOM or slow queries	30–60 min	Medium — may drop series
Tune --storage.tsdb.retention.size	Disk full causing crashes or write errors	5 min	Low
Add NetworkPolicy / firewall rule	Connection refused from external client or other pod	10–20 min	Medium — affects network topology
Upgrade Prometheus version	Bug in older release causing crashes or timeouts	20–40 min	Medium — test in staging first

Understanding Prometheus Connection Errors

Prometheus exposes an HTTP API and UI on port 9090 by default. A connection refused error means the TCP handshake never completed — the kernel sent back a RST packet because nothing was listening on that port. This is distinct from a timeout (no response at all) and from a 403/401 (process is up but rejects the request).

Common exact errors you will see:

Get "http://prometheus:9090/api/v1/query": dial tcp 10.96.0.1:9090: connect: connection refused
ts=2024-01-15T10:23:45Z level=error msg="Opening storage failed" err="open /prometheus/queries.active: permission denied"
OOMKilled
Back-off restarting failed container prometheus in pod prometheus-0

Step 1: Determine Whether Prometheus Is Running

Kubernetes environments:

kubectl get pods -n monitoring -l app=prometheus
kubectl describe pod prometheus-0 -n monitoring   # look at Events and Last State
kubectl logs prometheus-0 -n monitoring --previous  # logs from crashed container

Look for these fields in kubectl describe pod:

Last State: Terminated Reason: OOMKilled → memory problem
Last State: Terminated Reason: Error with exit code 1 or 2 → config or permission error
Restart Count: N where N > 3 → CrashLoopBackOff pattern

Bare-metal / VM environments:

systemctl status prometheus
journalctl -u prometheus -n 100 --no-pager
ps aux | grep prometheus

Step 2: Diagnose Connection Refused

Once you confirm the process state, narrow down the cause:

# Is Prometheus actually listening on port 9090?
ss -tlnp | grep 9090
# or on older systems:
netstat -tlnp | grep 9090

# Can you reach it locally?
curl -v http://localhost:9090/-/healthy

# In Kubernetes — check Service selector matches pod labels:
kubectl get svc prometheus -n monitoring -o yaml
kubectl get endpoints prometheus -n monitoring
# If Endpoints shows <none>, the Service selector is wrong

If ss shows nothing on port 9090 but the process is running, check the --web.listen-address flag:

kubectl exec -it prometheus-0 -n monitoring -- /bin/prometheus --help 2>&1 | grep listen
# Then check actual flags the process was started with:
kubectl exec -it prometheus-0 -n monitoring -- cat /proc/1/cmdline | tr '\0' ' '

The flag must be --web.listen-address=0.0.0.0:9090 (or :[port]) for the service to be reachable from other pods. If it is 127.0.0.1:9090, only loopback traffic works.

Step 3: Fix CrashLoopBackOff

CrashLoopBackOff is not a root cause — it is Kubernetes backing off restarts of a container that keeps failing. Identify why it crashes:

# Get the last 200 lines from the crashed container
kubectl logs prometheus-0 -n monitoring --previous --tail=200

Scenario A — Bad configuration file:

ts=2024-01-15T10:23:45Z level=error msg="Error loading config" file=/etc/prometheus/prometheus.yml err="yaml: line 42: mapping values are not allowed in this context"

Validate the config before applying:

promtool check config /etc/prometheus/prometheus.yml

In Kubernetes, the ConfigMap is often the source. Edit it with kubectl edit configmap prometheus-config -n monitoring and look for YAML indentation errors.

Scenario B — Storage corruption:

level=error msg="Failed to open db" err="unexpected end of JSON input"

This requires removing the corrupted WAL block:

# List TSDB blocks
ls -lah /prometheus/
# Remove WAL if corrupted (data loss for in-flight samples only)
rm -rf /prometheus/wal
# Then restart Prometheus

Step 4: Fix OOM Killed

kubectl describe pod prometheus-0 -n monitoring | grep -A5 "Last State"
# Last State: Terminated
#   Reason: OOMKilled
#   Exit Code: 137

Exit code 137 = 128 + 9 (SIGKILL). The Linux OOM killer terminated the process.

Option 1 — Raise the memory limit (fast):

# In your Deployment or StatefulSet spec:
resources:
  requests:
    memory: "2Gi"
  limits:
    memory: "4Gi"

Apply and wait for rollout: kubectl rollout status statefulset/prometheus -n monitoring

Option 2 — Reduce ingestion cardinality (sustainable):

High cardinality (millions of unique time series) is the most common cause of Prometheus OOM. Use the built-in TSDB status endpoint to find offenders:

curl http://localhost:9090/api/v1/status/tsdb | python3 -m json.tool | head -80

Look at headStats.numSeries. If it exceeds 1–2 million on a single Prometheus instance with 4 GiB RAM, you have a cardinality problem.

Drop high-cardinality labels with metric_relabel_configs:

# prometheus.yml scrape config
scrape_configs:
  - job_name: 'my-app'
    metric_relabel_configs:
      - source_labels: [request_id]   # drop unique per-request labels
        regex: '.*'
        action: labeldrop
      - source_labels: [__name__]
        regex: 'go_gc_.*'             # drop noisy Go runtime metrics
        action: drop

Option 3 — Tune retention:

# Limit retention by size instead of time
--storage.tsdb.retention.size=10GB
--storage.tsdb.retention.time=15d

Step 5: Fix Permission Denied

Prometheus runs as UID 65534 (nobody) by default. If the /prometheus data directory was created by root or another user, Prometheus cannot write to it.

ts=2024-01-15T10:23:45Z level=error caller=main.go:174 msg="Opening storage failed" err="open /prometheus/queries.active: permission denied"

Fix on bare metal:

chown -R 65534:65534 /var/lib/prometheus
# or if running as a dedicated user:
chown -R prometheus:prometheus /var/lib/prometheus
systemctl restart prometheus

Fix in Kubernetes (preferred — use securityContext):

spec:
  securityContext:
    runAsUser: 65534
    runAsNonRoot: true
    fsGroup: 65534      # ensures mounted volumes are chowned to this GID
  containers:
    - name: prometheus
      # ...

The fsGroup field is the key — Kubernetes will chown the volume mount point to that GID on pod startup, so Prometheus can write to it without running as root.

Step 6: Fix Scrape Timeouts

If Prometheus is running but you see timeout errors in the UI or logs:

level=warn msg="Scrape failed" scrape_url="http://my-app:8080/metrics" err="context deadline exceeded"

Increase the per-job scrape timeout in prometheus.yml:

global:
  scrape_timeout: 10s   # default is 10s; increase if targets are slow

scrape_configs:
  - job_name: 'slow-app'
    scrape_timeout: 30s  # job-level override
    scrape_interval: 60s

Note: scrape_timeout must always be less than or equal to scrape_interval.

Step 7: Check NetworkPolicies and Firewalls

In hardened Kubernetes clusters, NetworkPolicies can silently block traffic:

kubectl get networkpolicies -n monitoring
# Test connectivity from another pod:
kubectl run -it --rm debug --image=curlimages/curl --restart=Never -n default -- \
  curl -v http://prometheus.monitoring.svc.cluster.local:9090/-/healthy

If the curl pod gets connection refused but Prometheus is running, check if a NetworkPolicy is blocking ingress to port 9090 from the default namespace.

Frequently Asked Questions

bash

#!/usr/bin/env bash
# Prometheus Diagnostic Script
# Usage: Run this on the node or via kubectl exec

set -euo pipefail
NAMESPACE="monitoring"
POD=$(kubectl get pods -n "$NAMESPACE" -l app=prometheus -o jsonpath='{.items[0].metadata.name}' 2>/dev/null || echo "")

echo "=== Pod Status ==="
if [ -n "$POD" ]; then
  kubectl get pod "$POD" -n "$NAMESPACE" -o wide
  kubectl describe pod "$POD" -n "$NAMESPACE" | grep -A10 "Conditions:\|Last State:\|Limits:\|Requests:\|Events:"
else
  # Bare-metal fallback
  systemctl status prometheus --no-pager || true
  ps aux | grep '[p]rometheus'
fi

echo ""
echo "=== Port Binding ==="
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- ss -tlnp 2>/dev/null || \
    kubectl exec "$POD" -n "$NAMESPACE" -- netstat -tlnp 2>/dev/null || true
else
  ss -tlnp | grep 9090 || echo "Nothing listening on 9090"
fi

echo ""
echo "=== Health Check ==="
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- wget -qO- http://localhost:9090/-/healthy 2>&1 || \
    echo "Health check FAILED"
else
  curl -sf http://localhost:9090/-/healthy && echo "OK" || echo "FAILED"
fi

echo ""
echo "=== Recent Logs ==="
if [ -n "$POD" ]; then
  kubectl logs "$POD" -n "$NAMESPACE" --previous --tail=50 2>/dev/null || \
    kubectl logs "$POD" -n "$NAMESPACE" --tail=50
else
  journalctl -u prometheus -n 50 --no-pager
fi

echo ""
echo "=== TSDB Status (cardinality) ==="
TSDB_URL="http://localhost:9090/api/v1/status/tsdb"
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- wget -qO- "$TSDB_URL" 2>/dev/null | \
    python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print('Series:', d['headStats']['numSeries'], '| Chunks:', d['headStats']['numChunks'])" 2>/dev/null || true
else
  curl -sf "$TSDB_URL" | python3 -c "import sys,json; d=json.load(sys.stdin)['data']; print('Series:', d['headStats']['numSeries'], '| Chunks:', d['headStats']['numChunks'])" 2>/dev/null || true
fi

echo ""
echo "=== Endpoints (Service wiring) ==="
if [ -n "$POD" ]; then
  kubectl get endpoints -n "$NAMESPACE" | grep -i prom || echo "No endpoints found"
fi

echo ""
echo "=== Storage Permissions ==="
if [ -n "$POD" ]; then
  kubectl exec "$POD" -n "$NAMESPACE" -- ls -lah /prometheus/ 2>/dev/null || true
else
  ls -lah /var/lib/prometheus/ 2>/dev/null || ls -lah /prometheus/ 2>/dev/null || true
fi

echo ""
echo "Diagnostic complete."

Error Medic Editorial

The Error Medic Editorial team consists of senior SREs and platform engineers with experience running Prometheus at scale across bare-metal, AWS EKS, GKE, and on-prem Kubernetes clusters. We write from production incidents, not documentation.

Sources

Diagnose and resolve Prometheus connection refused, CrashLoopBackOff, and OOMKilled errors. Learn how to fix permission denied and timeout issues in Kubernetes.

Fixing 'Prometheus Not Sending Alerts to Alertmanager' and Slack Notification Routing Failures

Resolve missing Prometheus alerts, troubleshoot Alertmanager configuration, and fix kube-prometheus-stack routing issues for Slack and OpsGenie notifications.

How to Fix Prometheus Connection Refused, CrashLoopBackOff, and OOMKilled Errors

Fix Prometheus connection refused, CrashLoopBackOff, and OOM errors. SRE guide to diagnosing memory limits, TSDB corruption, permissions, and network timeouts.

Ansible Failed: Fix Connection Refused, Permission Denied & Timeout Errors

Fix Ansible failures including connection refused, permission denied, and timeout errors. Step-by-step diagnosis with real commands and verified solutions.