Fixing 'Prometheus Connection Refused' and CrashLoopBackOff Errors
Diagnose and resolve Prometheus connection refused, CrashLoopBackOff, and OOMKilled errors. Learn how to fix permission denied and timeout issues in Kubernetes.
- Connection refused errors often stem from Prometheus crashing due to OOMKilled (Out of Memory) or misconfigured bind addresses.
- CrashLoopBackOff is frequently caused by invalid configuration files (prometheus.yml) or incorrect persistent volume permissions (Permission Denied).
- Timeouts usually indicate network policies blocking traffic or target endpoints being overwhelmed.
- Quick Fix: Check pod logs (`kubectl logs`) and describe the pod (`kubectl describe pod`) to immediately identify OOM kills or configuration parsing errors.
| Error Symptom | Common Root Cause | Resolution Strategy | Downtime Risk |
|---|---|---|---|
| Connection Refused | Process crashed or bound to localhost (127.0.0.1) instead of 0.0.0.0 | Update `--web.listen-address` flag or fix underlying crash | Medium |
| OOMKilled / CrashLoopBackOff | Insufficient memory limits or excessive metric cardinality | Increase memory requests/limits in deployment, optimize scrape configs | High |
| Permission Denied | Incorrect chown/chmod on Persistent Volume /data directory | Use initContainer to `chown -R 65534:65534 /prometheus` | Low |
| Target Scrape Timeout | Network latency, overloaded target, or strict NetworkPolicies | Increase `scrape_timeout`, verify NetworkPolicies allow egress | Low |
Understanding the 'Connection Refused' Error in Prometheus
When you encounter a connection refused error while trying to access the Prometheus web UI or when Grafana attempts to query Prometheus, it means the TCP handshake is failing. The operating system is actively rejecting the connection. In the context of Prometheus, especially running in Kubernetes, this almost always means the Prometheus process is either not running (it crashed), or it is running but listening on the wrong network interface.
Often, connection refused is merely a symptom of a deeper issue, such as the pod being stuck in a CrashLoopBackOff state due to OOMKilled (Out of Memory) events or Permission Denied errors on its storage volumes.
Step 1: Diagnose the Pod State
The first step in any Kubernetes-based Prometheus troubleshooting is to check the state of the pod.
Run: kubectl get pods -n monitoring -l app=prometheus
If you see CrashLoopBackOff or Error, the process is dead, which explains the connection refusal.
Next, inspect the pod's history to see why it died:
kubectl describe pod <prometheus-pod-name> -n monitoring
Look at the Last State section. If you see Reason: OOMKilled, the OS kernel terminated Prometheus because it exceeded its container memory limit.
If the pod is Running, check the logs:
kubectl logs <prometheus-pod-name> -n monitoring
Step 2: Fixing OOMKilled (Out of Memory)
Prometheus is an in-memory time-series database. Its memory footprint scales directly with the number of active time series (cardinality) and the retention period of data kept in memory. When Prometheus crashes with OOMKilled, it will restart, enter CrashLoopBackOff, and clients will receive connection refused.
Resolution:
- Increase Memory Limits: Temporarily increase the memory requests and limits in your Prometheus Deployment or StatefulSet to get the system stable.
- Analyze Cardinality: Once running, port-forward to the Prometheus UI and navigate to
Status -> TSDB Status. Look for metrics with massive cardinality. You may need to drop high-cardinality labels usingmetric_relabel_configs. - Tweak WAL Settings: Ensure
--storage.tsdb.wal-compressionis enabled to reduce memory pressure during replay.
Step 3: Fixing 'Permission Denied' on Storage
Another common cause for a crashing Prometheus is an inability to write to its Persistent Volume Claim (PVC).
Error in logs: level=error ts=... caller=main.go:... err="opening storage failed: lock DB directory: open /prometheus/data/lock: permission denied"
Prometheus runs as the user nobody (UID 65534) by default in many Helm charts. If the underlying storage provisioner creates the volume with root ownership, Prometheus cannot write to it.
Resolution:
Implement an initContainer in your pod spec to change the ownership of the volume before Prometheus starts:
initContainers:
- name: volume-permissions
image: busybox:latest
command: ['sh', '-c', 'chown -R 65534:65534 /prometheus']
volumeMounts:
- name: prometheus-data
mountPath: /prometheus
Step 4: Resolving Network and Bind Address Issues
If the pod is running perfectly but you still get connection refused, verify the bind address.
By default, Prometheus listens on 0.0.0.0:9090. If a misconfiguration passed --web.listen-address=127.0.0.1:9090, it will only accept connections from inside its own container, causing external services (like an Ingress controller or Grafana) to fail.
Furthermore, if targets are showing as Down in Prometheus with context deadline exceeded or timeouts, verify your Kubernetes NetworkPolicies. Ensure that Prometheus is allowed egress to the target namespaces and ports, and that the targets allow ingress from the Prometheus namespace.
Frequently Asked Questions
# 1. Check pod status and restart count
kubectl get pods -n monitoring -l app=prometheus
# 2. Check for OOMKilled reasons in pod history
kubectl describe pod -l app=prometheus -n monitoring | grep -A 5 "Last State:"
# 3. Check logs for permission denied or config errors
kubectl logs -l app=prometheus -n monitoring --tail=100
# 4. Validate prometheus.yml syntax locally before deploying
promtool check config prometheus.yml
# 5. Check Kubernetes endpoints to ensure the Service is routing traffic
kubectl get endpoints -n monitoring prometheus-operatedError Medic Editorial
Error Medic Editorial is composed of senior SREs and DevOps practitioners dedicated to providing actionable, code-first troubleshooting guides for cloud-native infrastructure.