Error Medic

How to Fix Prometheus Connection Refused, CrashLoopBackOff, and OOMKilled Errors

Fix Prometheus connection refused, CrashLoopBackOff, and OOM errors. SRE guide to diagnosing memory limits, TSDB corruption, permissions, and network timeouts.

Last updated:
Last verified:
1,511 words
Key Takeaways
  • Connection Refused usually means Prometheus is stuck in a CrashLoopBackOff, still recovering its Write-Ahead Log (WAL), or bound to the wrong network interface.
  • OOMKilled (Exit Code 137) is caused by high metric cardinality or insufficient memory limits. Fix by identifying top churn metrics and increasing RAM.
  • CrashLoopBackOff with 'Permission Denied' stems from mismatched Persistent Volume ownership in Kubernetes; solvable via securityContext.fsGroup.
  • TSDB/WAL corruption from ungraceful shutdowns prevents Prometheus from starting, requiring manual WAL truncation or promtool repair.
Prometheus Crash Fix Approaches Compared
MethodWhen to UseTimeRisk
Increase Memory LimitsPod is OOMKilled (Exit Code 137) due to high cardinality< 5 minsLow
Set fsGroup in securityContextPrometheus crash with 'permission denied' on /prometheus/wal< 5 minsLow
Delete Corrupted WALStuck in crash loop with 'repair corrupted WAL' errors5-10 minsHigh (Lose ~2h of recent metrics)
Change bind address to 0.0.0.0Process runs but external connections are refused< 5 minsLow

Understanding the Error

As an SRE or DevOps engineer, seeing alerts fire for Prometheus connection refused or finding your monitoring stack stuck in a CrashLoopBackOff state is a high-stress scenario. When Prometheus goes down, your visibility into the rest of the infrastructure goes dark, blinding you to potential cascading failures.

The "connection refused" error typically manifests when you try to reach the Prometheus UI, API, or when Grafana attempts to query the datastore:

dial tcp 10.42.0.15:9090: connect: connection refused

This error is a symptom, not a root cause. It simply means no process is actively listening and accepting TCP connections on port 9090. To find the root cause, we have to look at why the Prometheus process failed to bind to the port, crashed abruptly, or is hanging during initialization. The most common culprits are Out of Memory (OOM) kills, Write-Ahead Log (WAL) corruption, and permission denied errors on the storage volume.

Scenario 1: Prometheus OOM Killed (Out of Memory)

One of the most frequent reasons Prometheus crashes and subsequently refuses connections is being OOMKilled by the Linux kernel or the Kubernetes container runtime (Exit Code 137). Prometheus is a memory-intensive application because it stores active time series data in memory before periodically flushing it to disk in blocks.

The Symptoms: You will see the pod restart count increasing rapidly. Checking the pod state using kubectl describe pod <prometheus-pod> reveals:

State:          Waiting
Reason:         CrashLoopBackOff
Last State:     Terminated
Reason:         OOMKilled
Exit Code:      137

The Root Cause: Prometheus is consuming more memory than its configured limits. This is almost always caused by "high cardinality"—an explosion of unique label combinations. For example, if a developer mistakenly configures an application to expose a session_id, user_id, or source_ip as a metric label, Prometheus creates a brand new time series in memory for every single user session.

The Fix:

  1. Temporarily Increase Memory: Edit your Deployment, StatefulSet, or Helm chart values to double the memory requests and limits. This is a band-aid to get Prometheus back online so you can query it and find the actual issue.
  2. Identify High Cardinality Metrics: Once Prometheus is responsive, execute this PromQL query to find the worst offenders generating the most time series: topk(10, count by (__name__) ({__name__=~".+"}))
  3. Drop or Relabel: Modify your scrape_configs to drop these high-cardinality labels before ingestion using metric_relabel_configs, or work with the application developers to remove the dynamic labels from their instrumentation.

Scenario 2: Permission Denied on Persistent Volumes

If Prometheus is failing to start entirely and crashing immediately upon initialization, check the container logs (kubectl logs prometheus-pod-0). You might encounter the classic permission denied crash:

level=error ts=2023-10-27T10:00:00.000Z caller=main.go:823 err="opening storage failed: mmap files, file: /prometheus/wal/0000001: permission denied"

The Root Cause: For security best practices, the Prometheus process runs as a specific non-root user (often UID 65534, nobody, or UID 1000). However, the Persistent Volume (PV) dynamically provisioned by your cloud provider (like AWS EBS or GCP Persistent Disk) might be formatted and mounted with root ownership. When the non-root Prometheus process attempts to initialize its Time Series Database (TSDB) or write to the WAL directory, the OS blocks it, resulting in a fatal permission denied error.

The Fix: In Kubernetes, leverage the securityContext feature to force the volume permissions to match the Prometheus user's Group ID. By setting fsGroup, the kubelet will automatically recursively change the ownership of the volume before starting the container.

securityContext:
  runAsUser: 65534
  runAsGroup: 65534
  fsGroup: 65534

Scenario 3: TSDB Corruption and WAL Replay Timeouts

Sometimes, Prometheus is technically running (the pod is in Running state), but you still get a prometheus timeout or connection refused for 10-30 minutes after a restart.

The Symptoms: Viewing the logs shows that Prometheus is stuck in a prolonged initialization phase replaying the Write-Ahead Log:

level=info ts=... caller=head.go:674 msg="Replaying WAL, this may take a while"

If the underlying node crashed unexpectedly, power was lost, or Prometheus was OOMKilled mid-write, the WAL might be corrupted. In this case, you will see a fatal error loop:

err="opening storage failed: repair corrupted WAL: cannot handle error: open /prometheus/wal/000234: no such file or directory"

The Root Cause: During startup, before binding to port 9090 and accepting queries, Prometheus must read the Write-Ahead Log from disk into memory to reconstruct its current state and avoid data loss. If the WAL is massive (due to slow disk I/O or an extremely high ingestion rate), Prometheus will refuse connections until the replay finishes. If the WAL is corrupted, the replay panics and the container crashes.

The Fix:

  1. Wait it Out: If you just see "Replaying WAL" without errors, let it finish. Do not forcefully kill the pod, or the replay process will have to start completely over.
  2. Fix WAL Corruption: If Prometheus is caught in a CrashLoopBackOff due to a corrupted WAL segment, you can attempt a repair using the official tool: promtool tsdb check /path/to/data.
  3. The Nuclear Option: As a last resort, if getting monitoring back online is more critical than the last 1-2 hours of metrics, you can manually delete the corrupted WAL directory. Exec into an alpine debug pod mounted to the same PVC and run rm -rf /prometheus/wal/*. Prometheus will start fresh.

Scenario 4: Network and Bind Address Issues

If the Prometheus logs show that it started successfully (level=info msg="Server is ready to receive web requests.") but curl still returns connection refused, the issue lies in network routing or process binding.

The Root Cause: Prometheus might be inadvertently configured to bind strictly to localhost (127.0.0.1) instead of all network interfaces (0.0.0.0). Alternatively, a Kubernetes NetworkPolicy, firewall rule, or AWS Security Group is silently dropping or refusing the packets before they even reach the container.

The Fix: Verify your startup arguments in the container spec. Ensure you are using: --web.listen-address="0.0.0.0:9090"

Next, verify your Kubernetes Network Policies. Ensure there is an ingress rule explicitly allowing TCP traffic on port 9090 from your Grafana pods, Ingress controllers, or specific namespaces.

Frequently Asked Questions

bash
# Diagnostic script to check Prometheus memory, WAL size, and fix permissions

# 1. Check if the Prometheus pod is OOMKilled
kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o json | jq '.items[].status.containerStatuses[] | select(.state.terminated.reason=="OOMKilled")'

# 2. Check the size of the WAL directory (if exec is possible)
kubectl exec -it -n monitoring prometheus-pod-0 -- sh -c 'du -sh /prometheus/wal'

# 3. Emergency manual WAL cleanup (Run via a debug pod mounted to the PVC if Prometheus is crashlooping)
# WARNING: This deletes recent uncompacted metrics data
# kubectl debug -it prometheus-pod-0 --image=busybox --target=prometheus
# rm -rf /prometheus/wal/*

# 4. Patching Kubernetes Deployment to increase Memory and fix Permissions
kubectl patch statefulset prometheus-stack -n monitoring --type='json' -p='[
  {"op": "replace", "path": "/spec/template/spec/containers/0/resources/limits/memory", "value": "8Gi"},
  {"op": "add", "path": "/spec/template/spec/securityContext/fsGroup", "value": 65534}
]'
E

Error Medic Editorial

Error Medic Editorial is a collective of senior DevOps, SREs, and platform engineers dedicated to providing actionable, code-first troubleshooting guides for cloud-native infrastructure.

Sources

Related Guides