Why does Prometheus take so long to start and refuse connections initially?

Prometheus must replay its Write-Ahead Log (WAL) from disk into memory to recover un-flushed data before it opens port 9090. If you have a high metric ingestion rate or slow disks, this WAL replay can take several minutes, during which connections will timeout or be refused.

What does Exit Code 137 mean in Prometheus?

Exit code 137 indicates the Prometheus process was OOMKilled (Out of Memory) by the operating system or container orchestrator. You need to increase memory limits or reduce the number of active time series (cardinality).

How do I fix 'permission denied' on the /prometheus/wal directory?

In Kubernetes deployments, this happens when the Persistent Volume is owned by root, but Prometheus runs as an unprivileged user. Fix this by adding `fsGroup: 65534` to the pod's `securityContext` to automatically adjust volume ownership.

Is it safe to delete the Prometheus WAL directory to fix a crash loop?

Deleting the WAL (`rm -rf /prometheus/wal/*`) will allow Prometheus to start immediately and fix corruption crashes, but it comes at the cost of data loss. You will lose the most recent 2-3 hours of metric data that had not yet been compacted into permanent TSDB blocks.

How to Fix Prometheus Connection Refused, CrashLoopBackOff, and OOMKilled Errors

Prometheus Crash Fix Approaches Compared
Method	When to Use	Time	Risk
Increase Memory Limits	Pod is OOMKilled (Exit Code 137) due to high cardinality	< 5 mins	Low
Set fsGroup in securityContext	Prometheus crash with 'permission denied' on /prometheus/wal	< 5 mins	Low
Delete Corrupted WAL	Stuck in crash loop with 'repair corrupted WAL' errors	5-10 mins	High (Lose ~2h of recent metrics)
Change bind address to 0.0.0.0	Process runs but external connections are refused	< 5 mins	Low

Understanding the Error

As an SRE or DevOps engineer, seeing alerts fire for Prometheus connection refused or finding your monitoring stack stuck in a CrashLoopBackOff state is a high-stress scenario. When Prometheus goes down, your visibility into the rest of the infrastructure goes dark, blinding you to potential cascading failures.

The "connection refused" error typically manifests when you try to reach the Prometheus UI, API, or when Grafana attempts to query the datastore:

dial tcp 10.42.0.15:9090: connect: connection refused

This error is a symptom, not a root cause. It simply means no process is actively listening and accepting TCP connections on port 9090. To find the root cause, we have to look at why the Prometheus process failed to bind to the port, crashed abruptly, or is hanging during initialization. The most common culprits are Out of Memory (OOM) kills, Write-Ahead Log (WAL) corruption, and permission denied errors on the storage volume.

Scenario 1: Prometheus OOM Killed (Out of Memory)

One of the most frequent reasons Prometheus crashes and subsequently refuses connections is being OOMKilled by the Linux kernel or the Kubernetes container runtime (Exit Code 137). Prometheus is a memory-intensive application because it stores active time series data in memory before periodically flushing it to disk in blocks.

The Symptoms: You will see the pod restart count increasing rapidly. Checking the pod state using kubectl describe pod <prometheus-pod> reveals:

State:          Waiting
Reason:         CrashLoopBackOff
Last State:     Terminated
Reason:         OOMKilled
Exit Code:      137

The Root Cause: Prometheus is consuming more memory than its configured limits. This is almost always caused by "high cardinality"—an explosion of unique label combinations. For example, if a developer mistakenly configures an application to expose a session_id, user_id, or source_ip as a metric label, Prometheus creates a brand new time series in memory for every single user session.

The Fix:

Temporarily Increase Memory: Edit your Deployment, StatefulSet, or Helm chart values to double the memory requests and limits. This is a band-aid to get Prometheus back online so you can query it and find the actual issue.
Identify High Cardinality Metrics: Once Prometheus is responsive, execute this PromQL query to find the worst offenders generating the most time series: topk(10, count by (__name__) ({__name__=~".+"}))
Drop or Relabel: Modify your scrape_configs to drop these high-cardinality labels before ingestion using metric_relabel_configs, or work with the application developers to remove the dynamic labels from their instrumentation.

Scenario 2: Permission Denied on Persistent Volumes

If Prometheus is failing to start entirely and crashing immediately upon initialization, check the container logs (kubectl logs prometheus-pod-0). You might encounter the classic permission denied crash:

level=error ts=2023-10-27T10:00:00.000Z caller=main.go:823 err="opening storage failed: mmap files, file: /prometheus/wal/0000001: permission denied"

The Root Cause: For security best practices, the Prometheus process runs as a specific non-root user (often UID 65534, nobody, or UID 1000). However, the Persistent Volume (PV) dynamically provisioned by your cloud provider (like AWS EBS or GCP Persistent Disk) might be formatted and mounted with root ownership. When the non-root Prometheus process attempts to initialize its Time Series Database (TSDB) or write to the WAL directory, the OS blocks it, resulting in a fatal permission denied error.

The Fix: In Kubernetes, leverage the securityContext feature to force the volume permissions to match the Prometheus user's Group ID. By setting fsGroup, the kubelet will automatically recursively change the ownership of the volume before starting the container.

securityContext:
  runAsUser: 65534
  runAsGroup: 65534
  fsGroup: 65534

Scenario 3: TSDB Corruption and WAL Replay Timeouts

Sometimes, Prometheus is technically running (the pod is in Running state), but you still get a prometheus timeout or connection refused for 10-30 minutes after a restart.

The Symptoms: Viewing the logs shows that Prometheus is stuck in a prolonged initialization phase replaying the Write-Ahead Log:

level=info ts=... caller=head.go:674 msg="Replaying WAL, this may take a while"

If the underlying node crashed unexpectedly, power was lost, or Prometheus was OOMKilled mid-write, the WAL might be corrupted. In this case, you will see a fatal error loop:

err="opening storage failed: repair corrupted WAL: cannot handle error: open /prometheus/wal/000234: no such file or directory"

The Root Cause: During startup, before binding to port 9090 and accepting queries, Prometheus must read the Write-Ahead Log from disk into memory to reconstruct its current state and avoid data loss. If the WAL is massive (due to slow disk I/O or an extremely high ingestion rate), Prometheus will refuse connections until the replay finishes. If the WAL is corrupted, the replay panics and the container crashes.

The Fix:

Wait it Out: If you just see "Replaying WAL" without errors, let it finish. Do not forcefully kill the pod, or the replay process will have to start completely over.
Fix WAL Corruption: If Prometheus is caught in a CrashLoopBackOff due to a corrupted WAL segment, you can attempt a repair using the official tool: promtool tsdb check /path/to/data.
The Nuclear Option: As a last resort, if getting monitoring back online is more critical than the last 1-2 hours of metrics, you can manually delete the corrupted WAL directory. Exec into an alpine debug pod mounted to the same PVC and run rm -rf /prometheus/wal/*. Prometheus will start fresh.

Scenario 4: Network and Bind Address Issues

If the Prometheus logs show that it started successfully (level=info msg="Server is ready to receive web requests.") but curl still returns connection refused, the issue lies in network routing or process binding.

The Root Cause: Prometheus might be inadvertently configured to bind strictly to localhost (127.0.0.1) instead of all network interfaces (0.0.0.0). Alternatively, a Kubernetes NetworkPolicy, firewall rule, or AWS Security Group is silently dropping or refusing the packets before they even reach the container.

The Fix: Verify your startup arguments in the container spec. Ensure you are using: --web.listen-address="0.0.0.0:9090"

Next, verify your Kubernetes Network Policies. Ensure there is an ingress rule explicitly allowing TCP traffic on port 9090 from your Grafana pods, Ingress controllers, or specific namespaces.