Why is Grafana showing 'Bad Gateway' or 'Connection Refused' for my Prometheus data source?

This indicates Grafana cannot reach the Prometheus service. Check if the Prometheus pod is in a CrashLoopBackOff state due to an OOM kill, or if the Kubernetes Service pointing to Prometheus has the correct selector labels.

How do I fix 'err="error loading config from \"/etc/prometheus/prometheus.yml\": couldn't load configuration'?

This error means your prometheus.yml has a syntax error (often YAML indentation issues). Use the 'promtool check config prometheus.yml' utility locally or via a CI/CD pipeline to validate the file before deploying it.

Prometheus keeps getting OOMKilled. How can I reduce its memory usage?

Identify high-cardinality metrics using the TSDB Status page. Use 'metric_relabel_configs' with the 'drop' action to discard unnecessary labels or entire metrics. Also, reduce the 'scrape_interval' for less critical targets.

Targets are failing with 'Get "http://.../metrics": context deadline exceeded'. How do I fix this?

This is a timeout. The target application might be taking too long to generate metrics, or network latency is high. Increase the 'scrape_timeout' in your prometheus.yml, ensuring it remains less than the 'scrape_interval'.

Fixing 'Prometheus Connection Refused' and CrashLoopBackOff Errors

Q: What does 'opening storage failed: lock DB directory: open /prometheus/data/lock: permission denied' mean?

The Prometheus process (usually running as UID 65534) does not have write access to the mounted persistent volume. Add an initContainer to run 'chown -R 65534:65534 /prometheus' to fix the directory permissions.

Prometheus Error Remediation Approaches
Error Symptom	Common Root Cause	Resolution Strategy	Downtime Risk
Connection Refused	Process crashed or bound to localhost (127.0.0.1) instead of 0.0.0.0	Update `--web.listen-address` flag or fix underlying crash	Medium
OOMKilled / CrashLoopBackOff	Insufficient memory limits or excessive metric cardinality	Increase memory requests/limits in deployment, optimize scrape configs	High
Permission Denied	Incorrect chown/chmod on Persistent Volume /data directory	Use initContainer to `chown -R 65534:65534 /prometheus`	Low
Target Scrape Timeout	Network latency, overloaded target, or strict NetworkPolicies	Increase `scrape_timeout`, verify NetworkPolicies allow egress	Low

Prometheus Error Remediation Approaches

Error Symptom

Common Root Cause

Resolution Strategy

Downtime Risk

Connection Refused

Process crashed or bound to localhost (127.0.0.1) instead of 0.0.0.0

Update `--web.listen-address` flag or fix underlying crash

Medium

OOMKilled / CrashLoopBackOff

Insufficient memory limits or excessive metric cardinality

Increase memory requests/limits in deployment, optimize scrape configs

High

Permission Denied

Incorrect chown/chmod on Persistent Volume /data directory

Use initContainer to `chown -R 65534:65534 /prometheus`

Low

Target Scrape Timeout

Network latency, overloaded target, or strict NetworkPolicies

Increase `scrape_timeout`, verify NetworkPolicies allow egress

Low

Understanding the 'Connection Refused' Error in Prometheus

When you encounter a connection refused error while trying to access the Prometheus web UI or when Grafana attempts to query Prometheus, it means the TCP handshake is failing. The operating system is actively rejecting the connection. In the context of Prometheus, especially running in Kubernetes, this almost always means the Prometheus process is either not running (it crashed), or it is running but listening on the wrong network interface.

Often, connection refused is merely a symptom of a deeper issue, such as the pod being stuck in a CrashLoopBackOff state due to OOMKilled (Out of Memory) events or Permission Denied errors on its storage volumes.

Step 1: Diagnose the Pod State

The first step in any Kubernetes-based Prometheus troubleshooting is to check the state of the pod.

Run: kubectl get pods -n monitoring -l app=prometheus

If you see CrashLoopBackOff or Error, the process is dead, which explains the connection refusal.

Next, inspect the pod's history to see why it died: kubectl describe pod <prometheus-pod-name> -n monitoring

Look at the Last State section. If you see Reason: OOMKilled, the OS kernel terminated Prometheus because it exceeded its container memory limit.

If the pod is Running, check the logs: kubectl logs <prometheus-pod-name> -n monitoring

Step 2: Fixing OOMKilled (Out of Memory)

Prometheus is an in-memory time-series database. Its memory footprint scales directly with the number of active time series (cardinality) and the retention period of data kept in memory. When Prometheus crashes with OOMKilled, it will restart, enter CrashLoopBackOff, and clients will receive connection refused.

Resolution:

Increase Memory Limits: Temporarily increase the memory requests and limits in your Prometheus Deployment or StatefulSet to get the system stable.
Analyze Cardinality: Once running, port-forward to the Prometheus UI and navigate to Status -> TSDB Status. Look for metrics with massive cardinality. You may need to drop high-cardinality labels using metric_relabel_configs.
Tweak WAL Settings: Ensure --storage.tsdb.wal-compression is enabled to reduce memory pressure during replay.

Step 3: Fixing 'Permission Denied' on Storage

Another common cause for a crashing Prometheus is an inability to write to its Persistent Volume Claim (PVC).

Error in logs: level=error ts=... caller=main.go:... err="opening storage failed: lock DB directory: open /prometheus/data/lock: permission denied"

Prometheus runs as the user nobody (UID 65534) by default in many Helm charts. If the underlying storage provisioner creates the volume with root ownership, Prometheus cannot write to it.

Resolution: Implement an initContainer in your pod spec to change the ownership of the volume before Prometheus starts:

initContainers:
- name: volume-permissions
  image: busybox:latest
  command: ['sh', '-c', 'chown -R 65534:65534 /prometheus']
  volumeMounts:
  - name: prometheus-data
    mountPath: /prometheus

Step 4: Resolving Network and Bind Address Issues

If the pod is running perfectly but you still get connection refused, verify the bind address. By default, Prometheus listens on 0.0.0.0:9090. If a misconfiguration passed --web.listen-address=127.0.0.1:9090, it will only accept connections from inside its own container, causing external services (like an Ingress controller or Grafana) to fail.

Furthermore, if targets are showing as Down in Prometheus with context deadline exceeded or timeouts, verify your Kubernetes NetworkPolicies. Ensure that Prometheus is allowed egress to the target namespaces and ports, and that the targets allow ingress from the Prometheus namespace.

Frequently Asked Questions

# 1. Check pod status and restart count kubectl get pods -n monitoring -l app=prometheus # 2. Check for OOMKilled reasons in pod history kubectl describe pod -l app=prometheus -n monitoring | grep -A 5 "Last State:" # 3. Check logs for permission denied or config errors kubectl logs -l app=prometheus -n monitoring --tail=100 # 4. Validate prometheus.yml syntax locally before deploying promtool check config prometheus.yml # 5. Check Kubernetes endpoints to ensure the Service is routing traffic kubectl get endpoints -n monitoring prometheus-operated