Why does Vault say 'connection refused' on port 8200?

This usually means the Vault process has crashed, the service is stopped, or it isn't bound to the expected network interface. Check the `listener` stanza in your `vault.hcl` config to ensure it binds to the correct IP, and verify the process is running using `systemctl status vault`.

How do I fix a Vault timeout during high load?

Vault timeouts often stem from storage backend constraints. Check the disk IOPS on your Raft volumes or CPU utilization on your Consul cluster. If you have an audit device configured (like writing logs to a file) and the disk is full, Vault will block all operations and timeout. Clear disk space immediately.

Vault is running but I get 'permission denied' for everything. Why?

If Vault has recently restarted, it is likely in a 'sealed' state. When sealed, all read/write operations return 'permission denied' or 503 errors. Run `vault status` to check the seal status and unseal it. If unsealed, verify your token hasn't expired and has the correct policy attached.

What causes Vault to crash randomly with 'out of memory' (OOM)?

Vault caches secrets and tokens in RAM. If you have a high churn rate of dynamic secrets or tokens without proper Time-To-Live (TTL) limits, memory usage will balloon until the Linux kernel terminates the process. Check the kernel ring buffer (`dmesg -T | grep -i oom`) and tune your lease TTLs.

How do I recover a Raft cluster that has lost quorum?

If multiple nodes fail and Raft loses quorum, Vault cannot elect a leader. You must perform a manual Raft recovery by creating a `peers.json` file containing the details of the surviving nodes, placing it in the Raft data directory, and restarting the Vault service to force a new cluster configuration.

Troubleshooting HashiCorp Vault Crash: Fix "Connection Refused" and "Permission Denied" Errors

Vault Recovery Approaches Compared
Method	When to Use	Time	Risk
Manual Unseal	After a standard restart, planned maintenance, or minor crash.	5 mins	Low
Auto-unseal (AWS KMS/Transit)	Preventative measure for environments prone to frequent restarts.	1-2 hrs	Medium
Raft Peering Recovery (peers.json)	When the integrated storage backend loses quorum and Vault cannot elect a leader.	30-60 mins	High
TLS Certificate Rotation	When seeing 'remote error: tls: bad certificate' or 'connection refused' due to SSL mismatch.	15 mins	Medium

Understanding the Error: Why Does Vault Crash?

HashiCorp Vault is designed to fail securely. When Vault encounters a critical error—whether it's a storage backend disruption, an out-of-memory (OOM) event, or a network partition—it prioritizes protecting the secrets it holds. This means Vault will often shut down or seal itself rather than operate in an unknown or compromised state. When this happens, DevOps engineers are typically greeted with a barrage of alerts indicating that Vault is not working.

The most common symptoms of a Vault crash or misconfiguration include:

Vault Connection Refused: The Vault process is not running, or it is not binding to the expected network interface.
Vault Timeout: The Vault API is reachable, but the backend storage is too slow to respond, causing context deadlines to exceed.
Vault Permission Denied: Vault is reachable, but the client lacks the necessary policy permissions, the token has expired, or the Vault is currently in a sealed state (which blocks all data reads).

Let's break down each of these scenarios, diagnose the root causes, and apply the appropriate fixes.

Symptom 1: Vault Connection Refused

When you run a command like vault status and receive the following error:

Error checking seal status: Get "https://127.0.0.1:8200/v1/sys/seal-status": dial tcp 127.0.0.1:8200: connect: connection refused

This explicitly means the OS network stack actively rejected the connection on port 8200. There is no service listening on that port at that IP address.

Step 1: Diagnose the Process

First, verify if the Vault process is actually running. On systemd-based Linux distributions, run:

systemctl status vault

If the service is dead or inactive, check the systemd logs to see why it crashed:

journalctl -u vault --no-pager | tail -n 100

Look for the following common crash indicators in the logs:

Out of Memory (OOM): kernel: Out of memory: Killed process 12345 (vault). Vault caches data in RAM. If you generate millions of dynamic secrets with long Time-To-Live (TTL) settings, Vault will consume all available memory and the Linux OOM Killer will terminate it.
Storage Backend Errors: core: failed to acquire lock: .... If Vault cannot communicate with its backend (Consul, etcd, PostgreSQL, or its internal Raft storage), it will panic and shut down to prevent split-brain scenarios or data corruption.
Configuration Parsing Errors: Error parsing /etc/vault.d/vault.hcl. A recent configuration change may have introduced a syntax error.

Step 2: Diagnose Network Bindings

If systemctl status vault shows the service is running, but you still get a connection refused, the issue is likely your listener configuration or firewall.

Check your vault.hcl file. Ensure the listener block is binding to the correct interface:

listener "tcp" {
  address       = "0.0.0.0:8200"
  tls_cert_file = "/opt/vault/tls/tls.crt"
  tls_key_file  = "/opt/vault/tls/tls.key"
}

If address is set to 127.0.0.1:8200, Vault will only accept connections from localhost. External clients hitting your server's public or private IP will get a "connection refused".

Symptom 2: Vault Not Working (Vault Timeout)

Sometimes, Vault doesn't outright refuse the connection, but commands simply hang and eventually fail with a timeout error:

Error reading secret/data/myapp: Get "https://vault.example.com:8200/v1/secret/data/myapp": net/http: request canceled (Client.Timeout exceeded while awaiting headers)

Step 1: Diagnose Storage Bottlenecks

Vault is highly dependent on its storage backend. If you are using Integrated Storage (Raft), Vault relies heavily on disk IOPS. If you are using Consul, it relies on network latency and Consul's CPU performance.

If Vault is timing out, it almost always means the underlying storage is taking too long to write the audit log or retrieve the secret.

Check the disk I/O on the Vault server: iostat -x 1 10 Look at the %util column. If your disk is at 100% utilization, Vault cannot write its Raft logs fast enough. You must upgrade your storage to faster SSDs (e.g., AWS io2 EBS volumes).

Step 2: Diagnose Audit Device Blocking

Vault operates on a fail-secure model regarding audit devices. If you have an audit device configured (like writing logs to a file or syslog) and Vault cannot write to that device (e.g., the disk is 100% full, or syslog is unresponsive), Vault will block all operations and time out.

Check your server's disk space: df -h If the partition holding /var/log/vault/audit.log is full, Vault will stop responding to API requests to ensure no untracked operations occur.

Symptom 3: Vault Permission Denied

This is perhaps the most confusing error for developers. They have a token, they hit the API, and they get:

Error reading secret/data/app: Error making API request. Code: 403. Errors: * 1 error occurred: * permission denied

Cause 1: Vault is Sealed

When Vault restarts (after a crash, server reboot, or update), it starts in a sealed state. It knows where the encrypted data is, but it does not have the master key to decrypt it. When Vault is sealed, almost all API endpoints return a 503 Service Unavailable or 403 Permission Denied.

Run vault status. Look for the Sealed field. If it says true, you must unseal the Vault. Gather your threshold of key holders and run vault operator unseal until the threshold is met.

Cause 2: Expired Token

Every authentication token in Vault has a Time-To-Live (TTL). Once the TTL expires, the token is revoked. If an application is hardcoded with a token that has expired, it will receive permission denied errors.

To check the token's validity, run (if you still have access via a different, valid token): vault token lookup <token_string> If it returns an error, the token is invalid or expired. Applications should implement logic to renew their tokens periodically before they expire using the POST /v1/auth/token/renew-self endpoint.

Cause 3: Policy Misconfiguration

If the Vault is unsealed and the token is active, the token's attached policies do not grant access to the requested path. Vault uses a default-deny architecture.

If your application needs to read secret/data/database/credentials, the token must have a policy attached that explicitly grants read capabilities to that exact path.

To debug this, check the policies attached to the token: vault token lookup Look at the policies array. Then, read those policies: vault policy read <policy_name> Ensure the path exactly matches the API endpoint. Remember that for KV Version 2 secret engines, the API path requires /data/ to be inserted (e.g., the CLI path secret/database becomes the API path secret/data/database). This is the #1 cause of "permission denied" for new Vault users.

Comprehensive Recovery Strategy

When faced with a total Vault outage resulting from a crash, follow this strict recovery path:

Halt Traffic: Remove the failing Vault node from the load balancer to prevent clients from experiencing hanging connections.
Inspect Logs: Tail the system logs (journalctl -u vault) to identify if the crash was due to OOM, configuration, or backend failure.
Fix the Root Cause:
- If OOM, increase server RAM and tune Vault's cache_size in vault.hcl.
- If backend failure (Raft), ensure all nodes in the cluster are online and can communicate over the cluster port (usually 8201).
Restart the Service: sudo systemctl restart vault
Unseal the Node: If not using auto-unseal, manually provide the unseal keys.
Verify Quorum: Run vault operator raft list-peers to ensure the node has rejoined the cluster and a leader is elected.
Restore Traffic: Add the node back to the load balancer.