Error Medic

Troubleshooting Elasticsearch Timeout, OOM, and Connection Errors

Comprehensive SRE guide to diagnosing and fixing Elasticsearch timeout, connection refused, out of memory (OOM), disk full, and permission denied errors.

Last updated:
Last verified:
1,657 words
Key Takeaways
  • Timeouts are often caused by unoptimized queries causing long Garbage Collection (GC) pauses or thread pool exhaustion.
  • Out of Memory (OOM) crashes happen when the JVM heap is undersized, fielddata explodes, or circuit breakers fail to trip in time.
  • Disk full errors trigger flood-stage watermarks, placing indices into a strict read-only mode that must be manually reversed after clearing space.
  • Connection refused errors stem from network interface misconfigurations (network.host), firewall rules, or a completely crashed JVM process.
  • Permission denied errors typically occur after upgrades or restoring snapshots due to incorrect file ownership or missing Keystore RBAC privileges.
Fix Approaches Compared
Error TypePrimary DiagnosticImmediate MitigationLong-term Fix
elasticsearch timeoutGET _nodes/stats/thread_poolCancel long-running tasks via _tasks APIOptimize query structure, implement routing, scale data nodes
elasticsearch out of memoryCheck JVM heap in _cat/nodesClear fielddata cache (POST /_cache/clear)Set heap to 50% of RAM (max 31GB), use doc_values
elasticsearch disk fullGET _cat/allocation?vDelete old indices or clear system logsAdjust high/flood watermarks, configure ILM policies
elasticsearch connection refusednetstat -tulpn | grep 9200Restart process, check elasticsearch.ymlBind network.host correctly, configure TLS properly

Understanding the Error

Elasticsearch is a highly distributed, resource-intensive search and analytics engine. When it operates normally, it is lightning fast. However, when resource limits are reached, network topologies change, or unoptimized queries are executed, the cluster can rapidly destabilize. The most common symptoms of cluster distress manifest as a variety of client-side and server-side errors, most notably the elasticsearch timeout.

Timeouts are rarely an isolated network issue; they are usually the canary in the coal mine indicating severe underlying resource contention. This guide, written from the perspective of a Site Reliability Engineer (SRE), will walk you through diagnosing and permanently resolving timeouts, as well as the closely related issues of elasticsearch connection refused, elasticsearch disk full, elasticsearch out of memory, and elasticsearch permission denied.

Scenario 1: Diagnosing and Fixing elasticsearch timeout

The most common error developers see is an ElasticsearchTimeoutException or a java.net.SocketTimeoutException on the client side. This happens when the Elasticsearch node takes too long to respond to a REST API request.

Step 1: Diagnose the Timeout Root Cause

Timeouts usually occur because the node's thread pools are exhausted, or the JVM is stuck in a "Stop-the-World" Garbage Collection (GC) pause. To determine which is happening, check the thread pools:

curl -X GET "localhost:9200/_cat/thread_pool/search?v&h=id,name,active,rejected,completed"

If you see the rejected count steadily increasing, your cluster is overwhelmed by too many concurrent requests. If the thread pools look fine, check the garbage collection metrics in the node stats.

Step 2: Mitigate the Timeout

If a massive, unoptimized query is hogging resources, you can find and cancel it using the Task Management API:

# Find long-running tasks
curl -X GET "localhost:9200/_tasks?detailed=true&actions=*search*"

# Cancel a specific rogue task
curl -X POST "localhost:9200/_tasks/<task_id>:12345/_cancel"

For a permanent fix, ensure your queries are optimized. Avoid heavy aggregations on large text fields, utilize filter context instead of query context for exact matches (to leverage caching), and ensure your cluster has enough data nodes to distribute the shard load.

Scenario 2: Handling elasticsearch out of memory

An elasticsearch out of memory (OOM) error is catastrophic. The JVM crashes, the node drops out of the cluster, and the cluster state turns yellow or red. The Elasticsearch logs (/var/log/elasticsearch/<cluster-name>.log) will definitively show java.lang.OutOfMemoryError: Java heap space.

Step 1: Check JVM Heap Pressure

Before a node crashes, you will typically see high heap usage. You can monitor this actively:

curl -X GET "localhost:9200/_cat/nodes?v=true&h=name,heap.percent,ram.percent,cpu"

If heap.percent is consistently above 85-90%, you are at severe risk of an OOM crash.

Step 2: Fix and Prevent OOM
  1. Configure the Heap Size Correctly: The JVM heap should be set to exactly 50% of the available physical RAM, but never more than 31GB (to ensure Java uses compressed Object Pointers, or compressed oops). Edit /etc/elasticsearch/jvm.options:
    -Xms16g
    -Xmx16g
    
  2. Clear the Cache: If the node is struggling but hasn't crashed, force a cache clear to buy time:
    curl -X POST "localhost:9200/_cache/clear"
    
  3. Use Doc Values: Ensure that your mappings use doc_values for aggregations and sorting, which rely on the OS filesystem cache rather than the JVM heap.

Scenario 3: Recovering from elasticsearch disk full

Elasticsearch has built-in safety mechanisms to prevent disks from filling up completely, which would corrupt indices. It uses "watermarks". When the flood-stage watermark (default 95%) is hit, you will experience an elasticsearch disk full scenario. The exact error usually looks like:

ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)]]

Step 1: Free Up Disk Space

The immediate action is to free up disk space on the affected nodes. Check allocation and disk usage:

curl -X GET "localhost:9200/_cat/allocation?v"

Delete older, unnecessary indices, or clear out standard system logs (/var/log/messages, old rotated logs) to drop the disk usage below the high watermark (default 90%).

Step 2: Remove the Read-Only Block

Once space is cleared, Elasticsearch does not automatically remove the read-only block. You must do this manually via the settings API:

curl -X PUT "localhost:9200/_all/_settings"
-H 'Content-Type: application/json'
-d'{
  "index.blocks.read_only_allow_delete": null
}'

To prevent this long-term, implement Index Lifecycle Management (ILM) policies to automatically roll over and delete older indices.

Scenario 4: Fixing elasticsearch connection refused

If you receive an elasticsearch connection refused error, it means the client cannot establish a TCP connection to the Elasticsearch port (default 9200 for HTTP, 9300 for transport).

Step 1: Verify the Process is Running

First, check if the JVM process is even running or if it crashed (perhaps due to an OOM error):

systemctl status elasticsearch

If it is running, check if it is listening on the correct interfaces:

sudo netstat -tulpn | grep 9200
Step 2: Fix Network Binding

By default, Elasticsearch binds only to localhost (127.0.0.1) for security reasons. If you are trying to connect from an external application server, the connection will be refused. Edit /etc/elasticsearch/elasticsearch.yml:

# Set to a specific IP, or 0.0.0.0 for all interfaces
network.host: 0.0.0.0
# If binding to a non-loopback address, you must configure discovery
discovery.seed_hosts: ["host1", "host2"]
cluster.initial_master_nodes: ["node-1", "node-2"]

Also, verify that firewalld, iptables, or AWS Security Groups are allowing inbound traffic on port 9200.

Scenario 5: Resolving elasticsearch permission denied

An elasticsearch permission denied error generally happens during startup, plugin installation, or snapshot/restore operations. The Elasticsearch service runs as the elasticsearch user, and it must have strict ownership over its data, log, and configuration directories.

Step 1: Fix File Ownership

If Elasticsearch fails to start and logs show AccessDeniedException, fix the ownership of the critical directories:

sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
sudo chown -R elasticsearch:elasticsearch /var/log/elasticsearch
sudo chown -R elasticsearch:elasticsearch /etc/elasticsearch
Step 2: Keystore and Snapshot Permissions

If the permission denied error occurs when accessing the keystore (e.g., for AWS S3 repository plugins), ensure the keystore has the correct permissions:

sudo chmod 660 /etc/elasticsearch/elasticsearch.keystore

For snapshot repositories using a shared network file system (NFS), ensure the NFS mount allows the elasticsearch user (UID/GID) to write to the mounted directory across all nodes in the cluster.

Conclusion

Maintaining a healthy Elasticsearch cluster requires vigilant monitoring of resources. Timeouts and OOMs are symptoms of memory and CPU exhaustion; connection drops and disk full errors point to infrastructure limits; and permission errors are configuration oversights. By using the _cat APIs, carefully managing JVM heap, and establishing robust ILM policies, you can ensure your cluster remains resilient under heavy analytical and search workloads.

Frequently Asked Questions

bash
#!/bin/bash
# Elasticsearch SRE Diagnostic Script
# Run this on the Elasticsearch node to quickly triage timeouts and cluster health

ES_URL="http://localhost:9200"

echo "=== 1. Checking Process Status ==="
systemctl status elasticsearch --no-pager | grep Active

echo -e "\n=== 2. Cluster Health ==="
curl -s -X GET "$ES_URL/_cluster/health?pretty"

echo -e "\n=== 3. Node Resource Usage (Heap, CPU, Disk) ==="
curl -s -X GET "$ES_URL/_cat/nodes?v=true&h=name,cpu,ram.percent,heap.percent,disk.used_percent"

echo -e "\n=== 4. Thread Pool Rejections (Leading to Timeouts) ==="
curl -s -X GET "$ES_URL/_cat/thread_pool/search,write?v&h=node_name,name,active,queue,rejected"

echo -e "\n=== 5. Pending Tasks ==="
curl -s -X GET "$ES_URL/_cat/pending_tasks?v"
E

Error Medic Editorial

Error Medic Editorial is composed of Senior Site Reliability Engineers and DevOps architects dedicated to breaking down complex database and infrastructure failures into actionable, verified solutions.

Sources

Related Guides