Error Medic

Fixing Elasticsearch Timeout and Connection Refused Errors

Comprehensive guide to resolving Elasticsearch timeout, connection refused, OOM, and disk full errors. Learn root causes, diagnostic commands, and permanent fix

Last updated:
Last verified:
1,631 words
Key Takeaways
  • High heap usage (OOM) or heavy garbage collection pauses often trigger timeouts.
  • A full disk (watermark exceeded) will cause nodes to block writes and potentially drop connections.
  • Misconfigured thread pools or network firewalls lead to 'connection refused' and dropped requests.
  • Check cluster health, logs, and resource metrics before adjusting timeouts or JVM settings.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Increase JVM HeapOOM errors, frequent long GC pauses10 minsLow (if RAM available)
Clear Disk Space / Add NodesDisk watermarks exceeded (read-only indices)30 minsLow
Adjust Thread PoolsHigh bulk rejection rates, queued tasks15 minsMedium (can overload node)
Increase Timeout SettingsTemporary network latency, heavy queries5 minsLow (band-aid solution)

Understanding Elasticsearch Timeouts and Connection Errors

Elasticsearch is a highly distributed system, relying heavily on network communication between nodes and rapid disk/memory access. When you encounter an elasticsearch timeout or elasticsearch connection refused error, it is rarely a simple network blip. More often, these errors are the secondary symptoms of underlying resource exhaustion—such as elasticsearch out of memory (OOM), elasticsearch disk full, or elasticsearch permission denied issues.

Typical error messages look like:

  • ElasticsearchTimeoutException: Timeout connecting to [node-1]
  • Connection refused: no further information
  • java.lang.OutOfMemoryError: Java heap space
  • ClusterBlockException: index [my-index] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

In this comprehensive guide, we will break down the root causes of these errors, how to accurately diagnose them using Elasticsearch's API and system tools, and how to apply permanent fixes.

Root Cause 1: JVM Heap Exhaustion (Out of Memory)

Elasticsearch runs on the Java Virtual Machine (JVM). If the JVM heap is too small for your workload, Elasticsearch will spend an excessive amount of time performing Garbage Collection (GC). During a 'stop-the-world' GC pause, the node cannot respond to ping requests from the master node or client requests. If the pause lasts longer than the configured timeout (default is usually 30 seconds), the node drops out of the cluster, or the client receives an elasticsearch timeout.

Eventually, if memory cannot be reclaimed, you will see an elasticsearch out of memory error in the logs, and the node will crash.

Root Cause 2: Disk Watermark Exceeded (Disk Full)

Elasticsearch monitors the disk usage of the nodes in the cluster. It has three watermark thresholds: low, high, and flood-stage. When disk usage reaches the high watermark (default 90%), Elasticsearch attempts to relocate shards away from the node. If it hits the flood-stage watermark (default 95%), Elasticsearch enforces a read-only block on every index that has one or more shards allocated on the node, to prevent the disk from filling up completely (elasticsearch disk full).

When indices are blocked, write requests will fail or hang, leading to client-side timeouts.

Root Cause 3: Thread Pool Rejections

Elasticsearch uses various thread pools (e.g., search, write, get) to manage concurrent operations. If the number of incoming requests exceeds the number of threads available plus the queue size, Elasticsearch will reject the requests. The client will receive a 429 Too Many Requests error, which some client libraries might interpret or surface as a connection drop or timeout if not handled correctly.

Root Cause 4: Network and Permissions

If you receive an elasticsearch connection refused immediately upon trying to connect, the Elasticsearch process might be down, listening on the wrong interface (e.g., localhost instead of the public IP), or a firewall is blocking the port (default 9200 for HTTP, 9300 for transport). Similarly, elasticsearch permission denied can occur if the user running the Elasticsearch process does not have read/write access to the data or log directories, causing the node to fail during startup.

Step 1: Diagnose the Issue

Before making any configuration changes, you must identify the exact bottleneck.

1. Check Cluster Health and Logs

The first step is always to check the cluster health and the Elasticsearch logs. The logs are usually located in /var/log/elasticsearch/<cluster-name>.log on Linux.

Run the following command to get a high-level overview of the cluster: curl -X GET "localhost:9200/_cluster/health?pretty"

If the status is red or yellow, you have unassigned shards. If the API times out, the node is likely overwhelmed.

2. Monitor JVM Memory and Garbage Collection

Check the node stats to see heap usage and GC activity: curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

Look at the jvm.mem.heap_used_percent. If it's consistently above 85%, you are at risk of long GC pauses and OOM errors. Also, check jvm.gc.collectors.old.collection_time_in_millis. A high value here indicates the node is spending too much time trying to free up memory.

3. Verify Disk Usage and Watermarks

Check the disk allocation for all nodes: curl -X GET "localhost:9200/_cat/allocation?v"

Pay attention to the disk.percent column. If any node is above 85-90%, you are hitting the disk watermarks. You can also check the cluster settings to see if blocks have been applied: curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true"

4. Inspect Thread Pools

Check if requests are being rejected due to full queues: curl -X GET "localhost:9200/_cat/thread_pool?v&s=name"

Look at the rejected column for the write and search thread pools. An increasing number of rejections indicates the node cannot keep up with the workload.

Step 2: Implement Solutions

Based on your diagnosis, apply the appropriate fix.

Solution 1: Fixing Out of Memory and High GC

If your heap is consistently full, you need to increase the JVM heap size. The heap size is configured in the jvm.options file, typically located in /etc/elasticsearch/jvm.options.

  1. Open the file: sudo nano /etc/elasticsearch/jvm.options
  2. Locate the -Xms (minimum heap) and -Xmx (maximum heap) settings.
  3. Increase the values. Rule of thumb: Set them to 50% of the total RAM on the machine, but do not exceed 32GB (to ensure zero-based compressed oops are used).

Example: -Xms16g -Xmx16g

  1. Restart Elasticsearch: sudo systemctl restart elasticsearch

If you cannot increase the RAM, you must optimize your queries or reduce the number of shards (e.g., by reindexing or using the shrink API).

Solution 2: Resolving Disk Full Errors

If your nodes have hit the flood-stage watermark, Elasticsearch will put indices into a read-only state. You must free up disk space and then manually remove the block.

  1. Free up space: Delete old indices, add more storage to the machine, or add a new node to the cluster to allow shards to relocate.
  2. Remove the block: Once disk usage is below the high watermark (usually < 90%), run this command to lift the read-only restriction:

curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d' { "index.blocks.read_only_allow_delete": null } '

Solution 3: Fixing Connection Refused and Permission Denied

If the service isn't starting due to elasticsearch permission denied:

  1. Ensure the elasticsearch user owns the data and log directories: sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch sudo chown -R elasticsearch:elasticsearch /var/log/elasticsearch

If you get connection refused from a remote machine:

  1. Check elasticsearch.yml (/etc/elasticsearch/elasticsearch.yml).
  2. Ensure network.host is set correctly. By default, it binds to localhost. To allow external connections, set it to the machine's private IP or 0.0.0.0 (ensure your firewall secures port 9200!). network.host: 0.0.0.0
  3. Restart the service.

Solution 4: Client-Side Timeout Adjustments

If the cluster is healthy but complex queries occasionally time out, you can increase the timeout settings in your Elasticsearch client. For example, in Python:

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}], timeout=60, max_retries=3, retry_on_timeout=True)

However, increasing timeouts is a band-aid. The permanent fix is always optimizing the cluster resources, query efficiency, or scaling out.

Conclusion

Troubleshooting Elasticsearch timeouts and connection errors requires a systematic approach. By checking cluster health, JVM metrics, and disk space, you can quickly identify whether the issue is network-related, resource-bound, or configuration-based. Proactive monitoring of heap usage and disk watermarks is essential to prevent these issues from impacting your production environment.

Frequently Asked Questions

bash
# Check cluster health and unassigned shards
curl -X GET "localhost:9200/_cluster/health?pretty"

# Check disk space allocation per node
curl -X GET "localhost:9200/_cat/allocation?v"

# Check thread pool rejections
curl -X GET "localhost:9200/_cat/thread_pool?v&s=name"

# Remove read-only block after fixing disk full issue
curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d'
{
  "index.blocks.read_only_allow_delete": null
}'

# Fix permissions if Elasticsearch service fails to start
sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
sudo chown -R elasticsearch:elasticsearch /var/log/elasticsearch
E

Error Medic Editorial

Error Medic Editorial is a team of Senior DevOps and SRE professionals dedicated to providing actionable, real-world solutions for modern infrastructure and database challenges.

Sources

Related Guides