Why am I getting 'ElasticsearchTimeoutException' during bulk indexing?

This usually happens when the 'write' thread pool queue is full. The node is processing as many documents as it can, but the incoming rate exceeds its capacity. You should reduce the batch size of your bulk requests, add more nodes, or slow down the indexing rate at the source.

How do I fix 'Connection refused' on a newly installed Elasticsearch node?

By default, Elasticsearch binds to 'localhost' (127.0.0.1). If you are trying to connect from another machine, you will get a connection refused error. You must edit the 'elasticsearch.yml' file, set 'network.host' to the server's network IP address, and restart the service.

My disk is only 85% full, why are indices read-only?

The flood-stage watermark (which makes indices read-only) defaults to 95%, but the low and high watermarks are 85% and 90%. If you manually configured these thresholds lower, or if an index was previously blocked and space was freed but the block wasn't manually lifted using the settings API, the indices will remain read-only.

What does 'java.lang.OutOfMemoryError: Java heap space' mean in Elasticsearch?

It means the JVM allocated for Elasticsearch has run out of memory. This is often caused by running complex aggregations, having too many shards, or a massive spike in indexing. You need to increase the JVM heap size in 'jvm.options' (up to 50% of system RAM or 32GB max) or optimize your cluster data structure.

Fixing Elasticsearch Timeout and Connection Refused Errors

Fix Approaches Compared
Method	When to Use	Time	Risk
Increase JVM Heap	OOM errors, frequent long GC pauses	10 mins	Low (if RAM available)
Clear Disk Space / Add Nodes	Disk watermarks exceeded (read-only indices)	30 mins	Low
Adjust Thread Pools	High bulk rejection rates, queued tasks	15 mins	Medium (can overload node)
Increase Timeout Settings	Temporary network latency, heavy queries	5 mins	Low (band-aid solution)

Understanding Elasticsearch Timeouts and Connection Errors

Elasticsearch is a highly distributed system, relying heavily on network communication between nodes and rapid disk/memory access. When you encounter an elasticsearch timeout or elasticsearch connection refused error, it is rarely a simple network blip. More often, these errors are the secondary symptoms of underlying resource exhaustion—such as elasticsearch out of memory (OOM), elasticsearch disk full, or elasticsearch permission denied issues.

Typical error messages look like:

ElasticsearchTimeoutException: Timeout connecting to [node-1]
Connection refused: no further information
java.lang.OutOfMemoryError: Java heap space
ClusterBlockException: index [my-index] blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];

In this comprehensive guide, we will break down the root causes of these errors, how to accurately diagnose them using Elasticsearch's API and system tools, and how to apply permanent fixes.

Root Cause 1: JVM Heap Exhaustion (Out of Memory)

Elasticsearch runs on the Java Virtual Machine (JVM). If the JVM heap is too small for your workload, Elasticsearch will spend an excessive amount of time performing Garbage Collection (GC). During a 'stop-the-world' GC pause, the node cannot respond to ping requests from the master node or client requests. If the pause lasts longer than the configured timeout (default is usually 30 seconds), the node drops out of the cluster, or the client receives an elasticsearch timeout.

Eventually, if memory cannot be reclaimed, you will see an elasticsearch out of memory error in the logs, and the node will crash.

Root Cause 2: Disk Watermark Exceeded (Disk Full)

Elasticsearch monitors the disk usage of the nodes in the cluster. It has three watermark thresholds: low, high, and flood-stage. When disk usage reaches the high watermark (default 90%), Elasticsearch attempts to relocate shards away from the node. If it hits the flood-stage watermark (default 95%), Elasticsearch enforces a read-only block on every index that has one or more shards allocated on the node, to prevent the disk from filling up completely (elasticsearch disk full).

When indices are blocked, write requests will fail or hang, leading to client-side timeouts.

Root Cause 3: Thread Pool Rejections

Elasticsearch uses various thread pools (e.g., search, write, get) to manage concurrent operations. If the number of incoming requests exceeds the number of threads available plus the queue size, Elasticsearch will reject the requests. The client will receive a 429 Too Many Requests error, which some client libraries might interpret or surface as a connection drop or timeout if not handled correctly.

Root Cause 4: Network and Permissions

If you receive an elasticsearch connection refused immediately upon trying to connect, the Elasticsearch process might be down, listening on the wrong interface (e.g., localhost instead of the public IP), or a firewall is blocking the port (default 9200 for HTTP, 9300 for transport). Similarly, elasticsearch permission denied can occur if the user running the Elasticsearch process does not have read/write access to the data or log directories, causing the node to fail during startup.

Step 1: Diagnose the Issue

Before making any configuration changes, you must identify the exact bottleneck.

1. Check Cluster Health and Logs

The first step is always to check the cluster health and the Elasticsearch logs. The logs are usually located in /var/log/elasticsearch/<cluster-name>.log on Linux.

Run the following command to get a high-level overview of the cluster: curl -X GET "localhost:9200/_cluster/health?pretty"

If the status is red or yellow, you have unassigned shards. If the API times out, the node is likely overwhelmed.

2. Monitor JVM Memory and Garbage Collection

Check the node stats to see heap usage and GC activity: curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

Look at the jvm.mem.heap_used_percent. If it's consistently above 85%, you are at risk of long GC pauses and OOM errors. Also, check jvm.gc.collectors.old.collection_time_in_millis. A high value here indicates the node is spending too much time trying to free up memory.

3. Verify Disk Usage and Watermarks

Check the disk allocation for all nodes: curl -X GET "localhost:9200/_cat/allocation?v"

Pay attention to the disk.percent column. If any node is above 85-90%, you are hitting the disk watermarks. You can also check the cluster settings to see if blocks have been applied: curl -X GET "localhost:9200/_cluster/settings?include_defaults=true&flat_settings=true"

4. Inspect Thread Pools

Check if requests are being rejected due to full queues: curl -X GET "localhost:9200/_cat/thread_pool?v&s=name"

Look at the rejected column for the write and search thread pools. An increasing number of rejections indicates the node cannot keep up with the workload.

Step 2: Implement Solutions

Based on your diagnosis, apply the appropriate fix.

Solution 1: Fixing Out of Memory and High GC

If your heap is consistently full, you need to increase the JVM heap size. The heap size is configured in the jvm.options file, typically located in /etc/elasticsearch/jvm.options.

Open the file: sudo nano /etc/elasticsearch/jvm.options
Locate the -Xms (minimum heap) and -Xmx (maximum heap) settings.
Increase the values. Rule of thumb: Set them to 50% of the total RAM on the machine, but do not exceed 32GB (to ensure zero-based compressed oops are used).

Example: -Xms16g -Xmx16g

Restart Elasticsearch: sudo systemctl restart elasticsearch

If you cannot increase the RAM, you must optimize your queries or reduce the number of shards (e.g., by reindexing or using the shrink API).

Solution 2: Resolving Disk Full Errors

If your nodes have hit the flood-stage watermark, Elasticsearch will put indices into a read-only state. You must free up disk space and then manually remove the block.

Free up space: Delete old indices, add more storage to the machine, or add a new node to the cluster to allow shards to relocate.
Remove the block: Once disk usage is below the high watermark (usually < 90%), run this command to lift the read-only restriction:

curl -X PUT "localhost:9200/_all/_settings" -H 'Content-Type: application/json' -d' { "index.blocks.read_only_allow_delete": null } '

Solution 3: Fixing Connection Refused and Permission Denied

If the service isn't starting due to elasticsearch permission denied:

Ensure the elasticsearch user owns the data and log directories: sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch sudo chown -R elasticsearch:elasticsearch /var/log/elasticsearch

If you get connection refused from a remote machine:

Check elasticsearch.yml (/etc/elasticsearch/elasticsearch.yml).
Ensure network.host is set correctly. By default, it binds to localhost. To allow external connections, set it to the machine's private IP or 0.0.0.0 (ensure your firewall secures port 9200!). network.host: 0.0.0.0
Restart the service.

Solution 4: Client-Side Timeout Adjustments

If the cluster is healthy but complex queries occasionally time out, you can increase the timeout settings in your Elasticsearch client. For example, in Python:

from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}], timeout=60, max_retries=3, retry_on_timeout=True)

However, increasing timeouts is a band-aid. The permanent fix is always optimizing the cluster resources, query efficiency, or scaling out.

Conclusion

Troubleshooting Elasticsearch timeouts and connection errors requires a systematic approach. By checking cluster health, JVM metrics, and disk space, you can quickly identify whether the issue is network-related, resource-bound, or configuration-based. Proactive monitoring of heap usage and disk watermarks is essential to prevent these issues from impacting your production environment.