Elasticsearch API Timeout: How to Diagnose and Fix Connection, Request, and Bulk Timeout Errors
Fix Elasticsearch API timeouts by tuning socket/request timeout settings, adjusting thread pools, and scaling cluster resources. Step-by-step guide with command
- Elasticsearch API timeouts fall into three categories: connection timeouts (client cannot reach the node), request timeouts (node accepted the request but did not respond in time), and bulk/index timeouts (write operations exceed the configured deadline).
- The most common root causes are undersized thread pools, GC pressure causing JVM pauses, hot shards from uneven data distribution, and client-side timeout values that are too low for the operation being performed.
- Quick fixes: increase client request_timeout for long-running queries, raise search.default_search_timeout at the cluster level, add replicas to distribute read load, and monitor _cat/thread_pool and _nodes/hot_threads to identify the bottleneck before tuning.
| Method | When to Use | Time to Apply | Risk |
|---|---|---|---|
| Increase client request_timeout | Queries consistently finish but client gives up too early | < 5 min | Low — client-side only |
| Raise search.default_search_timeout | All searches time out cluster-wide, server is slow | 5 min | Low — reversible setting |
| Tune thread_pool.search.size | search threadpool queue filling up (_cat/thread_pool shows rejections) | 10 min | Medium — can starve other pools |
| Add replica shards | Hot primary shard handling all read traffic | 15–30 min | Low — online operation |
| Increase JVM heap (restart required) | GC pauses > 5 s visible in logs, heap usage > 85% | 30 min | High — requires rolling restart |
| Force-merge / reduce shard count | Too many small shards causing overhead on every request | Hours | Medium — CPU-intensive, do off-peak |
| Circuit breaker tuning | requests.breaker.total.limit too low, breaker trips before timeout | < 5 min | Medium — can allow OOM if set too high |
Understanding Elasticsearch API Timeouts
When your application or curl command hits Elasticsearch and receives no response within the deadline, you will see one of several error messages depending on the client library and where in the stack the failure occurs:
# Python elasticsearch-py
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
RequestError: RequestError(408, 'request_timeout', 'Request timed out')
# Java High-Level REST Client
org.elasticsearch.client.ResponseException: method [POST], host [...], status line [HTTP/1.1 408 Request Timeout]
java.net.SocketTimeoutException: Read timed out
# curl
curl: (28) Operation timed out after 30000 milliseconds
# Kibana / Elasticsearch logs
[o.e.s.SearchService] [node-1] Search request timed out: [...] took [30001ms], timeout [30000ms]
These errors map to distinct failure modes that require different remediation paths.
Step 1: Identify the Timeout Category
Connection timeout — The TCP handshake never completed. The cluster is unreachable, a load balancer dropped the connection, or the node is down.
Socket / read timeout — The connection was established but the server did not send back a complete response before the deadline. This is the most common production scenario.
Bulk reject / queue full — The thread pool queue is full; the node returned a 429 Too Many Requests or an implicit timeout because threads were unavailable.
Run these diagnostic commands first to triage:
# Check cluster health
curl -s 'http://localhost:9200/_cluster/health?pretty'
# Inspect thread pool rejections (look for 'rejected' > 0)
curl -s 'http://localhost:9200/_cat/thread_pool?v&h=name,active,queue,rejected,completed'
# Show hot threads on all nodes (identifies CPU-bound operations)
curl -s 'http://localhost:9200/_nodes/hot_threads'
# Check GC pause times in slow log
grep 'GC overhead' /var/log/elasticsearch/elasticsearch.log | tail -20
# Find slow queries (requires slow log enabled)
curl -s 'http://localhost:9200/_cat/indices?v&h=index,search.fetch_time,search.query_time'
Step 2: Fix Connection Timeouts
If the cluster health endpoint itself times out, the node is unreachable. Verify network connectivity and check whether the node process is alive:
# Verify the process is running
systemctl status elasticsearch
# Check bound address and port
curl -s 'http://localhost:9200'
# Test from application host (replace ES_HOST)
telnet ES_HOST 9200
nc -zv ES_HOST 9200
# Check firewall / security group rules
iptables -L -n | grep 9200
If nodes are reachable but the client reports connection timeouts, raise the connect_timeout in your client:
# Python elasticsearch-py v8
from elasticsearch import Elasticsearch
es = Elasticsearch(
"http://localhost:9200",
request_timeout=60, # socket read timeout
connections_per_node=10,
)
// Java REST Client
RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200))
.setRequestConfigCallback(requestConfigBuilder ->
requestConfigBuilder
.setConnectTimeout(5000) // 5 s connect
.setSocketTimeout(60000)); // 60 s read
Step 3: Fix Search / Query Timeouts
Per-request timeout — Pass timeout in the request body. This is the safest option because it only affects the current query:
POST /my-index/_search
{
"timeout": "30s",
"query": { "match_all": {} }
}
Cluster-wide default — Set a global deadline for all searches. Queries that exceed it return partial results rather than an error:
curl -X PUT 'http://localhost:9200/_cluster/settings' \
-H 'Content-Type: application/json' \
-d '{
"transient": {
"search.default_search_timeout": "30s"
}
}'
Slow query analysis — Enable the slow log to find which queries are responsible:
curl -X PUT 'http://localhost:9200/my-index/_settings' \
-H 'Content-Type: application/json' \
-d '{
"index.search.slowlog.threshold.query.warn": "5s",
"index.search.slowlog.threshold.fetch.warn": "1s",
"index.search.slowlog.level": "warn"
}'
Then tail /var/log/elasticsearch/*_index_search_slowlog.log.
Step 4: Fix Thread Pool Exhaustion
If _cat/thread_pool shows rejected counts rising, the search or write thread pool is saturated. The safe fix is to reduce query complexity or add nodes. As a short-term measure you can increase the queue size (not the thread count, which is CPU-bound):
# elasticsearch.yml — increase search queue depth
thread_pool.search.queue_size: 2000
thread_pool.write.queue_size: 1000
Restart the node for static settings to take effect. For dynamic settings, use the Cluster Update Settings API.
Step 5: Fix JVM / GC-Induced Timeouts
Long GC pauses cause the JVM to stop all threads, making the node appear unresponsive. Signs: [gc][12345] overhead, spent [8.5s] collecting in the last [10s] in logs.
- Heap must be ≤ 50% of RAM and never exceed 32 GB (compressed OOP limit).
- Set
JAVA_OPTSto use G1GC (default in ES 7+). - Reduce field data cache if
fielddatacircuit breaker trips frequently:
curl -X PUT 'http://localhost:9200/_cluster/settings' \
-H 'Content-Type: application/json' \
-d '{
"persistent": {
"indices.breaker.fielddata.limit": "40%",
"indices.breaker.request.limit": "40%",
"indices.breaker.total.limit": "70%"
}
}'
Step 6: Fix Bulk Indexing Timeouts
Bulk write timeouts (BulkRequestBuilder taking longer than timeout) are usually caused by refresh pressure or too-large batch sizes.
# Temporarily disable refresh during heavy indexing
curl -X PUT 'http://localhost:9200/my-index/_settings' \
-H 'Content-Type: application/json' \
-d '{"index.refresh_interval": "-1"}'
# Restore after indexing
curl -X PUT 'http://localhost:9200/my-index/_settings' \
-H 'Content-Type: application/json' \
-d '{"index.refresh_interval": "1s"}'
Reduce bulk batch size to 5–15 MB per request and target 1,000–5,000 documents per batch as a starting point, then tune based on _nodes/stats/indices/indexing metrics.
Step 7: Long-Term Prevention
- Enable slow logs on all indices at
warnthreshold. - Alert on
_cat/thread_poolrejections — anyrejected> 0 per minute is a leading indicator. - Monitor GC pause time via
_nodes/stats/jvm. - Use index lifecycle management (ILM) to force-merge and roll over hot indices, keeping shard counts healthy.
- Distribute load with aliases pointing to multiple indices instead of querying one large index.
Frequently Asked Questions
#!/usr/bin/env bash
# elasticsearch-timeout-diag.sh
# Run this script against a node to produce a triage report.
ES_HOST="${ES_HOST:-localhost}"
ES_PORT="${ES_PORT:-9200}"
BASE="http://${ES_HOST}:${ES_PORT}"
echo "=== Cluster Health ==="
curl -sf "${BASE}/_cluster/health?pretty" || echo "UNREACHABLE"
echo ""
echo "=== Thread Pool (look for rejected > 0) ==="
curl -sf "${BASE}/_cat/thread_pool?v&h=name,active,queue,rejected,completed,queue_size"
echo ""
echo "=== Node JVM GC Stats ==="
curl -sf "${BASE}/_nodes/stats/jvm?pretty" | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
for node, info in d['nodes'].items():
gc = info['jvm']['gc']['collectors']
name = info['name']
for cname, cdata in gc.items():
print(f'{name} | {cname} | count={cdata[\"collection_count\"]} | time_ms={cdata[\"collection_time_in_millis\"]}ms')
"
echo ""
echo "=== Hot Threads ==="
curl -sf "${BASE}/_nodes/hot_threads?threads=3"
echo ""
echo "=== Pending Tasks ==="
curl -sf "${BASE}/_cluster/pending_tasks?pretty"
echo ""
echo "=== Slow Indices (search query time > 60s cumulative) ==="
curl -sf "${BASE}/_cat/indices?v&h=index,search.query_time,search.query_total,search.fetch_time" | \
awk 'NR==1 || $2 > 60000'
echo ""
echo "=== Circuit Breaker Status ==="
curl -sf "${BASE}/_nodes/stats/breaker?pretty" | \
python3 -c "
import sys, json
d = json.load(sys.stdin)
for node, info in d['nodes'].items():
for bname, bdata in info['breakers'].items():
pct = round(bdata.get('overhead', 1) * bdata['estimated_size_in_bytes'] /
max(bdata['limit_size_in_bytes'], 1) * 100, 1) if bdata.get('limit_size_in_bytes', 0) > 0 else 0
print(f\"{info['name']} | {bname} | used={bdata['estimated_size']} / {bdata['limit_size']} | tripped={bdata['tripped']}\")
"
echo ""
echo "=== Suggested Fixes ==="
REJECTED=$(curl -sf "${BASE}/_cat/thread_pool?h=rejected" | awk '{s+=$1} END{print s}')
if [ "${REJECTED}" -gt 0 ] 2>/dev/null; then
echo "[!] Thread pool rejections detected (${REJECTED} total). Consider increasing queue_size or adding nodes."
fi
TRIPPED=$(curl -sf "${BASE}/_nodes/stats/breaker?pretty" | grep -c '"tripped" : [^0]')
if [ "${TRIPPED}" -gt 0 ] 2>/dev/null; then
echo "[!] Circuit breaker has tripped. Review heap usage and breaker limits."
fi
echo "Done."
Error Medic Editorial
The Error Medic Editorial team is composed of senior DevOps engineers and SREs with experience operating large-scale Elasticsearch clusters in production. Our guides are derived from real incident postmortems and focus on actionable, command-level troubleshooting over theoretical explanations.
Sources
- https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-threadpool.html
- https://www.elastic.co/guide/en/elasticsearch/reference/current/search-your-data.html#search-timeout
- https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
- https://www.elastic.co/guide/en/elasticsearch/client/python-api/current/getting-started-python.html
- https://github.com/elastic/elasticsearch/issues/43187
- https://stackoverflow.com/questions/22924300/elasticsearch-request-timeout-vs-socket-timeout