Error Medic

Elasticsearch API Timeout: How to Diagnose and Fix Connection, Request, and Bulk Timeout Errors

Fix Elasticsearch API timeouts by tuning socket/request timeout settings, adjusting thread pools, and scaling cluster resources. Step-by-step guide with command

Last updated:
Last verified:
1,763 words
Key Takeaways
  • Elasticsearch API timeouts fall into three categories: connection timeouts (client cannot reach the node), request timeouts (node accepted the request but did not respond in time), and bulk/index timeouts (write operations exceed the configured deadline).
  • The most common root causes are undersized thread pools, GC pressure causing JVM pauses, hot shards from uneven data distribution, and client-side timeout values that are too low for the operation being performed.
  • Quick fixes: increase client request_timeout for long-running queries, raise search.default_search_timeout at the cluster level, add replicas to distribute read load, and monitor _cat/thread_pool and _nodes/hot_threads to identify the bottleneck before tuning.
Fix Approaches Compared
MethodWhen to UseTime to ApplyRisk
Increase client request_timeoutQueries consistently finish but client gives up too early< 5 minLow — client-side only
Raise search.default_search_timeoutAll searches time out cluster-wide, server is slow5 minLow — reversible setting
Tune thread_pool.search.sizesearch threadpool queue filling up (_cat/thread_pool shows rejections)10 minMedium — can starve other pools
Add replica shardsHot primary shard handling all read traffic15–30 minLow — online operation
Increase JVM heap (restart required)GC pauses > 5 s visible in logs, heap usage > 85%30 minHigh — requires rolling restart
Force-merge / reduce shard countToo many small shards causing overhead on every requestHoursMedium — CPU-intensive, do off-peak
Circuit breaker tuningrequests.breaker.total.limit too low, breaker trips before timeout< 5 minMedium — can allow OOM if set too high

Understanding Elasticsearch API Timeouts

When your application or curl command hits Elasticsearch and receives no response within the deadline, you will see one of several error messages depending on the client library and where in the stack the failure occurs:

# Python elasticsearch-py
ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
RequestError: RequestError(408, 'request_timeout', 'Request timed out')

# Java High-Level REST Client
org.elasticsearch.client.ResponseException: method [POST], host [...], status line [HTTP/1.1 408 Request Timeout]
java.net.SocketTimeoutException: Read timed out

# curl
curl: (28) Operation timed out after 30000 milliseconds

# Kibana / Elasticsearch logs
[o.e.s.SearchService] [node-1] Search request timed out: [...] took [30001ms], timeout [30000ms]

These errors map to distinct failure modes that require different remediation paths.


Step 1: Identify the Timeout Category

Connection timeout — The TCP handshake never completed. The cluster is unreachable, a load balancer dropped the connection, or the node is down.

Socket / read timeout — The connection was established but the server did not send back a complete response before the deadline. This is the most common production scenario.

Bulk reject / queue full — The thread pool queue is full; the node returned a 429 Too Many Requests or an implicit timeout because threads were unavailable.

Run these diagnostic commands first to triage:

# Check cluster health
curl -s 'http://localhost:9200/_cluster/health?pretty'

# Inspect thread pool rejections (look for 'rejected' > 0)
curl -s 'http://localhost:9200/_cat/thread_pool?v&h=name,active,queue,rejected,completed'

# Show hot threads on all nodes (identifies CPU-bound operations)
curl -s 'http://localhost:9200/_nodes/hot_threads'

# Check GC pause times in slow log
grep 'GC overhead' /var/log/elasticsearch/elasticsearch.log | tail -20

# Find slow queries (requires slow log enabled)
curl -s 'http://localhost:9200/_cat/indices?v&h=index,search.fetch_time,search.query_time'

Step 2: Fix Connection Timeouts

If the cluster health endpoint itself times out, the node is unreachable. Verify network connectivity and check whether the node process is alive:

# Verify the process is running
systemctl status elasticsearch

# Check bound address and port
curl -s 'http://localhost:9200'

# Test from application host (replace ES_HOST)
telnet ES_HOST 9200
nc -zv ES_HOST 9200

# Check firewall / security group rules
iptables -L -n | grep 9200

If nodes are reachable but the client reports connection timeouts, raise the connect_timeout in your client:

# Python elasticsearch-py v8
from elasticsearch import Elasticsearch
es = Elasticsearch(
    "http://localhost:9200",
    request_timeout=60,       # socket read timeout
    connections_per_node=10,
)
// Java REST Client
RestClientBuilder builder = RestClient.builder(new HttpHost("localhost", 9200))
    .setRequestConfigCallback(requestConfigBuilder ->
        requestConfigBuilder
            .setConnectTimeout(5000)   // 5 s connect
            .setSocketTimeout(60000)); // 60 s read

Step 3: Fix Search / Query Timeouts

Per-request timeout — Pass timeout in the request body. This is the safest option because it only affects the current query:

POST /my-index/_search
{
  "timeout": "30s",
  "query": { "match_all": {} }
}

Cluster-wide default — Set a global deadline for all searches. Queries that exceed it return partial results rather than an error:

curl -X PUT 'http://localhost:9200/_cluster/settings' \
  -H 'Content-Type: application/json' \
  -d '{
    "transient": {
      "search.default_search_timeout": "30s"
    }
  }'

Slow query analysis — Enable the slow log to find which queries are responsible:

curl -X PUT 'http://localhost:9200/my-index/_settings' \
  -H 'Content-Type: application/json' \
  -d '{
    "index.search.slowlog.threshold.query.warn": "5s",
    "index.search.slowlog.threshold.fetch.warn": "1s",
    "index.search.slowlog.level": "warn"
  }'

Then tail /var/log/elasticsearch/*_index_search_slowlog.log.


Step 4: Fix Thread Pool Exhaustion

If _cat/thread_pool shows rejected counts rising, the search or write thread pool is saturated. The safe fix is to reduce query complexity or add nodes. As a short-term measure you can increase the queue size (not the thread count, which is CPU-bound):

# elasticsearch.yml — increase search queue depth
thread_pool.search.queue_size: 2000
thread_pool.write.queue_size: 1000

Restart the node for static settings to take effect. For dynamic settings, use the Cluster Update Settings API.


Step 5: Fix JVM / GC-Induced Timeouts

Long GC pauses cause the JVM to stop all threads, making the node appear unresponsive. Signs: [gc][12345] overhead, spent [8.5s] collecting in the last [10s] in logs.

  1. Heap must be ≤ 50% of RAM and never exceed 32 GB (compressed OOP limit).
  2. Set JAVA_OPTS to use G1GC (default in ES 7+).
  3. Reduce field data cache if fielddata circuit breaker trips frequently:
curl -X PUT 'http://localhost:9200/_cluster/settings' \
  -H 'Content-Type: application/json' \
  -d '{
    "persistent": {
      "indices.breaker.fielddata.limit": "40%",
      "indices.breaker.request.limit": "40%",
      "indices.breaker.total.limit": "70%"
    }
  }'

Step 6: Fix Bulk Indexing Timeouts

Bulk write timeouts (BulkRequestBuilder taking longer than timeout) are usually caused by refresh pressure or too-large batch sizes.

# Temporarily disable refresh during heavy indexing
curl -X PUT 'http://localhost:9200/my-index/_settings' \
  -H 'Content-Type: application/json' \
  -d '{"index.refresh_interval": "-1"}'

# Restore after indexing
curl -X PUT 'http://localhost:9200/my-index/_settings' \
  -H 'Content-Type: application/json' \
  -d '{"index.refresh_interval": "1s"}'

Reduce bulk batch size to 5–15 MB per request and target 1,000–5,000 documents per batch as a starting point, then tune based on _nodes/stats/indices/indexing metrics.


Step 7: Long-Term Prevention

  • Enable slow logs on all indices at warn threshold.
  • Alert on _cat/thread_pool rejections — any rejected > 0 per minute is a leading indicator.
  • Monitor GC pause time via _nodes/stats/jvm.
  • Use index lifecycle management (ILM) to force-merge and roll over hot indices, keeping shard counts healthy.
  • Distribute load with aliases pointing to multiple indices instead of querying one large index.

Frequently Asked Questions

bash
#!/usr/bin/env bash
# elasticsearch-timeout-diag.sh
# Run this script against a node to produce a triage report.

ES_HOST="${ES_HOST:-localhost}"
ES_PORT="${ES_PORT:-9200}"
BASE="http://${ES_HOST}:${ES_PORT}"

echo "=== Cluster Health ==="
curl -sf "${BASE}/_cluster/health?pretty" || echo "UNREACHABLE"

echo ""
echo "=== Thread Pool (look for rejected > 0) ==="
curl -sf "${BASE}/_cat/thread_pool?v&h=name,active,queue,rejected,completed,queue_size"

echo ""
echo "=== Node JVM GC Stats ==="
curl -sf "${BASE}/_nodes/stats/jvm?pretty" | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
for node, info in d['nodes'].items():
    gc = info['jvm']['gc']['collectors']
    name = info['name']
    for cname, cdata in gc.items():
        print(f'{name} | {cname} | count={cdata[\"collection_count\"]} | time_ms={cdata[\"collection_time_in_millis\"]}ms')
"

echo ""
echo "=== Hot Threads ==="
curl -sf "${BASE}/_nodes/hot_threads?threads=3"

echo ""
echo "=== Pending Tasks ==="
curl -sf "${BASE}/_cluster/pending_tasks?pretty"

echo ""
echo "=== Slow Indices (search query time > 60s cumulative) ==="
curl -sf "${BASE}/_cat/indices?v&h=index,search.query_time,search.query_total,search.fetch_time" | \
  awk 'NR==1 || $2 > 60000'

echo ""
echo "=== Circuit Breaker Status ==="
curl -sf "${BASE}/_nodes/stats/breaker?pretty" | \
  python3 -c "
import sys, json
d = json.load(sys.stdin)
for node, info in d['nodes'].items():
    for bname, bdata in info['breakers'].items():
        pct = round(bdata.get('overhead', 1) * bdata['estimated_size_in_bytes'] /
                    max(bdata['limit_size_in_bytes'], 1) * 100, 1) if bdata.get('limit_size_in_bytes', 0) > 0 else 0
        print(f\"{info['name']} | {bname} | used={bdata['estimated_size']} / {bdata['limit_size']} | tripped={bdata['tripped']}\")
"

echo ""
echo "=== Suggested Fixes ==="
REJECTED=$(curl -sf "${BASE}/_cat/thread_pool?h=rejected" | awk '{s+=$1} END{print s}')
if [ "${REJECTED}" -gt 0 ] 2>/dev/null; then
  echo "[!] Thread pool rejections detected (${REJECTED} total). Consider increasing queue_size or adding nodes."
fi
TRIPPED=$(curl -sf "${BASE}/_nodes/stats/breaker?pretty" | grep -c '"tripped" : [^0]')
if [ "${TRIPPED}" -gt 0 ] 2>/dev/null; then
  echo "[!] Circuit breaker has tripped. Review heap usage and breaker limits."
fi
echo "Done."
E

Error Medic Editorial

The Error Medic Editorial team is composed of senior DevOps engineers and SREs with experience operating large-scale Elasticsearch clusters in production. Our guides are derived from real incident postmortems and focus on actionable, command-level troubleshooting over theoretical explanations.

Sources

Related Guides