Error Medic

Troubleshooting Elasticsearch API Timeouts: Fixing ReadTimeoutError and es_rejected_execution_exception

Diagnose and resolve Elasticsearch API timeout errors. Learn how to optimize queries, tune thread pools, fix GC pauses, and stabilize your cluster.

Last updated:
Last verified:
1,759 words
Key Takeaways
  • Unoptimized queries, such as deep pagination or massive aggregations on high-cardinality fields, are the most common cause of API timeouts.
  • High JVM Heap pressure resulting in 'Stop-The-World' Garbage Collection (GC) pauses can cause nodes to become unresponsive, triggering timeout exceptions in clients.
  • Thread pool rejections (specifically the 'search' and 'write' thread pools) indicate your cluster is overwhelmed and actively dropping requests.
  • Mismatch between Elasticsearch client timeouts, Load Balancer (e.g., AWS ALB, Nginx) timeouts, and actual query execution time often masks the true bottleneck.
Approaches to Resolving Elasticsearch Timeouts
MethodWhen to UseTime to ImplementRisk Level
Increase Client TimeoutImmediate mitigation for intermittent spikes while investigating root causes.MinutesHigh (Can mask underlying cluster instability and exhaust application threads)
Kill Long-Running TasksEmergency intervention when a rogue query (e.g., heavy wildcard) is locking up the cluster.MinutesLow (Only affects the canceled query, saves the cluster)
Optimize Queries & PaginationLong-term fix for slow performance; transitioning from 'from/size' to 'search_after'.Days/WeeksLow (Improves overall cluster health and application responsiveness)
Scale Out Data Nodes / HeapWhen CPU/Memory utilization is consistently near 100% despite query optimization.Hours/DaysMedium (Requires budget, infrastructure changes, and node rebalancing)

Understanding Elasticsearch API Timeouts

As a DevOps engineer or SRE, encountering an Elasticsearch API timeout is a stressful but common rite of passage. These timeouts rarely point to a single isolated failure; instead, they are usually a symptom of a broader systemic issue—resource exhaustion, unoptimized application logic, or architectural bottlenecks.

When an application attempts to interact with an Elasticsearch cluster, it relies on an HTTP client (e.g., Python's elasticsearch-py, Node.js @elastic/elasticsearch, or Java's RestHighLevelClient). If the cluster fails to return a response within the configured temporal threshold, the client severs the connection and throws an exception.

You will typically see error messages in your application logs resembling the following:

  • Python: elasticsearch.exceptions.ConnectionTimeout: ConnectionTimeout caused by - ReadTimeoutError(HTTPConnectionPool(host='es-cluster.internal', port=9200): Read timed out. (read timeout=10))
  • Java: java.net.SocketTimeoutException: 30,000 milliseconds timeout on connection http-outgoing-0
  • Node.js: TimeoutError: Request timed out
  • Elasticsearch Logs: es_rejected_execution_exception: rejected execution of org.elasticsearch.transport.TransportService

To permanently resolve these issues, we must shift our focus from the application's client settings to the underlying cluster health and query performance. Let's break down the diagnostic process and the structural fixes required to stabilize your Elasticsearch infrastructure.

Step 1: Diagnose the Root Cause

Before changing configurations or scaling infrastructure, you must identify why the timeouts are occurring. Elasticsearch provides extensive APIs for introspection.

1. Check for Long-Running Tasks

If timeouts suddenly spike, a rogue query might be consuming all cluster resources. A common culprit is a massive aggregation or a deeply nested query executed by a data scientist or a poorly optimized microservice.

You can inspect currently executing tasks using the _cat/tasks API:

curl -X GET "localhost:9200/_cat/tasks?v&detailed=true" | grep "search"

For a more programmatic approach, use the Task Management API to find queries running longer than a specific threshold (e.g., 10 seconds):

curl -X GET "localhost:9200/_tasks?detailed=true&actions=*search*" | jq '.nodes[].tasks[] | select(.running_time_in_nanos > 10000000000)'
2. Analyze Thread Pool Rejections

Elasticsearch uses distinct thread pools to manage different types of operations (search, write, fetch). When a node receives more requests than it can process, requests are placed in a queue. If the queue fills up, Elasticsearch rejects the request, throwing an es_rejected_execution_exception. This almost always translates to an API timeout or 503 error on the client side.

Check your thread pool stats:

curl -X GET "localhost:9200/_cat/thread_pool/search,write?v&h=node_name,name,active,queue,rejected,completed"

A continuously incrementing rejected count on the search thread pool indicates that your queries are too slow, or your request volume exceeds the cluster's concurrency limits.

3. Monitor JVM Heap and GC Pauses

Elasticsearch runs on the Java Virtual Machine (JVM). As it processes queries, it allocates objects in the Heap. If the heap fills up, the JVM triggers a Garbage Collection (GC). Minor GCs are fast, but a major "Stop-The-World" GC pauses all application threads. If a node is paused for 15 seconds doing garbage collection, any client waiting for a response from that node will time out.

Examine the node stats for GC metrics:

curl -X GET "localhost:9200/_nodes/stats/jvm?pretty"

Look for high heap_used_percent (consistently over 85%) and long GC durations. If you see frequent long GC pauses, you have a memory pressure problem, likely caused by Fielddata usage, large aggregations, or deeply nested documents.

4. Enable and Review Slow Logs

If resource utilization looks normal but clients still time out, individual queries are likely the problem. Enable the Search Slow Log to identify queries that take longer than your application's timeout threshold.

Dynamically update your index settings to log slow queries:

PUT /my-index-000001/_settings
{
  "index.search.slowlog.threshold.query.warn": "10s",
  "index.search.slowlog.threshold.query.info": "5s",
  "index.search.slowlog.threshold.fetch.warn": "1s"
}

Review the logs in /var/log/elasticsearch/my-cluster_index_search_slowlog.log to find the exact JSON bodies of the offending queries.

Step 2: Implement Fixes

Once you have identified the bottleneck, you can apply targeted remediation strategies.

Immediate Mitigation: Canceling Rogue Tasks

If a specific query is bringing down the cluster, you can forcefully cancel it using the Task API.

Find the Task ID from the diagnostic steps above, then issue a cancel request:

curl -X POST "localhost:9200/_tasks/node_id:task_id/_cancel"
Query Optimization: Fixing Deep Pagination

A classic cause of API timeouts is "deep pagination" using the from and size parameters. If you request from: 10000 and size: 10, the coordinating node must fetch 10,010 documents from every shard, sort them all in memory, and then discard 10,000 of them to return the final 10. This requires massive CPU and memory overhead.

The Fix: Transition to search_after or the Point in Time (PIT) API for deep scrolling.

GET /my-index/_search
{
  "size": 10,
  "query": { "match": { "status": "active" } },
  "sort": [
    {"timestamp": "desc"},
    {"_id": "asc"}
  ],
  "search_after": [
    1629837493000,
    "doc-123"
  ]
}
Bounding Query Execution Time

By default, an Elasticsearch query will run until it completes, even if the client has already timed out and closed the connection. This leads to "zombie queries" consuming resources for no reason.

You should enforce a server-side timeout on all expensive queries by appending the timeout parameter to your request body. This tells Elasticsearch to return whatever partial results it has gathered, or abort the query, freeing up the thread pool.

GET /my-index/_search
{
  "timeout": "5s",
  "query": { ... }
}
Resolving High Heap Pressure and Thread Pool Rejections

If timeouts are caused by resource exhaustion (GC pauses or thread rejections), you must reduce the load on your nodes or scale the cluster.

  1. Reduce Shard Count: Too many small shards (the "over-sharding" problem) consume heap memory for cluster state and Lucene segments. Aim for shard sizes between 30GB and 50GB. Use the _shrink API or Index Lifecycle Management (ILM) to consolidate small indices.
  2. Optimize Mappings: Avoid using the text field type for exact match filtering or aggregations; use keyword instead. Disable dynamic mapping to prevent accidental explosion of mapped fields.
  3. Scale Out: If optimizations are exhausted and CPU/Heap remains pegged at 90%+, you must add more data nodes to the cluster. Elasticsearch scales horizontally very well. Adding nodes distributes the primary and replica shards, reducing the memory and CPU burden on any single JVM.
Aligning Client and Proxy Timeouts

Finally, ensure your timeout stack is logically configured. If your application expects a response in 10 seconds, but your Nginx reverse proxy times out in 5 seconds, you will see 504 Gateway Timeouts before Elasticsearch even finishes processing.

A standard best practice is:

  • Elasticsearch Query Timeout (timeout in JSON): 8 seconds
  • Application Client Read Timeout: 10 seconds
  • Load Balancer / Proxy Timeout: 15 seconds

This cascading setup ensures that if a query runs too long, Elasticsearch terminates it gracefully, rather than the proxy ruthlessly severing the connection while the database continues to churn in the background.

Frequently Asked Questions

bash
#!/bin/bash
# Elasticsearch Diagnostic Script: Identify Timeout Root Causes

ES_HOST="http://localhost:9200"

echo "=== 1. Checking Cluster Health ==="
curl -s -X GET "${ES_HOST}/_cluster/health?pretty"

echo -e "\n=== 2. Checking for Thread Pool Rejections (Search & Write) ==="
curl -s -X GET "${ES_HOST}/_cat/thread_pool/search,write?v&h=node_name,name,active,queue,rejected,completed"

echo -e "\n=== 3. Finding Tasks Running Longer Than 10 Seconds ==="
curl -s -X GET "${ES_HOST}/_tasks?detailed=true&actions=*search*" | \
  jq '.nodes[]?.tasks[]? | select(.running_time_in_nanos > 10000000000) | {node, action, running_time_seconds: (.running_time_in_nanos / 1000000000), description}'

echo -e "\n=== 4. Checking JVM Heap Pressure (Look for heap_used_percent > 85%) ==="
curl -s -X GET "${ES_HOST}/_nodes/stats/jvm?pretty" | grep "heap_used_percent"

# To cancel a rogue task, uncomment and replace TASK_ID:
# curl -X POST "${ES_HOST}/_tasks/node_id:task_id/_cancel"
E

Error Medic Editorial

The Error Medic Editorial team consists of senior SREs and DevOps engineers dedicated to providing actionable, code-first troubleshooting guides for distributed systems, databases, and cloud infrastructure.

Sources

Related Guides