Overview
Slow queries are often caused by
- Poorly written or expensive search queries.
- Poorly configured Elasticsearch clusters or indices.
- Saturated CPU, Memory, Disk and network resources on the cluster.
- Periodic background processes like snapshots or merging segments that consume cluster resources (CPU, Memory, disk) causing other search queries to perform slowly as resources are sparsely available for the main search queries.
- Segment merging is used to reduce the number of segments so that search latency is improved, however, merges can be expensive to perform, especially on low IO environments.
As mentioned above there are several potential reasons for slow queries, but in search heavy systems, the main causes are usually expensive search queries or a poorly configured Elasticsearch cluster or index. Effective use of search slow queries could dramatically reduce the debugging/troubleshooting time.
Troubleshooting guide on how to use slow logs effectively
- Always define a proper log-threshold for search slow queries in your application. Define different log levels for faster-debugging purposes. For example, more than 20ms is good for TRACE logging but more than 250ms should be logged as WARN. These thresholds are for real-time systems like e-commerce search and it should be tuned on the basis of application SLA.
- There are two phases of search: the query phase and the fetch phase. More details can be found here on Elasticsearch search explained. It’s important to understand how these phases work and set a proper threshold for each one.
- Slow logs are always specific to a shard, and this is where most people get it wrong when they look at a slow queries log without understanding the full picture. More information on how ES shards play an important role in its performance can be found on the search latency troubleshooting guide.
- A slow query log includes the phase to which it belongs and the default is query then fetch but it can be DFS query then fetch which provides better search score by taking a performance hit. Hence it’s important to look for these.
- Adding the response times of all slow queries involving all of the relevant shards doesn’t provide the overall search time and doesn’t include the time of gathering results from all shards and fetching the top results (aka fetch phase).
- To solve the issue mentioned in #4, always have a trace log in your application which tracks the “took” param of Elasticsearch response. This is essentially the total time taken by a single Elasticsearch query on all its relevant shards and fetching top results from all shards. “Took” parameter of Elasticsearch response is the correct indicator of the total time taken by a query (including time spent on sending requests to all relevant shards and gathering and combining the results from all shards).
- If you are dealing with a multi-tenant ES cluster that hosts multiple indices then just checking the slow logs of one problematic index won’t be sufficient, as sometimes slow logs on the problematic index are caused by other indices heavy searches. Therefore it’s always a better idea to look at slow searches in the entire cluster(or at least in big indices) when issues start.
- Some examples of heavy search queries are regex queries, prefix queries, heavy aggregations, match_all, the huge value of size parameter, and deep paginations queries. Filter search slow logs for these queries and see how these queries are performing.