Quick Links
Introduction
How many times have you found yourself cluelessly staring at your struggling cluster because it suddenly became unresponsive? After making sure that this slow-down wasn’t caused by a massive data ingestion job hogging all your cluster resources, you might have started suspecting that the problem was a heavy and long-running search query. There are so many ways to write poor queries that can make your cluster suffer, but you can only fix it after you figure out which one is the source of your troubles.
In this article, we are going to review all the available options you can leverage to discover which queries are landing in your cluster: proxies, the task management API, audit logs, slow logs, and request tracing. Depending on your needs, some options might work better than others.
Proxies
An easy way to view which queries are being sent to your cluster is to add a proxy between your client and your Elasticsearch cluster and direct all your client requests to it. For instance, Nginx is a good candidate for the job, and there are plenty of online resources that explain how to record the request and response payloads into a log file that you can then inspect.
However, if you have many different components in your architecture that interact with your Elasticsearch cluster, it might not be easy or even feasible to funnel all the traffic through one or more proxies.
Moreover, if the prospect of adding one more hop into your request pipeline scares you because you don’t want to run the risk of increasing search latency, there are other solutions, which we are going to review next.
The task management API
Any single bit of work that Elasticsearch needs to execute is handled within a task. Depending on the amount of work to be carried out, a task can last anywhere from a few milliseconds to several seconds or minutes and sometimes even hours for long reindexing tasks. A search query landing into Elasticsearch is first wrapped within a new transient search task and then executed. When the search is over and the results have been returned, the task is discarded.
While the search is running, it is possible to inspect the search task and what query is being currently executed using the task management API or the compact task management API, which provides a more condensed and tabular view of the ongoing tasks. If you are chasing a long-running query, the odds are that you can catch the search query by running one of those two APIs, as we can see in the code below. Note that, by default, the task management API returns all currently running tasks, but since we are only interested in search tasks, we are filtering the results to only show the search tasks using the `actions=*search` parameter.
GET _cat/tasks?v&h=timestamp,running_time,description&detailed&actions=*search => timestamp running_time description 07:22:35 53.2s indices[my-index], search_type[QUERY_THEN_FETCH], source[{ "query": { "wildcard": { "first_name": { "value": "*a*" }}}}]
And there it is, our long-running search task has been identified! Now we can go into our code and tune it. Among other useful information, we can also see when the query started (i.e., the `timestamp` column) and for how long it has been running (i.e., the `running_time` column).
The task management API is very handy if you are trying to identify long-running queries. However, if your goal is to find queries that are more short-lived, it might be more difficult to proceed that way because once the query has been executed the task is gone and there’s no way to see what was running as nothing is logged or stored anywhere. Even if you check the running tasks regularly, you might miss your query. But don’t worry, there are other options.
Audit logs
The security auditing feature allows you to audit any security-related event happening in your cluster: authentication failures, authorization grants, security configuration changes, you name it! Some of those security events (e.g., `authentication_success`) contain the full HTTP request body that can be logged. Since our goal is to log search queries, we first need to audit search queries by enabling the audit log, which is disabled by default.
At this point, it is worth noting that the audit log feature requires at least a Platinum license. However, you can still try this option by converting your basic license to a trial one, which gives you access to all features for 30 days.
So, now let’s configure the audit logging feature to log our search queries in the security audit log file. You first need to modify the `elasticsearch.yml` configuration file on all your nodes, as shown below, and then restart them:
# enable audit logging xpack.security.audit.enabled: true # audit the authentication success event xpack.security.audit.logfile.events.include: authentication_success # enable the request body logging xpack.security.audit.logfile.events.emit_request_body: true
Once your nodes are restarted, you will start seeing events being logged into the audit logging file called `{cluster.name}_audit.log`. The next time a search query lands in your cluster, you’ll find an event such as the one shown below containing the JSON-escaped search query in the `request.body` field:
{ "type": "audit", "timestamp": "2023-10-26T11:28:47,981+0200", "cluster.uuid": "8DcMshxkTYqKb4ypBpd3qQ", "node.id": "_a3fKiRqRIqOqHu9yKJSqg", "event.type": "rest", "event.action": "authentication_success", "authentication.type": "REALM", "user.name": "elastic", "user.realm": "reserved", "origin.type": "rest", "origin.address": "[::1]:59254", "realm": "reserved", "url.path": "/my-index/_search", "request.method": "GET", "request.body": "{ \"query\": { \"wildcard\": { \"first_name\": { \"value\": \"*a*\" }}}}", "request.id": "VPsvX6u_SxSfjUYZ7_RsCQ" }
That works out pretty well. However, if you don’t have any other security-related needs that require a license (e.g., SSO, Active Directory authentication, etc), you shouldn’t have to buy a license just for logging search queries. As we will see next, there are other free features that also allow you to do it.
Slow logs
The next option is to turn on slow logs. When enabled, slow logs allow you to configure custom time thresholds to log slow searches at different log levels. For instance, in the code below, we configure Elasticsearch to log queries executing for more than ten seconds at the WARN level, queries executing for more than five seconds at the INFO level, and so on. It is also possible to set different thresholds for the query and fetch phases.
PUT /my-index/_settings { "index.search.slowlog.threshold.query.warn": "10s", "index.search.slowlog.threshold.query.info": "5s", "index.search.slowlog.threshold.query.debug": "2s", "index.search.slowlog.threshold.query.trace": "500ms", "index.search.slowlog.threshold.fetch.warn": "1s", "index.search.slowlog.threshold.fetch.info": "800ms", "index.search.slowlog.threshold.fetch.debug": "500ms", "index.search.slowlog.threshold.fetch.trace": "200ms" }
A common useful trick to log all queries flowing into your cluster is to set the TRACE level for queries at zero milliseconds so that any search query gets logged no matter the time it takes to execute. This can be achieved by using the code below:
PUT /my-index/_settings { "index.search.slowlog.threshold.query.trace": "0ms" }
After running the above command, the next query that is going to be executed against `my-index` will be logged in the file called `logs/{cluster.name}_index_search_slowlog.json`, as shown in the code below:
{ "@timestamp": "2023-10-25T09:23:29.785Z", "log.level": "TRACE", "elasticsearch.slowlog.message": "[my-index][0]", "elasticsearch.slowlog.search_type": "QUERY_THEN_FETCH", "elasticsearch.slowlog.source": "{\\\"query\\\":{\\\"wildcard\\\":{\\\"first_name\\\":{\\\"wildcard\\\":\\\"*a*\\\",\\\"boost\\\":1.0}}}}", "elasticsearch.slowlog.stats": "[]", "elasticsearch.slowlog.took": "11034ms", "elasticsearch.slowlog.took_millis": 11034, "elasticsearch.slowlog.total_hits": "54 hits", "elasticsearch.slowlog.total_shards": 1, "ecs.version": "1.2.0", "service.name": "ES_ECS", "event.dataset": "elasticsearch.index_search_slowlog", "process.thread.name": "elasticsearch[iMac.local][search][T#5]", "log.logger": "index.search.slowlog.query", "trace.id": "0038f0681472a209d3b15bf2216207fa", "elasticsearch.cluster.uuid": "8DcMshxkTYqKb4ypBpd3qQ", "elasticsearch.node.id": "_a3fKiRqRIqOqHu9yKJSqg", "elasticsearch.node.name": "iMac.local", "elasticsearch.cluster.name": "elasticsearch" }
The JSON-escaped query can be found in the `elasticsearch.slowlog.source` field, and we can see that a bad wildcard query with a leading wildcard character is the cause of the slowness. Now that we have identified it, we can fix it, e.g., by using the free Opster Search Log Analyzer or the Search Profiler available inside Kibana.
It is worth noting that slow logs are logged at the shard level, so if your index is partitioned into more than one primary shard, you’ll see the query being logged once per shard, which might create a lot of noise. For that reason, you should be very careful when enabling slow logs in production as it can generate a huge amount of logs depending on how your cluster is loaded and the number of indexes and shards you have. Slow logs should be enabled temporarily to find the queries you’re searching for and disabled as soon as you have identified them.
So, let’s quickly recap all the options we have looked at so far. While easy to set up and configure, proxies increase the network distance between your client and your cluster by introducing another hop. The task management API only allows you to catch long-running queries, not short-lived ones. Moreover, it does not keep track of executed queries. Slow logs can be configured to log all queries, whether short or long, but each query will be logged at the shard level, which can induce a lot of noise if your queries are running over a lot of shards or if you are experiencing heavy traffic.
It would be absolutely fantastic if we could log any search query, but only once regardless of the number of shards, and eventually be very selective in what we want to log so as not to produce too much noise. Let’s head to the next option: request tracing.
Request tracing
All the previous options are useful and can greatly help in a multitude of scenarios, but request tracing, available since ES 7.12, is a killer feature because it allows you to literally log anything happening on the HTTP layer. That means any calls made by anyone to any REST endpoint! Since we are only interested in search queries, we are going to focus on the `_search` API endpoint.
Configuring the HTTP tracer to log search queries is as simple as pie as can be seen in the code below where we instruct the tracer to only log requests made on the `/my-index/_search*` API endpoint:
PUT _cluster/settings { "persistent" : { "logger.org.elasticsearch.http.HttpTracer" : "TRACE", "http.tracer.include" : "/my-index/_search*", "http.tracer.exclude" : "" } }
As soon as the command is run, request tracing is enabled and you will see your requests and responses being logged, as shown below:
[1350008][x-opaque-id][POST][/my-index/_search] received request from [Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:49923}] trace.id: 558ebc7c375dbea44635976ea440558c set system property [es.insecure_network_trace_enabled] to [true] to enable HTTP body tracing [1350008][x-opaque-id][OK][application/vnd.elasticsearch+json;compatible-with=8][1328] sent response to [Netty4HttpChannel{localAddress=/127.0.0.1:9200, remoteAddress=/127.0.0.1:49923}] success [true]
By default, the HTTP tracer will log a summary of any incoming request (line 1) and outgoing response (line 3). The logging includes a linearly increasing request ID (i.e., 1350008) that can be used to correlate the request and the response. Also, the value of the “X-Opaque-ID” HTTP header (if provided) can allow you to identify which application is sending the query. Then, we can see the HTTP method as well as the request path, which provides insights into which index is being queried. Also shown on the request line is the value of the “traceparent” HTTP header, which comes in handy for tracing requests across Elastic products. On the response line, we can see the HTTP status code (i.e., OK) as well as the content type and the length (i.e., 1328 bytes) of the response.
All this is very nice, however, line 2 indicates that body tracing is disabled by default. This is mainly because HTTP payloads may contain sensitive information such as credentials, keys, or any other business data that needs to remain secret. Let’s say we have cleared that security hurdle because our goal is to log search queries; we now need to modify our configuration as instructed on line 2.
In development or testing, we can simply leverage the `ES_JAVA_OPTS` environment variable. However, when operating a production cluster, it is advised to create a custom JVM options file called `custom.options`, which needs to be located in the `jvm.options.d` folder and contain the following content:
-Des.insecure_network_trace_enabled=true
Once this has been configured on each node, each of them needs to be restarted for the change to take effect. At this point, all that remains to be done is to configure the logging level for the `HttpBodyTracer` class (line 5), as shown in the code below:
PUT _cluster/settings { "persistent" : { "logger.org.elasticsearch.http.HttpTracer" : "TRACE", "logger.org.elasticsearch.http.HttpBodyTracer" : "TRACE", "http.tracer.include" : "/my-index/_search*", "http.tracer.exclude" : "" } }
Once the cluster settings have been applied, Base64-encoded HTTP payloads will start to appear in your log file, as shown in the code below:
[1350008] request body (...): H4sIAAAAAAAA/6vmUlBQKixNLapUslKoBnKA3PLMnJTkxKIUuAhQLC2zqLgkPi8xNxVJFChelphTChJS0krUUoIK13LByFquWi4Ay9GwO2IAAAA= [1350008] response body (...): H4sIAAAAAAAA/7WYXW+bMBSG7/srolyHyjaYQK+2rs1WVdkuUq1rpynyghO8AE4xrF/Kf59xkiqQ8jHXlZASDraPD+/j42Oej3q9fsb5st876dnOQN2y...
Since the payload is gzipped and encoded in Base64, we need to process the log lines to extract the JSON query and response for request 1350008. This can be easily done using the following set of piped Unix commands:
grep "\[1350008\] request body" elasticsearch.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
Running the above commands will show us the content of the JSON query we wanted to trace:
{ "query": { "wildcard": { "first_name": { "value": "*a*" } } } }
We can proceed the same way to discover the content of the JSON response, which can be useful when debugging queries. As query responses can get quite long sometimes, if the payload weighs more than 2KB, the tracer will chunk it into several lines, as shown in the code below:
[1350008] response body [part 1]: H4sIAAAAAAAA/+19W3MbOZbmu3+FQk+zE+U... [1350008] response body [part 2]: m/uhLH4NF2Vph9pPCGW5QaQmCM52PBwbZ8X... [1350008] response body [part 3]: 7BVXoODWnASPKgDaJ8GNKfQMGmHUh1Hjsh+... [1350008] response body (gzip compressed, base64-encoded, and split into 3 parts on preceding log lines...)
Using the same set of Unix commands as before, but with a slightly different grep pattern, we can extract the response payload:
grep "\[1350008\] response body \[part" elasticsearch.log | sed -e 's/.*://' | base64 --decode | gzip --decompress
Running that command reveals the full JSON response in your terminal, as shown below:
{ "took" : 61, "timed_out" : false, "_shards" : { "total" : 100, "successful" : 100, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 54, "relation" : "eq" }, "max_score" : 1.0, "hits" : [ ...search hits omitted for brevity... ] } }
In summary, request tracing is a very handy feature that allows you to not only trace your search queries but also the search responses, which cannot be done with the task management API or slow logs.
Conclusion
Whether you are debugging queries or trying to find a rogue one that is bringing your cluster to its knees, this article has introduced you to a few options for visualizing the queries that are sent to your cluster.
Using proxies, you can inspect all the traffic between your client and your Elasticsearch cluster, but it might not always be feasible to introduce such a component into your existing architecture.
Leveraging the task management API can provide you with insights on long-running queries, but short-lived ones would be much more difficult to catch.
Enabling the audit logging feature allows you to log the request bodies for authentication success security events, but unless you already have a Platinum subscription for other security needs, it’s not worth buying one just for logging queries.
Tricking slow logs into logging all your queries can also be an option, but depending on your load and volume of data, it might generate a lot of noise and grow log files beyond reason.
Finally, request tracing can be configured to only trace the queries that you are interested in and can be enabled and disabled dynamically. However, the queries are not logged in plain text but require an additional post-processing step to be decoded and decompressed.
If your main focus is to log your search queries for whatever use case you might have (debugging, security checks, performance tuning, etc), we are pretty sure that you’ll find at least one of these five options useful. If not, there is yet another option: create your own plugin 😉