Quick Links
- Step-by-step guide on how to use the Scroll API
- Using the Search API with pagination
- Using Search After
- Conclusion
Introduction
One of the common operations in Elasticsearch is retrieving all documents from an index. This article will delve into the different techniques to retrieve all documents in Elasticsearch, providing examples and step-by-step instructions for each method.
A step-by-step guide on how to use the Scroll API
The Scroll API is a useful tool for retrieving large numbers of documents from Elasticsearch efficiently. It is designed to bypass the deep pagination problem by creating a “snapshot” of the index at the time of the initial search request.
How to use the Scroll API
- Step 1:
The scroll parameter tells Elasticsearch how long it should keep the search context alive. For example, to retrieve all documents from an index named ‘my_index’, you would use the command below.
- Step 2:
In the response, Elasticsearch returns a scroll ID that you can use to retrieve the next batch of results.
- Step 3:
To get the next batch of documents, you would use the command presented below.
- Step 4:
Repeat step 3 until no more documents are returned.
- Step 5:
After all documents have been retrieved, clear the scroll context. Code example for how to clear the scroll context below.
If you want to initiate a search request with the scroll parameter, you can use the following:
bash curl -X GET "localhost:9200/my_index/_search?scroll=1m" -H 'Content-Type: application/json' -d' { "size": 1000, "query": { "match_all": {} } } '
To get the next batch of documents, use the following command:
bash curl -X GET "localhost:9200/_search/scroll" -H 'Content-Type: application/json' -d' { "scroll" : "1m", "scroll_id" : "<scroll_id>" } '
To clear the scroll context use the following command:
bash curl -X DELETE "localhost:9200/_search/scroll" -H 'Content-Type: application/json' -d' { "scroll_id" : "<scroll_id>" } '
Using the Search API with pagination
Another method for retrieving all documents is to use the Search API with pagination. This method is simpler but can be less efficient for large data sets as it is limited to returning 10,000 documents by default. This limit can be increased by modifying the `index.max_result_window` index setting, but increasing it excessively can lead to a deep pagination problem.
Here’s how to use the Search API with pagination:
1. Step 1
Use the ‘size’ parameter to set the number of returned documents and ‘from’ to specify the start point. For example:
bash curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d' { "size": 1000, "from": 0, "query": { "match_all": {} } } '
2. Step 2
In the response, Elasticsearch returns the requested documents and the total number of hits.
3. Step 3
To get the next batch of documents, increment the ‘from’ parameter by the ‘size’ parameter and repeat the search request.
Using Search After
A very lightweight way to scroll over all your data, without having to maintain a costly search context, is to use the `search_after` parameter. To make ‘Search After’ function correctly, you need to add a sort clause to your query. When you receive the response, use the sort value of the last hit as the `search_after` parameter for retrieving the next page of results.
The first query will look like this (note that you must not specify `from`):
bash curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d' { "size": 1000, "sort": { “_id”: “asc” }, "query": { "match_all": {} } } '
Let’s say the `_id` of the last hit returned in the response is `536738`, then the next query would look like this, i.e. we return the next page of sorted results that come after the document with `_id` `536738`:
bash curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d' { "size": 1000, "sort": { “_id”: “asc” }, "search_after": [536738], "query": { "match_all": {} } } '
Search After comes with some drawbacks. For instance, if your index is constantly being written to and a refresh occurs, then the sort order of your documents might change while you’re iterating over your index causing inconsistent results across pages. In order to prevent this, it is possible to create a point in time (PIT) in order to preserve the state of the index during the time you’re scrolling over the index.
First, create a point in time with the following command by specifying the time the context should stay alive using the `keep_alive` query string parameter:
bash curl -X POST "localhost:9200/my_index/_pit?keep_alive=1m"
The response will include a PIT token such as:
{ "id": "46ToAwMDaWR5…gAAAAA==" }
Then you can simply use the same Search After query as before but you include the PIT token that you have just received:
bash curl -X GET "localhost:9200/my_index/_search" -H 'Content-Type: application/json' -d' { "size": 1000, "sort": { “_id”: “asc” }, "pit": { “id”: “46ToAwMDaWR5…gAAAAA==”, “keep_alive”: “1m” }, "query": { "match_all": {} } } '
Each response will include an updated PIT id in the `pit_id` field that you need to include in your next request. And so on, until you’ve iterated over the full index.
When you’re done iterating, you should make sure to close the PIT context explicitly:
bash curl -X DELETE "localhost:9200/_pit" -H 'Content-Type: application/json' -d' { "id" : "<pit_id>" } '
Conclusion
Using the basic`from/size` parameters is the simplest method to iterate over your data, but there’s a default limit of 10,000 documents that you can retrieve and higher values will lead to the deep pagination problem.
The next lightweight option is to use `search_after` in a sorted query, but if your index is refreshed due to new writes, then the results might get inconsistent.
To alleviate this, you can introduce the use of a point in time (PIT) token that will make sure to keep a consistent view of the index while you’re iterating, regardless of the index’s write status.
The Scroll API achieves the same results as PIT, however, the search context that is being kept is much more substantial.
The recommendation is to use PIT if you can, but make sure to not use the Scroll API for real-time searches. After completing your operations, it’s essential to close either the Scroll or PIT context that you’ve used.
Each of these methods has its own advantages and use cases. The Scroll API is best for retrieving large numbers of documents efficiently, while the Search API with pagination is simpler and more straightforward. The _cat API provides a more human-friendly output. Choose the method that best fits your needs.