Elasticsearch Pagination Techniques
Elasticsearch currently provides 3 different techniques for fetching many results: pagination, Search-After and Scroll.
Each use case calls for a different technique. We’ll cover the considerations in this guide.
When you build a user facing search application or an API reading from Elasticsearch, it’s crucial to think about the number of results to be returned per search request.
In many search applications, 10 hits are shown on the first page. There are different ways to let the user select “show more”.
Either there is a button for the next page:
Or the pages are listed explicitly and the user can freely jump to another page.
Over the last years infinite scrolling has become popular. The page shows more hits when you scroll down.
These are the most typical use cases for a standard search application like a webshop or a feed.
Some applications also allow the user to export hits and download them in CSV or Excel format.
The default in Elasticsearch is to return the first 10 hits via pagination. Whenever you want to show more than 10 documents to your user, you’ll have to pick the correct technique out of the following options.
Pagination
The default mechanism to fetch many results in Elasticsearch is pagination.
When you send a query to Elasticsearch it will always use the default values and return the first, or most relevant, 10 documents.
For showing the next page (in this case the next 10 hits) you will need to change the “from” parameter in the next request to 10 and so on.
How to use pagination
# Pagination # First request GET products/_search { "query": { "match": { "name": "bike" } }, "size": 10, "from": 0 } # Next request GET products/_search { "query": { "match": { "name": "bike" } }, "size": 10, "from": 10 }
The upper limit for this is 10,000 by default. Pagination will not let you return more than 10,000 documents.
There is a way to change that configuration (index.max_result_window) and many developers do so when they see the error log, but this is a common pitfall.
This is one of the few configuration parameters in Elasticsearch that you can change – but you shouldn’t.
Why not? What’s the problem when showing more than 10000 hits?
To answer this fully, one needs to deep dive a bit into how a search request works.
A search request has 2 phases. The query phase and the fetch phase.
During the query phase the data nodes calculate the scores of the matching documents and basically return a list of scores and document IDs.
This list is created on the data node and forwarded to the node handling the search request, sorted and kept in memory. As you can imagine this score-ID list for a query can be quite large.
Then comes the fetch phase. During the fetch phase the JSON Source of the documents is fetched from each node that holds the documents. That basically translates into a Multi Get-request by ID for all the documents that are part of the page to be returned.
Get-requests are very efficient. However, keep in mind that all the information related to your query needs to stay in memory until the response is sent back to the client.
That’s why it’s usually a good idea to use a small page size.
Performance killer: Deep pagination
Deep pagination is one of the top performance killers for your cluster. Deep pagination means to allow the user access to too many pages.
You should never give your users access to all the pages of their search request. If your PM is not happy about that, tell him that even Google is only showing ~50 pages (500 hits).
Why is deep pagination a performance killer?
Because Elasticsearch always needs to recalculate all the hits, sort and keep the entire Score-ID list in memory even if you just want to show 10 hits starting at position 9990.
A good UI with filters etc. and a good relevant scoring will make most users happy on page 1, so the goal should always be to make deep pagination unnecessary.
Search-After
When you don’t need free access to pages (like jumping from page 1 to 5 etc.), but you’re happy with a “next” button (or when you use infinite scrolling), then the search_after parameter might be the right choice for you.
How to use search_after
# Search after # First request GET products/_search { "query": { "match": { "name": "bike" } }, "size": 10, "sort": [ { "_score": "desc" }, { "id.keyword": "asc" } ] } # Next request GET products/_search { "query": { "match": { "name": "bike" } }, "size": 10, "sort": [ { "_score": "desc" }, { "id.keyword": "asc" } ], "search_after": [ 0.2876821, "1" ] }
How does Search-After work compared to pagination?
With search_after, you can tell Elasticsearch the last hit you viewed so all the hits before can be ignored.
Instead of keeping the whole score-ID list for the search request in memory and having to sort it to provide the right page of results, search_after will use a tie breaker from the last hit of your previous search request (think of a bookmark).
This is a lot more efficient when you need to show many hits. You can even use search_after to show more than 10000 hits if needed.
What about live index updates?
Elasticsearch does a pretty good job at supporting live updates to an index without causing performance bottlenecks. So you can easily add documents, update them or delete them and still perform queries on the same index (the refresh interval is a key concept here).
Though this behaviour is fairly useful, when it comes to pagination it might cause inconsistencies across search result pages, such as if you inserted a document relevant to the user’s query (among the first 10 hits), the user clicks on “Page 2” and the last document he viewed suddenly appears on top of the page.
Pagination and Search After are stateless. That means there is no guarantee that the order of the search results will be the same when users click back and forth between pages. That would likely be frustrating and confusing for your users.
If you need to make sure the search experience is the same over a certain amount of time, you need a stateful pagination technique.
In that case you can use the Point in Time API. It is relatively new (ES 7.10 X-Pack feature), so please make sure that your Elasticsearch version supports it.
If you’re using an older version of Elasticsearch you can use the Scroll API instead.
Point in Time API
The Point in Time API can be used to extend pagination or Search-After and make them stateful. The user will always see the same version of the index over a certain period of time. Updates will be ignored, or at the very least the user won’t notice them and the search experience will be fully consistent. There won’t be any documents suddenly popping up when clicking back and forth across search result pages.
How to use pagination with the Point in Time API
# Create PIT for index POST products/_pit?keep_alive=2m # First page GET _search { "from": 0, "size": 10, "query": { "match": { "name": "bike" } }, "pit": { "id": "85ezAwEIcHJvZHVjdHMWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAWS3ptRzhJX3FUZE9iaGVpY3J5VmhTdwAAAAAAAAAABxZVYm5aNWF2U1NqcWJJdXhPc1dyS2hBAAEWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAA", "keep_alive" : "2m" } } # Following page GET _search { "from": 10, "size": 10, "query": { "match": { "name": "bike" } }, "pit": { "id": "85ezAwEIcHJvZHVjdHMWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAWS3ptRzhJX3FUZE9iaGVpY3J5VmhTdwAAAAAAAAAABxZVYm5aNWF2U1NqcWJJdXhPc1dyS2hBAAEWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAA", "keep_alive" : "2m" } }
How to use Search After with the Point in Time API
# Create PIT for index POST products/_pit?keep_alive=2m # First page GET _search { "size": 10, "query": { "match": { "name": "bike" } }, "pit": { "id": "85ezAwEIcHJvZHVjdHMWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAWS3ptRzhJX3FUZE9iaGVpY3J5VmhTdwAAAAAAAAAACRZVYm5aNWF2U1NqcWJJdXhPc1dyS2hBAAEWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAA", "keep_alive" : "2m" }, "sort": [ { "_score": "desc" }, { "_shard_doc": "asc" } ] } # Next page GET _search { "size": 10, "query": { "match": { "name": "bike" } }, "pit": { "id": "85ezAwEIcHJvZHVjdHMWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAWS3ptRzhJX3FUZE9iaGVpY3J5VmhTdwAAAAAAAAAACRZVYm5aNWF2U1NqcWJJdXhPc1dyS2hBAAEWSzJZVTdKOU1RLU9SRVBsck43SGg3dwAA", "keep_alive" : "2m" }, "sort": [ { "_score": "desc" }, { "_shard_doc": "asc" } ], "search_after": [ 0.2876821, 0 ] }
Scroll API
The Scroll API can be used to iterate over a large amount of documents matching a query, or even all the matching documents.
Although the API is called Scroll AP,I it should not be used to implement infinite scrolling and should not be used to serve frequent end user requests.
As opposed to pagination and Search-After, the Scroll API is stateful. That means that updates to the index are ignored during the lifetime of a scroll request.
To achieve this, Elasticsearch needs to store a snapshot of the current version of the index and keep it alive during the lifetime of a scroll context.
Keeping the initial search context alive has a high cost for actively updated indices.
How to use the Scroll API
# First request GET products/_search?scroll=1m { "size": 10, "query": { "match": { "name": "bike" } } } # Next request GET _search/scroll { "scroll": "1m", "scroll_id": "FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFlViblo1YXZTU2pxYkl1eE9zV3JLaEEAAAAAAAAAChZLem1HOElfcVRkT2JoZWljcnlWaFN3" }
Summary
When to use pagination?
- When you need free access to pages and you’re not planning to offer deep pagination.
When to use Search-After?
- When a “next” button is suited for you and you want to give efficient access to many pages.
When to use the Point in Time API?
- When you need consistent order across search result pages.
- When you’re running Elasticsearch with X-Pack >7.10 or >7.13.
When to use Scroll?
- Scroll can be used to list all hits to a query. However it should be a rare request, probably alright in expert applications or for very rare end-user requests.
- When you need consistent order across search result pages.
- When your Elasticsearch version does not support Point in Time.