Quick links
- Overview
- Use Cases
- Pagination Methods
- Paginate with Point in Time (PIT)
- Summary of Pagination Methods
- Conclusion
Overview
Pagination is a common operation used when designing any search solution. It consists of returning a subset of the results instead of the whole dataset every time for performance reasons.
Use Cases
There are two main use cases for pagination:
- For users to see a “page” of results at a time, rather than all the documents in one pass. There are also some “infinite scroll” experiences that show only a subset of the documents.
- For users who need to gather all documents. Yet, returning all of them is not possible. The default window size is 10,000 documents, and returning millions of documents in a single query is not a good idea, since everything must be held in memory during the process. This process is referred to as “deep pagination.”
Pagination Methods
A common way to do pagination is using “from” and “size” parameters, which makes it very easy to calculate page numbers, and allow users to jump from page to page, forwards and backwards.
The problem is, “from” and “size” have to go through all the results to get the exact page the user is requesting. This is OK for small datasets, but can be a killer for huge indices, and users may not always want to jump pages, for example, in use case B present present above.
A more efficient way to do deep pagination is by using scroll, which is designed to return more results at once, but this too has its own limitations.
Scroll uses search context windows to keep data consistent while pagination occurs.
A “search context” window refers to the temporary state or snapshot that OpenSearch maintains during a scrolling operation. When a scroll request is initiated, OpenSearch takes a snapshot of the index’s current state, ensuring that the results returned during the entire scroll operation are consistent, even if the underlying data changes. The search context is maintained for a specific duration, which is specified in the settings.
The preferred way to use pagination on OpenSearch is by using search_after. Search_after provides a means to use the current search results cursor to query the next page of results. The cursor is a combination of “sort” parameters. So, users must provide the sort parameters from the last result, plus the same sort criteria.
The difference between search_after and scroll is that search_after results are stateless, there is no context window. This means that if data changes between queries, the results may not be consistent, which may or may not be problematic.
For the sake of this article, we will assume that it is problematic to have inconsistencies. In this scenario, we need to freeze the dataset during our pagination regardless of index changes during that time.
To overcome this limitation we will use the Point in time (PIT) API, which is available from OpenSearch 2.4.
Paginate with Point in Time (PIT)
PIT allows users to make search_after queries stateful and obtain consistent results every single time, even if documents are updated or deleted.
OpenSearch accomplishes this by locking a set of segments in time, and by not modifying or deleting any of the resources required for this PIT.
It’s important to highlight that PIT is not tied to a query. Users can run different queries using the same point in time and enjoy the same consistency. In other words, users can go back and forth and expect to see the exact same results.
Example
Let’s look at an example:
Before starting, let’s create a simple index.
POST test_pit/_doc { "@timestamp": "2023-07-21T04:31:40.210+00:00", "title":"First Page" } POST test_pit/_doc { "@timestamp": "2023-07-22T04:31:40.210+00:00", "title":"Second Page" } POST test_pit/_doc { "@timestamp": "2023-07-23T04:31:40.210+00:00", "title":"Third Page" }
To keep this example nice and clear, our queries will use “size 1,” so each document represents a page.
1. Create PIT
The first step is to create the PIT that will be used in the search queries to do pagination.
POST /test_pit/_search/point_in_time?keep_alive=1h
You can also add additional parameters:
Parameter | Description | Required | Default value |
---|---|---|---|
test_pit | The name of the index/indices. | Yes | |
keep_alive | Determines how much time to keep the PIT freezing the index. | Yes | |
routing | Routes the search requests to a specific shard. | No | _id |
expand_wildcards | Type of index that can match the wildcard pattern. Values: all, open, closed, hidden, and none. | No | open |
preference | Node or shard used to perform the search. | No | random |
This query will return the pit_id which can be used in the search_after queries.
This is the response:
{ "pit_id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=", "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "creation_time": 1696793510768 }
2. Run query with PIT
Use the PIT from step #1 to query the data:
GET /_search { "size": 1, "pit": { "id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=" }, "sort": [ {"@timestamp": {"order": "asc"}} ] }
This will return the “First Page” document because of the sort order by @timestamp.
To get the next page of results, send the sort value (@timestamp for this example) of the last hit as the search_after parameter along with the same pit.id sent in the first query:
GET /_search { "size": 1, "pit": { "id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=" }, "sort": [ {"@timestamp": {"order": "asc"}} ], "search_after": [ "2023-07-21T04:31:40.210+00:00" ] }
As expected, the “Second Page” document follows.
Let’s create a new page called: “Page Zero” whose @timestamp happens before “First Page:”
POST test_pit/_doc { "@timestamp": "2023-07-20T04:31:40.210+00:00", "title":"Page Zero" }
If we run the first query that returns results sorted by @timestamp asc, the “First Page” document will be returned because “Page Zero” is outside this “Point in Time’s” context.
But, What if I want to jump from page 1 to page 3?
With from/size users can easily jump from page to page by changing the value of “from.”
Now, what about search_after?
With search_after users can use slicing to split result sets into many pieces that can be called “pages.”
To achieve this, instead of defining a sorting order and search_after, just define a slice id (which will act a s page number), and a slice max, which will define how many pages you want to generate:
GET /_search { "slice": { "id": 1, "max": 2 }, "pit": { "id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=" } }
This will generate two slices:
- First Page/Second Page (id: 0)
- Third Page (id: 1)
So, the query we just ran will return the “Third Page” document because we requested the slice id to be 1:
{ "pit_id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=", "took": 2, "timed_out": false, "_shards": { "total": 1, "successful": 1, "skipped": 0, "failed": 0 }, "hits": { "total": { "value": 1, "relation": "eq" }, "max_score": null, "hits": [ { "_index": "test_pit", "_id": "MKUthokBtJ4kgTgv2JjV", "_score": null, "_source": { "@timestamp": "2023-07-23T04:31:40.210+00:00", "title": "Third Page" }, "sort": [ 1690086700210 ] } ] } }
Here, sort will apply per slice, and not for the whole result set, so this strategy is only useful if the overall order is not important.
Slicing also superpowers performance because many slices can be queried in parallel. This is a great approach to getting a big dataset quickly.
Summary of Pagination Methods
The following table summarizes different pagination methods.
Feature | from/size | scroll | search_after |
---|---|---|---|
Description | A common way to paginate, allowing users to jump from page to page easily. | Designed to return a larger number of results at once using context windows for data consistency. | Uses the current search results cursor to query the next page of results, with the PIT overcoming statelessness for consistency. |
Operation | Must go through all the results to get the exact page the user is requesting. | Uses context windows to keep data consistent during pagination. | Uses the sort parameters of the last result plus the same sort criteria for querying the next page. |
Data Consistency | Paginating with from/size will produce different results if datasets change. | Maintains data consistency. | Data may not be consistent between queries without the use of PIT. With PIT, users can ensure consistency even after document updates/deletions. |
Statefulness | Not stateful | Stateful, due to search context windows. | Stateless without PIT. With PIT, stateful and consistent. |
Large Datasets | Can be inefficient for large datasets. Max. results can return 10,000 by default, increasing values is not recommended. | Handles larger datasets more efficiently. | Efficient for large datasets, especially when coupled with PIT and slices. |
Jump Between Pages | Allows for easy page jumping by adjusting the from parameter up to 10,000 by default. | Not possible with scroll. | Page jumping is possible through slicing, breaking the result set into many pages. |
Performance | Not ideal for huge indices. | Efficient for large result sets. | Can handle large datasets quickly, especially when using slicing for parallel querying. |
Real-time Data | Changes in the data after the initial request might affect subsequent pages. | Data changes after initial requests will not affect subsequent pages. | Without PIT, changes in data might affect subsequent pages. With PIT, changes will not affect results. |
Live vs Batch | Good for displaying live results to users, and with small datasets (less than 10,000 documents). | Not recommended for live results, but can be used to retrieve large datasets. For deep pagination search_after is recommended. | Preferred method for live results of large datasets and data dumps. |
Conclusion
There are many different strategies to paginate on documents. Search_after + PIT is the one with the best performance for deep pagination, and flexibility, allowing all the most relevant features: large datasets, jump between pages, and data consistency to freeze the results set.
In this article we explored how to create a sample set, and efficiently paginate through it by applying search_after with Point in time (PIT), and slicing.