Elasticsearch How to Paginate with Point in Time (Performant and Consistent) in OpenSearch

By Opster Expert Team - Gustavo

Updated: Oct 12, 2023

| 4 min read

Quick links

Overview

Pagination is a common operation used when designing any search solution. It consists of returning a subset of the results instead of the whole dataset every time for performance reasons.

Use Cases

There are two main use cases for pagination: 

  1. For users to see a “page” of results at a time, rather than all the documents in one pass. There are also some “infinite scroll” experiences that show only a subset of the documents.
  2. For users who need to gather all documents. Yet, returning all of them is not possible. The default window size is 10,000 documents, and returning millions of documents in a single query is not a good idea, since everything must be held in memory during the process. This process is referred to as “deep pagination.”

Pagination Methods

A common way to do pagination is using “from” and “size” parameters, which makes it very easy to calculate page numbers, and allow users to jump from page to page, forwards and backwards.

The problem is, “from” and “size” have to go through all the results to get the exact page the user is requesting. This is OK for small datasets, but can be a killer for huge indices, and users may not always want to jump pages, for example, in use case B present present above.

A more efficient way to do deep pagination is by using scroll, which is designed to return more results at once, but this too has its own limitations. 

Scroll uses search context windows to keep data consistent while pagination occurs. 

A “search context” window refers to the temporary state or snapshot that OpenSearch maintains during a scrolling operation. When a scroll request is initiated, OpenSearch takes a snapshot of the index’s current state, ensuring that the results returned during the entire scroll operation are consistent, even if the underlying data changes. The search context is maintained for a specific duration, which is specified in the settings.

The preferred way to use pagination on OpenSearch is by using search_after. Search_after provides a means to use the current search results cursor to query the next page of results. The cursor is a combination of “sort” parameters. So, users must provide the sort parameters from the last result, plus the same sort criteria.

The difference between search_after and scroll is that search_after results are stateless, there is no context window. This means that if data changes between queries, the results may not be consistent, which may or may not be problematic. 

For the sake of this article, we will assume that it is problematic to have inconsistencies. In this scenario, we need to freeze the dataset during our pagination regardless of index changes during that time.

To overcome this limitation we will use the Point in time (PIT) API, which is available from OpenSearch 2.4.

Paginate with Point in Time (PIT)

PIT allows users to make search_after queries stateful and obtain consistent results every single time, even if documents are updated or deleted. 

OpenSearch accomplishes this by locking a set of segments in time, and by not modifying or deleting any of the resources required for this PIT.

It’s important to highlight that PIT is not tied to a query. Users can run different queries using the same point in time and enjoy the same consistency. In other words, users can go back and forth and expect to see the exact same results.

Example

Let’s look at an example:

Before starting, let’s create a simple index.

POST test_pit/_doc
{
  "@timestamp": "2023-07-21T04:31:40.210+00:00",
  "title":"First Page"
}

POST test_pit/_doc
{
  "@timestamp": "2023-07-22T04:31:40.210+00:00",
  "title":"Second Page"
}

POST test_pit/_doc
{
  "@timestamp": "2023-07-23T04:31:40.210+00:00",
  "title":"Third Page"
}

To keep this example nice and clear, our queries will use “size 1,” so each document represents a page.

1. Create PIT 

The first step is to create the PIT that will be used in the search queries to do pagination.

POST /test_pit/_search/point_in_time?keep_alive=1h

You can also add additional parameters: 

ParameterDescriptionRequiredDefault value
test_pitThe name of the index/indices.Yes
keep_aliveDetermines how much time to keep the PIT freezing the index.Yes
routingRoutes the search requests to a specific shard.No_id
expand_wildcardsType of index that can match the wildcard pattern. Values: all, open, closed, hidden, and none.Noopen
preferenceNode or shard used to perform the search.Norandom

This query will return the pit_id which can be used in the search_after queries.

This is the response:

{
  "pit_id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=",
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "creation_time": 1696793510768
}

2. Run query with PIT 

Use the PIT from step #1 to query the data:

GET /_search
{
  "size": 1,
  "pit": {
    "id":  "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA="
  },
  "sort": [ 
    {"@timestamp": {"order": "asc"}}
  ]
}

This will return the “First Page” document because of the sort order by @timestamp. 

Example of the returned “First Page” document.

To get the next page of results, send the sort value (@timestamp for this example) of the last hit as the search_after parameter along with the same pit.id sent in the first query:

GET /_search
{
  "size": 1,
  "pit": {
    "id":  "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA="
  },
  "sort": [ 
    {"@timestamp": {"order": "asc"}}
  ],
  "search_after": [  
    "2023-07-21T04:31:40.210+00:00"
  ]
}

As expected, the “Second Page” document follows.

Example of “Second Page” document .

Let’s create a new page called: “Page Zero” whose @timestamp happens before “First Page:”

POST test_pit/_doc
{
  "@timestamp": "2023-07-20T04:31:40.210+00:00",
  "title":"Page Zero"
}

If we run the first query that returns results sorted by @timestamp asc, the “First Page” document will be returned because “Page Zero” is outside this “Point in Time’s” context.

Example of scenario where If we run the first query that returns results sorted by @timestamp asc, the “First Page” document will be returned because “Page Zero” is outside this “Point in Time’s” context.

But, What if I want to jump from page 1 to page 3? 

With from/size users can easily jump from page to page by changing the value of “from.” 

Now, what about search_after? 

With search_after users can use slicing to split result sets into many pieces that can be called “pages.” 

To achieve this, instead of defining a sorting order and search_after, just define a slice id (which will act a s page number), and a slice max, which will define how many pages you want to generate:

GET /_search
{
  "slice": {
    "id": 1,  
    "max": 2
  },
  "pit": {
    "id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA="
  }
}

This will generate two slices:

  1. First Page/Second Page (id: 0)
  2. Third Page  (id: 1)

So, the query we just ran will return the “Third Page” document because we requested the slice id to be 1:

{
  "pit_id": "g5eAQQEIdGVzdF9waXQWcF9TVC1Sb1lSbE9iU2xWN2UtSVNwdwAWeTlMYkhvdUlRVWFjc09jcmkyN01EUQAAAAAAAAAABhZDc2Vfc190b1M1T19uRE1RbXpEQmZnARZwX1NULVJvWVJsT2JTbFY3ZS1JU3B3AAA=",
  "took": 2,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": {
      "value": 1,
      "relation": "eq"
    },
    "max_score": null,
    "hits": [
      {
        "_index": "test_pit",
        "_id": "MKUthokBtJ4kgTgv2JjV",
        "_score": null,
        "_source": {
          "@timestamp": "2023-07-23T04:31:40.210+00:00",
          "title": "Third Page"
        },
        "sort": [
          1690086700210
        ]
      }
    ]
  }
}

Here, sort will apply per slice, and not for the whole result set, so this strategy is only useful if the overall order is not important.

Slicing also superpowers performance because many slices can be queried in parallel. This is a great  approach to getting a big dataset quickly.

Summary of Pagination Methods

The following table summarizes different pagination methods.

Featurefrom/sizescrollsearch_after
DescriptionA common way to paginate, allowing users to jump from page to page easily.Designed to return a larger number of results at once using context windows for data consistency.Uses the current search results cursor to query the next page of results, with the PIT overcoming statelessness for consistency.
OperationMust go through all the results to get the exact page the user is requesting.Uses context windows to keep data consistent during pagination.Uses the sort parameters of the last result plus the same sort criteria for querying the next page.
Data ConsistencyPaginating with from/size will produce different results if datasets change.Maintains data consistency.Data may not be consistent between queries without the use of PIT. With PIT, users can ensure consistency even after document updates/deletions.
StatefulnessNot statefulStateful, due to search context windows.Stateless without PIT. With PIT, stateful and consistent.
Large DatasetsCan be inefficient for large datasets. Max. results can return 10,000 by default, increasing values is not recommended.Handles larger datasets more efficiently.Efficient for large datasets, especially when coupled with PIT and slices.
Jump Between PagesAllows for easy page jumping by adjusting the from parameter up to 10,000 by default.Not possible with scroll.Page jumping is possible through slicing, breaking the result set into many pages.
PerformanceNot ideal for huge indices.Efficient for large result sets.Can handle large datasets quickly, especially when using slicing for parallel querying.
Real-time DataChanges in the data after the initial request might affect subsequent pages.Data changes after initial requests will not affect subsequent pages.Without PIT, changes in data might affect subsequent pages. With PIT, changes will not affect results.
Live vs BatchGood for displaying live results to users, and with small datasets (less than 10,000 documents).Not recommended for live results, but can be used to retrieve large datasets. For deep pagination search_after is recommended.Preferred method for live results of large datasets and data dumps.

Conclusion

There are many different strategies to paginate on documents. Search_after + PIT is the one with the best performance for deep pagination, and flexibility, allowing all the most relevant features: large datasets, jump between pages, and data consistency to freeze the results set.

In this article we explored how to create a sample set, and efficiently paginate through it by applying search_after with Point in time (PIT), and slicing.