Elasticsearch When You Should Transform Your Data Instead of Using Aggregations

By Opster Expert Team - Flávio

Updated: Jan 28, 2024

| 2 min read

Transform API

Starting from version 7.3, Elasticsearch offers the Transform API, which allows you to convert existing Elasticsearch indices into summarized indices. This provides opportunities for new insights and analytics. 

With this API you can, for example:

  • Pivot your data into entity-centric indices that summarize the behavior of users, sessions or other entities in your data.
  • Find the latest document among all the documents that have a certain unique key.

There are at least three use cases where you should consider using transforms instead of aggregations:

  • You need a complete feature index rather than a top-N set of items.
  • You need to sort aggregation results by pipeline aggregation.
  • You want to create summary tables to optimize queries.

Consider the common use case of analyzing user web session data. This data is probably stored in one or more indices, each document representing a behavioral event of the user experience with the website. If you want to extract summarized values as, for example, the number of requests per time period or even break this down by geo location, this would probably not be a problem for the cluster to calculate.

However, things like the average session duration time could cause your cluster to run out of memory. That happens because in this scenario we are considering each document holding only information about the event itself, so the cluster is obligated to find the documents that represent both the very first and last session events for each session, which would require complex query logic and lots of memory.

This is where transforms can come in handy. You could have this being calculated in an ongoing background process, merging the data for related events into a summarized index, keeping your source index unmodified.

Data Rollups

Since we’ve mentioned transforms, another approach worth mentioning is data rollups. For a full guide on how to roll up your data, see here.

With data rollups you’ll be able to summarize high-granularity time-based data into a reduced granularity format for long-term storage. Elasticsearch implements that by creating a separate index with the rolled up data. This is a normal Elasticsearch index, so you’ll be able to query it by using Query DSL as usual.

This index will be smaller than the original one (since it will keep only summarized data) and thus will not only save you some space (if you choose to delete the original data from which the rolled up data derived), but also drastically improve your aggregations response time.   

Not only that, but the Rollup API through the _rollup_search endpoint allows you to get a result that automatically merges your summarized and real time data, giving preference for the latter. So with a single search/aggregations request to this endpoint you could receive aggregated data back with the time periods that were already rolled up, benefiting from this parallel structure (index) that Elasticsearch created and stays synced with the original source of truth. This can certainly help improve your aggregations performance.

Remember to pay attention to some current limitations Elastic mentions in its documentation.

The Rollup functionality allows fields to be grouped with the following aggregations:

  • Date histogram aggregation
  • Histogram aggregation
  • Terms aggregation

And the following metrics are allowed to be specified for numeric fields:

  • Min aggregation
  • Max aggregation
  • Sum aggregation
  • Average aggregation
  • Value count aggregation

Improve Aggregation Performance

For use cases in which aggregations should definitely be used, it’s important to know how you can improve their performance by making small, defining adjustments. Follow the steps in this guide to improve performance.