Quick links
- Overview and Background
- Composite aggregation uses
- How to implement composite aggregations in OpenSearch
- Consequences
- Notes and good things to know
Overview and Background
In OpenSearch, an aggregation is a collection or the gathering of related things together. The aggregation framework collects data based on the documents that match a search request which helps in building summaries of the data.
With aggregations you can not only search your data, but also take it a step further and extract analytical information, making it easier for organizations to glean insights from amassed data.
Aggregations are used all over the place in Kibana: dashboards, APM app, Machine Learning app and so on.
OpenSearch organizes aggregations into:
- Metric aggregations – compute metrics from field values, like a sum or average.
- Bucket aggregations – Documents are grouped into buckets, also known as bins, using bucket aggregations based on field values, ranges, or other factors
- Pipeline aggregations – accept data from other aggregations rather than from fields or documents.
Unlike metrics aggregations, bucket aggregations generate buckets of documents rather than calculating metrics over fields. Depending on the aggregation type, each bucket has a criterion that decides whether or not a document in the current context “falls” into it. Document sets are effectively defined by the buckets. The bucket aggregations return and compute the number of documents that “fell into” every bucket in addition to the buckets themselves.
In contrast to metrics aggregations, bucket aggregations are capable of holding sub-aggregations. For the buckets produced by their “parent” bucket aggregation, these sub-aggregations will be aggregated. There are various bucket aggregators, each with a unique approach to “bucketing.” Some designate a single bucket, others a set number of multiple buckets, while yet others define the buckets dynamically as the data is gathered.
Composite aggregation is defined under bucket aggregations. Composite aggregations are multi-bucket aggregations that combine buckets from many sources to produce composite buckets.
Composite aggregation uses
The composite aggregation allows you to paginate every bucket from a multi-level aggregation effectively, unlike other multi-bucket aggregations. In a manner similar to how scroll handles documents, this aggregation offers a mechanism to stream every bucket of a particular aggregation. The values that are taken from or created for each document are combined to form the composite buckets, and each combination is thought of as a composite bucket.
How to implement composite aggregations in OpenSearch
The main required parameter for composite aggregation is the sources parameter. When defining sources, you must give them a distinctive name. When creating composite buckets, the sources parameter specifies the source fields to use. The order in which the keys are returned depends on the definition order of the sources. Any of the following types, including terms, histogram, date histogram, and GeoTile grid, can be used in the sources parameter.
When terms is utilized as a value source, the process resembles the terms aggregation. In perfect accordance with the terms aggregation, the values are taken from a field. It is possible to use a runtime field to generate values for the composite buckets, just like with terms aggregation. The following is an example of composite aggregation that has been performed on the authors index using terms as a source in the sources parameter, and this source aggregates the “author_name” field.
GET authors/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "authors": { "terms": { "field": "author_name" } } } ] } } } }
The histogram value source can be used to generate intervals of fixed size over numeric data. The transformation of the numerical values is specified by the interval parameter. For example, if the interval is set to 5, a number of 106 would be translated to 105, which is the key for the interval between 105 and 110, . It is possible to use a runtime field to generate values for the composite buckets, just like with histogram aggregation.
The following is an example of composite aggregation that has been performed on the authors index using histogram as a source in the sources parameter with an interval set to 5, and this source has the “booksnum” name, and it uses the books_number field that represents the number of published books for the authors.
GET authors/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "booksnum": { "histogram": { "field": "books_number", "interval": 5 } } } ] } } } }
The date_histogram and the histogram value source are similar. Except for the interval. The interval is defined by a date/time expression in the date_histogram, unlike the histogram value source.
The following is an example of composite aggregation that has been performed on the books index using date_histogram as a source in the sources parameter with an interval set to one week. The source has the “week” name, and it uses the publish_date field.
The publish_date field represents the date on which the book has been published, which creates an interval per week and translates all publish_date values to the start of its closest interval.
GET books/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "week": { "date_histogram": { "field": "publish_date", "calendar_interval": "1w" } } } ] } } } }
The following terms can be used to express an interval in the date_histogram: second, minute, hour, day, week, month, quarter, and year. The time units parser supports the use of abbreviations to specify time values. Note that there is no support for fractional time values, but you can work around this by switching to a different time unit (e.g. 2.5h could instead be specified as 150m).
date_histogram has three parameters, which are; format, time_zone, and offset. A date is internally stored as a 64-bit value with a milliseconds-since-the-epoch timestamp. As the bucket keys, these timestamps are returned. Instead, you can use the format specified by the format parameter to return a formatted date string. Below is an example of using the format parameter in the date_histogram source.
GET books/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "formatteddate": { "date_histogram": { "field": "publish_date", "calendar_interval": "1d", "format": "MM-dd-yyyy" } } } ] } } } }
OpenSearch stores date-times as UTC. By default, UTC is also used for all bucketing and rounding. To specify that bucketing should utilize a different time zone, use the time_zone parameter.
To modify the starting value of each bucket by the supplied positive (+) or negative (-) offset duration, such as 1h for an hour or 1d for a day, use the offset parameter. For instance, each bucket runs from midnight to midnight when using an interval of day. Each bucket now runs from 7 am to 7 am when the offset parameter is set to +7h. After time_zone modifications, each bucket’s start offset is calculated.
When working with geo_point fields, the geotile_grid value source divides up the points into buckets that correspond to the grid’s cells. Only cells with matching data are included in the resulting grid, which may be sparse in nature. Each cell represents a map tile, similar to how many online map sites do. Each cell is identified using the format “{zoom}/{x}/{y}”, where zoom is equal to the precision that the user specifies.
The following is an example of a composite aggregation that has been performed on the authors index using geotile_grid as a source in the sources parameter with a precision set to 6, and this source has the “authorsloc” name, and it uses the authors_location field that represents the location of the author.
GET authors/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "authorsloc": { "geotile_grid": { "field": "authors_location", "precision": 6 } } } ] } } } }
Cells that are less than 10 cm by 10 cm in size are produced using high precision geotile of length 29. Since each tile does not need to be produced and loaded into memory, this precision is specifically suitable for composite aggregations. The number of tiles that can be used is decreased by restricting the geotile source to a particular geo bounding box. When only a small portion of a geographical area requires high precision tiling, these bounds are helpful.
An array of value sources can be used in the sources parameter. To make composite buckets, various value sources can be combined. As shown in the example below, from the values produced by two value sources, a date histogram, and terms, composite buckets will be generated. Two values, one for each value source specified in the aggregation, make up each bucket. Any combination is possible, and the composite buckets maintain the array’s order.
GET books/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "week": { "date_histogram": { "field": "publish_date", "calendar_interval": "1w" } } }, { "authors": { "terms": { "field": "author_name" } } } ] } } } }
The composite buckets are ordinarily arranged according to their natural arrangement. The values are arranged in ascending order. When many value sources are required, the ordering is carried out for each value source. The first value of the composite bucket is compared to the first value of the other composite bucket, and if they are equal, the next values in the composite bucket are utilized to break ties. By setting order to asc (default value) or desc (descending order) explicitly in the value source definition, it is possible to specify the sort direction for each value source. As shown in the example below, the composite bucket will be sorted in ascending order when values are compared from the terms source and in descending order when values are compared from the date_histogram source.
GET books/_search { "size": 0, "aggs": { "our_buckets": { "composite": { "sources": [ { "week": { "date_histogram": { "field": "publish_date", "calendar_interval": "1w", "order": "desc" } } }, { "authors": { "terms": { "field": "author_name", "order": "asc" } } } ] } } } }
Documents that don’t have a value for the specified source are by default ignored. By setting the missing_bucket parameter to true, you can include them in the response, knowing that the missing_bucket parameter value is false.
Using the optional missing_order parameter, you can govern where the null bucket appears. The null bucket is positioned in the first or last position, depending on whether missing_order is first or last. The bucket’s location is determined by the source’s order if missing_order is left out or set to default. The bucket is in the first place if the order is ascending (asc). The bucket is in the last place if the order is desc (descending).
The number of returned composite buckets can be specified using the size parameter. Setting size of 10 will return the first 10 composite buckets generated from the value sources because each composite bucket is treated as a single bucket. An array comprising the values extrapolated from each value source is included in the response along with the values for each composite bucket. The size parameter default value is 10.
It is possible to divide the retrieval into numerous requests if the number of composite buckets is too large (or undetermined) to be returned in a single response. The requested size is exactly the amount of composite buckets that will be returned in the response because the composite buckets are flat by nature (when assuming that they are at least size composite buckets to return). It is advisable to use a small size (like 100 or 1000) and then the after parameter to retrieve the following results if all composite buckets should be retrieved.
The composite aggregation can contain sub-aggregations, just as any bucket aggregation. Other buckets or statistics on each composite bucket produced by this parent aggregation can be computed using these sub-aggregations.
Consequences
The cost of composite aggregation is high. Before deploying a composite aggregation in production, load test your application.
Notes and good things to know
- OpenSearch caches the outcomes of frequently used aggregations in the shard request cache for quicker response. Use the same preference string for each search to get cached results. The same preference string for a search is sent to the same shards by OpenSearch. The shards deliver cached aggregation results if the data on them doesn’t change between searches.
- The index sort should be set on the index so that it partially or completely matches the source order in the composite aggregation for best performance.
- Setting track_total_hits to false in the request is recommended in order to enhance the early termination.
- The order of the source is significant. If your use case does not require the order of the sources, you can comply with these straightforward rules: fields having the highest cardinality should be placed first, make sure the field order corresponds to the index sort