Elasticsearch Elasticsearch Count Distinct

By Opster Team

Updated: Aug 17, 2023

| 2 min read

Introduction 

Elasticsearch provides a wide range of functionalities, one of these is the ability to count distinct occurrences of fields in a dataset, a feature that is incredibly useful for data analysis. This article will delve into the intricacies of using the count distinct feature in Elasticsearch, providing examples and step-by-step instructions to guide you through the process.

The cardinality aggregation feature

To count distinct occurrences in Elasticsearch, we use a feature known as cardinality aggregation. Cardinality is a measure of the number of distinct elements in a collection. In Elasticsearch, the cardinality aggregation provides an approximate count of distinct values. This approximation is necessary because the exact count can be resource-intensive, especially for large datasets.

The cardinality aggregation is based on the HyperLogLog++ algorithm, which provides a trade-off between accuracy and memory usage. While it may not always provide the exact count, it is usually close enough for most practical purposes.

Using the cardinality aggregation

To use the cardinality aggregation, you need to specify the field you want to count distinct values for. Here is a basic example:

json
GET /_search
{
  "aggs": {
    "distinct_count": {
      "cardinality": {
        "field": "field_name"
      }
    }
  }
}

In this example, replace “field_name” with the name of the field you want to count distinct values for. The response will include an aggregation named “distinct_count”, which contains the approximate count of distinct values.

Controlling the precision of the cardinality aggregation

Elasticsearch allows you to control the precision of the cardinality aggregation through the `precision_threshold` option. This option allows you to specify the maximum number of distinct values you expect in the field. The higher the value, the more accurate the count, but also the more memory used.

Here is an example of how to use the `precision_threshold` option:

json
GET /_search
{
  "aggs": {
    "distinct_count": {
      "cardinality": {
        "field": "field_name",
        "precision_threshold": 1000
      }
    }
  }
}

In this example, Elasticsearch will strive to provide an accurate count up to 1000 distinct values. Beyond that, the count may be less accurate.

Multi-field cardinality aggregation

Elasticsearch also allows you to count distinct values across multiple fields. This is done using a script. Here is an example:

json
GET /_search
{
  "aggs": {
    "distinct_count": {
      "cardinality": {
        "script": {
          "source": "doc['field1'].value + ' ' + doc['field2'].value"
        }
      }
    }
  }
}

In this example, Elasticsearch will count distinct combinations of “field1” and “field2”.

Limitations and considerations

While the cardinality aggregation is a powerful tool, it’s important to be aware of its limitations. The count is approximate, especially for large numbers of distinct values. Also, the memory usage can be significant, especially with high precision thresholds.

Furthermore, the cardinality of text fields is not always meaningful, as Elasticsearch will count distinct terms rather than distinct values. For text fields, it’s often more useful to use the `keyword` type or to enable `fielddata`.