Elasticsearch Counting Unique Values in Elasticsearch: An In-depth Guide

By Opster Team

Updated: Nov 5, 2023

| 2 min read

Quick Links

Overcview

Elasticsearch provides a multitude of functionalities. One such functionality is the ability to count unique values in a dataset. This feature is particularly useful when dealing with large datasets where manual counting is not feasible. In this article, we will delve into the process of counting unique values in Elasticsearch using the cardinality aggregation feature.

Understanding Cardinality Aggregation

Cardinality Aggregation is a type of Elasticsearch aggregation that is used to count distinct values in a dataset. It is based on the HyperLogLog++ algorithm, which is a probabilistic algorithm used for counting distinct values. The algorithm provides a trade-off between memory usage and accuracy. 

How to Use Cardinality Aggregation

To count unique values in Elasticsearch, you can use the cardinality aggregation feature. Here is a step-by-step guide on how to use this feature:

Step 1: Start by defining your Elasticsearch index. This is the dataset you will be working with.

Step 2: Next, you need to define the field you want to count the unique values of. This field should be within the defined index.

Step 3: Use the cardinality aggregation feature to count the unique values in the defined field. Here is an example of how to do this:

GET /_search
{
  "size": 0,
  "aggs": {
    "unique_count": {
      "cardinality": {
        "field": "your_field_name"
      }
    }
  }
}

In the above example, replace “your_field_name” with the name of the field you want to count the unique values of.

Step 4: Run the query. The response will include the count of unique values in the specified field.

Precision Control in Cardinality Aggregation

While the cardinality aggregation feature is highly useful, it’s important to note that it provides an approximate count, not an exact count. This is due to the probabilistic nature of the HyperLogLog++ algorithm. However, Elasticsearch allows you to control the precision of the count through the `precision_threshold` option.

The `precision_threshold` option allows you to set a threshold below which the counts are expected to be fairly accurate. Here is an example of how to use this option:

GET /_search
{
  "size": 0,
  "aggs": {
    "unique_count": {
      "cardinality": {
        "field": "your_field_name",
        "precision_threshold": 100
      }
    }
  }
}

In the above example, the count is expected to be accurate up to 100 unique values.

Limitations of Cardinality Aggregation

While the cardinality aggregation is a powerful tool, it does have some limitations. The most significant limitation is that it provides an approximate count, not an exact count. This is due to the probabilistic nature of the HyperLogLog++ algorithm.

Another limitation is that the memory usage of cardinality aggregation can be high, especially when dealing with large datasets or when running the cardinality aggregation on a field with high cardinality. This is because the algorithm needs to store a data structure for each unique value in the dataset.

Despite these limitations, cardinality aggregation is a highly useful feature in Elasticsearch. It provides a way to count unique values in large datasets quickly and efficiently. By understanding how to use this feature and its limitations, you can make the most of it in your Elasticsearch projects.