Elasticsearch Cardinality: Low & High Cardinality Fields %

By Opster Team

Updated: Jun 6, 2023

| 3 min read

Quick links

Overview
Examples
- Low cardinality fields
- High cardinality fields
Low vs. high cardinality fields
How to determine field cardinality in Elasticsearch
The importance of cardinality in Elasticsearch

Overview

What is cardinality in Elasticsearch?

Cardinality in Elasticsearch denotes the total number of distinct values found in a specific field or group of fields, across all the documents saved in a given index.

Understanding cardinality is key for optimizing Elasticsearch performance, as it can help inform decisions about data modeling, query construction, and resource allocation.

What is low and high cardinality in Elasticsearch?

Low cardinality refers to fields that have few distinct values, while high cardinality refers to fields that have large numbers of distinct values.

For example, a “status” field with “Open” and “Closed” values would be classed as having low cardinality because it has only two distinct values. On the other hand, a “product_name” field with a large number of distinct values would be considered a high cardinality field.

Examples

Here are some examples of low and high cardinality fields:

Low cardinality fields:

Country: This field may have a limited number of values, depending on the scope of the dataset. For example, a dataset of customers in a particular region may only have a few countries to choose from.
Product categories: A dataset of products may only have a small number of categories, such as “electronics,” “clothing,” “home goods,” etc.
Payment method: This field might only have a few possible values, such as “credit card,” “PayPa,l” “cash,” etc.
Marital status: This field typically has a limited number of values, such as “married,” “single,” “divorced,” “widowed,” etc.

High cardinality fields:

User IDs: In a dataset of online activity, user IDs may be unique for each individual user, resulting in high numbers of User IDs, and cardinality.
IP addresses: Each device connected to the internet has a unique IP address, therefore, a dataset of network activity may have a high cardinality field of IP addresses.
Email addresses: In a dataset of email communications, the “to” and “from” fields may have a large number of unique email addresses.
URLs: In a dataset of web traffic, the URLs visited by users may be highly unique.
Product IDs: In a dataset of e-commerce activity, each product might have a unique ID.

Low vs. high cardinality fields

Distinguishing between low and high cardinality fields is important because it can impact indexing and search performance. Generally, low cardinality fields can be indexed more efficiently than high ones. Queries that involve high cardinality fields may also require more processing power and take longer to execute. Understanding field cardinality can also help inform data modeling decisions.

For low cardinality fields, Elasticsearch uses a “doc values” approach, which pre-calculates the frequency of each unique field value and stores it in a separate data structure. This approach is optimized for aggregations and provides fast search performance.

For high cardinality fields, Elasticsearch uses a “terms” approach, which stores unique values in an inverted index and performs term-level queries to retrieve matching documents. This approach can handle large volumes of unique values, but it can be slower and more resource-intensive than the “doc values” approach.

To optimize performance for both low and high cardinality fields, it’s important to choose the right mapping type and index settings, and to use the appropriate query and aggregation strategies for each field type.

How to determine field cardinality in Elasticsearch

Users can execute the following Elasticsearch query to determine the cardinality of the “product_name” field from “orders” index:

GET orders/_search
{
  "size": 0,
  "aggs": {
    "unique_products": {
      "cardinality": {
        "field": "product_name"
      }
    }  
 } 
}

The response will include an “aggregations” section that shows the number of unique values for the “product_name” field in the “value” field.

The importance of cardinality in Elasticsearch

Cardinality is an important data analysis concept for several reasons:

Data quality: Understanding the cardinality of a dataset can help determine data quality. For example, if a low cardinality field has a large number of distinct values, it could be an indication of data entry errors or inconsistencies.
Indexing: This technique is used to improve database query performance. Understanding the cardinality of a field can help determine the appropriate indexing strategy, given that low and high cardinality fields may benefit from different indexing strategies.
Query performance: High cardinality fields can impact query performance because they require more processing power to search and aggregate. By understanding the cardinality of the data, queries can be optimized to handle high cardinality fields more efficiently.
Data modeling: Cardinality is an important consideration in data modeling, which involves designing the structure of a database. Understanding the cardinality of a dataset can help inform decisions about table structure and relationships between them.
Business insights: Understanding the cardinality of a dataset can also provide valuable business insights. For example, knowing the cardinality of a product ID field can help identify popular products or sales trends.

Elasticsearch Elasticsearch Cardinality – Low + High Cardinality Fields

Quick links

Overview

Examples

Low cardinality fields:

High cardinality fields:

Low vs. high cardinality fields

How to determine field cardinality in Elasticsearch

The importance of cardinality in Elasticsearch