Elasticsearch Mastering Elasticsearch Normalizers for Improved Text Analysis

By Opster Team

Updated: Oct 31, 2023

| 2 min read

Quick links

Overview

Elasticsearch normalizers are a crucial component in the text analysis process, specifically when dealing with keyword fields. Normalizers come into play when you want to aggregate, sort, or access values in a keyword field, but still need to maintain a certain level of text processing. This article delves into the intricacies of Elasticsearch normalizers, their usage, and how to optimize them for improved text analysis.

Understanding Elasticsearch Normalizers

Normalizers are similar to analyzers, but they are used with keyword fields instead of text fields. While analyzers break text into tokens, normalizers don’t. They take a string, process it, and return the processed string. This is particularly useful when you want to perform operations like sorting or aggregating on a keyword field, but still need to apply some text processing like lowercasing or removing diacritical marks.

Creating Custom Normalizer

Elasticsearch provides built-in normalizers like the `lowercase` normalizer. However, there are scenarios where you might need to create custom normalizers to meet specific requirements. Here’s how you can define a custom normalizer:

json
PUT /my_index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "my_field": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

In this example, a custom normalizer named `my_normalizer` is created. It uses two built-in filters: `lowercase` and `asciifolding`. The `lowercase` filter converts the text to lower case, and the `asciifolding` filter removes diacritical marks from the text. This normalizer is then applied to the `my_field` keyword field.

When defining custom normalizers, all of the following token filters can be used: arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, pattern_replace, persian_normalization, scandinavian_folding, serbian_normalization, sorani_normalization, trim, uppercase.

Optimizing Normalizers for Improved Text Analysis

While normalizers are powerful tools, they need to be used judiciously to ensure optimal performance. Here are some tips to optimize the use of normalizers:

1. Use Built-in Normalizers When Possible: As of 8.10, Elasticsearch only provides a single built-in normalizer called `lowercase`. It is optimized for performance and should be used whenever possible.

2. Limit the Number of Custom Normalizers: Each custom normalizer adds to the complexity of the analysis process. Therefore, limit the number of custom normalizers to only those that are absolutely necessary.

3. Avoid Using Normalizers on High Cardinality Fields: Normalizers can significantly slow down operations on high cardinality fields. If a field has a large number of unique values, consider using a different approach for text processing.

4. Test Normalizers Before Deployment: Always test the impact of normalizers on your Elasticsearch operations before deploying them in a production environment. This can help identify any potential performance issues early on.

Conclusion

In conclusion, Elasticsearch normalizers are a powerful tool for text analysis on keyword fields. By understanding their functionality and applying best practices for their usage, you can significantly improve the efficiency and effectiveness of your text analysis processes.