Elasticsearch Elasticsearch Lowercase Analysis

By Opster Team

Updated: Aug 29, 2023

| 2 min read

Quick Links

Introduction

Elasticsearch provides a wide range of text analysis mechanisms, one of which is to lowercase the tokens of an input stream. There are three built-in analyzers that convert all text into lowercase during the analysis phase in order to ensure a case-insensitive match during the search process, such as the `standard`, `simple` and `whitespace` analyzers. This article will also show how to build a custom analyzer and normalizer in order to support specific use-cases.

Understanding the Standard Analyzer

The Standard Analyzer is a built-in analyzer in Elasticsearch. It is essentially a two-step process. First, it tokenizes the input text into individual words using the Unicode Text Segmentation algorithm as defined by the Unicode Consortium. Following this, it applies the Lowercase Token Filter to each token, converting them to lowercase, and finally removes stop words using the Stop Words token filter.

Let’s consider an example. If we have the text “Hello World”, the Standard Analyzer would tokenize this into two tokens: “Hello” and “World”. Then, it would convert these tokens into “hello” and “world”.

To use the Standard Analyzer, you can specify it in the field mapping:

PUT /my_index
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "text",
        "analyzer": "standard"
      }
    }
  }
}

In this example, any text that is indexed in the “my_field” field will be analyzed using the Standard Analyzer.

Using the Simple and Whitespace built-in analyzers yields the same result: all tokens will be lowercased in the end, even though these analyzers tokenize and apply different sets of token filters.

Customizing the Standard Analyzer

While the Standard Analyzer is useful in many scenarios, there might be cases where you need to customize it. For instance, you might want to keep certain words in their original case. Elasticsearch allows you to create custom analyzers for such requirements.

To create a custom analyzer that uses the Lowercase Token Filter, you can use the following command:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "my_custom_filter"]
        }
      },
      "filter": {
        "my_custom_filter": {
          "type": "word_delimiter",
          "preserve_original": "true"
        }
      }
    }
  }
}

In this example, we have created a custom analyzer named “my_custom_analyzer”. This analyzer uses the standard tokenizer and applies two filters: the Lowercase Token Filter and a custom filter named “my_custom_filter”. The custom filter is a word delimiter filter that preserves the original token as well as the split tokens.

To use this custom analyzer, you can specify it in the field mapping:

PUT /my_index/_mapping
{
  "properties": {
    "my_field": {
      "type": "text",
      "analyzer": "my_custom_analyzer"
    }
  }
}

Understanding the Lowercase Normalizer

Analyzers can only be used by text fields, but keyword fields can also be configured to apply some transformation rules at indexing time—similar to how text fields are processed during analysis. Normalizers are to keyword fields what Analyzers are to text fields and can be defined in a similar way as analyzers, yet since they emit a single token, they don’t need any tokenizer.

To use a custom normalizer for your keyword field, you can define it like this:

PUT /my_index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "lowercase_normalizer": {
          "type": "custom",
          "filter": ["lowercase"]
        }
      }

    }
  }
}

Once defined, you can apply this custom normalizer on a keyword field like this:

PUT /my_index/_mapping
{
  "properties": {
    "my_field": {
      "type": "keyword",
      "normalizer": "lowercase_normalizer"
    }
  }
}

As a result, the value of the keyword field will be indexed in lowercase.