Elasticsearch Elasticsearch Stop Words

By Opster Team

Updated: Aug 24, 2023

| 2 min read

Introduction

Stop words are a fundamental aspect of text analysis in Elasticsearch. They are commonly used words that a search engine has been programmed to ignore. Words like ‘and’, ‘the’, ‘is’, and ‘in’ are often used as stop words since they do not add significant value to a search. However, the use of stop words in Elasticsearch is not as straightforward as it seems. This article will delve into the intricacies of using stop words in Elasticsearch, providing examples and step-by-step instructions where necessary. If you want to learn about Elasticsearch Text Analyzers – Tokenizers, Standard Analyzers, Stopwords and more, check out this guide.

Elasticsearch uses a stop token filter to handle stop words. This filter is used in the analysis process to remove stop words from the token stream. By default, Elasticsearch uses a set of English stop words, but you can configure it to use any set of stop words you want.

Implementing a custom stop words filter: A step-by-step guide

To illustrate, let’s consider an example. Suppose you have a text field in your index and you want to apply a custom stop words filter. Here’s how you can do it:

Step 1: Define your custom analyzer

Define your custom analyzer in the settings of your index. In the definition, include the stop filter and specify your stop words.

json
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop"]
        }
      },
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": ["the", "and"]
        }
      }
    }
  }
}

In this example, ‘the’ and ‘and’ are the stop words. The ‘my_analyzer’ analyzer will remove these words from the token stream.
In the case the list of stop words grows big, it is also possible to store all the stop words in a file and reference the path of the file in the “stopwords_path” parameter. The file must be UTF-8 encoded and be present on all nodes of the cluster, which means that whenever the list of stop words changes, the file must be updated on all nodes and reloaded.

Step 2: Apply the custom analyzer to your text field

json
PUT /my_index/_mapping
{
  "properties": {
    "my_field": {
      "type": "text",
      "analyzer": "my_analyzer"
    }
  }
}

In this step, we apply ‘my_analyzer’ to ‘my_field’. Now, whenever ‘my_field’ is indexed, ‘the’ and ‘and’ will be removed from the token stream.

Advanced stop words configuration

It’s important to note that the use of stop words can significantly reduce the size of your index and increase search speed. However, it can also lead to less precise search results. For instance, if you’re searching for the phrase “the and”, the search engine will not return any results because ‘the’ and ‘and’ are stop words.

Elasticsearch also allows you to remove all stop words by setting the ‘stopwords’ parameter to ‘_none_’ in the stop token filter. This can be useful in scenarios where precision is more important than index size and search speed.

json
PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "my_stop"]
        }
      },
      "filter": {
        "my_stop": {
          "type": "stop",
          "stopwords": "_none_"
        }
      }
    }
  }
}

This example showed how to configure Elasticsearch to exclude all stop words from its analysis by setting the ‘stopwords’ parameter to ‘_none_’ in the stop token filter.