Elasticsearch Elasticsearch String Contains Substring: Advanced Query Techniques

By Opster Team

Updated: Nov 2, 2023

| 2 min read

Introduction

Searching for documents containing specific substrings within a field is a common requirement in Elasticsearch. In this article, we will explore advanced techniques for querying Elasticsearch to find documents where a field contains a specific substring. We will discuss the use of query_string, match_phrase, and wildcard queries, as well as the use of analyzers and tokenizers to improve search accuracy. If you want to learn about Elasticsearch keyword vs. text, check out this guide.

1. Query String Query

The query_string query is a powerful and flexible way to search for documents containing a specific substring. It allows you to use the Lucene query syntax, which provides a wide range of search options. Here’s an example of a query_string query that searches for documents containing the substring “example”:

GET /_search
{
  "query": {
    "query_string": {
      "query": "*example*"
    }
  }
}

In this example, the asterisks (*) are used as wildcard characters, which match any sequence of characters. The query_string query will return documents containing the substring “example” in any field. Beware, though, as leading wildcards can be detrimental to your cluster performance.

2. Match Phrase Query

The match_phrase query is another option for searching for documents containing a specific substring. It searches for the exact phrase within a field, and it can be used with the slop parameter to allow for variations in word order. Here’s an example of a match_phrase query that searches for documents containing the substring “quick brown”:

GET /_search
{
  "query": {
    "match_phrase": {
      "field_name": "quick brown"
    }
  }
}

In this example, the match_phrase query will return documents containing the exact phrase “quick brown” in the specified field.

3. Wildcard Query

The wildcard query is a simple way to search for documents containing a specific substring. It uses wildcard characters to match any sequence of characters within a field. Here’s an example of a wildcard query that searches for documents containing the substring “exam”:

GET /_search
{
  "query": {
    "wildcard": {
      "field_name": "*exam*"
    }
  }
}

In this example, the wildcard query will return documents containing the substring “exam” in the specified field. In this case, you also need to pay special attention when using leading wildcards in a wildcard query as this can slow down your search performance.

4. Analyzers and Tokenizers

To improve the accuracy of substring searches, you can use analyzers and tokenizers to process the text in your documents. Analyzers are responsible for breaking down text into tokens, which are then used for indexing and searching. Tokenizers are a component of analyzers that split text into individual tokens.

For example, you can use the n-gram tokenizer to create tokens of varying lengths from the input text. This can help improve the accuracy of substring searches by allowing Elasticsearch to match substrings of different lengths. Here’s an example of how to create a custom analyzer with an n-gram tokenizer:

PUT /my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 5
        }
      }
    }
  }
}

In this example, the custom analyzer uses an n-gram tokenizer with a minimum token length of 3 and a maximum token length of 5. You can then use this custom analyzer when indexing your documents and when performing substring searches.

Conclusion

Elasticsearch provides several advanced techniques for querying documents containing specific substrings. By using query_string, match_phrase, and wildcard queries, as well as custom analyzers and tokenizers, you can improve the accuracy and flexibility of your substring searches. Experiment with these techniques to find the best approach for your specific use case and dataset.