Introduction
Partial matching in Elasticsearch is a common requirement when building search applications. It allows users to find relevant documents even if their search terms do not exactly match the indexed data. This could happen either because some terms have slightly different ways of being written or, more often, because the user misspelled some terms. In this article, we will discuss various techniques to achieve partial matching in Elasticsearch. We’ll also provide best practices for optimizing the search performance, since some of the approaches can be very resource-intensive.
1. Using Wildcards
Wildcards are a simple way to perform partial matching in Elasticsearch. You can use the asterisk (*) as a wildcard character to match any number of characters, and the question mark (?) to match a single character. The wildcard query can be used to search for documents containing partial matches.
Example:
GET /my_index/_search { "query": { "wildcard": { "title": "*lastic*" } } }
This query will match documents with titles containing the word “Elasticsearch”, “Elastic” as well as any other possible document whose title contains a term that would match the given pattern, such as “Plastic”, “Pyroclastic“ and so on.
Be aware that wildcard queries can be slow and resource-intensive, especially when using leading wildcards. It is not recommended to use them for large-scale production environments.
2. Using Edge N-grams
Edge N-grams are a more efficient way to achieve partial matching in Elasticsearch. They involve creating a custom analyzer that generates N-grams (substrings) of the indexed terms. This allows Elasticsearch to quickly find documents containing partial matches.
Let’s see an example on how to use the Edge N-gram Elasticsearch feature.
A) First, create an index with a custom analyzer, which should have the edge_ngram_tokenizer as its tokenizer.
PUT /my_index { "settings": { "analysis": { "analyzer": { "edge_ngram_analyzer": { "tokenizer": "edge_ngram_tokenizer" } }, "tokenizer": { "edge_ngram_tokenizer": { "type": "edge_ngram", "min_gram": 2, "max_gram": 10, "token_chars": [ "letter", "digit" ] } } } }, "mappings": { "properties": { "title": { "type": "text", "analyzer": "edge_ngram_analyzer" } } } }
B) You can now use the Analyze API to test how your index will index the terms for the “title” field. Run the following command and verify that each term will be broken in substrings starting with 2 characters and going up to 10.
POST my-index/_analyze { "analyzer": "edge_ngram_analyzer", "text": [ "Elasticsearch can help you with partial matching"] }
C) Now, index a document with a “title” field. Run the following command:
PUT /my-index/_doc/1 { "title": "Elasticsearch can help you with partial matching" }
D) Finally, search for a partial term,and observe that Elasticsearch returned the document we indexed, even though we didn’t pass the correct/complete term in our search request.
POST my-index/_search { "query": { "match": { "title": "Ela" } } }
3. Using Fuzzy Queries
Fuzzy queries in Elasticsearch allow you to find documents that are approximately similar to the search query. They are based on the Levenshtein edit distance, which measures the number of single-character edits (insertions, deletions, or substitutions) required to transform one string into another.
Example:
GET /my_index/_search { "query": { "fuzzy": { "title": { "value": "elsticsearch", "fuzziness": 2 } } } }
This query will match documents with titles containing “Elasticsearch”, even if there are minor spelling errors in the search term.
However, fuzzy queries can be slow for large datasets, and it is essential to choose the appropriate fuzziness value to balance search accuracy and performance.
Best Practices
- Choose the right technique: Depending on your use case and dataset size, choose the most appropriate partial matching technique. Wildcards are simple but can be slow, while edge N-grams and fuzzy queries offer better performance but require more configuration.
- Optimize your analyzers: When using edge N-grams or other custom analyzers, ensure that you optimize their settings to balance index size and search performance.
- Monitor search performance: Regularly monitor the performance of your Elasticsearch cluster and fine-tune your partial matching techniques to ensure optimal search performance and user experience.
Conclusion
Partial matching is an essential feature for many search applications, and Elasticsearch offers multiple techniques to achieve it. By understanding the different methods and following best practices, you can build efficient and user-friendly search experiences with Elasticsearch.
If you want to learn more about creating custom analyzers (if you are going to follow the Edge N-Gram approach), take a look at this guide available on Opster’s website. You can also find a complete guide on how to use fuzzy queries.