Quick links
Introduction
Fuzzy queries are an essential component of Elasticsearch when it comes to handling approximate or imprecise search terms. They allow users to search for documents containing terms that are similar to the specified query term, even if they are not exactly the same. This can be particularly useful in scenarios where users might make typos, or input variations of the same term.
In this article, we will discuss advanced techniques and use cases for Elasticsearch fuzzy queries.
Advanced Techniques and Use Cases
1. Customizing Fuzziness
By default, Elasticsearch uses the Damerau-Levenshtein edit distance to calculate the fuzziness between two terms. The distance between two terms is measured as the number of one-character changes that are needed to turn one term into another. However, you can customize the fuzziness level to control the number of allowed edits (insertions, deletions, substitutions, or transpositions) between the query term and the matching terms in the index. You can set the fuzziness parameter to “AUTO” or an integer.
For example, to allow a maximum of 2 edits, you can set the fuzziness parameter as follows:
{ "query": { "fuzzy": { "field_name": { "value": "black", "fuzziness": 2 } } } }
The `fuzzy` query above allows a maximum of two edits, which means that searching for the term `black` will match `block` (one edit) as well as `clock` (two edits).
In general, it is a good idea to use the fuzziness value `AUTO` instead of a fixed integer value because the allowed edit distance will be adapted depending on the length of the search token, the length of which we don’t know. `AUTO` will allow a maximum of two edits for search tokens having 6 characters or more, one edit for tokens of three to five characters, and shorter tokens will need to match exactly (i.e., no edits allowed at all). So for instance, we cannot use a `fuzzy` query with `AUTO` fuzziness and expect to find `my` when searching for `me`.
2. Prefix Length and Max Expansions
You can also control the minimum number of characters that must match exactly at the beginning of the query term by setting the “prefix_length” parameter. This can help improve performance by reducing the number of terms that need to be examined. By default, the prefix length is set to 0, which means that edits are allowed everywhere in the token.
Additionally, you can limit the number of terms that the fuzzy query expands to by setting the “max_expansions” parameter. This can help prevent overly broad queries that could impact performance. By default, this is set to 50 and you should avoid setting this too high, especially with a default prefix length of 0, to prevent having to examine too many variations of the search token.
{ "query": { "fuzzy": { "field_name": { "value": "search_term", "fuzziness": 2, "prefix_length": 3, "max_expansions": 50 } } } }
3. Boosting and Tie Breaker
In some cases, you might want to give higher relevance to documents that contain terms with fewer edits. You can achieve this by using the “boost” parameter to increase the score of documents containing terms with a lower edit distance.
Moreover, if you have multiple fields in your index, you can use the “multi_match” query with the “best_fields” type and set the “tie_breaker” parameter to control how the scores from different fields are combined.
{ "query": { "multi_match": { "query": "search_term", "type": "best_fields", "fields": ["field1", "field2"], "fuzziness": 2, "tie_breaker": 0.3 } } }
4. Use Cases
Elasticsearch fuzzy queries can be beneficial in various scenarios, including:
- Autocomplete suggestions: By allowing for approximate matches, fuzzy queries can provide more relevant suggestions to users as they type their search queries.
- Spell correction: Fuzzy queries can be used to identify and suggest corrections for misspelled words in user queries.
Conclusion
In conclusion, Elasticsearch fuzzy queries offer a powerful way to handle imprecise search terms and improve the overall search experience. By customizing fuzziness, prefix length and max expansions, you can fine-tune the behavior of fuzzy queries to suit your specific use cases and performance requirements.