Quicklinks:
- Overview
- Suggester types
- How to implement search suggesters in Elasticsearch
- Notes and good things to know
Overview
Elasticsearch provides various tools to help users avoid spelling mistakes. Apart from the more well-known fuzzy search, another feature that can be used is the “suggester”. Suggesters work differently and use a different syntax from regular Elasticsearch queries.
Suggester types: term, phrase, completion
Elasticsearch offers three types of suggesters:
- Term suggesters
- Phrase suggesters
- Completion suggesters (autocomplete)
Term suggester
The term suggester can be run at the same time as a query, and can be used to suggest “did you mean …?” alternatives, particularly in the case where a user has misspelled a word.
The term suggester, in its simplest form, will search a field for a given set of words. If the suggester finds very similarly spelled words in the index, which are more common than the terms used in the query, then the suggester will suggest replacing these terms with alternatives.
Phrase suggester
The phrase suggester is similar to a term suggester but is more sophisticated. It looks at the position of the words in the text, and can try to propose an improved phrase which is more likely to give relevant results. The phrase suggester requires you to implement a specific analyzer (trigram analyzer) to enable it to find relevant results.
Completion suggester
A completion suggester is quite different to the term and phrase suggesters. Rather than suggesting an improved query, it provides “search as you type” results. The completion suggester is optimized for speed.
How to implement search suggesters in Elasticsearch
How to implement term suggesters
In its simplest form, the term suggester will work without a specific mapping on any “text” field.
POST content_programmes_v4/_search { "suggest": { "my-suggestion" : { "text" : "elasticsearch powered search engyne", "term" : { "field" : "content.body", "suggest_mode" : "missing", } } } }
The output is as shown below. The text shows the original text, while options will be suggested for certain words.
"suggest" : { "my-suggestion" : [ { "text" : "elasticsearch", "offset" : 0, "length" : 13, "options" : [ ] }, { "text" : "search", "offset" : 14, "length" : 6, "options" : [ ] }, { "text" : "engyne", "offset" : 21, "length" : 6, "options" : [ { "text" : "engine", "score" : 0.6, "freq" : 12 }, { "text" : "enzyme", "score" : 0.5, "freq" : 36 } ] } ] }
The behavior of the suggester can be changed particularly using the suggest mode parameter.
Using “missing” will only suggest those terms that have zero occurrences in the field in the index, whereas “popular” will suggest terms for all terms where a reasonable variant is found that is more common than the term in the query. “missing” is a good option for most use cases, but if your data set is likely to already include a significant number of misspelled words, then you should probably use “popular”.
Below is a table with the main parameters you can use with the term suggester:
Field | The field where the suggester is supposed to look for suggestions |
Analyzer | Defines a search analyzer to apply to the text defaults of the field |
Gram_size | Should be set to 1 or omitted if the field is NOT an ngram or shingle field |
Text | The input text to be searched |
Highlight | Defines a highlighter for the output |
Collate | Defines a query to re-check all suggestions (see Using Collate below) |
Size | Defines the overall number of suggestions to be retrieved (best to use 5 or less to avoid irrelevant suggestions) |
Shard_size | Defines the max number of suggestions to be returned per shard |
How to implement phrase suggesters
he phrase suggester requires a trigram mapping.
PUT my-index { "settings": { "index": { "number_of_shards": 1, "analysis": { "analyzer": { "trigram": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase","shingle"] } }, "filter": { "shingle": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 3 } } } } }, "mappings": { "properties": { "content": { "type": "text", "fields": { "trigram": { "type": "text", "analyzer": "trigram" }} } } } }
You can then run the following query:
POST my-index/_search { "suggest": { "text": "Scarlett Johanssen", "simple_phrase": { "phrase": { "field": "content.trigram", "size": 1, "gram_size": 3, "direct_generator": [ { "field": "content.trigram", "suggest_mode": "always" } ], "highlight": { "pre_tag": "<em>", "post_tag": "</em>" } } } } }
"suggest" : { "simple_phrase" : [ { "text" : "Scarlett Johanssen", "offset" : 0, "length" : 18, "options" : [ { "text" : "scarlett johannsson", "highlighted" : "scarlett <em>johannsson</em>", "score" : 0.19398963 } ] }
In this case, although we might have a number of misspelled versions of “johannsson” in our text, Elasticsearch will use the context of “scarlett” to pull out the version of johannsson most commonly found together with the word Scarlett.
The phrase suggest requires us to define a list of direct_generators (1 or more) which are responsible for generating the suggestions for each term in the suggestion phrase. Each generator requires the field parameter as a minimum, although you can fine tune with the following options:
Field | The field in the index used to look for suggestions |
Max_edits | 0,1,2 - the max number of edits required to reach the suggestion from the input |
Min_doc_freq | Disabled by default. The minimum number of documents our word must appear in so as to appear as a suggestion |
Min_word_length | Defaults to 4, because spelling errors don't often occur in short words. Increasing this number improves performance at the risk of eliminating possible suggestions |
Max_term_freq | If the original term occurs more than x times in the index, then do not look for suggestions. This will improve performance by not looking for common words (it is assumed they are correctly spelled) at the risk of eliminating possible suggestions |
Prefix_length | Defaults to 1, since normally spelling errors don't occur at the start of words. Increasing this number will improve performance at the risk of eliminating possible suggestions |
Pre_filter | This allows you to apply a text analysis filter to the input text before looking for suggestions. This could be used to propose synonyms as suggestions, eg. sales analysis propose revenue analysis |
Post_filter | This allows you to apply a text analysis filter to the output suggestions. |
Size | Max number of suggestions per token |
Suggest_mode | Missing, popular or always - missing will only suggest terms not present in the index, popular is better if you have many misspelled words in the index, always is useful if you want to maximize the number of results |
Using collate
Collate provides the possibility to fine tune the search suggestions, by re-checking them using a query.
For example:
POST actors/_search { "suggest": { "text" : "scarlett johannson", "simple_phrase" : { "phrase" : { "field" : "actors.trigram", "size" : 1, "direct_generator" : [ { "field" : "actors.trigram", "suggest_mode" : "always", "min_word_length" : 1 } ], "collate": { "query": { "source" : { "match": { "{{field_name}}" : "{{suggestion}}" } } }, "params": {"field_name" : "actors"}, "prune": true } } } } }
This query will re-check that the suggestion actually returns results when run against the non-trigram field (actors vs actor.trigram). If the collate query provides no results, then the suggestion will not be included.
How to implement completion suggesters
To create an autocomplete type suggester, you need to create a specific mapping with type “completion”.
PUT actors { "mappings": { "properties": { "suggest": { "type": "completion" }, "name": { "type": "keyword" } } } }
In the example above, we created the field “suggest” to contain the data to be searched.
To search the above index, we can run a query such as:
POST actors/_search { "_source": "suggest", "suggest": { "my-suggest": { "prefix": "sca", "completion": { "field": "suggest", "size": 5 , "skip_duplicates": true } } } }
The “_source” is limited to the suggester field in order to make the response quicker.
The “skip_duplicates” can be true or false as required.
The “size” indicates how many suggestions should be returned.
Fuzzy search inside autocomplete suggesters
You could also allow spelling mistakes in the completion query by adding a “fuzzy” object under field.
"field": "suggest", "fuzzy": { "fuzziness": 2, "min_length": 4 }
The value of fuzziness indicates the number of spelling errors permitted in the query. However, be aware that adding fuzziness can return unexpected results, especially at the beginning when the number of letters in the search term is small. For this reason it is advised to set a reasonably high value of “min_length”, which is the minimum length of the search term before fuzziness is applied.
Autocompleter response
The response would be something like:
"suggest": { "my-suggest": [ { "text": "sca", "offset": 0, "length": 3, "options": [ { "text": "scarlett johannssen", "_index": "music", "_id": "1", "_score": 1.0, "_source": { "suggest": [ "scarlett johannssen" ] } } ] } ] }
Autocomplete context filtering
Normally, an autocomplete suggester will work on all documents in the index and it is not possible to filter on other fields in the document. However, it is possible to add a “context” field to enable the autocomplete to be filtered by either category or location. In this way it would be possible to suggest products in a webstore filtered by a certain category, or to suggest hotels filtered by a certain geographical area. In order to do this, it is necessary to add the context to each document at index time.
Add context to suggester mapping
The following mapping specifies that the completion suggester has two contexts: one of type category which uses the tags found in the field “category_tags”, and the other of type geo, found in the field “loc” which must be of type geo_point.
PUT hotels { "mappings": { "properties": { "suggest": { "type": "completion", "contexts": [ { "name": "hotel_type", "type": "category", "path": "category_tags" }, { "name": "location", "type": "geo", "precision": 4, "path": "loc" } ] }, "loc": { "type": "geo_point" }, "category_tags": { "type": "keyword" } } } }
For each document, we need to add the appropriate data to the context fields defined in the mapping.
PUT hotels/_doc/1 { "suggest": ["Gran Hotel"], "category_tags": ["family_run", "3 star"] , "loc": "41.56,-71.27" }
When running the suggest query, we must pass in the context or contexts which we want to use for filtering.
In the query below we pass in two contexts, one for location and another for hotel type. Note however that if more than one context is passed, then the suggestions only need to match one context to be returned, and there is no guarantee that suggestions meeting a greater number of contexts get returned higher up the list.
POST hotels/_search { "suggest": { "hotel_suggestion": { "prefix": "gra", "completion": { "field": "suggest", "size": 10, "contexts": { "location": { "lat": 41.56, "lon": 71.27 }, "hotel_type": "family_run" } } } } }
Notes and good things to know
In the case of terms and phrase suggesters, bear in mind that the terms and phrase suggesters can run at the same time as a main query, so that you can obtain suggestions at the same time as you retrieve the main query, saving you the need to run two queries.
However, the use of suggesters can quite significantly slow down your queries, so you might prefer to only send the suggester query in the event that the original query returns a low number of results.
If you are considering using a completion suggester, bear in mind there are other alternatives with similar functionality such as prefix queries or “search as you type”, or the Terms Enum API (from 7.14 onwards). In general, completion suggesters have the advantage of being fast, and it is easy to eliminate duplicates. However, it is harder to fine tune the priority in which suggestions are returned.
For a complete discussion on these alternatives, please see https://opster.com/guides/elasticsearch/how-tos/elasticsearch-auto-complete-guide.