How to design an Elasticsearch index to work with synonyms
(Note: this article was written with commands for ELK version 8.10 although you may find the information helpful for other versions as well. All commands listed are sent via the Dev Tools within Kibana.)
Creating a search engine that allows for synonyms is one of the most effective ways to enhance the relevance of search results for your end users. You may find that users are complaining about not getting the results that they imagine because instead of typing the exact keyword in the result, they used a synonym of a word and didn’t get the results they wanted.
For example, let’s pretend that you’ve created a search engine for a furniture store. You’ve imported these 3 documents:
{ "id":"1" "product":"red leather sofa" } { "id":"2" "product":"1950 FABRIC couch " } { "id":"3" "product":"corner sectional, leather, extra large" }
When a user searches for the word “couch”, they would only get the second result and miss out on seeing the other two relevant products. Herein lies the importance of incorporating synonyms into Elasticsearch; we want a user who types in either “couch”, “sofa” or “sectional” to see all three results without having to imagine all of the possible synonyms themselves. Our solution: the synonym token filter.
Text analysis overview
Before we get into the specifics of the synonym token filter, let’s go over some background information so we better understand what we’re doing.
Elasticsearch offers a broad set of functionalities to index data. Other than being a great place to store your data, Elasticsearch will run some basic analyzers on the data behind the scenes in order to make it searchable. Analyzers are the instructions that are given to Elasticsearch on the nitty gritty of how your text fields should be indexed.
By default, each index has a default analyzer that is used if none is specified, called the standard analyzer, that works for many cases. The beautiful thing about Elasticsearch is that you can create and specify a custom one yourself to fit your needs.
An analyzer has 2 main components: a tokenizer and zero or more token filters. There’s also another component called character filters that allows you to replace specific characters, or patterns thereof, but we will not handle them in this article.
Tokenizer
A tokenizer decides how Elasticsearch will take a set of words and divide it into separated terms called “tokens”. The most common tokenizer is called a whitespace tokenizer which breaks up a set of words by whitespaces. For example, a field like “red leather sofa” would be indexed into elasticsearch as 3 tokens: “red”, “leather” and “sofa”. Or “1950 FABRIC couch” would be indexed as “1950”, “FABRIC”, and “couch”. If a user types in any of these terms, the documents with any of these terms will be returned.
Without tokenizers, all multi-word fields would be saved as a full sentence that would need an exact match to return the result, i.e. searching for exactly the term “red leather sofa”. Don’t worry, this process of tokenization happens behind the scenes and you’ll still see your document exactly as you imported it and not as a chunked piece of text. Elasticsearch offers many different types of tokenizers: tokens that are created on change of case (lower to upper), change from one character class to another (letters to numbers), etc…
Token filter
Once a token has been created, it will then run through an analyzer’s token filters. A token filter is an operation done on tokens that modifies those tokens in some way or another. For example you can convert all tokens to lowercase (“FABRIC” to “fabric”), remove whitespace from each token (changing the token “red leather sofa” into “redleathersofa”), or take each token and strip them down to their stems.
What are stems? In English, you may not want the words “walk”, “walks”, “walking”, and “walked” to be all indexed as different words, so the stemming token filter will take all 4 of those words and turn them into “walk”. Again, this doesn’t change or edit the document you’ve imported, you’ll still see it as you imported it, but when searched, all versions of that word will be automatically indexed as the same stem. You can layer as many token filters as you like that support your use case, the only important thing to pay attention to is that token filters are chained and applied in the same order as they have been specified
Token filters are exactly where you can add support for synonyms to your documents, with the synonym token filter.
How to implement synonyms into Elasticsearch with the synonym token filter
The synonym token filter will take each token created in the tokenization process mentioned above, and check to see if it matches any of the terms that you define in the synonyms list. If so, it will map the synonyms to those tokens. If you import a document with “red leather sofa” with a whitespace tokenizer, you get the “red”, “leather” and “sofa” tokens. If you add a synonym token filter to map the word “sofa” to “couch” and “sectional”, then when Elasticsearch creates a token called “sofa” it will also create tokens for the synonyms, so that typing “couch” will return the document you stored with “red leather sofa”.
Without going into too many specifics as to not get lost in details, you can decide if these tokens get created upon index-time for documents or on the fly at search-time. In other words, for index-time analyzers, you can have the word “sofa” also saved as “couch” and “sectional” as an additional token within the index (taking up a little more storage space), or alternatively for search-time analyzers you can have the word “sofa” saved only as “sofa” within the index, but when a user searches for the word “sofa”, it gets expanded to the words “couch” and “sectional” at search time.
One of the main advantages of using search-time analyzers vs index-time analyzers is you can edit and add to the synonyms on the fly without having to reindex all of your documents. Reindexing can be a headache if you need to repeatedly do it every time you want to make a change to your synonym list. It is generally best practice to use search-time analyzers for synonyms rather than index-time ones. See this article to see more information about using search-time vs index-time analyzers.
In the mapping for a particular field, you can define how you want to save the tokens in the index (index-time analyzers), and how you want terms to be analyzed when a user searches (search-time analyzers). By default they are the same.
Let’s look at an example:
PUT /furniture-index { "settings": { "index": { "analysis": { "analyzer": { "my_synonym_analyzer": { "tokenizer": "whitespace", "filter": [ "lowercase", "my_synonym_token_filter" ] } }, "filter": { "my_synonym_token_filter": { "type": "synonym", "synonyms": [ "couch, sectional, sofa", "big, large => lrg", "small, tiny => sm, smll" ] } } } } }, "mappings": { "properties": { "product": { "type": "text", "analyzer": "standard", "search_analyzer": "my_synonym_analyzer" } } } }
From top to bottom, we’re creating `my_synonym_analyzer` which will tokenize on white space and a token filter that will first lowercase the search, and then look at the `my_synonym_token_filter` to go through potential synonyms.
As you can see, `my_synonym_token_filter` has synonyms defined in 3 ways: the first way makes the three terms “couch”, “sectional”, and “sofa” equivalent so if a user searches for any of these three words, the others will all appear.
The second one is unidirectional: if a user types in “big” or “large”, results with “lrg” will be returned but by typing in “lrg”, “big” or “large” wouldn’t be returned. Typing in “small” or “tiny” would return any results with “sm” OR “smll”. At the end of the command, we set the mappings for the product field so the index-time analyzer is the `standard` analyzer that is predefined and the custom search-time analyzer `my_synonym_analyzer`.
Let’s add some sample documents:
POST furniture-index/_bulk {"index":{"_id":"1"}} {"product":"smll red leather sofa"} {"index":{"_id":"2"}} {"product":"sm 1950 FABRIC couch"} {"index":{"_id":"3"}} {"product":"lrg corner sectional, leather"}
Let’s run a sample search:
GET furniture-index/_search { "query": { "match": { "product": "sofa" } } }
All 3 products returned!
So far, we are managing our list of synonyms directly inside the index settings. Depending on how many synonyms you need to manage, it might not be a good idea, because storing the synonyms inside your index settings will inflate your cluster state and can lead to performance issues.
There are two other ways that you can use to manage synonyms in a more effective and optimal way: a) through a synonyms file, and b) through the new Synonyms API released in 8.10. We’re going to introduce each option next.
If you want to pick the first option, you first need to create a text file called “my_list_synonyms.txt” in the `config` directory of ALL your Elasticsearch instances. If you are on Elastic Cloud, you won’t have access to your nodes’ file system but you can handle synonyms files through custom bundles. The acceptable file formats of this file are Solr and WordNet:
couch, sectional, sofa big, large => lrg small, tiny => sm, smll
We also need to change the index settings for this within `my_synonym_token_filter` by adding a path for the synonyms text file that we just created:
PUT /furniture-index { "settings": { "index": { "analysis": { "analyzer": { "my_synonym_analyzer": { "tokenizer": "whitespace", "filter": [ "lowercase" , "my_synonym_token_filter" ] } }, "filter": { "my_synonym_token_filter": { "type": "synonym", "synonyms_path": "my_list_synonyms.txt", "updateable": true } } } } }, "mappings": { "properties": { "product": { "type": "text", "analyzer": "standard", "search_analyzer": "my_synonym_analyzer" } } } }
Now when we want to update synonyms, because we have set the filter to ` “updateable”: true`, we will simply edit the file and send the following two commands (on version 7.3 and above):
POST /furniture-index/_reload_search_analyzers POST /furniture-index/_cache/clear?request=true
This will reload any changes we made to the synonyms file and empty the request cache to make sure it doesn’t contain responses based on the previous version of your synonyms file!
For those of you who are running a version older than 7.3, you’ll need to temporarily close and then reopen the index in which you want to reload the synonyms file rather than sending the two commands above. Upon reopening, the index will reload the new synonyms file.
*Important*: If you have multiple nodes in your cluster, be sure to update the synonyms file on every node to ensure a consistent result.
The second option is to use the new Synonyms API that has been released in beta in 8.10. That option is the most flexible one as it doesn’t impact the cluster state like we saw earlier and it doesn’t require you to upload your synonyms file to all your nodes and clearing all your caches.
This new API simply stores the synonyms in a system index, like any other documents. It allows you to create synonyms sets and their associated rules. Synonyms sets are simply a way to store related synonyms in a consistent way and they can be updated dynamically at search time. There are also no limits as to how many sets you can define. Each synonyms set contains synonyms rules. Let’s now see how to define such a synonyms set and apply it to our `my_synonym_analyzer`
First, we need to create our first synonyms set using the command below (note that the format to be used must be the Solr one):
PUT _synonyms/furniture-synonyms { "synonyms_set": [ { "id": "couch-synonyms", "synonyms": "couch, sectional, sofa" }, { "id": "large-synonyms", "synonyms": "big, large => lrg" } ] }
So, we have created a synonyms set called `furniture-synonyms` but it only contains the first two synonym rules we used earlier, for “couch” and “large”. We have forgotten to add the rule for “small”, but we don’t need to update the full synonyms set, we can just add a single rule to our existing synonyms set as shown below:
PUT _synonyms/furniture-synonyms/small-synonyms { "synonyms": "small, tiny => sm, smll" }
A new synonym rule with the identifier `small-synonyms` has now been added to our previously created synonyms set. And there is no need to reload anything, all synonyms sets and rules are reloaded automatically for us and search analyzers are ready to pick them up. See how easy that was?
Now that our synonyms have been stored, we can define our `my_synonym_token_filter` search analyzer using the following command (note that we are only showing the part that change):
PUT /furniture-index { "settings": { "index": { "analysis": { … "filter": { "my_synonym_token_filter": { "type": "synonym", "synonyms_set": "furniture-synonyms", "updateable": true } } } } }, … }
As you can see, we only need to specify the `synonyms_set` instead of the `synonyms_file` setting, all the rest stays the same.
Depending on your needs, environment constraints and the version you’re using, you now have different options to manage your synonyms.
Summary
In this article, we have explained the importance of text analyzers as well as why search-time analyzers are more flexible than index-time analyzers. We have then reviewed the benefits of the synonyms token filter in the analysis process.
We wrapped up by showing three different ways of managing your synonyms, namely inline in your index settings, through a text file that needs to be uploaded to all your nodes (since 7.3), and via the new Synonyms API (since 8.10) which offers the greatest flexibility.
Giving your users a relevant search experience requires lots of fine tuning and synonyms are a great way to give your users the experience you’re hoping they have. As with any system, it’s important to continually test and change your system to keep things up to date and working as you intend.