Quick links
- How to design an OpenSearch index to work with synonyms
- Text analysis overview
- How to implement synonyms into OpenSearch with the synonym token filter
- Summary
How to design an OpenSearch index to work with synonyms
(Note: All commands listed are sent via the Dev Tools within OpenSearch Dashboards.)
Creating a search engine that allows for synonyms is one of the most effective ways to enhance the relevance of search results for your end users. You may find that users are complaining about not getting the results that they imagine because instead of typing the exact keyword in the result, they used a synonym of a word and didn’t get the results they wanted.
For example, let’s pretend that you’ve created a search engine for a furniture store. You’ve imported these 3 documents:
{ "id":"1" "product":"red leather sofa" } { "id":"2" "product":"1950 FABRIC couch " } { "id":"3" "product":"corner sectional, leather, extra large" }
When a user searches for the word “couch”, they would only get the second result and miss out on seeing the other two relevant products. Herein lies the importance of incorporating synonyms into OpenSearch; we want a user who types in either “couch”, “sofa” or “sectional” to see all three results without having to imagine all of the possible synonyms themselves. Our solution: the synonym token filter.
Text analysis overview
Before we get into the specifics of the synonym token filter, let’s go over some background information so we better understand what we’re doing.
OpenSearch offers a broad set of functionalities to index data. Other than being a great place to store your data, OpenSearch will run some basic analyzers on the data behind the scenes in order to make it searchable. Analyzers are the instructions that are given to OpenSearch on the nitty gritty of how your text fields should be indexed/stored.
By default, each index has a default analyzer that is used if none is specified, called the standard analyzer, that works for many cases. The beautiful thing about OpenSearch is that you can create and specify a custom one yourself to fit your needs.
An analyzer has 2 main components: a tokenizer and zero or more token filters.
Tokenizer
A tokenizer decides how OpenSearch will take a set of words and divide it into separated terms called “tokens”. The most common tokenizer is called a whitespace tokenizer which breaks up a set of words by whitespaces. For example, a field like “red leather sofa” would be indexed into OpenSearch as 3 tokens: “red”, “leather” and “sofa”. Or “1950 FABRIC couch” would be indexed as “1950”, “FABRIC”, and “couch”. If a user types in any of these terms, the document with any of these terms will be returned.
Without tokenizers, all multi-word fields would be saved as a full sentence that would need an exact match to return the result, i.e. searching for exactly the term “red leather sofa”. Don’t worry, this process of tokenization happens behind the scenes and you’ll still see your document exactly as you imported it and not as a chunked piece of text. OpenSearch offers many different types of tokenizers: tokens that are created on change of case (lower to upper), change from one character class to another (letters to numbers), etc…
Token filter
Once a token has been created, it will then run through an analyzer’s token filters. A token filter is an operation done on tokens that modifies those tokens in some way or another. For example you can convert all tokens to lowercase (“FABRIC” to “fabric”), remove whitespace from each token (changing the token “red leather sofa” into “redleathersofa”), or take each token and strip them down to their stems.
What are stems? In English, you may not want the words “walk”, “walks”, “walking”, and “walked” to be all indexed as different words, so the stemming token filter will take all 4 of those words and turn them into “walk”. Again, this doesn’t change or edit the document you’ve imported, you’ll still see it as you imported it, but when searched, all versions of that word will be automatically indexed as the same stem. You can layer as many token filters as you like that support your use case, the only important thing to pay attention to is that token filters are chained and applied in the same order as they have been specified
Token filters are exactly where you can add support for synonyms to your documents, with the synonym token filter.
How to implement synonyms into OpenSearch with the synonym token filter
The synonym token filter will take each token created in the tokenization process mentioned above, and check to see if it matches any of the terms that you define in the synonyms list. If so, it will map the synonyms to those tokens. If you import a document with “red leather sofa” with a whitespace tokenizer, you get the “red”, “leather” and “sofa” tokens. If you add a synonym token filter to map the word “sofa” to “couch” and “sectional”, then when OpenSearch creates a token called “sofa” it will also create tokens for the synonyms, so that typing “couch” will return the document you stored with “red leather sofa”.
Without going into too many specifics as to not get lost in details, you can decide if these tokens get created upon index-time for documents or on the fly at search-time. In other words, for index-time analyzers, you can have the word “sofa” also saved as “couch” and “sectional” as an additional token within the index (taking up a little more storage space), or alternatively for search-time analyzers you can have the word “sofa” saved only as “sofa” within the index, but when a user searches for the word “sofa”, it gets expanded to the words “couch” and “sectional” at search time.
One of the main advantages of using search-time analyzers vs index-time analyzers is you can edit and add to the synonyms on the fly without having to reindex all of your documents. Reindexing can be a headache if you need to repeatedly do it every time you want to make a change to your synonym list. It is generally best practice to use search-time analyzers for synonyms rather than index-time ones.
In the mapping for a particular field, you can define how you want to save the tokens in the index (index-time analyzers), and how you want terms to be analyzed when a user searches (search-time analyzers). By default they are the same.
Let’s look at an example:
PUT /furniture-index { "settings": { "index": { "analysis": { "analyzer": { "my_synonym_analyzer": { "tokenizer": "whitespace", "filter": [ "lowercase", "my_synonym_token_filter" ] } }, "filter": { "my_synonym_token_filter": { "type": "synonym", "synonyms": [ "couch, sectional, sofa", "big, large => lrg", "small, tiny => sm, smll" ] } } } } }, "mappings": { "properties": { "product": { "type": "text", "analyzer": "standard", "search_analyzer": "my_synonym_analyzer" } } } }
From top to bottom, we’re creating `my_synonym_analyzer` which will tokenize on white space and a token filter that will first lowercase the search, and then look at the `my_synonym_token_filter` to go through potential synonyms.
As you can see, `my_synonym_token_filter` has synonyms defined in 3 ways: the first way makes the three terms “couch”, “sectional”, and “sofa” equivalent so if a user searches for any of these three words, the others will all appear.
The second one is unidirectional: if a user types in “big” or “large”, results with “lrg” will be returned but by typing in “lrg”, “big” or “large” wouldn’t be returned. Typing in “small” would return any results with “sm” OR “smll”. At the end of the command, we set the mappings for the product field so the index-time analyzer is the `standard` analyzer that is predefined and the custom search-time analyzer `my_synonym_analyzer`.
Let’s add some sample documents:
POST furniture-index/_bulk {"index":{"_id":"1"}} {"product":"smll red leather sofa"} {"index":{"_id":"2"}} {"product":"sm 1950 FABRIC couch"} {"index":{"_id":"3"}} {"product":"lrg corner sectional, leather"}
Let’s run a sample search:
GET furniture-index/_search { "query": { "match": { "product": "sofa" } } }
All 3 products returned!
However, typically it is best practice to use a separate document that is stored locally on the nodes rather than defining synonyms in-line like we did above.
Create a document called “my_list_synonyms.txt” in the `config` directory of your OpenSearch instance (the acceptable formats of this file are Solr and WordNet):
couch, sectional, sofa big, large => lrg small, tiny => sm, smll
We will need to change the index settings for this within `my_synonym_token_filter` by adding a path for a synonym file that you created:
PUT /furniture-index { "settings": { "index": { "analysis": { "analyzer": { "my_synonym_analyzer": { "tokenizer": "whitespace", "filter": [ "lowercase" , "my_synonym_token_filter" ] } }, "filter": { "my_synonym_token_filter": { "type": "synonym", "synonyms_path": "my_list_synonyms.txt" "updateable": true } } } } }, "mappings": { "properties": { "product": { "type": "text", "analyzer": "standard", "search_analyzer": "my_synonym_analyzer" } } } }
Now when we want to make a change, because we have set the filter to ` “updateable”: true`, when we need to make an update to the synonyms, we will edit the file and send these two commands (on version 7.3 and above):
POST /furniture-index/_reload_search_analyzers POST /furniture-index/_cache/clear?request=true
This will reload any changes we make to the synonyms!
For those of you who are running a version older than 7.3, you’ll need to temporarily close and then reopen the index in which you want to reload the synonym file rather than sending the two commands above. Upon reopening, the index will reload the new synonym file.
*Important*: If you have multiple nodes in your cluster, be sure to update the synonyms file on every node to ensure a consistent result.
Summary
Giving your users a relevant search experience requires lots of fine tuning and synonyms are a great way to give your users the experience you’re hoping they have. As with any system, it’s important to continually test and change your system to keep things up to date and working as you intend.