Elasticsearch Elasticsearch More_Like_This

By Opster Team

Updated: Aug 27, 2023

| 2 min read

Quick links

Leveraging the More_Like_This query in Elasticsearch for enhanced search results

Elasticsearch’s More_Like_This (MLT) query is a versatile tool that allows users to find documents that are similar to a given set of documents. This feature is particularly useful in scenarios where you want to provide recommendations or find related content based on a specific document or text.

The More_Like_This query works by using a simple bag-of-words model (BoW model). It analyzes the text in the ‘like’ field and then uses this analysis to find similar documents. The MLT query can be used on any field that is of type ‘text’.

How More_Like_This query works

The More_Like_This query operates by finding a set of words from the ‘like’ text. These words are chosen based on their tf-idf score. The tf-idf score is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.

Once the words are chosen, the More_Like_This query creates a disjunctive query (a ‘should’ Boolean query) with these terms. Documents that have more of these terms will rank higher.

Parameters of More_Like_This query

The More_Like_This query accepts several parameters that can be used to fine-tune the results. Some of the key parameters include:

– `min_term_freq`: The minimum term frequency below which the terms will be ignored. The default value is 2.
– `max_query_terms`: The maximum number of query terms that will be included in the generated query. The default value is 25.
– `min_doc_freq`: The minimum document frequency below which the terms will be ignored. The default value is 5.
– `max_doc_freq`: The maximum document frequency above which the terms will be ignored. This is used to exclude high-frequency words such as stop words. The default value is unbounded (Integer.MAX_VALUE).
– `stop_words`: An array of stop words that will be ignored.

Using More_Like_This query

Here is an example of how to use the More_Like_This query:

json
GET my-index/_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "description"],
            "like" : "Once upon a time",
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

In this example, the More_Like_This query will find documents that have similar ‘title’ or ‘description’ to the text ‘Once upon a time’. The query will include terms that have a minimum term frequency of 1 and a maximum of 12 query terms.

The like parameter can also accept the reference to specific documents (in any index) instead of a text string. In this case, Elasticsearch will retrieve the ‘title’ and ‘description’ content of the referenced documents and perform the More_Like_This query on those values:

json
GET my-index/_search
{
    "query": {
        "more_like_this" : {
            "fields" : ["title", "description"],
            "like" : [
              {
                 "_index": "other-index",
                 "_id": 123
              },
              {
                 "_index": "other-index",
                 "_id": 456
              }
            ],
            "min_term_freq" : 1,
            "max_query_terms" : 12
        }
    }
}

In the previous example, the documents with ID 123 and 456 will be retrieved from “other-index” and the values of the “title” and “description” fields will be used in the More_Like_This query. The like parameter also accepts to mix and match text strings and document references.

Fine-tuning More_Like_This query

While the More_Like_This query can be very effective out of the box, it’s often necessary to fine-tune it to get the best results. This can be done by adjusting the parameters based on the specific requirements of your use case.

For instance, if you’re dealing with a large corpus of documents, you might want to increase the `min_doc_freq` and `max_doc_freq` parameters to exclude very common and very rare words. Similarly, if you’re dealing with short texts, you might want to decrease the `min_term_freq` parameter to include more terms in the query.