Elasticsearch Combined_Fields Query Type in OpenSearch

By Opster Expert Team

Updated: Mar 10, 2024

| 5 min read

Quick links:

Overview and background

Sometimes a match could cover many text fields. In cases like these, you can use the combined_fields query to search multiple text fields as though their values have actually been indexed into one combined field. Beyond this, the main benefit of combined_fields is the robust and understandable scoring algorithm. 

The combined_fields query belongs to the full text queries group, which allow you to search an analyzed text field (for example, an error message). 

When no search_analyzer has been specified in the field mapping, the analyzer will default to the analyzer applied to the field during indexing, and this analyzer will be used to process the query string. If there is a search_analyzer specified in the field mapping, this analyzer will be the one used to process the query string.

An analyzer is just a component that is built out of three smaller components: character filters, tokenizers, and token filters, whether built-in or customized. These building blocks are pre-packaged into analyzers for various languages and kinds of text by the built-in analyzers. Individual building blocks are also exposed by OpenSearch, allowing them to be merged to create new custom analyzers. You can read more about analyzers here

What is this query used for?

The combined_fields query gives you the ability to search multiple text fields as though their indexed values have been indexed into one combined field. The combined_fields query is handy when a match could cover many text fields. 

The combined_fields query works by taking a term-centric view of the query. It analyzes the query string to single terms, and then searches for each term in any of the fields. This query can be used instead of cross fields multi_match query. It offers a more straightforward behavior and a more robust scoring system. The combined_fields query only works on fields with the exact same search analyzer, while the multi_match query does not care if the fields have the same search analyzer or not.

Top-level parameters in the combined_fields query

The combined_fields query has six top-level parameters: fields, query, auto_generate_synonyms_phrase_query, operator, minimum_should_match, and zero_terms_query.

  1. fields: an array of strings, a required parameter that represents the list of fields you want to search on. Only text fields are allowed, and all of them must use the same search analyzer. The use of wildcard patterns in the fields is permitted.
  1. query: a string field, required, and it represents the text that should be searched for in the fields provided in the fields parameter. Before executing a search, the combined_fields query analyzes the provided text.
  1. auto_generate_synonyms_phrase_query: a boolean parameter, optional, the default value is true. If it is set to true, for multi-term synonyms, match phrase queries are generated automatically.
  1. operator: string parameter, optional, represents a boolean logic used to interpret the text in the query value. 

There are two valid values for operator parameters: 

  1. “or” – the default value. For example, a query value of “data mining” is interpreted as “data OR mining”. 
  2. “and”, as in the example below, a query value of “data mining” is interpreted as “data AND mining”.
  1. minimum_should_match: string parameter, optional, it represents the minimum amount of clauses that have to match for a document to be returned.

The minimum_should_match parameter has six possible values, which are (taken from source):

  • Integer: specifies a fixed value regardless of the number of optional clauses. An example of an integer value for the minimum_should_match parameter is 4.
  • Negative integer: specifies that the entire number of optional clauses, minus this number, should be required. An example of a negative integer value for the minimum_should_match parameter is -3.
  • Percentage: specifies that this percentage of the total number of optional clauses must be filled out. The percentage-based number is rounded down and used as the minimum. An example of a percentage value for the minimum_should_match parameter is 73%.
  • Negative percentage: specifies that this percentage of the total number of optional clauses can be missing. The percentage-based number is rounded down before being deducted from the total, to determine the minimum. An example of a negative percentage value for the minimum_should_match parameter is -20%.
  • Combination: It represents a conditional specification consisting of a positive number, followed by the less-than sign and any of the previously mentioned specifiers. It indicates that if the number of optional clauses is equal to or less than the integer, they are all needed; however, if it is larger than the integer, the specification applies. An example of a combined value for the minimum_should_match parameter is 4<80%, in this example: if there are 1 to 4 clauses they are all required, but for 5 or more clauses only 80% are required.
  • Multiple combinations: Multiple conditional specifications can be separated by spaces, with each one being valid only for values bigger than the previous one. An example of multiple combinations for the value of the minimum_should_match parameter is 2<-25% 9<-3, in this example:  if there are 1 or 2 clauses both are required, if there are 3-9 clauses all but 25% are required, and if there are more than 9 clauses, all but three are required.
  1. zero_terms_query: string type, optional, represents whether no documents are returned if the analyzer eliminates all tokens, as well as when employing a stop filter. The zero_terms_query parameter has two valid values which are:
    1. “none” (the default value) – indicates that no document will be returned if the analyzer removes all tokens.
    2. “all” – returns all documents, like the match_all query.

How to implement it

In the example below, the combined_fields query will take a term-centric view of the query by analyzing the “data mining” query string into two individual terms, which are the “data” term and “mining” term. Then it looks at the operator which is “and”, so it will look for the data and mining terms in any of the “summary”, “abstract”, and “content” fields. 

GET my_index/_search
{
  "query": {
    "combined_fields" : {
      "query":      "data mining",
      "fields":     ["summary", "abstract", "content"],
      "operator":   "and"
    }
  }
}

To conclude, the above search query will search the my_index index for the documents that have the data and mining terms in any of the summary, abstract, and content fields. It will return the documents that have data and mining terms in at least one of those fields.

Based on the simple BM25F formula, the combined_fields query provides a systematic approach to scoring. The BM25F algorithm is one of the most successful Web-search and corporate-search similarity algorithms. It is based on the Probabilistic Relevance Framework (PRF) which is a formal framework for document retrieval. 

To understand the BM25F Algorithm, it is recommended to read The Probabilistic Relevance Framework: BM25 and Beyond, especially the derived models chapter that has a section for BM25F that will help you understand the BM25F work mechanism. 

BM25 offers the option to weight fields individually. Before individual field statistics are integrated into document-level statistics, field weights act by multiplying the raw term frequency of a field. This accomplishes two major goals: it captures relative field significance and creates a generalizable ranking mechanism.

The query integrates term and collection statistics throughout fields when scoring matches. This enables it to score each match as though all of the selected fields were merged into a single field. (Please note that this is a best effort; combined_fields makes some estimations, and scores will not precisely follow this model).

You can boost single fields with the caret (^) notation. In the example below, the abstract field receives a boost of 2; the score is computed as if each term in the abstract existed doubly in the synthesized combined field.

GET my_index/_search
{
  "query": {
    "combined_fields" : {
      "query":      "machine learning",
      "fields":     ["abstract^2", "content"]
    }
  }
}

In the combined_fields query, it is required that field boosts be equal to or more than 1.0, and it is allowed to be fractional.

When should you use a multi_match query or combined_fields query? Main differences

The main difference between the combined_fields query and the multi_match query is that the combined_fields query supports only text fields, and accepts text fields that share the same search analyzer. The multi_match query, on the other hand, supports both text and non-text fields, and it also accepts text fields that don’t share the same search analyzer.

The combined_fields query offers a logical way of matching and scoring across numerous text fields. This necessitates the use of the same search analyzer across all fields.

The multi_match query may be a better fit if you desire a single query that handles fields of various types, such as keywords or integers. It allows both text and non-text fields, as well as text fields with different analyzers.

The second difference between the two is that best_fields and most_fields multi_match modes take a field-centric viewpoint to the query. Combined_fields, on the other hand, is term-centric, with operator and minimum_should_match applied per-term rather than per-field.

The cross_fields multi_match mode additionally applies operator and minimum_should_match per-term and uses a term-centric approach. The key benefit of combined_fields over cross_fields is the BM25F algorithm’s robust and interpretable scoring strategy.

The combined_fields query has a simpler API than the multi_match query, and it skips many parameters including analyzer, fuzziness, and all of the associated parameters, as well as lenient, slop, and tie_breaker.

Notes and good things to know

In the combined_fields query, the number of fields that may be queried at once is limited. It is determined by the search settings indices.query.bool.max_clause_count, which is set to 1024 by default.

Only the BM25 similarity is presently supported by the combined_fields query (which is the default unless a custom similarity is configured). Similarities per_field are also not permitted. In each of these circumstances, using combined_fields will result in an error.

The main benefit of combined_fields is the robust and understandable scoring algorithm.