Quick links
- Background and overview
- Match Query
- Match Phrase Query
- Multi-Match Query
- Notes and good things to know
- Summary
Background and overview
A match query is the main query used for full-text queries, including those involving fuzzy matching, while the match_phrase query, although similar, accounts for word proximity. The multi_match query offers a simplified syntax for both match and match_phrase queries against multiple fields.
Match queries work on analyzed text, so results depend on the analyzers that have been defined to break down the text into tokens (usually words). Analyzers are usually defined in the index mappings. For more information on analyzers and text analysis in general, see: https://opster.com/guides/opensearch/opensearch-data-architecture/text-analyzers
Match query
The match query returns documents that match a specific set of searched tokens, which can be texts, numbers, dates, or boolean values. The match query, which has options for fuzzy matching, is the standard query used to perform a full-text search.
The following are two equivalent structures for performing a match query:
GET index/_search { "query": { "match": { "<field>": "The value that needs to be found in the provided field" } } } GET index/_search { "query": { "match": { "<field>": { "query": "The value that needs to be found in the provided field" } } } }
<field> represents the field that you want to search.
IMPORTANT: If users need to add any of the following parameters to the query, the second syntax must be used.
The following are match query parameters:
- Query: A required parameter that can be a text, number, a boolean value, or date, that users are searching for in the specified field.
- Analyzer: An optional type string parameter that defines the analyzer that will convert the query text being searched into tokens. The index analyzer mapped for the <field> is the default value.
- Auto_generate_synonyms_phrase_query: An optional parameter of type boolean. If it is true, then multi-term synonym match phrase queries are created automatically. This is only relevant if the index uses a graph token filter. True is the default value.
- Fuzziness: An optional parameter of type string that represents the greatest edit distance permitted for matching. Edit distance is a calculation used to determine how similar two strings are. The edit distance between two strings is calculated by counting the number of single-character edits (insertions, deletions, or substitutions) required to transform one string into the other. Lower edit distance indicates higher similarity between the two strings. For example, turning “lisence” into “license” requires 2 edits, representing an edit distance of 2. Use a number (1,2) or use “auto” to allow variable fuzziness as a function of the length of the word.
- Max_expansions: An optional parameter of type integer representing the max number of terms into which the query will be expanded. This is relevant to a fuzzy search, which re-writes the query into a number of terms dependent on the fuzzy parameters. The more terms used, the higher the load on the cluster. Its default value is 50.
- Prefix_length: An optional parameter of type integer that represents how many starting characters have been left unaltered for fuzzy matching. Its default value is 0. The logic here is that spelling mistakes don’t usually occur at the beginning of a word, so adding prefix_length makes fuzzy queries more efficient.
- Fuzzy_transpositions: An optional parameter of type boolean. If true, editing for fuzzy matching may involve transposing two nearby characters (ab → ba). In this case, 2 edits count as 1. Its default value is true.
- Fuzzy_rewrite: An optional parameter of type string that represents a method utilized to rewrite the query. By default, the match query uses a fuzzy_rewrite method of top_terms_blended_freqs_${max_expansions} if the fuzziness parameter is not 0.
- Lenient: An optional parameter of type boolean. If true, format-based errors are ignored, like entering a text query value for a number field. Its default value is false.
- Operator: An optional parameter of type string that represents the boolean logic used in interpreting the text in the query value. It has two valid values, OR and AND, where OR is the default value.
- Minimum_should_match: An optional parameter of type string that represents the minimum number of matching clauses for a document to be returned. This should only be used with “OR” operator, and is more flexible than a simple “and/or,” since users can set rules depending on the length of the phrase to be matched.
- 75% (eg. 3 out of 4 words to be matched, or 6 out of 8)
- 2 (minimum 2 words to be matched, irrespective of length of string)
- Zero_terms_query: An optional parameter of type string. When employing a stop filter, for example, this value indicates whether documents are returned if the analyzer deletes all tokens. It has two valid values, none and all, none is the default value indicating whether the analyzer removes all tokens, (for example, in a query “the” using an English language analyzer), no documents will be returned.
GET books/_search { "query": { "match": { "title": { "query": "OpenSearch Observability", "operator": "or" } } } }
The above query will match documents containing EITHER OpenSearch OR observability.
To increase the precision of the previous query, change the “or” logic of the match query to “and” using the “operator” parameter as shown in the example below. This way, only the documents with both the words “OpenSearch and Observability” in the “title” field will be a hit.
In the following example, add the fuzziness parameter, permitting a maximum of 2 edits per word, which results in increasing the number of returned hits, while changing the relevance of the results.
GET books/_search { "query": { "match": { "title": { "query": "OpenSearch Observability", "fuzziness": 2, “operator”: “and” } } } }
Using the above query the following will match:
OpenSearch observability
observability based on OpenSearch
observability of OpenSearch (because of fuzzy search)
We will not match just “OpenSearch” on its own (because of the and operator).
Match phrase query
The match_phrase query takes into account word proximity, requiring the words to be found within a configurable “slop” or distance. The example below is a match_phrase query that searches the books index for books with the “OpenSearch Observability” phrase in the title.
GET books/_search { "query": { "match_phrase": { "title": "OpenSearch Observability" } } }
A phrase query matches words in any sequence up to a configurable slop (the default value is 0). For instance:
GET books/_search { "query": { "match_phrase": { "title": { "query": "OpenSearch Observability", "slop": 1 } } } }
The above will match:
OpenSearch observability
OpenSearch system observability (because we configured a slop of 1)
Will not match:
OpenSearch system downtime observability
As mentioned in the match query, this query also uses the zero_terms_query parameter
Fuzzy search cannot be used with phrase matches.
Multi-match query
In order to conduct multi-field queries, the multi_match query is built based on the match query. A book can discuss OpenSearch observability in great detail and not have “OpenSearch observability” in the title, so it appears that searching only one field is not the best approach for this example. In the example below, a multi_match query is used that searches both the title and abstract fields for “OpenSearch observability.” The multi_match delivers many more hits than both the match and match_phase queries on the title field alone, while the top hits maintain their relevance.
GET books/_search { "query": { "multi_match": { "query": "OpenSearch Observability", "fields": [ "title", "abstract" ] } } }
Wildcards can be used to specify fields as shown in the example below, the book_title, first_name, and last_name fields will be queried.
GET authors/_search { "query": { "multi_match" : { "query": "Elie Smith", "fields": [ "book_title", "*_name" ] } } }
The caret (^) notation can be used to boost scores of specific fields. In the following example, the title field receives a boost of 2.
GET books/_search { "query": { "multi_match": { "query": "OpenSearch Observability", "fields": [ "title^2", "abstract" ] } } }
The multi match_query defaults to the index.query.default_field index settings if no fields are specified, which in turn defaults to *. * extracting all fields from the mapping that are appropriate for term queries, in addition to filtering the metadata fields. A query is then created by combining all the extracted fields.
The type parameter, which has the following options, determines how the multi_match query is internally executed:
– Best_fields: The default, meaning finding documents that match any field, utilizing the score from the highest scoring field. When looking for multiple words that are best found in one field, the best_fields type is the most helpful. As shown in the example below, “OpenSearch Observability” in a single field is more relevant than “OpenSearch” in one field and “Observability” in the other field.
GET books/_search { "query": { "multi_match": { "query": "OpenSearch Observability", "type": "best_fields", "fields": [ "title", "abstract" ], "tie_breaker": 0.3 } } }
It accepts the following parameters: analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, fuzzy_rewrite, zero_terms_query, auto_generate_synonyms_phrase_query, and fuzzy_transpositions that have been explained in the match query.
– Most_fields: Combines the _score from each field to find documents that match any field. When searching through many fields that have the same text but have been analyzed differently, the most_fields type is most helpful.
As shown in the example below, the primary field might include stemming, synonyms, and terms without diacritics. The original terms might be in a second field, and shingles might be in a third field. The most similar results are pushed to the top of the list using the second and third fields, by combining the scores from the three fields, which allows the matching of as many documents as possible with the primary field.
GET books/_search { "query": { "multi_match" : { "query": "data mining", "type": "most_fields", "fields": [ "title","title.shingles", "title.original" ] } } }
The total score from all of the match clauses is divided by the total number of match clauses. The following parameters, described previously, are accepted in the most_fields type: analyzer, boost, operator, minimum_should_match, fuzziness, lenient, prefix_length, max_expansions, fuzzy_rewrite, and zero_terms_query.
– Cross_fields: All fields with the same analyzer should be treated as if they were one big field. This query searches for every word in every field. When dealing with structured documents where many fields need to match, the cross_fields type is really helpful. For example, the best match when searching for “Elie Smith” likely includes “Elie” in the first_name field and “Smith” in the last_name field.
GET authors/_search { "query": { "multi_match" : { "query": "Elie Smith", "type": "cross_fields", "fields": [ "first_name", "last_name" ], "operator": "and" } } }
The following parameters, described previously, are accepted in the cross_field type: analyzer, boost, operator, minimum_should_match, lenient and zero_terms_query.
– Phrase: Utilizes the _score from the best field after performing a match_phrase query on each field.
– Phrase_prefix: Utilizes the _score from the best field after performing a match_phrase_prefix query on each field. While using a match_phrase or match_phrase_prefix query in place of a match query, the phrase and phrase_prefix types function exactly like best_fields.
GET books/_search { "query": { "multi_match": { "query": "OpenSearch Observability", "type": "phrase_prefix", "fields": [ "title", "abstract" ] } } }
The following parameters, described previously, are accepted in the phrase and phrase_prefix types: analyzer, boost, lenient, zero_terms_query, and slop. Also, max_expansions is accepted by the phrase_prefix type.
– Bool_prefix: Combines the _score from each field and performs a match_bool_prefix query on each field. The scoring of the bool_prefix type behaves similarly to that of most_fields, using a match_bool_prefix query instead of a match query.
GET books/_search { "query": { "multi_match": { "query": "OpenSearch Observability", "type": "bool_prefix", "fields": [ "title", "abstract" ] } } }
The following parameters are supported: analyzer, boost, operator, minimum_should_match, lenient, zero_terms_query, and auto_generate_synonyms_phrase_query.
For terms used to build term queries, the fuzziness, prefix_length, max_expansions, fuzzy_rewrite, and fuzzy_transpositions parameters are supported; however, they have no impact on the prefix query built from the final term. In addition, this query type does not support the slop parameter.
You cannot use a fuzzy search with phrase match.
Notes and good things to know
- Terms with synonyms and situations in which the analysis process generates multiple tokens at the same spot are not subject to fuzzy matching. These terms are expanded internally to special synonym queries that combine term frequencies and that do not support fuzzy expansion.
- A query only has a limited amount of clauses by default. The indices.query.bool.max_clause_count setting, with a default value of 4096, establishes this restriction. The number of clauses for multi-match queries is determined by multiplying the number of fields by the number of terms.
- In the multi-match query, the best_fields and most_fields types produce a match query for each field, making them field-centric. As a result, the minimum_should_match and the operator parameters will be applied to each field independently, which is likely not what you want. A term-centric method is provided by the combined_fields query, which treats operator and minimum_should_match on a per-term basis. Cross fields, the other multi-match type, similarly address this problem.
- In the multi-match query, users can’t use the fuzziness parameter with the phrase, phrase_prefix, and the cross_fields types.
- The cross_fields type combines field statistics in a way that occasionally results in scores that are not well-formed (for example, scores can become negative). The combined_fields query, which is also term-centric but more robustly combines field statistics, is an alternative worth considering.
- Keep in mind that cross_fields type in the multi-match query is typically only beneficial for short string fields with boosts of 1. Otherwise, the score is affected by boosts, term frequencies, and length normalization in a way that makes the combination of term statistics meaningless.
Summary
“Match”, “Multi-Match”, and “Match Phrase” are all types of query in OpenSearch, used to search for matching documents in an index.
- Match Query: The match query is used to search for documents that contain one or more specific terms. It allows for partial matches and will automatically apply some basic text analysis to the query string to generate the set of terms to search.
- Multi-Match Query: The multi-match query is used to search for documents that contain one or more specific terms in multiple fields. This query is useful when you have several fields that you want to search and you want to use the same query for each field.
- Match Phrase Query: The match phrase query is used to search for documents that contain an exact phrase. It matches the complete phrase, including the order of terms, as it appears in the document. Unlike the match query, the match phrase query does not allow for partial matches.
In terms of performance, the match query is generally faster than the match phrase query, as it can use an inverted index to quickly identify the matching documents. The multi-match query is typically slower than the match query, as it needs to search multiple fields, but it can provide more comprehensive results.