Quick links
- Introduction
- Creating a Custom Analyzer with Shingles
- Indexing Documents with Shingles
- Searching with Shingles
- Conclusion
Elasticsearch Shingles Example: Boosting Relevance with N-Grams
Introduction
Shingles, also known as word N-grams, are a useful technique for improving the relevance of search results in Elasticsearch. By breaking text into overlapping groups of words, shingles allow for more accurate matching of phrases and can help identify related documents.
In this article, we will explore how to use shingles in Elasticsearch with a practical example.
Creating a Custom Analyzer with Shingles
To use shingles, we need to create a custom analyzer that includes a shingle token filter. The following example demonstrates how to create an index with a custom analyzer that generates 2-word shingles:
PUT /shingle_example { "settings": { "analysis": { "analyzer": { "shingle_analyzer": { "tokenizer": "standard", "filter": [ "lowercase", "shingle_filter" ] } }, "filter": { "shingle_filter": { "type": "shingle", "min_shingle_size": 2, "max_shingle_size": 2, "output_unigrams": false } } } }, "mappings": { "properties": { "text": { "type": "text", "analyzer": "shingle_analyzer" } } } }
In this example, we define a custom analyzer called “shingle_analyzer” that uses the standard tokenizer and includes a lowercase filter and a custom shingle filter. The shingle filter is configured to generate 2-word shingles and not output unigrams.
Indexing Documents with Shingles
Now that we have created an index with a custom shingle analyzer, let’s index some sample documents:
POST /shingle_example/_doc { "text": "The quick brown fox jumps over the lazy dog" } POST /shingle_example/_doc { "text": "The quick brown dog jumps over the lazy fox" }
It’s easy to visualize the shingles that have been generated by running the following command:
POST /shingle_example/_analyze { "analyzer”: "shingle_analyzer", "text": "The quick brown fox jumps over the lazy dog" }
Searching with Shingles
To search for documents using shingles, we can use the match query with the custom analyzer:
GET /shingle_example/_search { "query": { "match": { "text": { "query": "quick brown fox", "analyzer": "shingle_analyzer" } } } }
The search results will show that the first document is more relevant than the second one, as it contains the exact phrase “quick brown fox”. Without shingles, both documents would have the same relevance score, as they contain the same individual words.
Conclusion
Using shingles in Elasticsearch can help improve the relevance of search results by considering the order of words and matching phrases more accurately. By creating a custom analyzer with a shingle token filter, you can easily implement this technique in your Elasticsearch setup. Experiment with different shingle sizes and configurations to find the best approach for your specific use case.