What is k-NN?
k-NN stands for k-nearest neighbors and is used to find nearby documents based on vector dimensions.
This strategy is widely used for recommendations. Based on many parameters (vector dimensions) users are classified and grouped, then when a new user arrives you can use k-NN to compare its vectors with the rest of the users and give relevant recommendations.
Another interesting use case is the “word embedding”, or converting words to vectors. In this scenario, you want to search for “cat” and find things like “pet supplies”, “kitty sand”, and “whiskers”, as opposed to “car” or “catapult”.
To achieve this, a model is trained to assign vectors to each word based on context. Then, using k-NN, we can know that “cat” and “pet supplies” are neighbors despite them being completely different words.
k-NN Opensearch plugin
Opensearch comes to the rescue with a plugin that handles this distance search. The only thing you need to do is ingest your documents with their vectors as a k-NN field type.
Installation
The all-in installation version of Opensearch includes the plugin, so you don’t have to install it.
If you are not sure, you can verify your plugins and look for opensearch-knn:
sudo bin/opensearch-plugin list
In case you don’t see opensearch-knn in the list, you can easily install it:
sudo bin/opensearch-plugin install opensearch-knn
Usage
The k-NN plugin supports 3 methods to obtain the neighbors. Below is a summary of these methods. If you’d like a more in depth explanation, see the official docs.
Name | Method | Features | Use case |
---|---|---|---|
Approximate k-NN | HNSW (Hierarchical Navigable Small World) algorithm | Low latency, scalable | Large indices (many vectors), no pre-filters |
Script Score k-NN | Brute force | Support binary represented fields | Smaller bodies or pre-filtered |
Painless extensions | Custom | Support distance functions | More complex scenarios |
You should use Approximate k-NN for large indices with no pre-filter queries, Script Score for smaller bodies/pre-filtered queries, and Painless extensions in case you need to use a custom distance function to calculate score.
k-NN Mapping
The plugin introduces a new field type called “knn_vector” that has many configuration options you can check out here.
A knn_vector field type looks like this:
{ "my_vector_field": { "type": "knn_vector", "dimension": 4, "method": { "name": "hnsw", "space_type": "l2", "engine": "nmslib", "parameters": { "ef_construction": 128, "m": 24 } } } }
The most relevant properties are:
- dimension: The amount of dimensions your vector will have.
- space_type: Space function to be used to calculate distance between vectors. Available options here.
- ef_construction: Size of dynamic list used for k-NN. The higher the value, the more precision you’ll have but the slower the indexing speed will be.
- m: Number of bidirectional links the plugin will create. Impacts on memory usage, keep it between 2 and 100.
Approximate k-NN
Create a k-nn index
Let’s ingest some documents to search across using the Approximate k-NN method. We will use the Word Embedding example. The vector array values should be generated using a word embedding library, the ones of this example are made up.
PUT test-knn { "settings": { "index": { "knn": true, "knn.algo_param.ef_search": 100 } }, "mappings": { "properties": { "product_vector": { "type": "knn_vector", "dimension": 4, "method": { "name": "hnsw", "space_type": "l2", "engine": "nmslib", "parameters": { "ef_construction": 128, "m": 24 } } } } } }
Now let’s ingest some example documents.
POST _bulk { "index": { "_index": "test-knn", "_id": "1" } } { "product_vector": [1, 5, 5, 4], "price": 10.00,"name": "Hygienic sand" } { "index": { "_index": "test-knn", "_id": "2" } } { "product_vector": [5, 4, 4, 4], "price": 25.00,"name": "Pet supplies pack" } { "index": { "_index": "test-knn", "_id": "3" } } { "product_vector": [7, 9, 9, 9], "price": 500,"name": "Catapult" } { "index": { "_index": "test-knn", "_id": "4" } } { "product_vector": [1, 1, 2, 1], "price": 5,"name": "Hot Wheels Car" }
Each element on the vector array is a dimension that represents a feature of the document. Machine learning algorithms can calculate thousands of different dimensions. The dimension property on the mappings supports up to 10,000.
Querying
Let’s say the vector representation for “cat” after training our model is: [2,3,5,6].
GET test-knn/_search { "size": 2, "query": { "knn": { "product_vector": { "vector": [2, 3, 5, 6], "k": 2 } } } }
- size: Amount of results to be returned.
- k: Amount of neighbors to return for each graph.
- vector: Vector representation of the document we are searching.
The response might look like this:
{ "took" : 2, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 0.1, "hits" : [ { "_index" : "test-knn", "_type" : "_doc", "_id" : "1", "_score" : 0.1, "_source" : { "product_vector" : [ 1, 5, 5, 4 ], "price" : 10.0, "name" : "Hygienic sand" } }, { "_index" : "test-knn", "_type" : "_doc", "_id" : "2", "_score" : 0.0625, "_source" : { "product_vector" : [ 5, 4, 4, 4 ], "price" : 25.0, "name" : "Pet supplies pack" } } ] } }
See how Catapult and car are not being returned but Pet Supply and Sand are (or relevant/more relevant).
The interesting thing is the criteria used to set this distance between k elements may be different. Word embedding models are trained with tons of text like books or newspapers, but dimensions can be things like movie category preferences, ratings, people age, city, movie duration, etc.
Usage with filters
As we said, Approximate k-NN can not be pre-filtered, so if for example we want to filter the products by price after running the k-NN, we can use a post_filter clause.
GET test-knn/_search { "size": 2, "query": { "knn": { "product_vector": { "vector": [2, 3, 5, 6], "k": 2 } } }, "post_filter": { "range": { "price": { "lte": 20 } } } }
With this query the Pet supplies pack will not be returned because of its price.
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.1, "hits" : [ { "_index" : "test-knn", "_type" : "_doc", "_id" : "1", "_score" : 0.1, "_source" : { "product_vector" : [ 1, 5, 5, 4 ], "price" : 10.0, "name" : "Hygienic sand" } } ] } }
Scoring Script k-NN
With this approach we can apply a filter before executing the nearest neighbor search. For this reason it does not scale very well but is useful for dynamic search cases where the index body may vary based on other conditions.
Mappings
PUT my-knn-index-1 { "mappings": { "properties": { "product_vector": { "type": "knn_vector", "dimension": 4 } } } }
As you can see this is a shorter version of the approximate search because many of the configuration options are exclusive to that approach.
If you want to use this index for both script score and approximation you have to set knn.index: true and set the knn.space_type.
If you only want to use the script score capabilities it’s recommended to set knn.index: false. You will benefit from less memory usage and faster indexing but you will not be able to perform standard kNN queries against the index.
Query
GET my-knn-index-1/_search { "size": 4, "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "knn_score", "lang": "knn", "params": { "field": "product_vector", "query_value": [2.0, 3.0, 5.0, 6.0], "space_type": "cosinesimil" } } } } }
All parameters are required.
The most relevant are:
- lang: knn instead of painless.
- source: the name of the script.
- query_value: The point you want to find neighbors, represented as an array of floats that must coincide with the mappings dimensions.
- space_type: The function used to calculate distance between points.
Binary data in Score Script
Score script k-NN also supports query_value as binary instead of an array. In this case you have to use the Hamming distance space and attach the string as base64 data. You can also represent this value as type long.
Create the index
PUT my-index { "mappings": { "properties": { "my_binary": { "type": "binary", "doc_values": true }, "color": { "type": "keyword" } } } }
Index documents
POST _bulk { "index": { "_index": "my-index", "_id": "1" } } { "my_binary": "SGVsbG8gV29ybGQh", "color" : "RED" } { "index": { "_index": "my-index", "_id": "2" } } { "my_binary": "ay1OTiBjdXN0b20gc2NvcmluZyE=", "color" : "RED" }
Run a query
GET my-index/_search { "size": 2, "query": { "script_score": { "query": { "bool": { "filter": { "term": { "color": "RED" } } } }, "script": { "lang": "knn", "source": "knn_score", "params": { "field": "my_binary", "query_value": "U29tZXRoaW5nIEltIGxvb2tpbmcgZm9y", "space_type": "hammingbit" } } } } }
Note you are looking for RED only neighbors, so any other color will be ignored.
Painless Script
This method allows you to manipulate the distance function values and score. You can choose from a list of available functions and, for example, multiply the value of the score, or involve other fields of the document in the equation. All the functions receive the field name and vector value as parameters.
See the example below.
GET my-knn-index-1/_search { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "1.0 + cosineSimilarity(params.query_value, doc[params.field])", "params": { "field": "product_vector", "query_value": [ 7.9, 2.9, 4.9, 4.9 ] } } } } }
This script will execute a cosine similarity function to calculate score, and then sum 1.0 to the total.
Constraints
- You can choose between l2Squared, l1Norm and cosineSimilarity distance functions (more information here).
- query_value dimension size must match with the one defined in the mapping or the query will fail.
- If a document does not have a vector value then the query will fail. You can fix this by validating this value before running the distance function with this code: “source”: “doc[params.field].size() == 0 ? 0 : 1 / (1 + l2Squared(params.query_value, doc[params.field]))”.
- Scores can only be positive, so documents with vector fields will score higher.