The Open Distro project is archived. Open Distro development has moved to OpenSearch. The Open Distro plugins will continue to work with legacy versions of Elasticsearch OSS, but we recommend upgrading to OpenSearch to take advantage of the latest features and improvements.
Approximate k-NN Search
The approximate k-NN method uses nmslib’s implementation of the HNSW algorithm to power k-NN search. In this case, approximate means that for a given search, the neighbors returned are an estimate of the true k-nearest neighbors. Of the three methods, this method offers the best search scalability for large data sets. Generally speaking, once the data set gets into the hundreds of thousands of vectors, this approach should be preferred.
This plugin builds an HNSW graph of the vectors for each “knn-vector field”/ “Lucene segment” pair during indexing that can be used to efficiently find the k-nearest neighbors to a query vector during search. To learn more about Lucene segments, please refer to Apache Lucene’s documentation. These graphs are loaded into native memory during search and managed by a cache. To learn more about pre-loading graphs into memory, refer to the warmup API. Additionally, you can see what graphs are already loaded in memory, which you can learn more about in the stats API section.
Because the graphs are constructed during indexing, it is not possible to apply a filter on an index and then use this search method. All filters will be applied on the results produced by the approximate nearest neighbor search.
Get started with approximate k-NN
To use the k-NN plugin’s approximate search functionality, you must first create a k-NN index with setting index.knn
to true
. This setting tells the plugin to create HNSW graphs for the index.
Additionally, if you are using the approximate k-nearest neighbor method, you should specify knn.space_type
to the space that you are interested in. This setting cannot be changed after it is set. To see what spaces we support, please refer to the spaces section. By default, index.knn.space_type
is l2
. For more information on index settings, such as algorithm parameters that can be tweaked to tune performance, please refer to the documentation.
Next, you must add one or more fields of the knn_vector
data type. Here is an example that creates an index with two knn_vector
fields and uses cosine similarity:
PUT my-knn-index-1
{
"settings": {
"index": {
"knn": true,
"knn.space_type": "cosinesimil"
}
},
"mappings": {
"properties": {
"my_vector1": {
"type": "knn_vector",
"dimension": 2
},
"my_vector2": {
"type": "knn_vector",
"dimension": 4
}
}
}
}
The knn_vector
data type supports a vector of floats that can have a dimension of up to 10,000, as set by the dimension mapping parameter.
In Elasticsearch, codecs handle the storage and retrieval of indices. The k-NN plugin uses a custom codec to write vector data to graphs so that the underlying k-NN search library can read it.
After you create the index, you can add some data to it:
POST _bulk
{ "index": { "_index": "my-knn-index-1", "_id": "1" } }
{ "my_vector1": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "2" } }
{ "my_vector1": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-knn-index-1", "_id": "3" } }
{ "my_vector1": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-knn-index-1", "_id": "4" } }
{ "my_vector1": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-knn-index-1", "_id": "5" } }
{ "my_vector1": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-knn-index-1", "_id": "6" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-knn-index-1", "_id": "7" } }
{ "my_vector2": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-knn-index-1", "_id": "8" } }
{ "my_vector2": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-knn-index-1", "_id": "9" } }
{ "my_vector2": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }
Then you can execute an approximate nearest neighbor search on the data using the knn
query type:
GET my-knn-index-1/_search
{
"size": 2,
"query": {
"knn": {
"my_vector2": {
"vector": [2, 3, 5, 6],
"k": 2
}
}
}
}
k
is the number of neighbors the search of each graph will return. You must also include the size
option. This option indicates how many results the query actually returns. The plugin returns k
amount of results for each shard (and each segment) and size
amount of results for the entire query. The plugin supports a maximum k
value of 10,000.
Using approximate k-NN with filters
If you use the knn
query alongside filters or other clauses (e.g. bool
, must
, match
), you might receive fewer than k
results. In this example, post_filter
reduces the number of results from 2 to 1:
GET my-knn-index-1/_search
{
"size": 2,
"query": {
"knn": {
"my_vector2": {
"vector": [2, 3, 5, 6],
"k": 2
}
}
},
"post_filter": {
"range": {
"price": {
"gte": 5,
"lte": 10
}
}
}
}
Spaces
A space corresponds to the function used to measure the distance between 2 points in order to determine the k-nearest neighbors. From the k-NN perspective, a lower score equates to a closer and better result. This is the opposite of how Elasticsearch scores results, where a greater score equates to a better result. To convert distances to Elasticsearch scores, we take 1 / (1 + distance). Currently, the k-NN plugin supports the following spaces:
spaceType | Distance Function | Elasticsearch Score |
---|---|---|
l2 | \[ Distance(X, Y) = \sum_{i=1}^n (X_i - Y_i)^2 \] | 1 / (1 + Distance Function) |
l1 | \[ Distance(X, Y) = \sum_{i=1}^n (X_i - Y_i) \] | 1 / (1 + Distance Function) |
cosinesimil | \[ 1 - {A · B \over \|A\| · \|B\|} = 1 - {\sum_{i=1}^n (A_i · B_i) \over \sqrt{\sum_{i=1}^n A_i^2} · \sqrt{\sum_{i=1}^n B_i^2}}\] where \(\|A\|\) and \(\|B\|\) represent normalized vectors. | 1 / (1 + Distance Function) |
hammingbit | Distance = countSetBits(X \(\oplus\) Y) | 1 / (1 + Distance Function) |
The cosine similarity formula does not include the 1 -
prefix. However, because nmslib equates smaller scores with closer results, they return 1 - cosineSimilarity
for their cosine similarity space—that’s why 1 -
is included in the distance function.