1 Introduction

The BiocNeighbors package implements a few algorithms for exact nearest neighbor searching:

  • The k-means for k-nearest neighbors (KMKNN) algorithm (Wang 2012) uses k-means clustering to create an index. Within each cluster, the distance of each of that cluster’s points to the cluster center are computed and used to sort all points. Given a query point, the distance to each cluster center is determined and the triangle inequality is applied to determine which points in each cluster warrant a full distance calculation.
  • The vantage point (VP) tree algorithm (Yianilos 1993) involves constructing a tree where each node is located at a data point and is associated with a subset of neighboring points. Each node progressively partitions points into two subsets that are either closer or further to the node than a given threshold. Given a query point, the triangle inequality is applied at each node in the tree to determine if the child nodes warrant searching.

Both methods involve a component of randomness during index construction, though the k-nearest neighbors result is fully deterministic1.

2 Identifying k-nearest neighbors

The most obvious application is to perform a k-nearest neighbors search. We’ll mock up an example here with a hypercube of points, for which we want to identify the 10 nearest neighbors for each point.

nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)

The findKNN() method expects a numeric matrix as input with data points as the rows and variables/dimensions as the columns. We indicate that we want to use the KMKNN algorithm by setting BNPARAM=KmknnParam() (which is also the default, so this is not strictly necessary here). We could use a VP tree instead by setting BNPARAM=VptreeParam().

fout <- findKNN(data, k=10, BNPARAM=KmknnParam())
head(fout$index)
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 2191 1284 8943 1829 8043 3484 4226 6109 1044  9461
## [2,] 9994 6271 6048 5050 1961 7677  698 4011 9721  9356
## [3,]  424 5554 4009 2829 9467 5302 7066 4813 1189  9951
## [4,] 4700 8180 7883  830 4408 2712 4703 5723 3591  9530
## [5,] 3693 1540 1457  843 5085 7008  673 3114 3830  5045
## [6,] 6459 6877  553 1002 6656  264  922 3988 8291  2087
head(fout$distance)
##           [,1]      [,2]      [,3]      [,4]      [,5]     [,6]      [,7]
## [1,] 0.8133961 0.9400178 0.9703469 1.0006581 1.0020703 1.013964 1.0451076
## [2,] 1.0488302 1.0740912 1.1017112 1.1366849 1.1378279 1.139590 1.1519148
## [3,] 0.9431701 0.9618933 0.9919817 1.0038151 1.0179641 1.027765 1.0379311
## [4,] 0.8552499 0.8808786 0.9122011 0.9184494 0.9589407 0.961043 0.9776151
## [5,] 0.9730693 0.9945448 1.0333593 1.0337936 1.0416727 1.093584 1.0940787
## [6,] 1.0308669 1.0722899 1.0863727 1.0940507 1.1140036 1.115656 1.1264884
##           [,8]     [,9]     [,10]
## [1,] 1.0521773 1.054278 1.0661377
## [2,] 1.1536413 1.155539 1.1593822
## [3,] 1.0609799 1.063281 1.0682536
## [4,] 0.9858938 0.995758 0.9976013
## [5,] 1.1019123 1.104304 1.1124395
## [6,] 1.1617155 1.177938 1.1855291

Each row of the index matrix corresponds to a point in data and contains the row indices in data that are its nearest neighbors. For example, the 3rd point in data has the following nearest neighbors:

fout$index[3,]
##  [1]  424 5554 4009 2829 9467 5302 7066 4813 1189 9951

… with the following distances to those neighbors:

fout$distance[3,]
##  [1] 0.9431701 0.9618933 0.9919817 1.0038151 1.0179641 1.0277653 1.0379311
##  [8] 1.0609799 1.0632809 1.0682536

Note that the reported neighbors are sorted by distance.

3 Querying k-nearest neighbors

Another application is to identify the k-nearest neighbors in one dataset based on query points in another dataset. Again, we mock up a small data set:

nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)

We then use the queryKNN() function to identify the 5 nearest neighbors in data for each point in query.

qout <- queryKNN(data, query, k=5, BNPARAM=KmknnParam())
head(qout$index)
##      [,1] [,2] [,3] [,4] [,5]
## [1,] 6797 1683 6427 9274 8486
## [2,] 4751 9190 6949 1038 8123
## [3,] 7598 3613 8355  952 2173
## [4,] 7168  728 9034 2740 8100
## [5,]  599 3212 8370 9934 9244
## [6,] 6659 3529 4457 7137 4122
head(qout$distance)
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.8576696 0.9786691 0.9967828 1.0032973 1.0110760
## [2,] 0.8192674 0.8883764 0.9419443 0.9419605 0.9603309
## [3,] 0.9086010 0.9097717 0.9399545 0.9681321 0.9706446
## [4,] 0.9245513 0.9529929 0.9586078 1.0535925 1.0562416
## [5,] 0.8205016 0.8399731 0.9228952 0.9341433 0.9379830
## [6,] 0.9199978 0.9494325 0.9944878 1.0072480 1.0093574

Each row of the index matrix contains the row indices in data that are the nearest neighbors of a point in query. For example, the 3rd point in query has the following nearest neighbors in data:

qout$index[3,]
## [1] 7598 3613 8355  952 2173

… with the following distances to those neighbors:

qout$distance[3,]
## [1] 0.9086010 0.9097717 0.9399545 0.9681321 0.9706446

Again, the reported neighbors are sorted by distance.

4 Further options

Users can perform the search for a subset of query points using the subset= argument. This yields the same result as but is more efficient than performing the search for all points and subsetting the output.

findKNN(data, k=5, subset=3:5)
## $index
##      [,1] [,2] [,3] [,4] [,5]
## [1,]  424 5554 4009 2829 9467
## [2,] 4700 8180 7883  830 4408
## [3,] 3693 1540 1457  843 5085
## 
## $distance
##           [,1]      [,2]      [,3]      [,4]      [,5]
## [1,] 0.9431701 0.9618933 0.9919817 1.0038151 1.0179641
## [2,] 0.8552499 0.8808786 0.9122011 0.9184494 0.9589407
## [3,] 0.9730693 0.9945448 1.0333593 1.0337936 1.0416727

If only the indices are of interest, users can set get.distance=FALSE to avoid returning the matrix of distances. This will save some time and memory.

names(findKNN(data, k=2, get.distance=FALSE))
## [1] "index"

It is also simple to speed up functions by parallelizing the calculations with the BiocParallel framework.

library(BiocParallel)
out <- findKNN(data, k=10, BPPARAM=MulticoreParam(3))

For multiple queries to a constant data, the pre-clustering can be performed in a separate step with buildIndex(). The result can then be passed to multiple calls, avoiding the overhead of repeated clustering2.

pre <- buildIndex(data, BNPARAM=KmknnParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)

The default setting is to search on the Euclidean distance. Alternatively, we can use the Manhattan distance by setting distance="Manhattan" in the BiocNeighborParam object.

out.m <- findKNN(data, k=5, BNPARAM=KmknnParam(distance="Manhattan"))

Advanced users may also be interested in the raw.index= argument, which returns indices directly to the precomputed object rather than to data. This may be useful inside package functions where it may be more convenient to work on a common precomputed object.

5 Session information

sessionInfo()
## R Under development (unstable) (2020-01-27 r77730)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.21.2 BiocNeighbors_1.5.2 knitr_1.28         
## [4] BiocStyle_2.15.6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3          bookdown_0.18       lattice_0.20-40    
##  [4] digest_0.6.25       grid_4.0.0          stats4_4.0.0       
##  [7] magrittr_1.5        evaluate_0.14       rlang_0.4.5        
## [10] stringi_1.4.6       S4Vectors_0.25.12   Matrix_1.2-18      
## [13] rmarkdown_2.1       tools_4.0.0         stringr_1.4.0      
## [16] parallel_4.0.0      xfun_0.12           yaml_2.2.1         
## [19] compiler_4.0.0      BiocGenerics_0.33.0 BiocManager_1.30.10
## [22] htmltools_0.4.0

References

Wang, X. 2012. “A Fast Exact k-Nearest Neighbors Algorithm for High Dimensional Search Using k-Means Clustering and Triangle Inequality.” Proc Int Jt Conf Neural Netw 43 (6): 2351–8.

Yianilos, P. N. 1993. “Data Structures and Algorithms for Nearest Neighbor Search in General Metric Spaces.” In SODA, 93:311–21. 194.