BiocNeighbors 1.5.2
The BiocNeighbors package provides several algorithms for approximate neighbor searches:
These methods complement the exact algorithms described previously.
Again, it is straightforward to switch from one algorithm to another by simply changing the BNPARAM
argument in findKNN
and queryKNN
.
We perform the k-nearest neighbors search with the Annoy algorithm by specifying BNPARAM=AnnoyParam()
.
nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)
fout <- findKNN(data, k=10, BNPARAM=AnnoyParam())
head(fout$index)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## [1,] 6670 5842 4500 3393 8541 7131 9803 9499 1765 7784
## [2,] 41 8166 2239 9779 9915 9942 2601 306 7626 5323
## [3,] 8682 807 2703 2960 9354 6073 978 8852 6997 2741
## [4,] 7672 4103 2564 4031 6400 3554 4607 4557 7800 3620
## [5,] 7464 2283 158 6981 4752 8833 2908 3936 8379 3841
## [6,] 8952 5883 9564 130 4482 3590 4024 9142 5892 238
head(fout$distance)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## [1,] 0.7185587 0.8849660 0.8990269 0.9272472 0.9426381 0.9509270 0.9781944
## [2,] 0.9375380 1.0224338 1.0480847 1.0677400 1.0908558 1.1178118 1.1202444
## [3,] 0.9684885 0.9710075 0.9873037 0.9948838 1.0341672 1.0428770 1.0463076
## [4,] 0.9653872 1.0137937 1.0161486 1.0230589 1.0237193 1.0367494 1.0368437
## [5,] 0.8731992 0.8765553 0.8851078 0.8890289 0.9425309 0.9523616 0.9726140
## [6,] 0.9558727 1.0701734 1.0881078 1.1048518 1.1080862 1.1132088 1.1334031
## [,8] [,9] [,10]
## [1,] 1.014034 1.016446 1.024055
## [2,] 1.123012 1.124689 1.128097
## [3,] 1.069681 1.081369 1.088220
## [4,] 1.039266 1.041562 1.047012
## [5,] 0.979152 1.008427 1.011603
## [6,] 1.134022 1.140072 1.154267
We can also identify the k-nearest neighbors in one dataset based on query points in another dataset.
nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)
qout <- queryKNN(data, query, k=5, BNPARAM=AnnoyParam())
head(qout$index)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1162 391 4447 2610 5932
## [2,] 9325 8106 264 9919 6426
## [3,] 4514 6004 2082 7320 5387
## [4,] 7159 1214 7660 2820 8678
## [5,] 5989 8563 4765 2050 738
## [6,] 3098 9614 1291 901 8368
head(qout$distance)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.8793623 0.8979356 0.9143692 0.9378607 1.0349739
## [2,] 0.9202604 0.9274356 0.9511770 0.9615410 0.9667929
## [3,] 0.8071835 0.9572333 0.9834715 1.0442172 1.0469409
## [4,] 0.9567441 1.1421938 1.1536205 1.1682745 1.1762619
## [5,] 0.7843636 0.9174890 0.9415302 0.9419450 0.9560050
## [6,] 0.8676425 0.9382980 0.9920224 1.0139333 1.0571074
It is similarly easy to use the HNSW algorithm by setting BNPARAM=HnswParam()
.
Most of the options described for the exact methods are also applicable here. For example:
subset
to identify neighbors for a subset of points.get.distance
to avoid retrieving distances when unnecessary.BPPARAM
to parallelize the calculations across multiple workers.BNINDEX
to build the forest once for a given data set and re-use it across calls.The use of a pre-built BNINDEX
is illustrated below:
pre <- buildIndex(data, BNPARAM=AnnoyParam())
out1 <- findKNN(BNINDEX=pre, k=5)
out2 <- queryKNN(BNINDEX=pre, query=query, k=2)
Both Annoy and HNSW perform searches based on the Euclidean distance by default.
Searching by Manhattan distance is done by simply setting distance="Manhattan"
in AnnoyParam()
or HnswParam()
.
Users are referred to the documentation of each function for specific details on the available arguments.
Both Annoy and HNSW generate indexing structures - a forest of trees and series of graphs, respectively -
that are saved to file when calling buildIndex()
.
By default, this file is located in tempdir()
1 and will be removed when the session finishes.
AnnoyIndex_path(pre)
## [1] "C:\\Users\\biocbuild\\bbs-3.11-bioc\\tmpdir\\Rtmpe0W1Zo\\file15e4172e18da.idx"
If the index is to persist across sessions, the path of the index file can be directly specified in buildIndex
.
This can be used to construct an index object directly using the relevant constructors, e.g., AnnoyIndex()
, HnswIndex()
.
However, it becomes the responsibility of the user to clean up any temporary indexing files after calculations are complete.
sessionInfo()
## R Under development (unstable) (2020-01-27 r77730)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=C
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] BiocNeighbors_1.5.2 knitr_1.28 BiocStyle_2.15.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.3 bookdown_0.18 lattice_0.20-40
## [4] digest_0.6.25 grid_4.0.0 stats4_4.0.0
## [7] magrittr_1.5 evaluate_0.14 rlang_0.4.5
## [10] stringi_1.4.6 S4Vectors_0.25.12 Matrix_1.2-18
## [13] rmarkdown_2.1 BiocParallel_1.21.2 tools_4.0.0
## [16] stringr_1.4.0 parallel_4.0.0 xfun_0.12
## [19] yaml_2.2.1 compiler_4.0.0 BiocGenerics_0.33.0
## [22] BiocManager_1.30.10 htmltools_0.4.0