1 Identifying all neighbors within range

Another application of the KMKNN or VP tree algorithms is to identify all neighboring points within a certain distance1 of the current point. We first mock up some data:

nobs <- 10000
ndim <- 20
data <- matrix(runif(nobs*ndim), ncol=ndim)

We apply the findNeighbors() function to data:

fout <- findNeighbors(data, threshold=1)
head(fout$index)
## [[1]]
##  [1] 2863 5322 9246 4974 4640 1893 5928 9541 6049 6816   97 6913 6336 8539 5588
## [16] 3447 4237    1 1430 9864 2080 3162 1611
## 
## [[2]]
## [1]  694 2755 1632    2
## 
## [[3]]
##  [1] 7217    3 6327 4091 9260 4416 9476 8327 6428 4503 1038 7416
## 
## [[4]]
## [1] 6912 7143 2947    4 9124
## 
## [[5]]
## [1] 4848    5
## 
## [[6]]
## [1] 5995 1383    6 5584 8370 3652
head(fout$distance)
## [[1]]
##  [1] 0.9609800 0.9518324 0.9700027 0.9870617 0.9027826 0.9863867 0.9537165
##  [8] 0.9920214 0.9579141 0.8663649 0.9959886 0.9925796 0.9751466 0.9155481
## [15] 0.8140463 0.9465446 0.9336174 0.0000000 0.9208373 0.7875078 0.9856976
## [22] 0.9970428 0.9005577
## 
## [[2]]
## [1] 0.9241700 0.9622873 0.9102940 0.0000000
## 
## [[3]]
##  [1] 0.7928541 0.0000000 0.9100973 0.8986305 0.9746544 0.9786117 0.9569915
##  [8] 0.9298775 0.9270930 0.9129424 0.9789321 0.9997940
## 
## [[4]]
## [1] 0.9931604 0.9643559 0.9589190 0.0000000 0.9628718
## 
## [[5]]
## [1] 0.9259691 0.0000000
## 
## [[6]]
## [1] 0.7775292 0.9133919 0.0000000 0.9671387 0.9672200 0.9189961

Each entry of the index list corresponds to a point in data and contains the row indices in data that are within threshold. For example, the 3rd point in data has the following neighbors:

fout$index[[3]]
##  [1] 7217    3 6327 4091 9260 4416 9476 8327 6428 4503 1038 7416

… with the following distances to those neighbors:

fout$distance[[3]]
##  [1] 0.7928541 0.0000000 0.9100973 0.8986305 0.9746544 0.9786117 0.9569915
##  [8] 0.9298775 0.9270930 0.9129424 0.9789321 0.9997940

Note that, for this function, the reported neighbors are not sorted by distance. The order of the output is completely arbitrary and will vary depending on the random seed. However, the identity of the neighbors is fully deterministic.

2 Querying another data set for neighbors

The queryNeighbors() function is also provided for identifying all points within a certain distance of a query point. Given a query data set:

nquery <- 1000
ndim <- 20
query <- matrix(runif(nquery*ndim), ncol=ndim)

… we apply the queryNeighbors() function:

qout <- queryNeighbors(data, query, threshold=1)
length(qout$index)
## [1] 1000

… where each entry of qout$index corresponds to a row of query and contains its neighbors in data. Again, the order of the output is arbitrary but the identity of the neighbors is deterministic.

3 Further options

Most of the options described for findKNN() are also applicable here. For example:

  • subset to identify neighbors for a subset of points.
  • get.distance to avoid retrieving distances when unnecessary.
  • BPPARAM to parallelize the calculations across multiple workers.
  • raw.index to return the raw indices from a precomputed index.

Note that the argument for a precomputed index is precomputed:

pre <- buildIndex(data, BNPARAM=KmknnParam())
fout.pre <- findNeighbors(BNINDEX=pre, threshold=1)
qout.pre <- queryNeighbors(BNINDEX=pre, query=query, threshold=1)

Users are referred to the documentation of each function for specific details.

4 Session information

sessionInfo()
## R Under development (unstable) (2020-01-27 r77730)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows Server 2012 R2 x64 (build 9600)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=C                          
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] BiocParallel_1.21.2 BiocNeighbors_1.5.2 knitr_1.28         
## [4] BiocStyle_2.15.6   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.3          bookdown_0.18       lattice_0.20-40    
##  [4] digest_0.6.25       grid_4.0.0          stats4_4.0.0       
##  [7] magrittr_1.5        evaluate_0.14       rlang_0.4.5        
## [10] stringi_1.4.6       S4Vectors_0.25.12   Matrix_1.2-18      
## [13] rmarkdown_2.1       tools_4.0.0         stringr_1.4.0      
## [16] parallel_4.0.0      xfun_0.12           yaml_2.2.1         
## [19] compiler_4.0.0      BiocGenerics_0.33.0 BiocManager_1.30.10
## [22] htmltools_0.4.0