Package {discoCVI}


Type: Package
Title: Implementation of the DISCO Metric for Internal Clustering Evaluation
Version: 0.1.1
Description: Implementation of the DISCO (Density-based Internal Score for Clusterings with nOise) metric, a cluster validity index for evaluating density-based clustering results without ground truth labels. DISCO is the first index to explicitly assess the quality of noise point assignments in addition to cluster quality. It uses density-connectivity distance derived from a minimum spanning tree of the mutual-reachability graph, providing interpretable, bounded scores in [-1, 1]. Higher scores indicate better clustering. Based on Beer et al. (2025) <doi:10.48550/arXiv.2503.00127>.
License: MIT + file LICENSE
Encoding: UTF-8
Language: en-GB
Depends: R (≥ 3.6.0)
Imports: FNN, stats
Suggests: dbscan, testthat (≥ 3.0.0)
RoxygenNote: 7.3.3
URL: https://github.com/aminentezari/discoCVI
BugReports: https://github.com/aminentezari/discoCVI/issues
NeedsCompilation: no
Packaged: 2026-05-02 08:55:19 UTC; aminentezari
Author: Amin Entezari ORCID iD [aut, cre], Davide Chicco ORCID iD [ctb] (Supervision and guidance), Anna Beer [aut] (Original Python implementation (arXiv:2503.00127)), Lena Krieger [aut] (Original Python implementation (arXiv:2503.00127)), Pascal Weber [aut] (Original Python implementation (arXiv:2503.00127))
Maintainer: Amin Entezari <amin_entezari@outlook.com>
Repository: CRAN
Date/Publication: 2026-05-05 18:20:13 UTC

disco: Density-Informed Clustering Scoring and Optimisation

Description

The DISCO metric evaluates the quality of a clustering result using density-connectivity (DC) distances derived from the minimum spanning tree of the mutual-reachability graph. It generalises the silhouette coefficient to density-based partitions and assigns a principled score to noise points.

Author(s)

Maintainer: Amin Entezari amin_entezari@outlook.com (ORCID)

Other contributors:

See Also

Useful links:


Compute mutual-reachability (reachability) distances

Description

For every pair of points (i, j), the mutual-reachability distance is

d_{\text{reach}}(i,j) = \max\bigl(\text{core}_k(i),\; \text{core}_k(j),\; d(i,j)\bigr)

where \text{core}_k(i) is the distance from point i to its k-th nearest neighbour (k = \texttt{min\_points} - 1) and d(i,j) is the Euclidean distance.

Usage

calculate_reachability_distance(points, min_points = 5)

Arguments

points

A numeric matrix of shape n \times p.

min_points

A positive integer \geq 2 giving the neighbourhood size used for the core-distance calculation. Corresponds to MinPts in HDBSCAN. Default is 5.

Value

A symmetric n \times n numeric matrix of mutual-reachability distances with zeros on the diagonal.


Compute density-connectivity (DC) distances

Description

The main entry point for constructing the DC-distance matrix used throughout the DISCO metric. The procedure is:

  1. Compute mutual-reachability distances (calculate_reachability_distance).

  2. Build the minimum spanning tree of those distances (get_mst_edges).

  3. Extract pairwise DC distances as minimax-path weights in the MST (extract_dc_distances_from_mst).

Usage

compute_dc_distances(X, min_points = 5)

Arguments

X

A numeric matrix of shape n \times p, or a data.frame that will be coerced to a matrix.

min_points

A positive integer for the core-distance neighbourhood size. Default is 5.

Value

A symmetric n \times n numeric matrix of DC distances.

Examples

set.seed(1)
X <- matrix(rnorm(40), ncol = 2)
D <- compute_dc_distances(X, min_points = 5)
dim(D)   # 20 x 20

Per-sample DISCO scores

Description

Computes a pointwise DISCO score for every observation in X. The score lies in [-1, 1]:

Four cases are handled:

  1. All noise — every point labelled -1: all scores are -1.

  2. Single cluster with no noise — all scores are 0.

  3. One real cluster + noise — cluster points are scored via p_cluster; noise points are scored via p_noise.

  4. Two or more clusters (optional noise) — non-noise points use p_cluster on the non-noise sub-matrix; noise points use p_noise.

Usage

disco_samples(X, labels, min_points = 5)

Arguments

X

A numeric matrix (n \times p) or data.frame.

labels

An integer vector of length n containing cluster labels. Use -1 to denote noise/outlier points.

min_points

A positive integer for the core-distance neighbourhood size. Default is 5.

Value

A numeric vector of length n with per-point DISCO scores.

See Also

disco_score, p_cluster, p_noise

Examples

set.seed(42)
X <- matrix(rnorm(100), ncol = 2)
labels <- rep(c(0L, 1L), each = 25)
s <- disco_samples(X, labels)
hist(s, main = "Per-sample DISCO scores")

Overall DISCO score

Description

Returns the mean of the per-sample DISCO scores produced by disco_samples. This is the single-number summary of clustering quality used for model selection and comparison.

Usage

disco_score(X, labels, min_points = 5)

Arguments

X

A numeric matrix (n \times p) or data.frame.

labels

An integer vector of cluster labels (-1 for noise).

min_points

Positive integer; core-distance neighbourhood size. Default is 5.

Value

A single numeric value in [-1, 1].

See Also

disco_samples

Examples

set.seed(42)
X <- matrix(rnorm(100), ncol = 2)
labels <- rep(c(0L, 1L), each = 25)
disco_score(X, labels)

Extract DC distances from an MST via BFS minimax-path traversal

Description

Given the edges of a minimum spanning tree, computes the density-connectivity (DC) distance matrix. The DC distance between two points is the maximum edge weight on the unique path connecting them in the MST (minimax path). BFS is used to propagate these path maxima from every source node.

Usage

extract_dc_distances_from_mst(mst_edges, n)

Arguments

mst_edges

A data.frame with columns i, j, dist as returned by get_mst_edges.

n

Integer, total number of points.

Value

A symmetric n \times n numeric matrix of DC distances.


Compute a minimum spanning tree via Prim's algorithm

Description

Implements Prim's algorithm exactly as in the reference Python code (dctree.py, lines 401–438), operating on a precomputed distance matrix.

Usage

get_mst_edges(dist_matrix)

Arguments

dist_matrix

A symmetric n \times n numeric distance matrix.

Value

A data.frame with n-1 rows and three columns:

i

Integer index of the first endpoint (1-based).

j

Integer index of the second endpoint (1-based).

dist

Edge weight.


Silhouette-like cluster scores using DC distances

Description

Computes a silhouette coefficient for each non-noise point using the precomputed (or internally derived) DC-distance matrix rather than raw Euclidean distances. This matches the formula used by sklearn.metrics.silhouette_samples with metric="precomputed" (Python reference: disco.py, line 280):

s(i) = \frac{b(i) - a(i)}{\max(a(i),\, b(i))}

where a(i) is the mean DC distance from i to all other points in the same cluster and b(i) is the minimum over other clusters of the mean DC distance from i to that cluster.

Usage

p_cluster(X, labels, min_points = 5, precomputed_dc_dists = FALSE)

Arguments

X

Either (a) a numeric matrix of raw data (n \times p) when precomputed_dc_dists = FALSE, or (b) a precomputed n \times n DC-distance matrix when precomputed_dc_dists = TRUE.

labels

An integer vector of cluster labels of length n. Noise labels (-1) should not be present; pass only the non-noise subset.

min_points

Positive integer; used only when precomputed_dc_dists = FALSE. Default is 5.

precomputed_dc_dists

Logical. If TRUE, X is treated as an already-computed DC-distance matrix; no further distances are computed. Default is FALSE.

Value

A numeric vector of length n with per-point silhouette-style scores in [-1, 1].

See Also

disco_samples, compute_dc_distances

Examples

set.seed(7)
X <- matrix(rnorm(60), ncol = 2)
labels <- rep(c(0L, 1L, 2L), each = 10)
p_cluster(X, labels)

Noise-point scores

Description

Assigns a quality score to each point labelled as noise (-1). Two complementary sub-scores are computed and the element-wise minimum is taken as the final score:

p_{\text{sparse}}

Measures whether the noise point is sparser (more peripheral in density) than the densest part of every real cluster. For cluster c, let M_c be the maximum core distance over all points in c. Then

p_{\text{sparse}}(i) = \min_c \frac{\text{core}_k(i) - M_c}{\max(\text{core}_k(i),\, M_c)}.

p_{\text{far}}

Measures whether the noise point is far from every cluster in DC-distance space. For cluster c, let d_{\min,c}(i) be the minimum DC distance from i to any point in c. Then

p_{\text{far}}(i) = \min_c \frac{d_{\min,c}(i) - M_c}{\max(d_{\min,c}(i),\, M_c)}.

Both sub-scores lie in [-1, 1]: positive means the noise point is legitimately sparse/far; negative means it is denser/closer than the cluster boundary and might be a misclassified inlier.

Usage

p_noise(X, labels, min_points = 5, dc_dists = NULL)

Arguments

X

A numeric matrix (n \times p) or data.frame of all points (including non-noise).

labels

An integer vector of length n; -1 denotes noise.

min_points

Positive integer \geq 2; core-distance neighbourhood size. Default is 5.

dc_dists

Optional precomputed n \times n DC-distance matrix. If NULL (default), it is computed internally.

Value

A named list with two numeric vectors, each of length equal to the number of noise points:

p_sparse

Sparsity-based noise scores.

p_far

DC-distance-based noise scores.

See Also

disco_samples

Examples

set.seed(3)
X      <- matrix(rnorm(60), ncol = 2)   # 30 rows x 2 cols
labels <- c(rep(0L, 10), rep(1L, 10), rep(-1L, 10))  # 30 labels
nr     <- p_noise(X, labels, min_points = 3)
str(nr)