1 Introduction

The dataSDA package (v0.2.6) gathers various symbolic data tailored to different research themes and provides a comprehensive set of functions for reading, writing, converting, and analyzing symbolic data. The package is available on CRAN at https://CRAN.R-project.org/package=dataSDA and on GitHub at https://github.com/hanmingwu1103/dataSDA.

The package includes 114 datasets spanning seven types of symbolic data. Each dataset name uses a suffix that indicates its type:

Type	Suffix	Datasets	Description
Interval	`.int`, `.iGAP`, `.int.mm`	57	Interval-valued data in RSDA (54), iGAP (2), and min-max (1) formats
Histogram	`.hist`	25	Histogram-valued distributional data
Mixed	`.mix`	11	Datasets combining interval and categorical variables
Interval Time Series	`.its`	9	Interval-valued time series data
Modal	`.modal`	7	Modal multi-valued symbolic data
Distributional	`.distr`	3	Distributional symbolic data
Other	???	2	Auxiliary datasets (`bank_rates`, `hierarchy`)
Total		114

The package provides functions organized into the following categories:

Category	Functions	Count
Format detection & conversion	`int_detect_format`, `int_list_conversions`, `int_convert_format`, `RSDA_to_MM`, `iGAP_to_MM`, `SODAS_to_MM`, `MM_to_iGAP`, `RSDA_to_iGAP`, `SODAS_to_iGAP`, `MM_to_RSDA`, `iGAP_to_RSDA`	11
Core statistics	`int_mean`, `int_var`, `int_cov`, `int_cor`	4
Geometric properties	`int_width`, `int_radius`, `int_center`, `int_midrange`, `int_overlap`, `int_containment`	6
Position & scale	`int_median`, `int_quantile`, `int_range`, `int_iqr`, `int_mad`, `int_mode`	6
Robust statistics	`int_trimmed_mean`, `int_winsorized_mean`, `int_trimmed_var`, `int_winsorized_var`	4
Distribution shape	`int_skewness`, `int_kurtosis`, `int_symmetry`, `int_tailedness`	4
Similarity measures	`int_jaccard`, `int_dice`, `int_cosine`, `int_overlap_coefficient`, `int_tanimoto`, `int_similarity_matrix`	6
Uncertainty & variability	`int_entropy`, `int_cv`, `int_dispersion`, `int_imprecision`, `int_granularity`, `int_uniformity`, `int_information_content`	7
Distance measures	`int_dist`, `int_dist_matrix`, `int_pairwise_dist`, `int_dist_all`	4
Histogram statistics	`hist_mean`, `hist_var`, `hist_cov`, `hist_cor`	4
Utilities	`clean_colnames`, `RSDA_format`, `set_variable_format`, `read_symbolic_csv`, `write_symbolic_csv`, `check_zero_width_intervals`	6

2 Data Formats and Conversion

2.1 Interval data formats overview

The dataSDA package works with three primary formats for interval-valued data:

RSDA format: symbolic_tbl objects where intervals are encoded as complex numbers (min + max*i). Used by the RSDA package.
MM format: Standard data frames with paired _min / _max columns for each variable.
iGAP format: Data frames where each interval is a comma-separated string (e.g., "2.5,4.0").

data(mushroom.int)
head(mushroom.int, 3)
class(mushroom.int)

data(abalone.int)
head(abalone.int, 3)
class(abalone.int)

data(abalone.iGAP)
head(abalone.iGAP, 3)
class(abalone.iGAP)

The int_detect_format() function automatically identifies the format of a dataset:

int_detect_format(mushroom.int)
int_detect_format(abalone.int)
int_detect_format(abalone.iGAP)

Use int_list_conversions() to see all available format conversion paths:

int_list_conversions()

2.2 Unified format conversion

The int_convert_format() function provides a unified interface for converting between formats. It auto-detects the source format and applies the appropriate conversion:

# RSDA to MM
mushroom.MM <- int_convert_format(mushroom.int, to = "MM")
head(mushroom.MM, 3)

# iGAP to MM
abalone.MM <- int_convert_format(abalone.iGAP, to = "MM")
head(abalone.MM, 3)

# iGAP to RSDA
data(face.iGAP)
face.RSDA <- int_convert_format(face.iGAP, to = "RSDA")
head(face.RSDA, 3)

2.3 Direct conversion functions

For explicit control, direct conversion functions are available:

# RSDA to MM
mushroom.MM <- RSDA_to_MM(mushroom.int, RSDA = TRUE)
head(mushroom.MM, 3)

# MM to iGAP
mushroom.iGAP <- MM_to_iGAP(mushroom.MM)
head(mushroom.iGAP, 3)

# iGAP to MM
data(face.iGAP)
face.MM <- iGAP_to_MM(face.iGAP, location = 1:6)
head(face.MM, 3)

# MM to RSDA
face.RSDA <- MM_to_RSDA(face.MM)
head(face.RSDA, 3)
class(face.RSDA)

# iGAP to RSDA (direct, one-step)
abalone.RSDA <- iGAP_to_RSDA(abalone.iGAP, location = 1:7)
head(abalone.RSDA, 3)
class(abalone.RSDA)

# RSDA to iGAP
mushroom.iGAP2 <- RSDA_to_iGAP(mushroom.int)
head(mushroom.iGAP2, 3)

The SODAS_to_MM() and SODAS_to_iGAP() functions convert SODAS XML files but require an XML file path and are not demonstrated here.

2.4 Legacy workflow: creating symbolic_tbl from raw data

The traditional workflow for converting a raw data frame into the symbolic_tbl class used by RSDA involves several steps. We illustrate with the mushroom dataset, which contains 23 species described by 3 interval-valued variables and 2 categorical variables.

data(mushroom.int.mm)
head(mushroom.int.mm, 3)

First, use set_variable_format() to create pseudo-variables for each category using one-hot encoding:

mushroom_set <- set_variable_format(
  data = mushroom.int.mm, location = 8,
  var = "Species"
)
head(mushroom_set, 3)

Next, apply RSDA_format() to prefix each variable with $I (interval) or $S (set) tags:

mushroom_tmp <- RSDA_format(
  data = mushroom_set,
  sym_type1 = c("I", "I", "I", "S"),
  location = c(25, 27, 29, 31),
  sym_type2 = c("S"),
  var = c("Species")
)
head(mushroom_tmp, 3)

Clean up variable names with clean_colnames() and write to CSV with write_symbolic_csv():

mushroom_clean <- clean_colnames(data = mushroom_tmp)
head(mushroom_clean, 3)

write_symbolic_csv(mushroom_clean, file = "mushroom_interval.csv")
mushroom_int <- read_symbolic_csv(file = "mushroom_interval.csv")
head(mushroom_int, 3)
class(mushroom_int)

Note: The MM_to_RSDA() function provides a simpler one-step alternative to this workflow.

2.5 Histogram data: the MatH class

Histogram-valued data uses the MatH class from the HistDAWass package. The built-in BLOOD dataset is a MatH object with 14 patient groups and 3 distributional variables:

BLOOD[1:3, 1:2]

Below we illustrate constructing a MatH object from raw histogram data:

A1 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B1 <- c(0.00, 0.02, 0.08, 0.32, 0.62, 0.86, 0.92, 1.00)
A2 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B2 <- c(0.00, 0.05, 0.12, 0.42, 0.68, 0.88, 0.94, 1.00)
A3 <- c(50, 60, 70, 80, 90, 100, 110, 120)
B3 <- c(0.00, 0.03, 0.24, 0.36, 0.75, 0.85, 0.98, 1.00)

ListOfWeight <- list(
  distributionH(A1, B1),
  distributionH(A2, B2),
  distributionH(A3, B3)
)

Weight <- methods::new("MatH",
  nrows = 3, ncols = 1, ListOfDist = ListOfWeight,
  names.rows = c("20s", "30s", "40s"),
  names.cols = c("weight"), by.row = FALSE
)
Weight

3 The Eight Interval Methods

Many dataSDA functions accept a method parameter that determines how interval boundaries are used in computations. The eight available methods (Wu, Kao and Chen, 2020) are:

Method	Name	Description
CM	Center Method	Uses the midpoint (center) of each interval
VM	Vertices Method	Uses both endpoints of the intervals
QM	Quantile Method	Uses a quantile-based representation
SE	Stacked Endpoints Method	Stacks the lower and upper values of an interval
FV	Fitted Values Method	Fits a linear regression model
EJD	Empirical Joint Density Method	Joint distribution of lower and upper bounds
GQ	Symbolic Covariance Method	Alternative expression of the symbolic sample variance
SPT	Total Sum of Products	Decomposition of the SPT

Quick demonstration:

data(mushroom.int)
var_name <- c("Stipe.Length", "Stipe.Thickness")
int_mean(mushroom.int, var_name, method = c("CM", "FV", "EJD"))

4 Descriptive Statistics for Interval-Valued Data

The core statistical functions int_mean, int_var, int_cov, and int_cor compute descriptive statistics for interval-valued data across any combination of the eight methods.

4.1 Mean and variance

We compute the mean and variance of Pileus.Cap.Width and Stipe.Length in the mushroom.int dataset using all eight interval methods.

data(mushroom.int)
var_name <- c("Pileus.Cap.Width", "Stipe.Length")
method <- c("CM", "VM", "QM", "SE", "FV", "EJD", "GQ", "SPT")

mean_mat <- int_mean(mushroom.int, var_name, method)
mean_mat

var_mat <- int_var(mushroom.int, var_name, method)
var_mat

The means are identical across most methods because methods other than FV operate on the same midpoint or boundary values; only FV (which regresses upper bounds on lower bounds) produces a different mean. In contrast, the variances differ substantially across methods, reflecting how each method weighs interval width and position.

cols <- c("#4E79A7", "#F28E2B")
par(mfrow = c(2, 1), mar = c(5, 4, 3, 6), las = 2, xpd = TRUE)

# --- Mean across eight methods ---
bp <- barplot(t(mean_mat),
  beside = TRUE, col = cols,
  main = "Interval Mean by Method (mushroom.int)",
  ylab = "Mean",
  ylim = c(0, max(mean_mat) * 1.25)
)
legend("topright",
  inset = c(-0.18, 0),
  legend = colnames(mean_mat), fill = cols, bty = "n", cex = 0.85
)

# --- Variance across eight methods ---
bp <- barplot(t(var_mat),
  beside = TRUE, col = cols,
  main = "Interval Variance by Method (mushroom.int)",
  ylab = "Variance",
  ylim = c(0, max(var_mat) * 1.25)
)
legend("topright",
  inset = c(-0.18, 0),
  legend = colnames(var_mat), fill = cols, bty = "n", cex = 0.85
)

4.2 Covariance and correlation

We compute the covariance and correlation between Pileus.Cap.Width and Stipe.Length across all eight methods. Note that EJD, GQ, and SPT methods require character variable names (not numeric indices).

cov_list <- int_cov(mushroom.int, "Pileus.Cap.Width", "Stipe.Length", method)
cor_list <- int_cor(mushroom.int, "Pileus.Cap.Width", "Stipe.Length", method)

# Collect scalar values into named vectors for display and plotting
cov_vec <- sapply(cov_list, function(x) x[1, 1])
cor_vec <- sapply(cor_list, function(x) x[1, 1])

data.frame(
  Method = names(cov_vec), Covariance = round(cov_vec, 4),
  Correlation = round(cor_vec, 4), row.names = NULL
)

The SE method yields the largest covariance because it doubles the effective sample by stacking both endpoints, amplifying joint variation. VM produces the lowest correlation (0.36) because the vertex expansion introduces $2^p$ combinations per observation, many of which are non-informative.

par(mfrow = c(2, 1), mar = c(5, 4, 3, 1), las = 2)

# --- Covariance across eight methods ---
bar_cols <- c(
  "#4E79A7", "#59A14F", "#F28E2B", "#E15759",
  "#76B7B2", "#EDC948", "#B07AA1", "#FF9DA7"
)
bp <- barplot(cov_vec,
  col = bar_cols, border = NA,
  main = "Cov(Pileus.Cap.Width, Stipe.Length) by Method",
  ylab = "Covariance",
  ylim = c(0, max(cov_vec) * 1.25)
)
text(bp, cov_vec, labels = round(cov_vec, 2), pos = 3, cex = 0.8)

# --- Correlation across eight methods ---
bp <- barplot(cor_vec,
  col = bar_cols, border = NA,
  main = "Cor(Pileus.Cap.Width, Stipe.Length) by Method",
  ylab = "Correlation",
  ylim = c(0, 1.15)
)
text(bp, cor_vec, labels = round(cor_vec, 2), pos = 3, cex = 0.8)
abline(h = 1, lty = 2, col = "grey50")

5 Geometric Properties

Geometric functions characterize the shape and spatial properties of individual intervals and relationships between interval variables.

5.1 Width, radius, center, and midrange

data(mushroom.int)

# Width = upper - lower
head(int_width(mushroom.int, "Stipe.Length"))

# Radius = width / 2
head(int_radius(mushroom.int, "Stipe.Length"))

# Center = (lower + upper) / 2
head(int_center(mushroom.int, "Stipe.Length"))

# Midrange
head(int_midrange(mushroom.int, "Stipe.Length"))

5.2 Overlap and containment

These functions measure the degree to which intervals from two variables overlap or contain each other, observation by observation:

# Overlap between two interval variables
head(int_overlap(mushroom.int, "Stipe.Length", "Stipe.Thickness"))

# Containment: proportion of var_name2 contained within var_name1
head(int_containment(mushroom.int, "Stipe.Length", "Stipe.Thickness"))

6 Position and Scale Measures

6.1 Median and quantiles

data(mushroom.int)

# Median (default method = "CM")
int_median(mushroom.int, "Stipe.Length")

# Quantiles
int_quantile(mushroom.int, "Stipe.Length", probs = c(0.25, 0.5, 0.75))

# Compare median across methods
int_median(mushroom.int, "Stipe.Length", method = c("CM", "FV"))

6.2 Range, IQR, MAD, and mode

# Range (max - min)
int_range(mushroom.int, "Stipe.Length")

# Interquartile range (Q3 - Q1)
int_iqr(mushroom.int, "Stipe.Length")

# Median absolute deviation
int_mad(mushroom.int, "Stipe.Length")

# Mode (histogram-based estimation)
int_mode(mushroom.int, "Stipe.Length")

7 Robust Statistics

Robust statistics reduce the influence of outliers by trimming or winsorizing extreme values.

7.1 Trimmed and winsorized means

data(mushroom.int)

# Compare standard mean vs trimmed mean (10% trim)
int_mean(mushroom.int, "Stipe.Length", method = "CM")
int_trimmed_mean(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")

# Winsorized mean: extreme values are replaced (not removed)
int_winsorized_mean(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")

7.2 Trimmed and winsorized variances

int_var(mushroom.int, "Stipe.Length", method = "CM")
int_trimmed_var(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")
int_winsorized_var(mushroom.int, "Stipe.Length", trim = 0.1, method = "CM")

8 Distribution Shape

Shape functions characterize the distribution of interval-valued data.

data(mushroom.int)

# Skewness: asymmetry of the distribution
int_skewness(mushroom.int, "Stipe.Length", method = "CM")

# Kurtosis: tail heaviness
int_kurtosis(mushroom.int, "Stipe.Length", method = "CM")

# Symmetry coefficient
int_symmetry(mushroom.int, "Stipe.Length", method = "CM")

# Tailedness (related to kurtosis)
int_tailedness(mushroom.int, "Stipe.Length", method = "CM")

9 Similarity Measures

Similarity functions quantify how alike two interval variables are across all observations. Available measures include Jaccard, Dice, cosine, and overlap coefficient.

data(mushroom.int)

int_jaccard(mushroom.int, "Stipe.Length", "Stipe.Thickness")
int_dice(mushroom.int, "Stipe.Length", "Stipe.Thickness")
int_cosine(mushroom.int, "Stipe.Length", "Stipe.Thickness")
int_overlap_coefficient(mushroom.int, "Stipe.Length", "Stipe.Thickness")

Note: int_tanimoto() is equivalent to int_jaccard() for interval-valued data:

int_tanimoto(mushroom.int, "Stipe.Length", "Stipe.Thickness")

The int_similarity_matrix() function computes a pairwise similarity matrix across all interval variables:

int_similarity_matrix(mushroom.int, method = "jaccard")

10 Uncertainty and Variability

These functions measure the uncertainty, variability, and information content of interval-valued data.

10.1 Entropy, CV, and dispersion

data(mushroom.int)

# Shannon entropy (higher = more uncertainty)
int_entropy(mushroom.int, "Stipe.Length", method = "CM")

# Coefficient of variation (SD / mean)
int_cv(mushroom.int, "Stipe.Length", method = "CM")

# Dispersion index
int_dispersion(mushroom.int, "Stipe.Length", method = "CM")

10.2 Imprecision, granularity, uniformity, and information content

# Imprecision: based on interval widths
int_imprecision(mushroom.int, "Stipe.Length")

# Granularity: variability in interval sizes
int_granularity(mushroom.int, "Stipe.Length")

# Uniformity: inverse of granularity (higher = more uniform)
int_uniformity(mushroom.int, "Stipe.Length")

# Normalized information content (between 0 and 1)
int_information_content(mushroom.int, "Stipe.Length", method = "CM")

11 Distance Measures

Distance functions compute dissimilarity between observations in interval-valued datasets. Available methods include: euclidean, hausdorff, ichino, de_carvalho, and others.

We use the interval columns of car.int for distance examples (excluding the character Car column):

data(car.int)
car_num <- car.int[, 2:5]
head(car_num, 3)

11.1 Single distance method

# Euclidean distance between observations
int_dist(car_num, method = "euclidean")

11.2 Distance matrix

# Return as a full matrix
dm <- int_dist_matrix(car_num, method = "hausdorff")
dm[1:5, 1:5]

11.3 Pairwise distance between variables

int_pairwise_dist(car_num, "Price", "Max_Velocity", method = "euclidean")

11.4 All distance methods at once

all_dists <- int_dist_all(car_num)
names(all_dists)

12 Descriptive Statistics for Histogram-Valued Data

The hist_mean, hist_var, hist_cov, and hist_cor functions compute descriptive statistics for histogram-valued data (MatH objects). All four functions support the same four methods: BG (Bertrand and Goupil, 2000), BD (Billard and Diday, 2006), B (Billard, 2008), and L2W (L2 Wasserstein).

12.1 Mean and variance

We compute the mean and variance of Cholesterol and Hemoglobin in the BLOOD dataset using all four methods.

all_methods <- c("BG", "BD", "B", "L2W")
var_names <- c("Cholesterol", "Hemoglobin")

# Compute mean for each variable and method
mean_mat <- sapply(all_methods, function(m) {
  sapply(var_names, function(v) hist_mean(BLOOD, v, method = m))
})
rownames(mean_mat) <- var_names
mean_mat

# Compute variance for each variable and method
var_mat <- sapply(all_methods, function(m) {
  sapply(var_names, function(v) hist_var(BLOOD, v, method = m))
})
rownames(var_mat) <- var_names
var_mat

The BG, BD, and B means are identical because they share the same first-order moment definition; only L2W (quantile-based) differs slightly. The variances, however, show large differences: BG is the largest because it includes within-histogram spread, while BD, B, and L2W progressively decrease.

bar_cols <- c("#4E79A7", "#59A14F", "#F28E2B", "#E15759")
par(mfrow = c(2, 2), mar = c(4, 5, 3, 1), las = 1)

# --- Mean: Cholesterol ---
bp <- barplot(mean_mat["Cholesterol", ],
  col = bar_cols, border = NA,
  main = "Mean of Cholesterol", ylab = "Mean",
  ylim = c(0, max(mean_mat["Cholesterol", ]) * 1.15)
)
text(bp, mean_mat["Cholesterol", ],
  labels = round(mean_mat["Cholesterol", ], 2), pos = 3, cex = 0.8
)

# --- Mean: Hemoglobin ---
bp <- barplot(mean_mat["Hemoglobin", ],
  col = bar_cols, border = NA,
  main = "Mean of Hemoglobin", ylab = "Mean",
  ylim = c(0, max(mean_mat["Hemoglobin", ]) * 1.15)
)
text(bp, mean_mat["Hemoglobin", ],
  labels = round(mean_mat["Hemoglobin", ], 2), pos = 3, cex = 0.8
)

# --- Variance: Cholesterol ---
bp <- barplot(var_mat["Cholesterol", ],
  col = bar_cols, border = NA,
  main = "Variance of Cholesterol", ylab = "Variance",
  ylim = c(0, max(var_mat["Cholesterol", ]) * 1.25)
)
text(bp, var_mat["Cholesterol", ],
  labels = round(var_mat["Cholesterol", ], 1), pos = 3, cex = 0.8
)

# --- Variance: Hemoglobin ---
bp <- barplot(var_mat["Hemoglobin", ],
  col = bar_cols, border = NA,
  main = "Variance of Hemoglobin", ylab = "Variance",
  ylim = c(0, max(var_mat["Hemoglobin", ]) * 1.25)
)
text(bp, var_mat["Hemoglobin", ],
  labels = round(var_mat["Hemoglobin", ], 4), pos = 3, cex = 0.8
)

12.2 Covariance and correlation

We compute the covariance and correlation between Cholesterol and Hemoglobin using all four methods.

cov_vec <- sapply(all_methods, function(m) {
  hist_cov(BLOOD, "Cholesterol", "Hemoglobin", method = m)
})
cor_vec <- sapply(all_methods, function(m) {
  hist_cor(BLOOD, "Cholesterol", "Hemoglobin", method = m)
})

data.frame(
  Method = all_methods,
  Covariance = round(cov_vec, 4),
  Correlation = round(cor_vec, 4),
  row.names = NULL
)

All four methods yield a negative association between Cholesterol and Hemoglobin. Following Irpino and Verde (2015, Eqs. 30–32), the BG, BD, and B correlations all use the Bertrand-Goupil standard deviation in the denominator, so their values are similar (around -0.20 to -0.22). Only L2W uses its own Wasserstein-based variance, which produces a different correlation.

par(mfrow = c(1, 2), mar = c(4, 5, 3, 1), las = 1)

# --- Covariance ---
bp <- barplot(cov_vec,
  col = bar_cols, border = NA,
  main = "Cov(Cholesterol, Hemoglobin)",
  ylab = "Covariance",
  ylim = c(min(cov_vec) * 1.35, 0)
)
text(bp, cov_vec, labels = round(cov_vec, 2), pos = 1, cex = 0.8)

# --- Correlation ---
bp <- barplot(cor_vec,
  col = bar_cols, border = NA,
  main = "Cor(Cholesterol, Hemoglobin)",
  ylab = "Correlation",
  ylim = c(min(cor_vec) * 1.4, 0)
)
text(bp, cor_vec, labels = round(cor_vec, 2), pos = 1, cex = 0.8)
abline(h = -1, lty = 2, col = "grey50")

13 Application: Benchmarking Symbolic Data Methods

This section demonstrates how dataSDA datasets can be used for benchmarking symbolic data analysis methods across four analytical tasks: clustering (interval and histogram), classification, and regression. Five representative datasets are selected for each task, with no overlap among the interval-data tasks.

13.1 Exploratory Data Analysis and Visualization

13.1.1 Aggregating classical data to symbolic data

The aggregate_to_symbolic() function converts a classical data frame into interval-valued or histogram-valued symbolic data via grouping (clustering, resampling, or a categorical variable). Here we stratify by Species and form 10 k-means clusters within each, giving up to 30 interval-valued concepts. The zero_width = "remove" option drops any concept that collapses to a zero-width interval (min == max, e.g. a singleton cluster), since such degenerate intervals break width-based plots such as the index image below.

set.seed(42)
iris_int <- aggregate_to_symbolic(
  iris,
  type = "int",
  group_by = "kmeans",
  stratify_var = "Species",
  K = 10,
  zero_width = "remove"
)
iris_int

13.1.2 Visualization with ggInterval

The ggInterval package provides specialized plots for symbolic data including index image plots, PCA biplots, and radar plots. The following examples require ggInterval to be installed. Note: with around 30 observations the index image and PCA plots may take several minutes to render.

Index image plot – a heatmap of all interval variables:

library(ggInterval)
library(ggplot2)
ggInterval_indexImage(iris_int[, 2:5], plotAll = TRUE, full_strip = FALSE, 
                      column_condition = FALSE) + 
  scale_colour_distiller(palette = "Spectral")

PCA biplot – principal component analysis for interval data:

ggInterval_PCA(iris_int[, 2:5])

Radar plot – multivariate comparison of interval columns from environment.mix (observations 4 and 6):

data(environment.mix)
ggInterval_radarplot(environment.mix,
  plotPartial = c(4, 6),
  showLegend = FALSE, addText = FALSE
)

13.1.3 Interval time series visualization

We plot the first 12 months of the irish_wind.its dataset, showing each station’s wind speed interval as a bar with midpoint lines.

library(ggplot2)
data(irish_wind.its)
wind_sub <- irish_wind.its[1:12, ]

# Reshape to long format
stations <- c("BIR", "DUB", "KIL", "SHA", "VAL")
wind_long <- do.call(rbind, lapply(stations, function(st) {
  data.frame(
    month_num = seq_len(12),
    Station   = st,
    lower     = wind_sub[[paste0(st, "_l")]],
    upper     = wind_sub[[paste0(st, "_u")]],
    mid       = (wind_sub[[paste0(st, "_l")]] + wind_sub[[paste0(st, "_u")]]) / 2
  )
}))
wind_long$Station <- factor(wind_long$Station, levels = stations)

# Dodge bars for each station within each month
n_st <- length(stations)
bar_w <- 0.6 / n_st
wind_long$st_idx <- as.numeric(wind_long$Station)
wind_long$x <- wind_long$month_num +
  (wind_long$st_idx - (n_st + 1) / 2) * bar_w

ggplot(wind_long) +
  geom_rect(
    aes(
      xmin = x - bar_w / 2, xmax = x + bar_w / 2,
      ymin = lower, ymax = upper, fill = Station
    ),
    alpha = 0.4, color = NA
  ) +
  geom_line(aes(x = x, y = mid, color = Station, group = Station),
    linewidth = 0.5
  ) +
  geom_point(aes(x = x, y = mid, color = Station), size = 1) +
  scale_x_continuous(breaks = 1:12, labels = month.abb) +
  labs(
    title = "Irish Wind Speed Intervals (1961)",
    x = "Month", y = "Wind Speed (knots)"
  ) +
  theme_grey(base_size = 12)

13.2 Clustering for Interval-Valued Data

We benchmark three clustering algorithms on five interval-valued datasets using the quality index $1 - \text{WSS}/\text{TSS}$:

RSDA::sym.kmeans() – K-means for symbolic data
symbolicDA::DClust() – Distance-based symbolic clustering
symbolicDA::SClust() – Symbolic clustering

Each method independently determines its own optimal number of clusters $k$ via an n-adaptive elbow method. For each method, we sweep $k$ from 2 to $k_{\max} = \min(n-1,\, 10,\, \max(3,\, \lfloor n/5 \rfloor))$ and compute the quality index at each $k$. The elbow is detected using an absolute gain threshold $\tau = \Delta_{\max} / (1 + n/100)$, where $\Delta_{\max}$ is the largest quality gain across all $k$. A 2-step lookahead skips temporary dips. This yields a higher threshold (fewer clusters) for small datasets and a lower threshold (more clusters allowed) for large datasets.

library(symbolicDA)

set.seed(123)

datasets_clust_int <- list(
  list(name = "face.iGAP", data = "face.iGAP"),
  list(name = "prostate.int", data = "prostate.int"),
  list(name = "nycflights.int", data = "nycflights.int"),
  list(name = "china_temp.int", data = "china_temp.int"),
  list(name = "lisbon_air_quality.int", data = "lisbon_air_quality.int")
)

clust_int_results <- do.call(rbind, lapply(datasets_clust_int, function(ds) {
  tryCatch(
    {
      data(list = ds$data)
      x <- get(ds$data)
      if (!inherits(x, "symbolic_tbl")) {
        x <- tryCatch(int_convert_format(x, to = "RSDA"), error = function(e) x)
        for (i in seq_along(x)) {
          if (is.complex(x[[i]]) && !inherits(x[[i]], "symbolic_interval")) {
            class(x[[i]]) <- c("symbolic_interval", "vctrs_vctr")
          }
        }
        if (!inherits(x, "symbolic_tbl")) {
          class(x) <- c("symbolic_tbl", class(x))
        }
      }
      x_int <- .get_interval_cols(x)
      n <- nrow(x_int)
      p <- ncol(x_int)
      k_max <- min(n - 1, 10, max(3, floor(n / 5)))

      d <- int_dist_matrix(x_int, method = "hausdorff")
      so <- simple2SO(.to_3d_array(x_int))

      km_qs <- dc_qs <- sc_qs <- setNames(
        rep(NA_real_, k_max - 1),
        as.character(2:k_max)
      )
      for (k in 2:k_max) {
        set.seed(123)
        km_qs[as.character(k)] <- tryCatch(
          {
            res <- sym.kmeans(x_int, k = k)
            1 - res$tot.withinss / res$totss
          },
          error = function(e) NA
        )

        set.seed(123)
        dc_qs[as.character(k)] <- tryCatch(
          {
            cl <- DClust(d, cl = k, iter = 100)
            .clust_quality(d, cl)
          },
          error = function(e) NA
        )

        set.seed(123)
        sc_qs[as.character(k)] <- tryCatch(
          {
            cl <- SClust(so, cl = k, iter = 100)
            .clust_quality(d, cl)
          },
          error = function(e) NA
        )
      }

      km_k <- .find_optimal_k(km_qs, n)
      km_q <- km_qs[as.character(km_k)]
      dc_k <- .find_optimal_k(dc_qs, n)
      dc_q <- dc_qs[as.character(dc_k)]
      sc_k <- .find_optimal_k(sc_qs, n)
      sc_q <- sc_qs[as.character(sc_k)]

      data.frame(
        Dataset = ds$name, n = n, p = p,
        sym.kmeans = sprintf("%.4f (%d)", km_q, km_k),
        DClust = sprintf("%.4f (%d)", dc_q, dc_k),
        SClust = sprintf("%.4f (%d)", sc_q, sc_k)
      )
    },
    error = function(e) NULL
  )
}))

kable(clust_int_results,
  row.names = FALSE,
  caption = "Table 4: Interval clustering quality (1 - WSS/TSS) with optimal k in parentheses"
)

13.3 Clustering for Histogram-Valued Data

We benchmark three clustering algorithms on five histogram-valued datasets from dataSDA. Each dataset is converted from dataSDA’s histogram string format to HistDAWass::MatH objects for analysis:

WH_kmeans() – K-means for histogram data
WH_fcmeans() – Fuzzy C-means for histogram data
WH_hclust() – Hierarchical clustering with Wasserstein distance

The same n-adaptive elbow method from Section 4.2 is used for each method to independently select its optimal $k$.

set.seed(123)

datasets_clust_hist <- list(
  list(name = "age_pyramids.hist"),
  list(name = "ozone.hist"),
  list(name = "china_climate_season.hist"),
  list(name = "french_agriculture.hist"),
  list(name = "flights_detail.hist")
)

clust_hist_results <- do.call(rbind, lapply(datasets_clust_hist, function(ds) {
  tryCatch(
    {
      data(list = ds$name, package = "dataSDA")
      raw <- get(ds$name)
      x <- .dataSDA_hist_to_MatH(raw)
      n <- nrow(x@M)
      p <- ncol(x@M)
      k_max <- min(n - 1, 10, max(3, floor(n / 5)))

      # Precompute Wasserstein distance matrix and hclust tree (shared across k)
      dm <- WH_MAT_DIST(x)
      set.seed(123)
      hc <- WH_hclust(x, simplify = TRUE)

      km_qs <- fc_qs <- hc_qs <- setNames(
        rep(NA_real_, k_max - 1),
        as.character(2:k_max)
      )
      for (k in 2:k_max) {
        set.seed(123)
        km_qs[as.character(k)] <- tryCatch(
          {
            res <- WH_kmeans(x, k = k)
            res$quality
          },
          error = function(e) NA
        )

        set.seed(123)
        fc_qs[as.character(k)] <- tryCatch(
          {
            res <- WH_fcmeans(x, k = k)
            res$quality
          },
          error = function(e) NA
        )

        set.seed(123)
        hc_qs[as.character(k)] <- tryCatch(
          {
            cl <- cutree(hc, k = k)
            .clust_quality(dm, cl)
          },
          error = function(e) NA
        )
      }

      km_k <- .find_optimal_k(km_qs, n)
      km_q <- km_qs[as.character(km_k)]
      fc_k <- .find_optimal_k(fc_qs, n)
      fc_q <- fc_qs[as.character(fc_k)]
      hc_k <- .find_optimal_k(hc_qs, n)
      hc_q <- hc_qs[as.character(hc_k)]

      data.frame(
        Dataset = ds$name, n = n, p = p,
        WH_kmeans = sprintf("%.4f (%d)", km_q, km_k),
        WH_fcmeans = sprintf("%.4f (%d)", fc_q, fc_k),
        WH_hclust = sprintf("%.4f (%d)", hc_q, hc_k)
      )
    },
    error = function(e) NULL
  )
}))

kable(clust_hist_results,
  row.names = FALSE,
  caption = "Table 5: Histogram clustering quality (1 - WSS/TSS) with optimal k in parentheses"
)

13.4 Classification for Interval-Valued Data

We benchmark three classifiers on five interval-valued datasets and report resubstitution accuracy:

MAINT.Data::lda() – Linear discriminant analysis for interval data
MAINT.Data::qda() – Quadratic discriminant analysis for interval data
e1071::svm() – Support vector machine on lower/upper bound features

library(MAINT.Data)
library(e1071)

datasets_class <- list(
  list(
    name = "cars.int", data = "cars.int",
    class_col = "class",
    class_desc = "class: Utilitarian(7), Berlina(8), Sportive(8), Luxury(4)"
  ),
  list(
    name = "china_temp.int", data = "china_temp.int",
    class_col = "GeoReg",
    class_desc = "GeoReg: 6 regions"
  ),
  list(
    name = "mushroom.int", data = "mushroom.int",
    class_col = "Edibility",
    class_desc = "Edibility: T(4), U(2), Y(17)"
  ),
  list(
    name = "ohtemp.int", data = "ohtemp.int",
    class_col = "STATE",
    class_desc = "STATE: 10 groups"
  ),
  list(
    name = "wine.int", data = "wine.int",
    class_col = "class",
    class_desc = "class: 1(21), 2(12)"
  )
)

class_results <- do.call(rbind, lapply(datasets_class, function(ds) {
  tryCatch(
    {
      data(list = ds$data)
      x <- get(ds$data)
      grp <- .get_class_labels(x, ds$class_col)

      idata <- .build_IData(x)

      int_cols <- sapply(x, function(col) inherits(col, "symbolic_interval"))
      svm_df <- data.frame(row.names = seq_len(nrow(x)))
      for (v in names(x)[int_cols]) {
        cv <- unclass(x[[v]])
        svm_df[[paste0(v, "_l")]] <- Re(cv)
        svm_df[[paste0(v, "_u")]] <- Im(cv)
      }

      set.seed(123)
      lda_acc <- tryCatch(
        {
          res <- MAINT.Data::lda(idata, grouping = grp)
          pred <- predict(res, idata)
          mean(pred$class == grp)
        },
        error = function(e) NA
      )

      set.seed(123)
      qda_acc <- tryCatch(
        {
          res <- MAINT.Data::qda(idata, grouping = grp)
          pred <- predict(res, idata)
          mean(pred$class == grp)
        },
        error = function(e) NA
      )

      set.seed(123)
      svm_acc <- tryCatch(
        {
          svm_df$class <- grp
          res <- svm(class ~ ., data = svm_df, kernel = "radial")
          pred <- predict(res, svm_df)
          mean(pred == grp)
        },
        error = function(e) NA
      )

      data.frame(
        Dataset = ds$name, Response = ds$class_desc,
        LDA = lda_acc, QDA = qda_acc, SVM = svm_acc
      )
    },
    error = function(e) NULL
  )
}))

kable(class_results,
  digits = 4, row.names = FALSE,
  caption = "Table 6: Classification accuracy (resubstitution)"
)

13.5 Regression for Interval-Valued Data

We benchmark five regression methods on five interval-valued datasets and report $R^2$:

RSDA::sym.lm() – Symbolic linear regression (center method)
RSDA::sym.glm() – LASSO regression via glmnet (center method)
RSDA::sym.rf() – Symbolic random forest
RSDA::sym.rt() – Symbolic regression tree
RSDA::sym.nnet() – Symbolic neural network

datasets_reg <- list(
  list(
    name = "abalone.iGAP", data = "abalone.iGAP",
    response = "Length", n_x = 6
  ),
  list(
    name = "cardiological.int", data = "cardiological.int",
    response = "pulse", n_x = 4
  ),
  list(
    name = "nycflights.int", data = "nycflights.int",
    response = "distance", n_x = 3
  ),
  list(
    name = "oils.int", data = "oils.int",
    response = "specific_gravity", n_x = 3
  ),
  list(
    name = "prostate.int", data = "prostate.int",
    response = "lpsa", n_x = 8
  )
)

reg_results <- do.call(rbind, lapply(datasets_reg, function(ds) {
  tryCatch(
    {
      data(list = ds$data)
      x <- get(ds$data)

      if (!inherits(x, "symbolic_tbl")) {
        x2 <- tryCatch(int_convert_format(x, to = "RSDA"), error = function(e) NULL)
        if (!is.null(x2)) {
          x <- x2
          for (i in seq_along(x)) {
            if (is.complex(x[[i]]) && !inherits(x[[i]], "symbolic_interval")) {
              class(x[[i]]) <- c("symbolic_interval", "vctrs_vctr")
            }
          }
          if (!inherits(x, "symbolic_tbl")) {
            class(x) <- c("symbolic_tbl", class(x))
          }
        } else {
          cn <- colnames(x)
          l_cols <- grep("_l$", cn, value = TRUE)
          vars <- sub("_l$", "", l_cols)
          out <- data.frame(row.names = seq_len(nrow(x)))
          for (v in vars) {
            lv <- x[[paste0(v, "_l")]]
            uv <- x[[paste0(v, "_u")]]
            si <- complex(real = lv, imaginary = uv)
            class(si) <- c("symbolic_interval", "vctrs_vctr")
            out[[v]] <- si
          }
          class(out) <- c("symbolic_tbl", class(out))
          x <- out
        }
      }
      x_int <- .get_interval_cols(x)

      fml <- as.formula(paste(ds$response, "~ ."))
      nc <- data.frame(row.names = seq_len(nrow(x_int)))
      for (v in names(x_int)) {
        cv <- unclass(x_int[[v]])
        nc[[v]] <- (Re(cv) + Im(cv)) / 2
      }
      actual <- nc[[ds$response]]
      resp_idx <- which(names(x_int) == ds$response)
      .r2 <- function(a, p) 1 - sum((a - p)^2) / sum((a - mean(a))^2)

      set.seed(123)
      lm_r2 <- tryCatch(
        {
          res <- sym.lm(fml, sym.data = x_int, method = "cm")
          summary(res)$r.squared
        },
        error = function(e) NA
      )

      set.seed(123)
      glm_r2 <- tryCatch(
        {
          res <- sym.glm(sym.data = x_int, response = resp_idx, method = "cm")
          pred <- as.numeric(predict(res,
            newx = as.matrix(nc[, -resp_idx]),
            s = "lambda.min"
          ))
          .r2(actual, pred)
        },
        error = function(e) NA
      )

      set.seed(123)
      rf_r2 <- tryCatch(
        {
          res <- sym.rf(fml, sym.data = x_int, method = "cm")
          tail(res$rsq, 1)
        },
        error = function(e) NA
      )

      set.seed(123)
      rt_r2 <- tryCatch(
        {
          res <- sym.rt(fml, sym.data = x_int, method = "cm")
          .r2(actual, predict(res))
        },
        error = function(e) NA
      )

      set.seed(123)
      nnet_r2 <- tryCatch(
        {
          res <- sym.nnet(fml, sym.data = x_int, method = "cm")
          pred_sc <- as.numeric(res$net.result[[1]])
          pred <- pred_sc * res$data_c_sds[resp_idx] + res$data_c_means[resp_idx]
          .r2(actual, pred)
        },
        error = function(e) NA
      )

      data.frame(
        Dataset = ds$name, Response = ds$response, p = ds$n_x,
        sym.lm = lm_r2, sym.glm = glm_r2, sym.rf = rf_r2,
        sym.rt = rt_r2, sym.nnet = nnet_r2
      )
    },
    error = function(e) NULL
  )
}))

kable(reg_results,
  digits = 4, row.names = FALSE,
  caption = "Table 7: Regression R-squared"
)

Field	Description	Example
Dataset Name	A clear, descriptive title.	“face recognition data”
Dataset Short Name	A clear,abbreviation title.	“face data”
Authors	Full names of donator.	“First name, Last name”
E-mail	Contact email.	“abc123@gmail.com”
Institutes	Affiliated organizations.	“-”
Country	Origin of the dataset.	“France”
Dataset Descriptions	Data descriptive	See ‘README’
Sample Size	Number of instances/rows.	27
Number of Variables	Total features/columns (categorical/numeric).	6 (interval)
Missing Values	Indicate if missing values exist and how they are handled.	“None” / “Yes, marked as NA”
Variable Descriptions	Detailed description of each column (name, type, units, range).	See ‘README’
Source	Original data source (if applicable).	“Leroy et al. (1996)”
References	Citations for prior work using the dataset.	“Douzal-Chouakria, Billard, and Diday (2011)”
Applied Areas	Relevant fields (e.g., biology, finance).	“Machine Learning”
Usage Constraints	Licensing (CC-BY, MIT) or restrictions.	“Public domain”
Data Link	URL to download the dataset (Google Drive, GitHub, etc.).	“(https)”

Introduction to dataSDA

Supplementary material for dataSDA: datasets and basic statistics for symbolic data analysis in R

Po-Wei Chen, Chun-houh Chen and Han-Ming Wu*

June 11, 2026