splitGraph: From Metadata to Leakage-Aware Split Design

Fast path: `graph_from_metadata()`

When your metadata already uses the canonical column names (sample_id, subject_id, batch_id, study_id, timepoint_id, time_index, assay_id, featureset_id, site_id, region_id, platform_id, outcome_id / outcome_value), graph_from_metadata() does ingestion, typed node construction, canonical edge construction, and optional timepoint_precedes derivation in a single call. Any canonical column that is absent is simply skipped, so you only supply the structure you have:

quick_graph <- graph_from_metadata(
  data.frame(
    sample_id    = c("S1", "S2", "S3", "S4", "S5", "S6"),
    subject_id   = c("P1", "P1", "P2", "P2", "P3", "P3"),
    batch_id     = c("B1", "B2", "B1", "B2", "B1", "B2"),
    timepoint_id = c("T0", "T1", "T0", "T1", "T0", "T1"),
    time_index   = c(0, 1, 0, 1, 0, 1),
    outcome_id   = c("ctrl", "case", "ctrl", "case", "case", "ctrl")
  ),
  graph_name = "quick_demo"
)

quick_graph
#> <dependency_graph> quick_demo 
#>   Nodes: 15 
#>   Edges: 25

If you pass outcome_value as a numeric column (e.g. 0/1), the package will create one Outcome node per distinct numeric value (outcome:0, outcome:1) and emit a warning. For most workflows you want a character class label, so prefer outcome_id.

The rest of this vignette uses the explicit constructor path because it lets us show node attributes (time_index, visit_label, platform, derivation_scope) and edges that graph_from_metadata() does not build for you on its own — specifically featureset_generated_from_study and featureset_generated_from_batch. (subject_has_outcome is the one non-default outcome edge graph_from_metadata() will build for you when you pass outcome_scope = "subject".) Use graph_from_metadata() when the canonical columns are enough; use the explicit path when you need custom attributes or feature-provenance edges.

Ingest metadata and build typed nodes and edges

The first step is to standardize metadata and then turn each entity type into canonical graph nodes. Sample-level relations become typed edges.

meta <- ingest_metadata(meta, dataset_name = "VignetteDemo")

sample_nodes <- create_nodes(meta, type = "Sample", id_col = "sample_id")
subject_nodes <- create_nodes(meta, type = "Subject", id_col = "subject_id")
batch_nodes <- create_nodes(meta, type = "Batch", id_col = "batch_id")
study_nodes <- create_nodes(meta, type = "Study", id_col = "study_id")

time_nodes <- create_nodes(
  data.frame(
    timepoint_id = c("T0", "T1", "T2"),
    time_index = c(0L, 1L, 2L),
    visit_label = c("baseline", "follow_up", "late_follow_up"),
    stringsAsFactors = FALSE
  ),
  type = "Timepoint",
  id_col = "timepoint_id",
  attr_cols = c("time_index", "visit_label")
)

assay_nodes <- create_nodes(
  data.frame(
    assay_id = c("RNAseq", "Proteomics"),
    modality = c("transcriptomics", "proteomics"),
    platform = c("NovaSeq", "Orbitrap"),
    stringsAsFactors = FALSE
  ),
  type = "Assay",
  id_col = "assay_id",
  attr_cols = c("modality", "platform")
)

featureset_nodes <- create_nodes(
  data.frame(
    featureset_id = c("FS_GLOBAL", "FS_PROT"),
    featureset_name = c("global_rna_signature", "proteomics_panel"),
    derivation_scope = c("per_dataset", "external"),
    feature_count = c(500L, 80L),
    stringsAsFactors = FALSE
  ),
  type = "FeatureSet",
  id_col = "featureset_id",
  attr_cols = c("featureset_name", "derivation_scope", "feature_count")
)

outcome_nodes <- create_nodes(
  data.frame(
    outcome_id = c("O_case", "O_ctrl"),
    outcome_name = c("response", "response"),
    outcome_type = c("binary", "binary"),
    observation_level = c("subject", "subject"),
    stringsAsFactors = FALSE
  ),
  type = "Outcome",
  id_col = "outcome_id",
  attr_cols = c("outcome_name", "outcome_type", "observation_level")
)

subject_edges <- create_edges(
  meta, "sample_id", "subject_id",
  "Sample", "Subject", "sample_belongs_to_subject"
)

batch_edges <- create_edges(
  meta, "sample_id", "batch_id",
  "Sample", "Batch", "sample_processed_in_batch",
  allow_missing = TRUE
)

study_edges <- create_edges(
  meta, "sample_id", "study_id",
  "Sample", "Study", "sample_from_study"
)

time_edges <- create_edges(
  meta, "sample_id", "timepoint_id",
  "Sample", "Timepoint", "sample_collected_at_timepoint",
  allow_missing = TRUE
)

assay_edges <- create_edges(
  meta, "sample_id", "assay_id",
  "Sample", "Assay", "sample_measured_by_assay"
)

featureset_edges <- create_edges(
  meta, "sample_id", "featureset_id",
  "Sample", "FeatureSet", "sample_uses_featureset"
)

outcome_edges <- create_edges(
  data.frame(
    subject_id = c("P1", "P2", "P3", "P4"),
    outcome_id = c("O_case", "O_ctrl", "O_case", "O_ctrl"),
    stringsAsFactors = FALSE
  ),
  "subject_id", "outcome_id",
  "Subject", "Outcome", "subject_has_outcome"
)

precedence_edges <- create_edges(
  data.frame(
    from_timepoint = c("T0", "T1"),
    to_timepoint = c("T1", "T2"),
    stringsAsFactors = FALSE
  ),
  "from_timepoint", "to_timepoint",
  "Timepoint", "Timepoint", "timepoint_precedes"
)

featureset_from_study <- create_edges(
  data.frame(
    featureset_id = "FS_GLOBAL",
    study_id = "ST1",
    stringsAsFactors = FALSE
  ),
  "featureset_id", "study_id",
  "FeatureSet", "Study", "featureset_generated_from_study"
)

featureset_from_batch <- create_edges(
  data.frame(
    featureset_id = "FS_GLOBAL",
    batch_id = "B1",
    stringsAsFactors = FALSE
  ),
  "featureset_id", "batch_id",
  "FeatureSet", "Batch", "featureset_generated_from_batch"
)

The node and edge tables are canonical and typed. The package assigns globally unique node IDs such as sample:S1 and subject:P1, so different entity types cannot collide accidentally.

sample_nodes
#> <graph_node_set> 6 nodes across 1 types (schema 0.2.0 )
as.data.frame(sample_nodes)[, c("node_id", "node_type", "node_key", "label")]
#>     node_id node_type node_key label
#> 1 sample:S1    Sample       S1    S1
#> 2 sample:S2    Sample       S2    S2
#> 3 sample:S3    Sample       S3    S3
#> 4 sample:S4    Sample       S4    S4
#> 5 sample:S5    Sample       S5    S5
#> 6 sample:S6    Sample       S6    S6

edge_preview <- do.call(rbind, lapply(
  list(
    subject_edges, batch_edges, study_edges, time_edges,
    assay_edges, featureset_edges, outcome_edges,
    precedence_edges, featureset_from_study, featureset_from_batch
  ),
  as.data.frame
))

edge_preview[, c("from", "to", "edge_type")]
#>                    from                   to                       edge_type
#> 1             sample:S1           subject:P1       sample_belongs_to_subject
#> 2             sample:S2           subject:P1       sample_belongs_to_subject
#> 3             sample:S3           subject:P2       sample_belongs_to_subject
#> 4             sample:S4           subject:P3       sample_belongs_to_subject
#> 5             sample:S5           subject:P4       sample_belongs_to_subject
#> 6             sample:S6           subject:P2       sample_belongs_to_subject
#> 7             sample:S1             batch:B1       sample_processed_in_batch
#> 8             sample:S2             batch:B2       sample_processed_in_batch
#> 9             sample:S3             batch:B1       sample_processed_in_batch
#> 10            sample:S4             batch:B3       sample_processed_in_batch
#> 11            sample:S6             batch:B1       sample_processed_in_batch
#> 12            sample:S1            study:ST1               sample_from_study
#> 13            sample:S2            study:ST1               sample_from_study
#> 14            sample:S3            study:ST1               sample_from_study
#> 15            sample:S4            study:ST2               sample_from_study
#> 16            sample:S5            study:ST3               sample_from_study
#> 17            sample:S6            study:ST2               sample_from_study
#> 18            sample:S1         timepoint:T0   sample_collected_at_timepoint
#> 19            sample:S2         timepoint:T1   sample_collected_at_timepoint
#> 20            sample:S3         timepoint:T0   sample_collected_at_timepoint
#> 21            sample:S4         timepoint:T2   sample_collected_at_timepoint
#> 22            sample:S6         timepoint:T1   sample_collected_at_timepoint
#> 23            sample:S1         assay:RNAseq        sample_measured_by_assay
#> 24            sample:S2         assay:RNAseq        sample_measured_by_assay
#> 25            sample:S3         assay:RNAseq        sample_measured_by_assay
#> 26            sample:S4         assay:RNAseq        sample_measured_by_assay
#> 27            sample:S5     assay:Proteomics        sample_measured_by_assay
#> 28            sample:S6         assay:RNAseq        sample_measured_by_assay
#> 29            sample:S1 featureset:FS_GLOBAL          sample_uses_featureset
#> 30            sample:S2 featureset:FS_GLOBAL          sample_uses_featureset
#> 31            sample:S3 featureset:FS_GLOBAL          sample_uses_featureset
#> 32            sample:S4 featureset:FS_GLOBAL          sample_uses_featureset
#> 33            sample:S5   featureset:FS_PROT          sample_uses_featureset
#> 34            sample:S6 featureset:FS_GLOBAL          sample_uses_featureset
#> 35           subject:P1       outcome:O_case             subject_has_outcome
#> 36           subject:P2       outcome:O_ctrl             subject_has_outcome
#> 37           subject:P3       outcome:O_case             subject_has_outcome
#> 38           subject:P4       outcome:O_ctrl             subject_has_outcome
#> 39         timepoint:T0         timepoint:T1              timepoint_precedes
#> 40         timepoint:T1         timepoint:T2              timepoint_precedes
#> 41 featureset:FS_GLOBAL            study:ST1 featureset_generated_from_study
#> 42 featureset:FS_GLOBAL             batch:B1 featureset_generated_from_batch

The node table shows the canonical sample IDs that everything else refers to. The edge table shows the package’s central design choice: dependency structure is explicit, typed, and inspectable.

Assemble the dependency graph

graph <- build_dependency_graph(
  nodes = list(
    sample_nodes, subject_nodes, batch_nodes, study_nodes,
    time_nodes, assay_nodes, featureset_nodes, outcome_nodes
  ),
  edges = list(
    subject_edges, batch_edges, study_edges, time_edges,
    assay_edges, featureset_edges, outcome_edges,
    precedence_edges, featureset_from_study, featureset_from_batch
  ),
  graph_name = "vignette_graph",
  dataset_name = attr(meta, "dataset_name")
)

graph
#> <dependency_graph> vignette_graph 
#>   Nodes: 25 
#>   Edges: 42
summary(graph)
#> $graph_name
#> [1] "vignette_graph"
#> 
#> $dataset_name
#> [1] "VignetteDemo"
#> 
#> $schema_version
#> [1] "0.2.0"
#> 
#> $n_nodes
#> [1] 25
#> 
#> $n_edges
#> [1] 42
#> 
#> $node_types
#>        value n
#> 1     Sample 6
#> 2    Subject 4
#> 3      Batch 3
#> 4      Study 3
#> 5  Timepoint 3
#> 6      Assay 2
#> 7 FeatureSet 2
#> 8    Outcome 2
#> 
#> $edge_types
#>                              value n
#> 1        sample_belongs_to_subject 6
#> 2                sample_from_study 6
#> 3         sample_measured_by_assay 6
#> 4           sample_uses_featureset 6
#> 5    sample_collected_at_timepoint 5
#> 6        sample_processed_in_batch 5
#> 7              subject_has_outcome 4
#> 8               timepoint_precedes 2
#> 9  featureset_generated_from_batch 1
#> 10 featureset_generated_from_study 1

At this point the package has a single dependency_graph object with both tabular and igraph representations behind it. The summary is useful because it tells you exactly which entity types and relation types are present before you derive any split rules.

Visualize the typed structure

plot() renders the graph with a typed, layered layout: Sample on top, peer dependencies (Subject, Batch, Study, Timepoint) in the middle band, Assay/FeatureSet below that, and Outcome at the bottom. Node colors are keyed to type and an auto-generated legend is drawn by default.

plot(graph)

Useful options:

plot(graph, layout = "sugiyama")         # alternative hierarchical layout
plot(graph, show_labels = FALSE)         # hide labels on dense graphs
plot(graph, legend = FALSE)              # suppress the legend
plot(graph, legend_position = "bottomright")
plot(graph, node_colors = c(Sample = "#000000"))

Validate before you split

Validation is where splitGraph starts paying off. The graph below is structurally valid, but it still carries leakage-relevant warnings and advisories.

validation <- validate_graph(graph)

validation
#> <depgraph_validation_report>  vignette_graph
#>   Valid: TRUE 
#>   Issues: 6 
#>   By severity:
#>    - advisory : 5 
#>    - warning : 1
as.data.frame(validation)[, c("level", "severity", "code", "message")]
#>     level severity                         code
#> 1 leakage advisory     repeated_subject_samples
#> 2 leakage advisory     repeated_subject_samples
#> 3 leakage  warning  subject_cross_study_overlap
#> 4 leakage advisory       per_dataset_featureset
#> 5 leakage advisory shared_featureset_provenance
#> 6 leakage advisory            heavy_batch_reuse
#>                                                                    message
#> 1                      Subject `subject:P1` is linked to multiple samples.
#> 2                      Subject `subject:P2` is linked to multiple samples.
#> 3                    Subject `subject:P2` appears across multiple studies.
#> 4 FeatureSet `featureset:FS_GLOBAL` was derived at the full-dataset scope.
#> 5     FeatureSet `featureset:FS_GLOBAL` is shared across multiple samples.
#> 6                          Batch `batch:B1` is reused across many samples.

That output is the core value proposition of the package in one place:

repeated subjects are surfaced explicitly
cross-study subject overlap is surfaced explicitly
full-dataset feature provenance is surfaced explicitly
heavy batch reuse is surfaced explicitly

valid = TRUE here means the graph has no errors. It does not mean the dataset is free of leakage risk. Warnings and advisories still matter.

The package is also intentionally strict about silent failure. If you ask for a subset of samples and some of them do not resolve, it errors instead of dropping them.

tryCatch(
  derive_split_constraints(graph, mode = "subject", samples = c("S1", "BAD")),
  error = function(e) e$message
)
#> [1] "Unknown sample IDs: BAD"

That behavior is important in practice because quietly omitting samples would change the truth of the split problem.

When you really do need to relax a check

Some leakage-relevant rules have legitimate exceptions. For example, in some pooled designs a single biological “sample” is intentionally paired across two subjects. The semantic validator flags this as an error by default, and derive_split_constraints(mode = "subject") refuses to choose a subject for you — both consistent with the “no silent guessing” stance.

To make the contrast concrete, build a tiny graph in which sample S1 is linked to two subjects:

multi_nodes <- graph_node_set(data.frame(
  node_id   = c("sample:S1", "subject:P1", "subject:P2"),
  node_type = c("Sample", "Subject", "Subject"),
  node_key  = c("S1", "P1", "P2"),
  label     = c("S1", "P1", "P2"),
  attrs     = I(list(list(), list(), list())),
  stringsAsFactors = FALSE
))

multi_edges <- graph_edge_set(data.frame(
  edge_id   = c("sample_belongs_to_subject:1", "sample_belongs_to_subject:2"),
  from      = c("sample:S1", "sample:S1"),
  to        = c("subject:P1", "subject:P2"),
  edge_type = c("sample_belongs_to_subject", "sample_belongs_to_subject"),
  attrs     = I(list(list(), list())),
  stringsAsFactors = FALSE
))

multi_graph <- dependency_graph(nodes = multi_nodes, edges = multi_edges)

Default behavior — the validator surfaces the multi-subject sample as an error, and constraint derivation refuses to pick a subject:

default_report <- validate_graph(multi_graph)
default_report$valid
#> [1] FALSE
default_report$issues[, c("severity", "code", "message")]
#>   severity                                code
#> 1    error sample_multiple_subject_assignments
#>                                                                           message
#> 1 Samples violate relation cardinality for `sample_belongs_to_subject`: sample:S1

tryCatch(
  derive_split_constraints(multi_graph, mode = "subject"),
  error = function(e) e$message
)
#> [1] "Multiple subject assignments found for sample(s): S1 (set `validation_overrides = list(allow_multi_subject_samples = TRUE)` to allow this and pick the first listed assignment.)"

When the ambiguity is intended, opt in explicitly via validation_overrides. The same key (allow_multi_subject_samples = TRUE) is honored by both the validator and the constraint deriver:

# Validator: pass.
permissive_report <- validate_graph(
  multi_graph,
  validation_overrides = list(allow_multi_subject_samples = TRUE)
)
permissive_report$valid
#> [1] TRUE

# Constraint derivation: pick the first listed subject and record the
# ambiguity in metadata$warnings instead of erroring.
multi_graph$metadata$validation_overrides <-
  list(allow_multi_subject_samples = TRUE)
relaxed_constraint <- derive_split_constraints(multi_graph, mode = "subject")
relaxed_constraint$sample_map[, c("sample_id", "group_id")]
#>   sample_id   group_id
#> 1        S1 subject:P1
relaxed_constraint$metadata$warnings
#> [1] "Samples linked to multiple subjects were assigned to their first listed subject (allow_multi_subject_samples override): S1"

The override is documented under ?validate_graph. Use it sparingly and only when the relaxation matches your scientific intent — the message in metadata$warnings is the audit trail that the choice was tolerated, not hidden.

Query the graph to inspect hidden structure

You can inspect local provenance, trace paths, and project direct sample dependencies.

neighbors_s1 <- query_neighbors(graph, node_ids = "sample:S1", direction = "out")
neighbors_s1
#> <graph_query_result> query_neighbors 
#>   Rows: 6
as.data.frame(neighbors_s1)[, c("seed_node_id", "node_id", "node_type", "edge_type")]
#>   seed_node_id              node_id  node_type                     edge_type
#> 1    sample:S1           subject:P1    Subject     sample_belongs_to_subject
#> 2    sample:S1             batch:B1      Batch     sample_processed_in_batch
#> 3    sample:S1            study:ST1      Study             sample_from_study
#> 4    sample:S1         timepoint:T0  Timepoint sample_collected_at_timepoint
#> 5    sample:S1         assay:RNAseq      Assay      sample_measured_by_assay
#> 6    sample:S1 featureset:FS_GLOBAL FeatureSet        sample_uses_featureset

subject_outcome_path <- query_shortest_paths(
  graph,
  from = "sample:S1",
  to = "outcome:O_case",
  edge_types = c("sample_belongs_to_subject", "subject_has_outcome")
)

subject_outcome_path
#> <graph_query_result> query_shortest_paths 
#>   Rows: 3
as.data.frame(subject_outcome_path)
#>   path_id step        node_id node_type                     edge_id
#> 1  path_1    1      sample:S1    Sample                        <NA>
#> 2  path_1    2     subject:P1   Subject sample_belongs_to_subject:1
#> 3  path_1    3 outcome:O_case   Outcome       subject_has_outcome:1
#>                   edge_type
#> 1                      <NA>
#> 2 sample_belongs_to_subject
#> 3       subject_has_outcome

The first query shows everything the graph knows directly about S1. The second shows that S1 reaches the subject-level outcome through its subject node, which is exactly the kind of relationship that would stay implicit in a plain metadata table.

shared_dependencies <- detect_shared_dependencies(
  graph,
  via = c("Subject", "Batch", "FeatureSet")
)

as.data.frame(shared_dependencies)[, c(
  "sample_id_1", "sample_id_2", "shared_node_type", "shared_node_id", "edge_type"
)]
#>    sample_id_1 sample_id_2 shared_node_type       shared_node_id
#> 1           S1          S2          Subject           subject:P1
#> 2           S3          S6          Subject           subject:P2
#> 3           S1          S3            Batch             batch:B1
#> 4           S1          S6            Batch             batch:B1
#> 5           S3          S6            Batch             batch:B1
#> 6           S1          S2       FeatureSet featureset:FS_GLOBAL
#> 7           S1          S3       FeatureSet featureset:FS_GLOBAL
#> 8           S1          S4       FeatureSet featureset:FS_GLOBAL
#> 9           S1          S6       FeatureSet featureset:FS_GLOBAL
#> 10          S2          S3       FeatureSet featureset:FS_GLOBAL
#> 11          S2          S4       FeatureSet featureset:FS_GLOBAL
#> 12          S2          S6       FeatureSet featureset:FS_GLOBAL
#> 13          S3          S4       FeatureSet featureset:FS_GLOBAL
#> 14          S3          S6       FeatureSet featureset:FS_GLOBAL
#> 15          S4          S6       FeatureSet featureset:FS_GLOBAL
#>                    edge_type
#> 1  sample_belongs_to_subject
#> 2  sample_belongs_to_subject
#> 3  sample_processed_in_batch
#> 4  sample_processed_in_batch
#> 5  sample_processed_in_batch
#> 6     sample_uses_featureset
#> 7     sample_uses_featureset
#> 8     sample_uses_featureset
#> 9     sample_uses_featureset
#> 10    sample_uses_featureset
#> 11    sample_uses_featureset
#> 12    sample_uses_featureset
#> 13    sample_uses_featureset
#> 14    sample_uses_featureset
#> 15    sample_uses_featureset

dependency_components <- detect_dependency_components(
  graph,
  via = c("Subject", "Batch")
)

as.data.frame(dependency_components)
#>   sample_id sample_node_id component_id component_size
#> 1        S1      sample:S1  component_1              4
#> 2        S2      sample:S2  component_1              4
#> 3        S3      sample:S3  component_1              4
#> 4        S4      sample:S4  component_2              1
#> 5        S5      sample:S5  component_3              1
#> 6        S6      sample:S6  component_1              4

These projected queries are useful because they answer the splitting question directly. They tell you which samples should be treated as structurally linked, not just which metadata columns happen to match.

Derive split constraints from the graph

splitGraph derives direct constraints — one grouping node per sample — for subject, batch, study, time, site, region, platform, and assay, plus composite constraints that combine several dependency sources, and pairwise constraints (relatedness, spatial) built from thresholded similarity. This section demonstrates the core four (subject, batch, study, time) and both composite strategies; the cluster relations (site, region, platform, assay) and the pairwise relations are covered in their own sections below and, in more depth, in the modeling-structure vignette.

subject_constraint <- derive_split_constraints(graph, mode = "subject")
batch_constraint <- derive_split_constraints(graph, mode = "batch")
study_constraint <- derive_split_constraints(graph, mode = "study")
time_constraint <- derive_split_constraints(graph, mode = "time")

strict_constraint <- derive_split_constraints(
  graph,
  mode = "composite",
  strategy = "strict",
  via = c("Subject", "Batch")
)

rule_based_constraint <- derive_split_constraints(
  graph,
  mode = "composite",
  strategy = "rule_based",
  priority = c("batch", "study", "subject", "time")
)

constraint_overview <- do.call(rbind, lapply(
  list(
    subject = subject_constraint,
    batch = batch_constraint,
    study = study_constraint,
    time = time_constraint,
    composite_strict = strict_constraint,
    composite_rule = rule_based_constraint
  ),
  function(x) {
    data.frame(
      strategy = x$strategy,
      groups = length(unique(x$sample_map$group_id)),
      warnings = if (is.null(x$metadata$warnings)) 0L else length(x$metadata$warnings),
      stringsAsFactors = FALSE
    )
  }
))

constraint_overview <- cbind(constraint = row.names(constraint_overview), constraint_overview)
row.names(constraint_overview) <- NULL

constraint_overview
#>         constraint   strategy groups warnings
#> 1          subject    subject      4        0
#> 2            batch      batch      4        1
#> 3            study      study      3        0
#> 4             time       time      4        1
#> 5 composite_strict     strict      3        0
#> 6   composite_rule rule_based      4        0

That summary already shows why the package is useful: different notions of dependency produce different splitting units.

Batch constraints

batch_constraint
#> <split_constraint> batch 
#>   Samples: 6 
#>   Groups: 4 
#>   Warnings: 1
as.data.frame(batch_constraint)[, c("sample_id", "group_id", "group_label", "explanation")]
#>   sample_id          group_id group_label
#> 1        S1          batch:B1          B1
#> 2        S2          batch:B2          B2
#> 3        S3          batch:B1          B1
#> 4        S4          batch:B3          B3
#> 5        S5 batch:unlinked:S5 unlinked_S5
#> 6        S6          batch:B1          B1
#>                                                                          explanation
#> 1                          Grouped by batch through sample_processed_in_batch -> B1.
#> 2                          Grouped by batch through sample_processed_in_batch -> B2.
#> 3                          Grouped by batch through sample_processed_in_batch -> B1.
#> 4                          Grouped by batch through sample_processed_in_batch -> B3.
#> 5 No batch assignment was available; sample retained as an unlinked singleton group.
#> 6                          Grouped by batch through sample_processed_in_batch -> B1.

Batch grouping keeps all B1 samples together and preserves S5 as an explicit singleton because it has no batch assignment. Missing structure is not hidden.

Time constraints

time_constraint
#> <split_constraint> time 
#>   Samples: 6 
#>   Groups: 4 
#>   Warnings: 1
as.data.frame(time_constraint)[, c("sample_id", "group_id", "timepoint_id", "order_rank")]
#>   sample_id         group_id timepoint_id order_rank
#> 1        S1          time:T0           T0          1
#> 2        S2          time:T1           T1          2
#> 3        S3          time:T0           T0          1
#> 4        S4          time:T2           T2          3
#> 5        S5 time:unlinked:S5         <NA>         NA
#> 6        S6          time:T1           T1          2

Time grouping adds order_rank, which is the field downstream tooling actually needs for ordered evaluation. The missing timepoint on S5 stays visible as NA, so ordering is partial rather than pretended.

Composite constraints

strict_constraint
#> <split_constraint> strict 
#>   Samples: 6 
#>   Groups: 3
as.data.frame(strict_constraint)[, c("sample_id", "group_id", "constraint_type")]
#>   sample_id    group_id  constraint_type
#> 1        S1 component_1 composite_strict
#> 2        S2 component_1 composite_strict
#> 3        S3 component_1 composite_strict
#> 4        S4 component_2 composite_strict
#> 5        S5 component_3 composite_strict
#> 6        S6 component_1 composite_strict

rule_based_constraint
#> <split_constraint> rule_based 
#>   Samples: 6 
#>   Groups: 4
as.data.frame(rule_based_constraint)[, c("sample_id", "group_id", "constraint_type", "group_label")]
#>   sample_id            group_id constraint_type group_label
#> 1        S1  composite_batch:B1           batch          B1
#> 2        S2  composite_batch:B2           batch          B2
#> 3        S3  composite_batch:B1           batch          B1
#> 4        S4  composite_batch:B3           batch          B3
#> 5        S5 composite_study:ST3           study         ST3
#> 6        S6  composite_batch:B1           batch          B1

The strict composite constraint uses transitive closure: S1, S2, S3, and S6 end up in the same group because subject and batch links connect them into one dependency component. The rule-based composite constraint is different: it uses the highest-priority available dependency per sample, so S5 falls back to study-level grouping instead of becoming a composite component.

Cluster-style relations: site, region, platform, assay

Beyond subject/batch/study/time, several other metadata columns define cluster-style leakage axes — categorical groupings a model can memorize instead of generalizing across. splitGraph models each as a first-class node type with its own auto-detected column, validation rule, and constraint mode:

Relation	Column	Edge	Mode
Collection site / center	`site_id`	`sample_collected_at_site`	`"site"`
Tissue / anatomical region	`region_id`	`sample_located_in_region`	`"region"`
Sequencing / measurement platform	`platform_id`	`sample_run_on_platform`	`"platform"`
Assay / modality	`assay_id`	`sample_measured_by_assay`	`"assay"`

They all behave identically: graph_from_metadata() auto-detects the column, validate_graph() flags samples assigned to more than one target, the mode groups samples so no cluster straddles a split, and as_split_spec() carries the assignment as a blocking annotation (site_group, region_group, platform_group, assay_group).

Site

In multi-site studies, the collection site is a common leakage axis: a model can “recognize” a site rather than generalize across sites. splitGraph models the site as a first-class Site node connected by sample_collected_at_site edges. graph_from_metadata() auto-detects a site_id column, so no extra wiring is needed.

site_meta <- data.frame(
  sample_id  = c("S1", "S2", "S3", "S4", "S5", "S6"),
  subject_id = c("P1", "P1", "P2", "P2", "P3", "P3"),
  site_id    = c("NYC", "NYC", "BOS", "BOS", "NYC", "BOS"),
  stringsAsFactors = FALSE
)

site_graph <- graph_from_metadata(site_meta, graph_name = "multi-site")

# Group samples so that no collection site straddles a train/test split.
site_constraint <- derive_split_constraints(site_graph, mode = "site")
as.data.frame(site_constraint)[, c("sample_id", "group_id", "group_label")]
#>   sample_id group_id group_label
#> 1        S1 site:NYC         NYC
#> 2        S2 site:NYC         NYC
#> 3        S3 site:BOS         BOS
#> 4        S4 site:BOS         BOS
#> 5        S5 site:NYC         NYC
#> 6        S6 site:BOS         BOS

mode = "site" keeps every sample from a given site in the same group. The site assignment is also carried into the split_spec as a blocking annotation (site_group), so a downstream consumer can block on site even when the primary grouping is something else (e.g. subject):

subject_then_block_by_site <- as_split_spec(
  derive_split_constraints(site_graph, mode = "subject"),
  graph = site_graph
)
subject_then_block_by_site$block_vars
#> [1] "site_group"
head(subject_then_block_by_site$sample_data[, c("sample_id", "group_id", "site_group")])
#>   sample_id   group_id site_group
#> 1        S1 subject:P1        NYC
#> 2        S2 subject:P1        NYC
#> 3        S3 subject:P2        BOS
#> 4        S4 subject:P2        BOS
#> 5        S5 subject:P3        NYC
#> 6        S6 subject:P3        BOS

Samples assigned to more than one site are rejected rather than silently resolved, by both validate_graph() and derive_split_constraints(mode = "site").

Region

The Region relation works the same way for a categorical tissue or anatomical region (region_id column, sample_located_in_region edge, derive_split_constraints(mode = "region")), so samples from the same region stay together across a split.

region_meta <- data.frame(
  sample_id = c("S1", "S2", "S3", "S4"),
  region_id = c("cortex", "cortex", "hippocampus", "hippocampus"),
  stringsAsFactors = FALSE
)
region_graph <- graph_from_metadata(region_meta, graph_name = "regions")
as.data.frame(derive_split_constraints(region_graph, mode = "region"))[
  , c("sample_id", "group_id", "group_label")
]
#>   sample_id           group_id group_label
#> 1        S1      region:cortex      cortex
#> 2        S2      region:cortex      cortex
#> 3        S3 region:hippocampus hippocampus
#> 4        S4 region:hippocampus hippocampus

Platform and assay

Technical measurement structure is the same story. platform_id (the sequencing instrument or measurement platform) and assay_id (the assay or modality) are both auto-detected, and mode = "platform" / mode = "assay" group samples so a whole platform or assay never straddles a split — useful when a batch effect tracks the instrument rather than the run.

tech_meta <- data.frame(
  sample_id   = c("S1", "S2", "S3", "S4"),
  platform_id = c("illumina", "illumina", "nanopore", "nanopore"),
  assay_id    = c("rnaseq", "rnaseq", "wgs", "wgs"),
  stringsAsFactors = FALSE
)
tech_graph <- graph_from_metadata(tech_meta, graph_name = "tech")

as.data.frame(derive_split_constraints(tech_graph, mode = "platform"))[
  , c("sample_id", "group_id", "group_label")
]
#>   sample_id          group_id group_label
#> 1        S1 platform:illumina    illumina
#> 2        S2 platform:illumina    illumina
#> 3        S3 platform:nanopore    nanopore
#> 4        S4 platform:nanopore    nanopore
as.data.frame(derive_split_constraints(tech_graph, mode = "assay"))[
  , c("sample_id", "group_id", "group_label")
]
#>   sample_id     group_id group_label
#> 1        S1 assay:rnaseq      rnaseq
#> 2        S2 assay:rnaseq      rnaseq
#> 3        S3    assay:wgs         wgs
#> 4        S4    assay:wgs         wgs

These new types are first-class in the typed layout too. Building a graph that carries several of them shows how each gets its own colour and layer band — the same visual vocabulary as the core types, so a mixed graph stays readable:

mixed_meta <- data.frame(
  sample_id   = c("S1", "S2", "S3", "S4"),
  subject_id  = c("P1", "P1", "P2", "P2"),
  site_id     = c("NYC", "NYC", "BOS", "BOS"),
  platform_id = c("illumina", "illumina", "nanopore", "nanopore"),
  stringsAsFactors = FALSE
)
plot(graph_from_metadata(mixed_meta, graph_name = "mixed_structure"))

Pairwise relations: relatedness and spatial proximity

Not every leakage source is a clean categorical group. Genetic relatedness and spatial proximity are pairwise and continuous: they link individual pairs by a similarity score. splitGraph models these as thresholded, undirected edges built with relatedness_edges_from_kinship() and spatial_edges_from_coords(), then mode = "relatedness" / mode = "spatial" form groups by transitive closure over the surviving edges — so a chain of individually near neighbours still lands in one group, a grouping a single column cannot express.

The example below keeps subject pairs whose kinship is at least 0.1. P1–P2 and P2–P3 clear the threshold, so those three subjects collapse into one group by transitive closure; the unrelated P4 stays on its own:

kin <- data.frame(
  id1     = c("P1", "P2", "P1"),
  id2     = c("P2", "P3", "P4"),
  kinship = c(0.25, 0.20, 0.02),   # P1-P4 is below the 0.1 threshold
  stringsAsFactors = FALSE
)
rel_edges <- relatedness_edges_from_kinship(kin, threshold = 0.1)

rel_meta <- data.frame(
  sample_id  = paste0("S", 1:4),
  subject_id = c("P1", "P2", "P3", "P4"),
  stringsAsFactors = FALSE
)
rel_graph <- build_dependency_graph(
  nodes = list(
    create_nodes(rel_meta, "Sample", "sample_id"),
    create_nodes(rel_meta, "Subject", "subject_id")
  ),
  edges = list(
    create_edges(rel_meta, "sample_id", "subject_id",
                 "Sample", "Subject", "sample_belongs_to_subject"),
    rel_edges
  )
)

grouping_vector(derive_split_constraints(rel_graph, mode = "relatedness"))
#>                        S1                        S2                        S3 
#> "relatedness:component_1" "relatedness:component_1" "relatedness:component_1" 
#>                        S4 
#> "relatedness:component_2"

The spatial mode works identically over sample coordinates (spatial_edges_from_coords(coords, radius)). Both relations — with the full scikit-learn handoff — are covered in depth in the modeling-structure vignette:

vignette("modeling-structure", package = "splitGraph")

Time ordering can come from precedence edges alone

If explicit time_index metadata are unavailable, splitGraph can still infer time order from timepoint_precedes edges.

precedence_meta <- data.frame(
  sample_id = c("S1", "S2", "S3"),
  subject_id = c("P1", "P1", "P2"),
  study_id = c("ST1", "ST1", "ST2"),
  timepoint_id = c("T0", "T1", "T2"),
  stringsAsFactors = FALSE
)

precedence_graph <- build_dependency_graph(
  nodes = list(
    create_nodes(precedence_meta, type = "Sample", id_col = "sample_id"),
    create_nodes(precedence_meta, type = "Subject", id_col = "subject_id"),
    create_nodes(precedence_meta, type = "Study", id_col = "study_id"),
    create_nodes(
      data.frame(timepoint_id = c("T0", "T1", "T2"), stringsAsFactors = FALSE),
      type = "Timepoint",
      id_col = "timepoint_id"
    )
  ),
  edges = list(
    create_edges(
      precedence_meta, "sample_id", "subject_id",
      "Sample", "Subject", "sample_belongs_to_subject"
    ),
    create_edges(
      precedence_meta, "sample_id", "study_id",
      "Sample", "Study", "sample_from_study"
    ),
    create_edges(
      precedence_meta, "sample_id", "timepoint_id",
      "Sample", "Timepoint", "sample_collected_at_timepoint"
    ),
    create_edges(
      data.frame(
        from_timepoint = c("T0", "T1"),
        to_timepoint = c("T1", "T2"),
        stringsAsFactors = FALSE
      ),
      "from_timepoint", "to_timepoint",
      "Timepoint", "Timepoint", "timepoint_precedes"
    )
  ),
  graph_name = "precedence_only_graph"
)

precedence_time_constraint <- derive_split_constraints(precedence_graph, mode = "time")

precedence_time_constraint$metadata$time_order_source
#> [1] "timepoint_precedes"
as.data.frame(precedence_time_constraint)[, c("sample_id", "timepoint_id", "time_index", "order_rank")]
#>   sample_id timepoint_id time_index order_rank
#> 1        S1           T0         NA          1
#> 2        S2           T1         NA          2
#> 3        S3           T2         NA          3

The important detail is that ordering is still derived, but the source is timepoint_precedes rather than time_index.

Translate the constraint into a split specification

The graph-derived constraint is not the end of the workflow. The main handoff target is a canonical sample-level split specification — the split_spec class. Downstream tools consume it through their own adapters, so split_spec stays tool-agnostic.

split_spec <- as_split_spec(strict_constraint, graph = graph)
split_spec
#> <split_spec> composite 
#>   Samples: 6 
#>   Groups: 3 
#>   Recommended resampling: custom_grouped_cv

as.data.frame(split_spec)[, c(
  "sample_id", "group_id", "batch_group", "study_group", "timepoint_id", "order_rank"
)]
#>   sample_id    group_id batch_group study_group timepoint_id order_rank
#> 1        S1 component_1          B1         ST1           T0          1
#> 2        S2 component_1          B2         ST1           T1          2
#> 3        S3 component_1          B1         ST1           T0          1
#> 4        S4 component_2          B3         ST2           T2          3
#> 5        S5 component_3        <NA>         ST3         <NA>         NA
#> 6        S6 component_1          B1         ST2           T1          2

split_spec_validation <- validate_split_spec(split_spec)
split_spec_validation
#> <split_spec_validation>
#>   Valid: TRUE 
#>   Issues: 0
as.data.frame(split_spec_validation)
#> [1] issue_id   severity   code       message    n_affected details   
#> <0 rows> (or 0-length row.names)

This translation step is where the package becomes operational for downstream evaluation workflows:

group_id carries the split unit
batch_group and study_group are available for blocking
order_rank is available for ordered evaluation
the generated object is validated before handoff

Summarize the leakage picture in one object

The final helper combines graph validation, constraint diagnostics, and split-spec readiness into one summary object. Crucially, when you pass the constraint you chose, the summary reports a severed column: whether that constraint structurally eliminates each leakage path (TRUE), leaves it open (FALSE), or is not applicable (NA, e.g. for informational split-spec rows).

risk_summary <- summarize_leakage_risks(
  graph,
  constraint = strict_constraint,
  split_spec = split_spec
)

risk_summary
#> <leakage_risk_summary>
#>   Overview: Detected 13 structural leakage diagnostics across validation, constraint, and split-spec readiness. 
#>   Diagnostics: 13
as.data.frame(risk_summary)[, c("source", "severity", "category", "severed", "message")]
#>        source severity                     category severed
#> 1  validation advisory     repeated_subject_samples    TRUE
#> 2  validation advisory     repeated_subject_samples    TRUE
#> 3  validation  warning  subject_cross_study_overlap    TRUE
#> 4  validation advisory       per_dataset_featureset   FALSE
#> 5  validation advisory shared_featureset_provenance   FALSE
#> 6  validation advisory            heavy_batch_reuse    TRUE
#> 7  constraint advisory   singleton_heavy_constraint      NA
#> 8  split_spec advisory             split_spec_ready      NA
#> 9  split_spec advisory           ordering_available      NA
#> 10 split_spec advisory           blocking_available      NA
#> 11 split_spec advisory           blocking_available      NA
#> 12 split_spec advisory           blocking_available      NA
#> 13 split_spec advisory   split_spec_singleton_heavy      NA
#>                                                                     message
#> 1                       Subject `subject:P1` is linked to multiple samples.
#> 2                       Subject `subject:P2` is linked to multiple samples.
#> 3                     Subject `subject:P2` appears across multiple studies.
#> 4  FeatureSet `featureset:FS_GLOBAL` was derived at the full-dataset scope.
#> 5      FeatureSet `featureset:FS_GLOBAL` is shared across multiple samples.
#> 6                           Batch `batch:B1` is reused across many samples.
#> 7            The derived split constraint is dominated by singleton groups.
#> 8                                   Split spec passed preflight validation.
#> 9     Split spec provides ordering through `order_rank` for 5 of 6 samples.
#> 10  Split spec provides blocking variable `batch_group` for 5 of 6 samples.
#> 11  Split spec provides blocking variable `study_group` for 6 of 6 samples.
#> 12  Split spec provides blocking variable `assay_group` for 6 of 6 samples.
#> 13                    Split spec grouping is dominated by singleton groups.

Read the severed column against the constraint you actually chose. Here the strict composite constraint (via = c("Subject", "Batch")) severs the subject, cross-study, and batch-reuse risks (TRUE) because those samples are forced into the same group — but it does not address missing_time_ordering or shared_featureset_provenance (FALSE), which are orthogonal to a subject/batch grouping. That distinction is the point: the summary tells you which of your surfaced risks your split design has actually handled, and which still need attention (a different mode, a blocking variable, or a data fix) — so the leakage trade-off is explicit before any model is trained.

Downstream handoff

split_spec is the tool-agnostic handoff artifact. splitGraph does not know about any particular resampling package — downstream consumers provide their own adapters so splitGraph stays neutral. The typical end-to-end flow is:

graph_from_metadata() (or the explicit constructor path) → typed dependency_graph
derive_split_constraints(g, mode = ...) → split_constraint
as_split_spec(constraint, graph = g) → split_spec
(optional) write_split_spec(spec, path) → JSON, for cross-session or cross-language handoff
adapter in the downstream package → native resamples

The sample_data frame carried by split_spec exposes exactly the columns downstream adapters consume: sample_id for joining against the observation frame, group_id for grouped resampling, batch_group / study_group for blocking, and order_rank for ordered evaluation. Adapters can be built by any package that wants to consume a split_spec — for example, on top of rsample::group_vfold_cv() (grouped CV keyed to group_id) or rsample::rolling_origin() (ordered evaluation keyed to order_rank).

Persisting the handoff

If the downstream consumer is in a different R session — or in a different language entirely — write the spec (and, if useful, the graph) to JSON. The on-disk format has a formal JSON Schema (Draft 2020-12) shipped in inst/schema/, and each written file references it via a $schema key. You can validate a handoff file against that contract with validate_split_spec_json() before consuming it, and NA values round-trip as JSON null.

spec_path <- tempfile(fileext = ".json")
write_split_spec(split_spec, spec_path)

# Validate the file against the shipped JSON Schema.
validate_split_spec_json(spec_path)$valid
#> [1] TRUE

# Round-trip it back into R unchanged.
spec_round_trip <- read_split_spec(spec_path)
identical(split_spec$sample_data$group_id, spec_round_trip$sample_data$group_id)
#> [1] TRUE

unlink(spec_path)

Because the format is a documented, versioned contract, consumers are not limited to R. The package ships a pure-Python reference reader (inst/python) that recovers the same grouping and ordering and drives scikit-learn resamplers; the cross-language-handoff vignette walks the full R → JSON → Python → scikit-learn path. For worked R adapters (a base-R leave-one-group-out adapter and illustrative rsample adapters), see the adapter-cookbook vignette:

vignette("adapter-cookbook", package = "splitGraph")       # R adapters
vignette("cross-language-handoff", package = "splitGraph") # Python / sklearn

Case studies

The end-to-end workflow above shows the package surface. The case studies below show how the same graph leads to different evaluation decisions depending on the scientific question.

Case study 1: repeated subjects in a longitudinal cohort

Suppose the real question is whether future observations from the same subject should be held out from training. In this setting, subject reuse and time ordering both matter, but they solve different problems.

subject_groups <- grouping_vector(subject_constraint)
time_groups <- time_constraint$sample_map[, c("sample_id", "group_id", "timepoint_id", "order_rank")]

subject_groups
#>           S1           S2           S3           S4           S5           S6 
#> "subject:P1" "subject:P1" "subject:P2" "subject:P3" "subject:P4" "subject:P2"
time_groups
#>   sample_id         group_id timepoint_id order_rank
#> 1        S1          time:T0           T0          1
#> 2        S2          time:T1           T1          2
#> 3        S3          time:T0           T0          1
#> 4        S4          time:T2           T2          3
#> 5        S5 time:unlinked:S5         <NA>         NA
#> 6        S6          time:T1           T1          2

Interpretation:

S1 and S2 share subject P1, so subject-grouped evaluation keeps them together.
S3 and S6 share subject P2, so they also stay together under a subject-based split.
time grouping adds a different axis: T0, T1, and T2 become ordered units with explicit order_rank.

If the leakage concern is repeated measurements from the same individual, use the subject constraint. If the evaluation question is prospective prediction, the time constraint adds the ordering information you need.

Case study 2: a subject reused across studies

The graph intentionally includes subject P2 in both ST1 and ST2. A study-only split would treat those studies as separate units, but the graph shows that subject overlap breaks the intended independence.

cross_study_issues <- as.data.frame(validation)[
  as.data.frame(validation)$code == "subject_cross_study_overlap",
  c("severity", "code", "message")
]

p2_shared <- detect_shared_dependencies(
  graph,
  via = "Subject",
  samples = c("S3", "S6")
)

study_only_map <- study_constraint$sample_map[, c("sample_id", "group_id", "group_label")]
strict_map <- strict_constraint$sample_map[, c("sample_id", "group_id", "constraint_type")]

cross_study_issues
#>   severity                        code
#> 3  warning subject_cross_study_overlap
#>                                                 message
#> 3 Subject `subject:P2` appears across multiple studies.
as.data.frame(p2_shared)
#>   sample_id_1 sample_id_2 sample_node_id_1 sample_node_id_2 shared_node_id
#> 1          S3          S6        sample:S3        sample:S6     subject:P2
#>   shared_node_type                 edge_type
#> 1          Subject sample_belongs_to_subject
study_only_map[study_only_map$sample_id %in% c("S3", "S6"), ]
#>   sample_id  group_id group_label
#> 3        S3 study:ST1         ST1
#> 6        S6 study:ST2         ST2
strict_map[strict_map$sample_id %in% c("S3", "S6"), ]
#>   sample_id    group_id  constraint_type
#> 3        S3 component_1 composite_strict
#> 6        S6 component_1 composite_strict

Interpretation:

validation surfaces the cross-study subject overlap directly
the shared-dependency query confirms that S3 and S6 are linked through the same subject
a study-only split would place them in different groups (ST1 versus ST2)
the strict composite constraint correctly keeps them in the same dependency component

This is exactly the kind of failure mode splitGraph is designed to expose: metadata columns suggest a legitimate study split, but graph structure shows that the split would still leak subject information.

Case study 3: partially observed technical metadata

Real metadata are rarely complete. Here, S5 has no batch assignment and no timepoint assignment. The package does not pretend those fields exist. It keeps the sample visible and tells you how the split logic handled it.

batch_missing <- batch_constraint$sample_map[
  batch_constraint$sample_map$sample_id == "S5",
  c("sample_id", "group_id", "group_label", "explanation")
]

rule_based_missing <- rule_based_constraint$sample_map[
  rule_based_constraint$sample_map$sample_id == "S5",
  c("sample_id", "group_id", "constraint_type", "group_label", "explanation")
]

split_spec_missing <- as.data.frame(split_spec)[
  as.data.frame(split_spec)$sample_id == "S5",
  c("sample_id", "group_id", "batch_group", "study_group", "timepoint_id", "order_rank")
]

batch_missing
#>   sample_id          group_id group_label
#> 5        S5 batch:unlinked:S5 unlinked_S5
#>                                                                          explanation
#> 5 No batch assignment was available; sample retained as an unlinked singleton group.
rule_based_missing
#>   sample_id            group_id constraint_type group_label
#> 5        S5 composite_study:ST3           study         ST3
#>                                                                                                                                                  explanation
#> 5 Composite rule-based grouping selected study based on priority order batch > study > subject > time -> ST3. Additional available dependencies: subject=P4.
split_spec_missing
#>   sample_id    group_id batch_group study_group timepoint_id order_rank
#> 5        S5 component_3        <NA>         ST3         <NA>         NA

Interpretation:

batch-based splitting keeps S5 as an explicit singleton because batch metadata are missing
the rule-based composite strategy falls back to study-level grouping for S5
the translated split specification preserves the missing batch and time fields as NA rather than silently inventing values

That behavior matters because incomplete metadata are common. splitGraph stays strict about what is known, but still produces a usable, inspectable split object.

Case study 4: choosing a defensible split strategy

A typical practical question is not “what can the package compute?” but “which constraint should I actually use?” The answer depends on which dependency source is scientifically unacceptable to leak across train and test.

strategy_summary <- data.frame(
  constraint = c("subject", "batch", "study", "time", "composite_strict", "composite_rule"),
  groups = c(
    length(unique(subject_constraint$sample_map$group_id)),
    length(unique(batch_constraint$sample_map$group_id)),
    length(unique(study_constraint$sample_map$group_id)),
    length(unique(time_constraint$sample_map$group_id)),
    length(unique(strict_constraint$sample_map$group_id)),
    length(unique(rule_based_constraint$sample_map$group_id))
  ),
  warnings = c(
    length(or_empty(subject_constraint$metadata$warnings)),
    length(or_empty(batch_constraint$metadata$warnings)),
    length(or_empty(study_constraint$metadata$warnings)),
    length(or_empty(time_constraint$metadata$warnings)),
    length(or_empty(strict_constraint$metadata$warnings)),
    length(or_empty(rule_based_constraint$metadata$warnings))
  ),
  recommended_resampling = c(
    as_split_spec(subject_constraint, graph = graph)$recommended_resampling,
    as_split_spec(batch_constraint, graph = graph)$recommended_resampling,
    as_split_spec(study_constraint, graph = graph)$recommended_resampling,
    as_split_spec(time_constraint, graph = graph)$recommended_resampling,
    as_split_spec(strict_constraint, graph = graph)$recommended_resampling,
    as_split_spec(rule_based_constraint, graph = graph)$recommended_resampling
  ),
  stringsAsFactors = FALSE
)

strategy_summary
#>         constraint groups warnings recommended_resampling
#> 1          subject      4        0             grouped_cv
#> 2            batch      4        1             blocked_cv
#> 3            study      3        0    leave_one_group_out
#> 4             time      4        1          ordered_split
#> 5 composite_strict      3        0      custom_grouped_cv
#> 6   composite_rule      4        0             grouped_cv

Interpretation:

subject grouping is the right default when repeated individuals are the dominant leakage source
batch grouping is appropriate when technical runs are the main contamination risk
study grouping is useful for cross-study generalization only when no higher level dependency crosses study boundaries
strict composite grouping is the safest choice when multiple dependency sources can connect samples transitively
rule-based composite grouping is a pragmatic fallback when you want a single deterministic hierarchy over partially observed metadata

The package does not choose the scientific objective for you. It makes the trade-off visible and auditable.

splitGraph: From Metadata to Leakage-Aware Split Design

Selçuk Korkmaz

2026-07-03

Why `splitGraph` exists

A realistic toy dataset

Fast path: `graph_from_metadata()`

Ingest metadata and build typed nodes and edges

Assemble the dependency graph

Visualize the typed structure

Validate before you split

When you really do need to relax a check

Query the graph to inspect hidden structure

Derive split constraints from the graph

Batch constraints

Time constraints

Composite constraints

Cluster-style relations: site, region, platform, assay

Site

Region

Platform and assay

Pairwise relations: relatedness and spatial proximity

Time ordering can come from precedence edges alone

Translate the constraint into a split specification

Summarize the leakage picture in one object

Downstream handoff

Persisting the handoff

Case studies

Case study 1: repeated subjects in a longitudinal cohort

Case study 2: a subject reused across studies

Case study 3: partially observed technical metadata

Case study 4: choosing a defensible split strategy

When `splitGraph` is useful

What `splitGraph` is not for

Takeaway

splitGraph: From Metadata to Leakage-Aware Split Design

Selçuk Korkmaz

2026-07-03

Why splitGraph exists

A realistic toy dataset

Fast path: graph_from_metadata()

Ingest metadata and build typed nodes and edges

Assemble the dependency graph

Visualize the typed structure

Validate before you split

When you really do need to relax a check

Query the graph to inspect hidden structure

Derive split constraints from the graph

Batch constraints

Time constraints

Composite constraints

Cluster-style relations: site, region, platform, assay

Site

Region

Platform and assay

Pairwise relations: relatedness and spatial proximity

Time ordering can come from precedence edges alone

Translate the constraint into a split specification

Summarize the leakage picture in one object

Downstream handoff

Persisting the handoff

Case studies

Case study 1: repeated subjects in a longitudinal cohort

Case study 2: a subject reused across studies

Case study 3: partially observed technical metadata

Case study 4: choosing a defensible split strategy

When splitGraph is useful

What splitGraph is not for

Takeaway

Why `splitGraph` exists

Fast path: `graph_from_metadata()`

When `splitGraph` is useful

What `splitGraph` is not for