rtForecastEval guide

Mind map and workflow

rtForecastEval compares two real-time forecasters on a common time grid. The schematic below matches the analysis order in this vignette: prepare a long tibble, run the global delta test (calc_Z → calc_eig → calc_pval), and optionally summarize pointwise loss with calc_L_s2 and plot_pcb.

Typical workflow: global test (top) and pointwise loss plot (bottom).

Figures in the paper and where they are produced

The paper (preprint arXiv:2010.00781) has two kinds of graphics:

NBA application: calibration surfaces, reliability diagrams, and model comparisons

Those figures are not self-contained in rtForecastEval because they require the scraped NBA play-by-play inputs, the load_nba_data() / pre_process() pipeline, and several bespoke plotting scripts.

Reproduce them from the separate replication repository (RTPForNBA; historically bundled with the paper’s supplementary code), not from this package alone:

Idea in the paper	Typical script (replication repo)	Depends on
Calibration surface (3D: time × forecast × centered residual)	`plotting/surfacePlot.R`	`nba_FD.R` (fits logit, builds `my_df`), plot3D, akima, lattice, binned CIs
Reliability-style plot at one time (e.g. mid-game)	`plotting/calibrationPlot.R` (see also end of `surfacePlot.R`)	Same prepared `df_try` object
Skill curves / \(\hat\Delta_n(t)\) for ESPN vs models (PgRS, LTW, etc.)	`plotting/1_PgRS.R`, `2_LTW.R`, `3_CF.R`, …	`utility.R` (`calc_L_s2`, `L_smoothing`), `nba_FD.R`
Simulation figures (oracle, score difference, etc.)	`simulation/PF-sim/plot_sim_*.R`	`simulation/PF-sim/simulator.R` and generated outputs

Suggested order in the replication repo: prepare data under data/, run nba_FD.R (or the season scripts it sources), then source() the relevant file under plotting/ with working directory set to that project root.

The rest of this vignette focuses on (1).

library(ggplot2)
library(tibble)
library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
nsamp <- 100 # number of in-game events
ngame <- 200 # number of games (README example uses the same for comparable figures)

#' Parameter for generating the eigenvalues, and p-values
D <- 10 # Number of eigenvalues to keep
N_MC <- 5000 # for simulating the p-value
L <- function(x, y) {
  return((x - y) ^ 2)
}

# Data generation ---------------------------------------------------------
df_equ <- df_gen(N = nsamp, Ngame = ngame) %>%
  group_by(grid) %>%
  mutate(
    p_bar_12 = mean(phat_A - phat_B),
    diff_non_cent = phat_A - phat_B,
    diff_cent = phat_A - phat_B - p_bar_12
  ) %>% ungroup()

# Apply our test (explicit construction, as in utility.R) ----------------

Z <- df_equ %>% group_by(grid) %>%
  summarise(delta_n = mean(L(phat_A, Y) - L(phat_B, Y))) %>%
  {
    sum((.)$delta_n^2) / nsamp * ngame
  }

temp <- df_equ %>% group_split(grid, .keep = FALSE)

eigV_hat <- lapply(1:nsamp, function(i) {
  sapply(1:nsamp, function(j) {
    as.numeric(temp[[i]]$diff_non_cent %*% temp[[j]]$diff_non_cent / ngame)
  })
}) %>% list.rbind() %>% {
  RSpectra::eigs_sym(
    A = (.),
    k = D,
    which = "LM",
    opts = list(retvec = FALSE)
  )$values
} %>%
  {
    (.)/nsamp
  }

eigV_til <- lapply(1:nsamp, function(i) {
  sapply(1:nsamp, function(j) {
    as.numeric(temp[[i]]$diff_cent %*% temp[[j]]$diff_cent / ngame)
  })
}) %>% list.rbind() %>% {
  RSpectra::eigs_sym(
    A = (.),
    k = D,
    which = "LM",
    opts = list(retvec = FALSE)
  )$values
} %>%
  {
    (.)/nsamp
  }

MC_hat <- sapply(1:N_MC, function(x) {
  crossprod(eigV_hat, rchisq(D, df = 1))
})

q_90_hat <- quantile(MC_hat, 0.90)
q_95_hat <- quantile(MC_hat, 0.95)
q_99_hat <- quantile(MC_hat, 0.99)

MC_til <- sapply(1:N_MC, function(x) {
  crossprod(eigV_til, rchisq(D, df = 1))
})

q_90_til <- quantile(MC_til, 0.90)
q_95_til <- quantile(MC_til, 0.95)
q_99_til <- quantile(MC_til, 0.99)

p_hat <- 1 - ecdf(MC_hat)(Z)

tibble(
  type = c("non-center", "center"),
  Z = rep(Z, 2),
  "pval" = c(p_hat, p_hat),
  "90%" = c(q_90_hat, q_90_til),
  "95%" = c(q_95_hat, q_95_til),
  "99%" = c(q_99_hat, q_99_til)
)
#> # A tibble: 2 × 6
#>   type            Z  pval `90%` `95%` `99%`
#>   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 non-center 0.0574 0.719 0.442 0.627 0.991
#> 2 center     0.0574 0.719 0.449 0.631 1.04

Same analysis using package functions

to_center <- FALSE

ZZ <- calc_Z(df = df_equ, pA = "phat_A", pB = "phat_B", Y = "Y", nsamp = nsamp, ngame = ngame)
eigg <- calc_eig(
  df = df_equ, n_eig = D, ngame = ngame,
  nsamp = nsamp, grid = "grid", cent = to_center
)
oh <- calc_pval(ZZ, eig = eigg, quan = c(0.90, 0.95, 0.99), n_MC = N_MC)

temp <- calc_L_s2(df = df_equ, pA = "phat_A", pB = "phat_B")

plot_pcb(df = temp)

Pointwise mean loss difference (A vs B) with naive normal band — simulation setting. This is a skill curve, not a calibration diagram.


tibble(
  type = ifelse(to_center, "center", "non-center"),
  Z = ZZ,
  pval = oh$p_val,
  "90%" = oh$quantile[1],
  "95%" = oh$quantile[2],
  "99%" = oh$quantile[3]
)
#> # A tibble: 1 × 6
#>   type            Z  pval `90%` `95%` `99%`
#>   <chr>       <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 non-center 0.0574 0.709 0.458 0.629  1.04

A simple calibration (reliability) view at one time

The previous figure tracks skill (which forecaster loses less Brier loss over time). A complementary check is marginal calibration: within a narrow time slice, do predicted probabilities match observed event frequencies? Grey vertical segments show a 95% central range for \(\bar Y\) under \(H_0\) in each bin: \(n\bar Y \sim \mathrm{Binomial}(n, p)\) with \(p = \overline{\hat p}\), using exact qbinom bounds (avoiding asymmetric normal approximation + clipping that can distort small-sample bands). The paper’s NBA figures use richer calibration surfaces (time × forecast × residual); here we only bin forecasts at a single grid point (closest to mid-game, \(t \approx 0.5\)) and plot mean outcome vs mean forecast — a standard reliability diagram.

g_mid <- df_equ %>%
  distinct(grid) %>%
  slice_min(order_by = abs(grid - 0.5), n = 1) %>%
  pull(grid)

slice_t <- df_equ %>%
  filter(grid == g_mid) %>%
  transmute(
    game,
    Y,
    phat_A,
    phat_B
  )

rel_long <- slice_t %>%
  tidyr::pivot_longer(c(phat_A, phat_B), names_to = "forecaster", values_to = "phat") %>%
  mutate(forecaster = ifelse(forecaster == "phat_A", "Forecaster A", "Forecaster B"))

rel_binned <- rel_long %>%
  group_by(forecaster) %>%
  mutate(bin = ntile(phat, 10)) %>%
  group_by(forecaster, bin) %>%
  summarise(
    mean_forecast = mean(phat),
    mean_outcome = mean(Y),
    n_games = dplyr::n(),
    .groups = "drop"
  ) %>%
  mutate(
    p_bin = pmin(pmax(mean_forecast, 0), 1),
    ci_lo = stats::qbinom(0.025, size = n_games, prob = p_bin) / n_games,
    ci_hi = stats::qbinom(0.975, size = n_games, prob = p_bin) / n_games
  )

ggplot(rel_binned, aes(mean_forecast, mean_outcome)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
  geom_errorbar(
    aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
    width = 0.02,
    color = "grey55",
    alpha = 0.85,
    linewidth = 0.35
  ) +
  geom_point(aes(size = n_games), alpha = 0.9, color = "darkred") +
  facet_wrap(~forecaster, nrow = 1) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
  labs(
    title = "Binned reliability (calibration) at one grid",
    subtitle = paste0(
      "Grid = ", signif(g_mid, 3),
      " (closest to 0.5); grey: exact 95% Binomial range for mean(Y) under H0"
    ),
    x = "Mean forecast in bin",
    y = "Mean outcome (Y) in bin",
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold")
  )

Binned reliability diagram at a fixed grid (closest to 0.5): mean outcome vs mean forecast for A and B; grey bars are exact 95% central Binomial range for mean(Y) under H0. Points near the diagonal indicate better marginal calibration at that time.

Reliability with shared bins (two standard diagrams)

The facet reliability figure above uses separate decile bins per forecaster. Here we bin the same games with equal-width cuts of the midpoint \(m = (\hat p_A + \hat p_B)/2\), then plot one classical reliability diagram per forecaster (mean forecast vs mean outcome, 45° line). Grey vertical segments are the exact Binomial 95% central range for \(\bar{Y\)} under H0 at each bin’s mean forecast. Overlaying both forecasters in one square is avoided.

cal_bins <- slice_t %>%
  mutate(mid = (phat_A + phat_B) / 2) %>%
  mutate(bin = cut(mid, breaks = seq(0, 1, length.out = 11), include.lowest = TRUE))

rel_joint <- cal_bins %>%
  group_by(bin) %>%
  summarise(
    mean_forecast_A = mean(phat_A),
    mean_forecast_B = mean(phat_B),
    mean_outcome = mean(Y),
    n_games = dplyr::n(),
    .groups = "drop"
  ) %>%
  filter(!is.na(bin)) %>%
  mutate(
    p_A = pmin(pmax(mean_forecast_A, 0), 1),
    p_B = pmin(pmax(mean_forecast_B, 0), 1),
    ci_lo_A = stats::qbinom(0.025, size = n_games, prob = p_A) / n_games,
    ci_hi_A = stats::qbinom(0.975, size = n_games, prob = p_A) / n_games,
    ci_lo_B = stats::qbinom(0.025, size = n_games, prob = p_B) / n_games,
    ci_hi_B = stats::qbinom(0.975, size = n_games, prob = p_B) / n_games
  )

rel_joint_long <- rel_joint %>%
  tidyr::pivot_longer(
    c(mean_forecast_A, mean_forecast_B),
    names_to = "which",
    names_prefix = "mean_forecast_",
    values_to = "mean_forecast"
  ) %>%
  mutate(
    forecaster = ifelse(which == "A", "Forecaster A", "Forecaster B"),
    ci_lo = ifelse(which == "A", ci_lo_A, ci_lo_B),
    ci_hi = ifelse(which == "A", ci_hi_A, ci_hi_B)
  )

ggplot(rel_joint_long, aes(mean_forecast, mean_outcome)) +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
  geom_errorbar(
    aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
    width = 0.02,
    color = "grey55",
    alpha = 0.85,
    linewidth = 0.4
  ) +
  geom_point(aes(color = forecaster, size = n_games), alpha = 0.9) +
  facet_wrap(~forecaster, nrow = 1) +
  coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
  scale_color_manual(values = c("Forecaster A" = "steelblue", "Forecaster B" = "orangered")) +
  labs(
    title = "Reliability: shared midpoint bins (one classical diagram per forecaster)",
    subtitle = "Bins from cut((phat_A + phat_B) / 2); grey = exact 95% Binomial H0 range for mean(Y)",
    x = "Mean forecast in bin",
    y = "Mean outcome (Y) in bin",
    color = NULL,
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "right"
  ) +
  guides(color = "none")

Shared midpoint bins: two standard reliability diagrams (A and B); grey bars = exact 95% Binomial range for mean(Y) under H0.

Comparing calibration between forecasters (error curves)

The reliability diagram shows levels on the 45° plot; here we plot calibration errors \(\bar Y - \bar{\hat p}_k\) vs bin midpoint. The dashed line is \(\bar{\hat p}_B - \bar{\hat p}_A\).

cal_compare <- cal_bins %>%
  group_by(bin) %>%
  summarise(
    mid_hat = mean(mid),
    cal_err_A = mean(Y) - mean(phat_A),
    cal_err_B = mean(Y) - mean(phat_B),
    n_games = dplyr::n(),
    .groups = "drop"
  ) %>%
  filter(!is.na(bin))

cal_long <- cal_compare %>%
  tidyr::pivot_longer(
    c(cal_err_A, cal_err_B),
    names_to = "which",
    values_to = "cal_err"
  ) %>%
  mutate(
    which = ifelse(which == "cal_err_A", "A: mean(Y) - phat_A", "B: mean(Y) - phat_B")
  )

ggplot(cal_long, aes(mid_hat, cal_err, color = which)) +
  geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
  geom_line(linewidth = 0.85) +
  geom_point(aes(size = n_games), alpha = 0.85) +
  geom_line(
    data = cal_compare,
    aes(mid_hat, cal_err_B - cal_err_A),
    inherit.aes = FALSE,
    color = "grey35",
    linetype = "dashed",
    linewidth = 0.65
  ) +
  labs(
    title = "Calibration comparison (same midpoint bins)",
    subtitle = "Solid: mean calibration error per forecaster; dashed: mean(phat_B) - mean(phat_A)",
    x = "Mean (phat_A + phat_B) / 2 in bin",
    y = "mean(Y) - mean(phat)",
    color = "Curve",
    size = "Games"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = ggplot2::element_blank(),
    plot.title = ggplot2::element_text(face = "bold"),
    legend.position = "bottom"
  )

Mean calibration error per forecaster (solid) and difference in mean forecasts (dashed), binned by (phat_A + phat_B) / 2 at the same grid as the reliability figure.

rtForecastEval guide

EVALuating Real-Time Probabilistic Forecast

Chi-Kuang Yeh (Georgia State University)
Gregory Rice, Joel A. Dubin (University of Waterloo)

2026-04-23

Mind map and workflow

Figures in the paper and where they are produced

NBA application: calibration surfaces, reliability diagrams, and model comparisons

Same analysis using package functions

A simple calibration (reliability) view at one time

Reliability with shared bins (two standard diagrams)

Comparing calibration between forecasters (error curves)

rtForecastEval guide

EVALuating Real-Time Probabilistic Forecast

Chi-Kuang Yeh (Georgia State University) Gregory Rice, Joel A. Dubin (University of Waterloo)

2026-04-23

Mind map and workflow

Figures in the paper and where they are produced

NBA application: calibration surfaces, reliability diagrams, and model comparisons

Same analysis using package functions

A simple calibration (reliability) view at one time

Reliability with shared bins (two standard diagrams)

Comparing calibration between forecasters (error curves)

Chi-Kuang Yeh (Georgia State University)
Gregory Rice, Joel A. Dubin (University of Waterloo)