rtForecastEval compares two real-time forecasters on
a common time grid. The schematic below matches the analysis order in
this vignette: prepare a long tibble, run the global delta
test (calc_Z → calc_eig →
calc_pval), and optionally summarize
pointwise loss with calc_L_s2 and
plot_pcb.
Typical workflow: global test (top) and pointwise loss plot (bottom).
The paper (preprint arXiv:2010.00781) has two kinds of graphics:
Those figures are not self-contained in
rtForecastEval because they require the scraped NBA
play-by-play inputs, the load_nba_data() /
pre_process() pipeline, and several bespoke plotting
scripts.
Reproduce them from the separate replication repository (RTPForNBA; historically bundled with the paper’s supplementary code), not from this package alone:
| Idea in the paper | Typical script (replication repo) | Depends on |
|---|---|---|
| Calibration surface (3D: time × forecast × centered residual) | plotting/surfacePlot.R |
nba_FD.R (fits logit, builds my_df),
plot3D, akima,
lattice, binned CIs |
| Reliability-style plot at one time (e.g. mid-game) | plotting/calibrationPlot.R (see also end of
surfacePlot.R) |
Same prepared df_try object |
| Skill curves / \(\hat\Delta_n(t)\) for ESPN vs models (PgRS, LTW, etc.) | plotting/1_PgRS.R, 2_LTW.R,
3_CF.R, … |
utility.R (calc_L_s2,
L_smoothing), nba_FD.R |
| Simulation figures (oracle, score difference, etc.) | simulation/PF-sim/plot_sim_*.R |
simulation/PF-sim/simulator.R and generated
outputs |
Suggested order in the replication repo: prepare
data under data/, run nba_FD.R (or the season
scripts it sources), then source() the relevant file under
plotting/ with working directory set to that project
root.
The rest of this vignette focuses on (1).
library(ggplot2)
library(tibble)
library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
nsamp <- 100 # number of in-game events
ngame <- 200 # number of games (README example uses the same for comparable figures)
#' Parameter for generating the eigenvalues, and p-values
D <- 10 # Number of eigenvalues to keep
N_MC <- 5000 # for simulating the p-value
L <- function(x, y) {
return((x - y) ^ 2)
}
# Data generation ---------------------------------------------------------
df_equ <- df_gen(N = nsamp, Ngame = ngame) %>%
group_by(grid) %>%
mutate(
p_bar_12 = mean(phat_A - phat_B),
diff_non_cent = phat_A - phat_B,
diff_cent = phat_A - phat_B - p_bar_12
) %>% ungroup()
# Apply our test (explicit construction, as in utility.R) ----------------
Z <- df_equ %>% group_by(grid) %>%
summarise(delta_n = mean(L(phat_A, Y) - L(phat_B, Y))) %>%
{
sum((.)$delta_n^2) / nsamp * ngame
}
temp <- df_equ %>% group_split(grid, .keep = FALSE)
eigV_hat <- lapply(1:nsamp, function(i) {
sapply(1:nsamp, function(j) {
as.numeric(temp[[i]]$diff_non_cent %*% temp[[j]]$diff_non_cent / ngame)
})
}) %>% list.rbind() %>% {
RSpectra::eigs_sym(
A = (.),
k = D,
which = "LM",
opts = list(retvec = FALSE)
)$values
} %>%
{
(.)/nsamp
}
eigV_til <- lapply(1:nsamp, function(i) {
sapply(1:nsamp, function(j) {
as.numeric(temp[[i]]$diff_cent %*% temp[[j]]$diff_cent / ngame)
})
}) %>% list.rbind() %>% {
RSpectra::eigs_sym(
A = (.),
k = D,
which = "LM",
opts = list(retvec = FALSE)
)$values
} %>%
{
(.)/nsamp
}
MC_hat <- sapply(1:N_MC, function(x) {
crossprod(eigV_hat, rchisq(D, df = 1))
})
q_90_hat <- quantile(MC_hat, 0.90)
q_95_hat <- quantile(MC_hat, 0.95)
q_99_hat <- quantile(MC_hat, 0.99)
MC_til <- sapply(1:N_MC, function(x) {
crossprod(eigV_til, rchisq(D, df = 1))
})
q_90_til <- quantile(MC_til, 0.90)
q_95_til <- quantile(MC_til, 0.95)
q_99_til <- quantile(MC_til, 0.99)
p_hat <- 1 - ecdf(MC_hat)(Z)
tibble(
type = c("non-center", "center"),
Z = rep(Z, 2),
"pval" = c(p_hat, p_hat),
"90%" = c(q_90_hat, q_90_til),
"95%" = c(q_95_hat, q_95_til),
"99%" = c(q_99_hat, q_99_til)
)
#> # A tibble: 2 × 6
#> type Z pval `90%` `95%` `99%`
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 non-center 0.0574 0.719 0.442 0.627 0.991
#> 2 center 0.0574 0.719 0.449 0.631 1.04to_center <- FALSE
ZZ <- calc_Z(df = df_equ, pA = "phat_A", pB = "phat_B", Y = "Y", nsamp = nsamp, ngame = ngame)
eigg <- calc_eig(
df = df_equ, n_eig = D, ngame = ngame,
nsamp = nsamp, grid = "grid", cent = to_center
)
oh <- calc_pval(ZZ, eig = eigg, quan = c(0.90, 0.95, 0.99), n_MC = N_MC)
temp <- calc_L_s2(df = df_equ, pA = "phat_A", pB = "phat_B")
plot_pcb(df = temp)Pointwise mean loss difference (A vs B) with naive normal band — simulation setting. This is a skill curve, not a calibration diagram.
The previous figure tracks skill (which forecaster
loses less Brier loss over time). A complementary check is
marginal calibration: within a narrow time slice, do
predicted probabilities match observed event frequencies? Grey
vertical segments show a 95% central range for
\(\bar Y\) under \(H_0\) in each bin: \(n\bar Y \sim \mathrm{Binomial}(n, p)\) with
\(p = \overline{\hat p}\), using
exact qbinom bounds (avoiding asymmetric
normal approximation + clipping that can distort small-sample bands).
The paper’s NBA figures use richer calibration surfaces
(time × forecast × residual); here we only bin
forecasts at a single grid point (closest to mid-game, \(t \approx 0.5\)) and plot mean outcome vs
mean forecast — a standard reliability diagram.
g_mid <- df_equ %>%
distinct(grid) %>%
slice_min(order_by = abs(grid - 0.5), n = 1) %>%
pull(grid)
slice_t <- df_equ %>%
filter(grid == g_mid) %>%
transmute(
game,
Y,
phat_A,
phat_B
)
rel_long <- slice_t %>%
tidyr::pivot_longer(c(phat_A, phat_B), names_to = "forecaster", values_to = "phat") %>%
mutate(forecaster = ifelse(forecaster == "phat_A", "Forecaster A", "Forecaster B"))
rel_binned <- rel_long %>%
group_by(forecaster) %>%
mutate(bin = ntile(phat, 10)) %>%
group_by(forecaster, bin) %>%
summarise(
mean_forecast = mean(phat),
mean_outcome = mean(Y),
n_games = dplyr::n(),
.groups = "drop"
) %>%
mutate(
p_bin = pmin(pmax(mean_forecast, 0), 1),
ci_lo = stats::qbinom(0.025, size = n_games, prob = p_bin) / n_games,
ci_hi = stats::qbinom(0.975, size = n_games, prob = p_bin) / n_games
)
ggplot(rel_binned, aes(mean_forecast, mean_outcome)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "grey45") +
geom_errorbar(
aes(x = mean_forecast, ymin = ci_lo, ymax = ci_hi),
width = 0.02,
color = "grey55",
alpha = 0.85,
linewidth = 0.35
) +
geom_point(aes(size = n_games), alpha = 0.9, color = "darkred") +
facet_wrap(~forecaster, nrow = 1) +
coord_fixed(xlim = c(0, 1), ylim = c(0, 1), ratio = 1) +
labs(
title = "Binned reliability (calibration) at one grid",
subtitle = paste0(
"Grid = ", signif(g_mid, 3),
" (closest to 0.5); grey: exact 95% Binomial range for mean(Y) under H0"
),
x = "Mean forecast in bin",
y = "Mean outcome (Y) in bin",
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold")
)Binned reliability diagram at a fixed grid (closest to 0.5): mean outcome vs mean forecast for A and B; grey bars are exact 95% central Binomial range for mean(Y) under H0. Points near the diagonal indicate better marginal calibration at that time.
The reliability diagram shows levels on the 45° plot; here we plot calibration errors \(\bar Y - \bar{\hat p}_k\) vs bin midpoint. The dashed line is \(\bar{\hat p}_B - \bar{\hat p}_A\).
cal_compare <- cal_bins %>%
group_by(bin) %>%
summarise(
mid_hat = mean(mid),
cal_err_A = mean(Y) - mean(phat_A),
cal_err_B = mean(Y) - mean(phat_B),
n_games = dplyr::n(),
.groups = "drop"
) %>%
filter(!is.na(bin))
cal_long <- cal_compare %>%
tidyr::pivot_longer(
c(cal_err_A, cal_err_B),
names_to = "which",
values_to = "cal_err"
) %>%
mutate(
which = ifelse(which == "cal_err_A", "A: mean(Y) - phat_A", "B: mean(Y) - phat_B")
)
ggplot(cal_long, aes(mid_hat, cal_err, color = which)) +
geom_hline(yintercept = 0, linetype = "dashed", color = "grey70") +
geom_line(linewidth = 0.85) +
geom_point(aes(size = n_games), alpha = 0.85) +
geom_line(
data = cal_compare,
aes(mid_hat, cal_err_B - cal_err_A),
inherit.aes = FALSE,
color = "grey35",
linetype = "dashed",
linewidth = 0.65
) +
labs(
title = "Calibration comparison (same midpoint bins)",
subtitle = "Solid: mean calibration error per forecaster; dashed: mean(phat_B) - mean(phat_A)",
x = "Mean (phat_A + phat_B) / 2 in bin",
y = "mean(Y) - mean(phat)",
color = "Curve",
size = "Games"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = ggplot2::element_blank(),
plot.title = ggplot2::element_text(face = "bold"),
legend.position = "bottom"
)Mean calibration error per forecaster (solid) and difference in mean forecasts (dashed), binned by (phat_A + phat_B) / 2 at the same grid as the reliability figure.