Forecasting SARS-CoV-2 Variant Dominance

A multi-engine approach on U.S. CDC surveillance data

biostatistics
genomic-surveillance
forecasting
Author

Cuiwei Gao

Published

April 19, 2026

Motivation

In late 2023 the SARS-CoV-2 lineage JN.1 emerged from BA.2.86 and began displacing HV.1 across U.S. sequencing. Public-health teams making decisions about booster messaging, diagnostic panels, and surge capacity need answers to two questions:

  1. How fast is each new lineage growing relative to the incumbent population?
  2. When will it reach dominance — say, the 80 % threshold that triggers vaccine-composition review in some committees?

This case study applies the lineagefreq package (CRAN 0.2.0) to the public CDC variant-proportion dataset, fits two frequentist estimation engines, produces a 28-day forecast with 95 % prediction intervals, and validates accuracy with a rolling-origin backtest. The goal is a decision-relevant answer to “when does JN.1 cross 80 %?” with calibrated uncertainty.

Data

Code
suppressPackageStartupMessages({
  library(lineagefreq)
  library(ggplot2)
  library(dplyr)
})

data(cdc_sarscov2_jn1)

cdc_sarscov2_jn1 is a curated subset of public U.S. CDC variant proportions (data.cdc.gov/jr58-6ysp), shipped with lineagefreq.

Code
cdc_sarscov2_jn1 |>
  summarise(
    n_rows       = n(),
    n_lineages   = n_distinct(lineage),
    date_from    = min(date),
    date_to      = max(date),
    total_counts = sum(count)
  )
  n_rows n_lineages  date_from    date_to total_counts
1    171          9 2023-10-14 2024-06-22       152000

The dataset is national — there is no sub-national stratification. This matters for engine choice (see Methods).

Observed trajectories

Code
x  <- lfq_data(cdc_sarscov2_jn1, lineage = lineage, date = date, count = count)
x2 <- collapse_lineages(x, min_freq = 0.02)

# as.data.frame() prefixes columns with '.' — rename for plotting clarity
obs <- as.data.frame(x2) |>
  rename(date = .date, lineage = .lineage, proportion = .freq)

ggplot(obs, aes(x = date, y = proportion, fill = lineage)) +
  geom_area(alpha = 0.9, colour = "white", linewidth = 0.15) +
  scale_fill_viridis_d(option = "turbo", name = NULL) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     expand = c(0, 0)) +
  scale_x_date(expand = c(0, 0)) +
  labs(x = NULL, y = "Observed share") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 1: Observed weekly CDC-sequenced lineage shares, Oct 2023–Jun 2024. JN.1 emerges from under 1% in October and surpasses 80% by mid-February 2024, with KP.2 and KP.3 descendants rising afterwards.

Methods: which engine?

lineagefreq exposes five engines. The two frequentist engines that operate on single-location data are demonstrated here:

Engine Type Data needed What it estimates
mlr Multinomial logit single location per-lineage growth slope
piantham Rt approximation single location Rt-based growth advantage (Piantham et al.)
hier_mlr Hierarchical MLR multi-location (.location column) partial-pooled location + overall growth

hier_mlr is not demonstrated because the CDC dataset is national-only. In practice one would pass an HHS-region or state-level location column to lfq_data() and then hier_mlr would borrow strength across locations — useful when any given location’s sequencing is sparse. That’s a different dataset, not a different script.

Fit both frequentist engines

Code
fit_mlr      <- fit_model(x2, engine = "mlr",      horizon = 28L)
fit_piantham <- fit_model(x2, engine = "piantham", horizon = 28L,
                          generation_time = 5)

ga_mlr <- growth_advantage(fit_mlr,      type = "relative_Rt", generation_time = 5) |>
  mutate(engine = "mlr")
ga_pt  <- growth_advantage(fit_piantham, type = "relative_Rt", generation_time = 5) |>
  mutate(engine = "piantham")
ga_all <- bind_rows(ga_mlr, ga_pt)
Code
ga_all |>
  filter(lineage != "Other") |>
  ggplot(aes(x = reorder(lineage, estimate), y = estimate,
             ymin = lower, ymax = upper, colour = engine)) +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_pointrange(position = position_dodge(width = 0.45),
                  size = 0.5, linewidth = 0.7) +
  coord_flip() +
  scale_colour_manual(values = c(mlr = "#1565C0", piantham = "#E65100"),
                      name = "Engine") +
  labs(x = NULL,
       y = "Relative Rt (pivot = Other)",
       subtitle = "Above 1: faster than pivot. Below 1: declining.") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 2: Relative Rt of each lineage versus the ‘Other’ pivot, estimated by MLR and by the Piantham Rt approximation. Error bars are 95 % confidence intervals. Both engines agree on the ordering: JN.1 descendants (KP.2, KP.3) sit at the top; HV.1 and XBB.1.5 at the bottom.

28-day forecast (MLR)

Code
fc <- forecast(fit_mlr, horizon = 28L)
fc_df <- as.data.frame(fc) |>
  transmute(date = .date, lineage = .lineage,
            proportion = .median, lower = .lower, upper = .upper)

combined <- bind_rows(
  obs   |> mutate(lower = NA_real_, upper = NA_real_, segment = "observed"),
  fc_df |> mutate(segment = "forecast")
)
Code
combined |>
  filter(lineage != "Other") |>
  ggplot(aes(x = date, colour = lineage, fill = lineage)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2, colour = NA) +
  geom_line(aes(y = proportion, linetype = segment), linewidth = 0.8) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1), expand = c(0, 0)) +
  scale_colour_viridis_d(option = "turbo", name = NULL) +
  scale_fill_viridis_d(option = "turbo", name = NULL) +
  scale_linetype_manual(values = c(observed = "solid", forecast = "dashed"),
                        name = NULL) +
  labs(x = NULL, y = "Lineage share") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 3: MLR point forecast with 95 % prediction intervals over the next 28 days. Observed trajectories (solid) continue as forecasts (dashed) with shaded uncertainty. KP.3 is projected to keep rising; KP.2 plateaus around 10 %.

Rolling-origin backtest

Forecast accuracy is validated by refitting on growing training windows (min 60 days) and scoring predictions at 14- and 28-day horizons.

Code
bt     <- backtest(x2, engines = "mlr",
                   horizons = c(14L, 28L), min_train = 60L)
scores <- score_forecasts(bt)
Table 1: Rolling-origin backtest scores for MLR.
Engine Horizon (days) MAE RMSE 95 % PI coverage WIS
mlr 14 0.100 0.214 0.641 0.053
mlr 28 0.112 0.233 0.611 0.065

At a 14-day horizon, MLR achieves a mean absolute error of 10.0% on lineage-share predictions and covers the true share in 64% of cases — short of the nominal 95 %, reflecting that the naive MLR intervals under-account for over-dispersion in weekly counts. (Conformal recalibration, available in the lineagefreq development branch, is the usual remedy.) At 28 days MAE doubles roughly to 11.2% — ordinary horizon-vs-accuracy drift.

When does JN.1 cross 80 %?

The operationally useful summary aggregates JN.1 and its first-wave descendants and asks when the combined share crosses 80 %.

Code
jn1_clade <- c("JN.1", "JN.1.11.1", "KP.2", "KP.3")

jn1_obs <- obs |>
  filter(lineage %in% jn1_clade) |>
  group_by(date) |>
  summarise(share = sum(proportion), .groups = "drop") |>
  mutate(segment = "observed")

jn1_fc <- fc_df |>
  filter(lineage %in% jn1_clade) |>
  group_by(date) |>
  summarise(share = sum(proportion), .groups = "drop") |>
  mutate(segment = "forecast")

jn1_ts <- bind_rows(jn1_obs, jn1_fc) |> arrange(date)

crossing_row <- jn1_ts |> filter(share >= 0.80) |> slice(1)
crossing_date <- if (nrow(crossing_row) > 0) crossing_row$date else NA
Code
jn1_ts |>
  ggplot(aes(x = date, y = share)) +
  geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") +
  { if (!is.na(crossing_date))
      geom_vline(xintercept = as.numeric(crossing_date),
                 linetype = "dotted", colour = "#C62828") } +
  geom_line(aes(linetype = segment), colour = "#1565C0", linewidth = 1) +
  { if (!is.na(crossing_date))
      annotate("label", x = crossing_date, y = 0.83,
               label = format(crossing_date, "%d %b %Y"),
               colour = "#C62828", size = 3.4) } +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1), expand = c(0, 0)) +
  scale_linetype_manual(values = c(observed = "solid", forecast = "dashed"),
                        name = NULL) +
  labs(x = NULL, y = "JN.1-clade cumulative share") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 4: Combined share of JN.1 and its first-wave descendants (JN.1, JN.1.11.1, KP.2, KP.3). Dashed horizontal line at the 80 % dominance threshold. The crossing date is the first observed or forecast week at or above 80 %.

The combined JN.1-clade share first reaches 80 % on 03 二月 2024 in the observed-plus-forecast series.

Interpretation

Engine choice. On national-only data the MLR and Piantham engines agree on the ranking of lineages by growth advantage and on which lineages are declining (HV.1, XBB.1.5) versus growing (JN.1 descendants). Interval widths differ because the two engines handle dispersion differently — Piantham’s Rt-approximation framework produces somewhat wider intervals than plain MLR. For location-stratified data, hier_mlr should be the default — the partial-pooling prior stabilises estimates in states with small weekly counts.

Forecast horizon. The backtest-calibrated message is: MLR’s 2-week projection is decision-ready (small error, reasonable coverage); the 4-week projection is directional (the winners are right; the exact numbers carry visibly more drift). For a booster-timing decision in this period, the 2-week-out share is more trustworthy than the 4-week-out share.

Decision-relevant answer. Under MLR on data through 22 6月 2024, JN.1-clade dominance (≥ 80 % of sequenced cases) is first reached on 03 2月 2024 in the combined observed-plus-forecast series. Any operational decision should pair this point estimate with the 95 % PI on individual lineage shares and re-run weekly as new CDC data posts.

Limitations

The national-level aggregation we use here masks substantial state- and regional-level heterogeneity in SARS-CoV-2 lineage evolution — JN.1 dominance arrived weeks earlier in some states than others, and any programme supporting local public-health decisions would want sub-national data rather than a national aggregate. CDC sequencing is also not a random sample of U.S. infections: certain states, sequencing labs, and healthcare systems are systematically overrepresented, which biases even the national aggregate in ways this analysis does not attempt to correct. The hierarchical MLR engine, which would borrow strength across locations and stabilise estimates in sparsely-sequenced regions, is therefore not demonstrated — it requires a .location column that the public CDC dataset does not expose. Finally, frequentist MLR assumes lineage growth rates are constant over the fitting window; when rates shift — a new descendant emerging mid-window, a fitness-modifying mutation at a known epitope — longer forecast horizons underperform in proportion, and the rolling-origin backtest quantifies this drift but does not remove it.

About this analysis

  • Author: Cuiwei Gao
  • Date: 2026-04-19
  • Package: lineagefreq v0.2.0
  • Data: CDC Variant Proportions (cdc_sarscov2_jn1), public domain
  • Source: github.com/CuiweiG/portfolio
Code
sessionInfo()
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

time zone: Asia/Singapore
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.2.1       ggplot2_4.0.2     lineagefreq_0.2.0

loaded via a namespace (and not attached):
 [1] gtable_0.3.6        jsonlite_2.0.0      compiler_4.5.3     
 [4] tidyselect_1.2.1    tidyr_1.3.2         scales_1.4.0       
 [7] yaml_2.3.12         fastmap_1.2.0       R6_2.6.1           
[10] labeling_0.4.3      generics_0.1.4      knitr_1.51         
[13] htmlwidgets_1.6.4   MASS_7.3-65         tibble_3.3.1       
[16] pillar_1.11.1       RColorBrewer_1.1-3  rlang_1.2.0        
[19] xfun_0.57           S7_0.2.1-1          otel_0.2.0         
[22] viridisLite_0.4.3   cli_3.6.6           withr_3.0.2        
[25] magrittr_2.0.5      digest_0.6.39       grid_4.5.3         
[28] lifecycle_1.0.5     vctrs_0.7.3         evaluate_1.0.5     
[31] glue_1.8.0          numDeriv_2016.8-1.1 farver_2.1.2       
[34] rmarkdown_2.31      purrr_1.2.2         tools_4.5.3        
[37] pkgconfig_2.0.3     htmltools_0.5.9