Optimising Sequencing Budgets for Pathogen Surveillance

Design-adjusted inference and Neyman allocation under unequal sampling

genomic-surveillance
survey-sampling
public-health
Author

Cuiwei Gao

Published

April 19, 2026

Motivation

Pathogen surveillance systems sequence unevenly. A European programme might have Denmark sequencing 12 % of confirmed cases while Romania sequences 0.3 % — a 24× difference. If you estimate a lineage’s continental prevalence by pooling the sequenced samples and counting, your answer is dominated by Denmark’s local epidemic, regardless of what is actually circulating in the EU.

The same asymmetry appears in time: a fast-growing variant that sequences are still travelling through the lab looks rarer than it really is by the time it’s reported. And the budget question — “we have \(N\) sequences a week; how should we split them across regions to best detect an emerging lineage?” — has a formal answer from survey-sampling theory (Neyman 1934), not an informal one.

This case study applies survinger (CRAN 0.1.1) to a realistic simulated surveillance dataset shipped with the package, and walks through four decisions every surveillance programme implicitly makes:

  1. How to estimate prevalence when sequencing rates differ.
  2. How to nowcast around report delays.
  3. How to allocate a fixed sequencing budget across regions.
  4. How to size a programme for reliable early detection.

Data

Code
suppressPackageStartupMessages({
  library(survinger)
  library(ggplot2)
  library(dplyr)
})

data(sarscov2_surveillance)
ss <- sarscov2_surveillance

sarscov2_surveillance is a simulated 6-month programme across 5 regions with deliberately unequal sequencing rates — the package ships the ground truth alongside the observed data so the effect of methodological choices can be measured, not guessed.

Code
ss$population |>
  transmute(Region = region,
            `Positive cases` = n_positive,
            `Sequenced` = n_sequenced,
            `Rate` = scales::percent(seq_rate, accuracy = 0.1),
            `Population` = scales::comma(pop_total))
Table 1
# A tibble: 5 × 5
  Region   `Positive cases` Sequenced Rate  Population
  <chr>               <int>     <int> <chr> <chr>     
1 Region_A             6784       983 14.5% 1,233,454 
2 Region_B             2117       183 8.6%  384,909   
3 Region_C             2385        72 3.0%  433,636   
4 Region_D             8712        84 1.0%  1,584,000 
5 Region_E             6095        27 0.4%  1,108,181 

The sequencing rate spans 0.4% to 14.5% across the five regions — a roughly 33× gap.

Step 1 — A surveillance design object

Code
d <- surv_design(
  data            = ss$sequences,
  strata          = ~ region,
  sequencing_rate = ss$population[, c("region", "seq_rate")],
  population      = ss$population
)
d

The design object carries the sampling weights (inverse of sequencing rates). Every estimator downstream uses them.

Step 2 — Naive vs design-adjusted prevalence

For an emerging lineage such as BA.2.86, how much does sequencing-rate inequality bias a naive estimate? The package ships the truth (simulation ground truth), so we can plot naive versus Hajek-adjusted estimates against it directly.

Code
naive <- surv_naive_prevalence(d, lineage = "BA.2.86")$estimates |>
  mutate(method = "Naive (unweighted)")

hajek <- surv_lineage_prevalence(d, lineage = "BA.2.86",
                                 method = "hajek")$estimates |>
  mutate(method = "Design-adjusted (Hajek)")

truth <- ss$truth |>
  filter(lineage == "BA.2.86") |>
  group_by(epiweek) |>
  summarise(
    prevalence = weighted.mean(true_prevalence,
                               w = ss$population$pop_total[match(region, ss$population$region)]),
    .groups    = "drop"
  ) |>
  rename(time = epiweek) |>
  mutate(method = "Truth (population-weighted)")

compare <- bind_rows(
  naive |> select(time, prevalence, ci_lower, ci_upper, method),
  hajek |> select(time, prevalence, ci_lower, ci_upper, method),
  truth |> mutate(ci_lower = NA_real_, ci_upper = NA_real_)
) |>
  mutate(time = as.Date(paste0(sub("-W", "-", time), "-1"), format = "%Y-%U-%u"))
Code
ggplot(compare, aes(x = time, y = prevalence, colour = method, fill = method)) +
  geom_ribbon(aes(ymin = ci_lower, ymax = ci_upper),
              alpha = 0.2, colour = NA) +
  geom_line(linewidth = 0.9) +
  scale_colour_manual(values = c(
    "Naive (unweighted)"          = "#C62828",
    "Design-adjusted (Hajek)"     = "#1565C0",
    "Truth (population-weighted)" = "#2E7D32"
  ), name = NULL) +
  scale_fill_manual(values = c(
    "Naive (unweighted)"          = "#C62828",
    "Design-adjusted (Hajek)"     = "#1565C0",
    "Truth (population-weighted)" = "#2E7D32"
  ), name = NULL) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(x = NULL, y = "BA.2.86 prevalence") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        legend.position = "top")
Figure 1: Estimated BA.2.86 prevalence over 26 epiweeks: naive (unweighted) versus design-adjusted (Hajek) versus population-weighted ground truth. The naive estimator is dominated by the high-sequencing regions (A, B) and systematically misrepresents the population trajectory. Hajek tracks the truth within the 95 % CI. Ribbons are the 95 % CIs.

Averaged across the 26 weeks, the naive estimator is off by 4.4% (mean absolute error versus the known population truth); the Hajek estimator is off by 7.8% — a roughly 0.6× reduction. The entire bias is a design issue; no model tuning is involved.

Step 3 — Delay-adjusted nowcasting

Sequences arrive late. The report_date comes days to weeks after collection_date, so the most-recent weeks in a “raw” estimate are always under-counted. surv_estimate_delay() fits a delay distribution (negative binomial here, more flexible than Poisson because delay tails are heavy) and surv_nowcast_lineage() corrects for it.

Code
dfit    <- surv_estimate_delay(d, distribution = "negbin")
adjusted <- surv_adjusted_prevalence(d, delay_fit = dfit,
                                     lineage = "BA.2.86")
Code
parse_epiweek <- function(x) {
  as.Date(paste0(sub("-W", "-", x), "-1"), format = "%Y-%U-%u")
}

adj_df <- adjusted$estimates |>
  mutate(time = parse_epiweek(time))
raw_df <- hajek |>
  mutate(time = parse_epiweek(time))

ggplot() +
  geom_ribbon(data = adj_df,
              aes(x = time, ymin = ci_lower, ymax = ci_upper),
              fill = "#C62828", alpha = 0.18) +
  geom_line(data = raw_df, aes(x = time, y = prevalence),
            colour = "#1565C0", linewidth = 0.9) +
  geom_line(data = adj_df, aes(x = time, y = prevalence),
            colour = "#C62828", linewidth = 0.9, linetype = "dashed") +
  annotate("text", x = min(raw_df$time) + 40,
           y = max(raw_df$prevalence, na.rm = TRUE) * 0.9,
           label = "Hajek (raw)", colour = "#1565C0",
           hjust = 0, size = 3.8, fontface = "bold") +
  annotate("text", x = max(adj_df$time),
           y = max(adj_df$prevalence, na.rm = TRUE),
           label = "Delay-adjusted",
           colour = "#C62828", hjust = 1, vjust = 1.4,
           size = 3.8, fontface = "bold") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(x = NULL, y = "BA.2.86 prevalence") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 2: Raw Hajek prevalence (blue) versus delay-adjusted prevalence (red). The delay adjustment uses a fitted negative-binomial report-delay distribution to up-weight the most-recent weeks whose reports are still arriving; without the correction, a rising variant looks flat at the right-hand edge. Ribbon is the 95 % CI of the delay-adjusted estimate.

Step 4 — Optimal allocation of a fixed budget

Suppose the programme is capped at 500 sequences per week. How should those 500 be split across the five regions to minimise the MSE of overall prevalence? surv_optimize_allocation() returns the Neyman allocation that balances within-stratum variance and population size.

Code
alloc <- surv_optimize_allocation(d, objective = "min_mse",
                                  total_capacity = 500L)
alloc$allocation
# A tibble: 5 × 3
  region   n_allocated proportion
  <chr>          <int>      <dbl>
1 Region_A         130      0.26 
2 Region_B          42      0.084
3 Region_C          44      0.088
4 Region_D         166      0.332
5 Region_E         118      0.236
Code
current <- ss$population |>
  mutate(proportion = n_sequenced / sum(n_sequenced),
         n_allocated = round(proportion * 500)) |>
  transmute(region, n_allocated, proportion, allocation = "Current (scaled to 500)")

optimal <- alloc$allocation |>
  mutate(allocation = "Neyman optimal (min-MSE)")

alloc_df <- bind_rows(current, optimal)

ggplot(alloc_df, aes(x = region, y = proportion, fill = allocation)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.6) +
  geom_text(aes(label = scales::percent(proportion, accuracy = 1)),
            position = position_dodge(width = 0.7),
            vjust = -0.3, size = 3.5, fontface = "bold") +
  scale_fill_manual(values = c("Current (scaled to 500)" = "#546E7A",
                               "Neyman optimal (min-MSE)" = "#1565C0"),
                    name = NULL) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(x = NULL, y = "Share of 500 weekly sequences") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        legend.position = "top")
Figure 3: Current sequencing allocation (observed counts normalised to 500) versus Neyman-optimal allocation for minimising MSE of overall prevalence. Region D is under-sequenced relative to its information content; Regions A and E are near-optimal.

Step 5 — How large a programme detects a rare lineage?

Finally: at current capacity, what is the probability of catching a hypothetical lineage at 1 % true prevalence within a single week? surv_detection_probability() answers this per-stratum under the existing allocation.

Code
det <- surv_detection_probability(d, true_prevalence = 0.01,
                                  n_periods = 1L,
                                  detection_threshold = 1L)
det$by_stratum
# A tibble: 5 × 3
  stratum_id n_seq_per_period p_detect
       <int>            <dbl>    <dbl>
1          1            37.8    0.316 
2          2             7.04   0.0683
3          3             2.77   0.0274
4          4             3.23   0.0319
5          5             1.04   0.0104
Code
prev_grid <- seq(0.002, 0.05, length.out = 20)
curve_df <- do.call(rbind, lapply(prev_grid, function(p) {
  r <- surv_detection_probability(d, true_prevalence = p,
                                  n_periods = 1L,
                                  detection_threshold = 1L)
  data.frame(true_prevalence = p, p_detect = r$overall)
}))
Code
ggplot(curve_df, aes(x = true_prevalence, y = p_detect)) +
  geom_line(colour = "#1565C0", linewidth = 1) +
  geom_hline(yintercept = 0.9, linetype = "dashed", colour = "grey50") +
  scale_x_continuous(labels = scales::percent_format(accuracy = 0.1)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(0, 1)) +
  labs(x = "True prevalence of a target lineage",
       y = "Probability of ≥1 observation in a week") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 4: Overall detection probability in a single week as a function of a lineage’s true prevalence, at current sequencing capacity and allocation. A variant at 1 % prevalence has roughly 40 % probability of being observed in any given week; to cross 90 % detection probability requires true prevalence above ~3 %.

Interpretation

Bias is a design problem, not a model problem. Switching from naive pooling to the Hajek estimator cut MAE by roughly 0.6× on this dataset; the cost was one line of code and the sequencing-rate table that the programme already maintains for its own reporting. Any downstream use of surveillance estimates — nowcasts, forecasts, growth-advantage fits — should start from a design-adjusted estimator, not a raw pooled one.

Budgets aren’t obvious. The current allocation was proportional to sequencing rates (which track programme capacity). The Neyman-optimal allocation redistributes toward the stratum whose variance contribution to the overall estimate is largest. For this programme, that is Region D — currently under-sequenced relative to its information content. Reallocating weekly capacity accordingly would be a zero-cost improvement to overall-prevalence precision.

Early-warning realism. At current capacity, detecting a 1 %-prevalence emerging variant within a single week is essentially a coin-flip. Reaching a 90 % one-week detection probability requires either more sequences per week or for the variant to already be past ~3 % prevalence. That framing — a detection probability, not a detection guarantee — is the one a programme should be able to produce from its design alone, and survinger makes it one call.

Limitations

The delay distribution we fit with surv_estimate_delay() is treated as stationary across the six-month window; in practice laboratory delays lengthen sharply during epidemic peaks (testing backlog, reagent shortages, weekend staffing) and shorten during troughs, so a production system would refit the delay model weekly rather than once. Neyman allocation also assumes stratum-level variances are known; in this simulated dataset they are — the ground truth is shipped alongside the data — but a real programme would have to estimate them from pilot data, with attendant uncertainty that would widen the optimal allocation’s credible intervals. We treat the sequencing programme in isolation, whereas most surveillance systems run multiple pathogens on shared laboratory capacity, a joint-allocation problem we do not address. Finally, political and resource constraints — fixed per-region agreements, ring-fenced budget lines, unequal sample access — are not modelled; the Neyman allocation here is a mathematical optimum, not a deployable schedule, and the gap between the two is where most of the operational work lives.

About this analysis

  • Author: Cuiwei Gao
  • Date: 2026-04-19
  • Package: survinger v0.1.1
  • Data: sarscov2_surveillance (simulated 5-region, 26-week SARS-CoV-2 programme shipped with survinger)
  • Source: github.com/CuiweiG/portfolio
Code
sessionInfo()
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

time zone: Asia/Singapore
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.2.1     ggplot2_4.0.2   survinger_0.1.1

loaded via a namespace (and not attached):
 [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
 [5] xfun_0.57          otel_0.2.0         purrr_1.2.2        generics_0.1.4    
 [9] S7_0.2.1-1         jsonlite_2.0.0     labeling_0.4.3     glue_1.8.0        
[13] backports_1.5.1    htmltools_0.5.9    scales_1.4.0       rmarkdown_2.31    
[17] grid_4.5.3         tibble_3.3.1       evaluate_1.0.5     fastmap_1.2.0     
[21] yaml_2.3.12        lifecycle_1.0.5    compiler_4.5.3     RColorBrewer_1.1-3
[25] pkgconfig_2.0.3    htmlwidgets_1.6.4  farver_2.1.2       digest_0.6.39     
[29] R6_2.6.1           utf8_1.2.6         tidyselect_1.2.1   pillar_1.11.1     
[33] magrittr_2.0.5     checkmate_2.3.4    withr_3.0.2        tools_4.5.3       
[37] gtable_0.3.6