Simulate genomic surveillance data — surv

Generates synthetic surveillance datasets with realistic features: multiple regions with unequal sequencing rates, multiple lineages with time-varying prevalence, configurable reporting delays, and multiple sample sources.

Usage

surv_simulate(
  n_regions = 5L,
  n_weeks = 26L,
  total_positive_per_week = 1000L,
  sequencing_rates = NULL,
  lineage_dynamics = NULL,
  delay_params = list(mu = 10, size = 3),
  sources = c("clinical", "wastewater", "sentinel"),
  source_weights = c(0.7, 0.2, 0.1),
  seed = NULL
)

Arguments

n_regions: Integer. Number of geographic regions. Default 5.
n_weeks: Integer. Number of epiweeks. Default 26.
total_positive_per_week: Integer. Mean total positive cases per week across all regions. Default 1000.
sequencing_rates: Numeric vector of length n_regions. Per-region sequencing probability. If NULL, generated from a Beta distribution with realistic inequality. Default NULL.
lineage_dynamics: Named list of functions, each taking a week number and returning a positive weight. If NULL, uses a default four-lineage scenario. Default NULL.
delay_params: List with mu and size for negative binomial reporting delay. Default list(mu = 10, size = 3).
sources: Character vector of sample source types. Default c("clinical", "wastewater", "sentinel").
source_weights: Numeric vector (same length as sources). Default c(0.7, 0.2, 0.1).
seed: Integer or NULL. Random seed. Default NULL.

Value

A named list with elements:

sequences: Tibble of individual sequence records.
population: Tibble with one row per region.
truth: Tibble of true lineage prevalence by region and week.
parameters: List of all input parameters.

Examples

sim <- surv_simulate(n_regions = 3, n_weeks = 8, seed = 42)
head(sim$sequences)
#> # A tibble: 6 × 7
#>   sequence_id region   source_type lineage collection_date report_date epiweek 
#>   <chr>       <chr>    <chr>       <chr>   <date>          <date>      <chr>   
#> 1 seq_1_1_1   Region_A wastewater  BA.5    2024-01-07      2024-01-10  2024-W01
#> 2 seq_1_1_2   Region_A clinical    BA.5    2024-01-05      2024-01-07  2024-W01
#> 3 seq_1_1_3   Region_A clinical    BA.5    2024-01-05      2024-01-12  2024-W01
#> 4 seq_1_1_4   Region_A clinical    BA.5    2024-01-04      2024-01-09  2024-W01
#> 5 seq_1_1_5   Region_A clinical    XBB.1.5 2024-01-02      2024-01-07  2024-W01
#> 6 seq_1_1_6   Region_A wastewater  XBB.1.5 2024-01-04      2024-01-04  2024-W01
sim$population
#> # A tibble: 3 × 5
#>   region   n_positive n_sequenced seq_rate pop_total
#>   <chr>         <int>       <int>    <dbl>     <int>
#> 1 Region_A       4784         265   0.0554    350844
#> 2 Region_B       1163         313   0.269      85291
#> 3 Region_C       2112         197   0.0933    154887