library(lineagefreq)
# Load the BA.1 → BA.2 transition dataset
data("cdc_ba2_transition")
# Fit a multinomial logistic regression model
fit <- fit_model(
data = cdc_ba2_transition,
engine = "mlr"
)
# Estimate growth advantages relative to the reference lineage
ga <- growth_advantage(fit)
gaIntroducing lineagefreq: Tracking Pathogen Variant Dynamics in R
When a new pathogen variant begins circulating, one of the first questions public health agencies ask is: how fast is it growing relative to existing lineages? Answering that question from raw sequence counts is harder than it looks. Counts are noisy, sampling is uneven across time and geography, and the multinomial nature of the data — every sequence belongs to exactly one lineage — means you cannot just fit separate logistic curves and call it a day.
lineagefreq is an R package that tackles this problem end to end. It fits multinomial logistic regression models to genomic surveillance count data, estimates relative growth advantages between lineages, generates short-term frequency forecasts, and validates those forecasts with rigorous rolling-origin backtesting. The current CRAN release is 0.2.0; the development version (0.6.0) on GitHub adds several new features.
Five Engines, One Interface
A key design decision in lineagefreq is offering multiple estimation engines behind a single fit_model() interface. You choose the engine with one argument; everything else — data format, output structure, downstream methods — stays the same.
The five engines are:
mlr— standard multinomial logistic regression (frequentist, fast, good default)hier_mlr— hierarchical multinomial logistic regression for multi-site or multi-region datapiantham— the Piantham et al. method, commonly used in SARS-CoV-2 variant trackingfga— fixed growth advantage model, assumes constant relative fitness over the estimation windowgarw— growth advantage random walk, allows the fitness advantage to drift over time (Bayesian)
This means you can benchmark multiple statistical approaches on the same dataset with minimal code changes, which is exactly what you want when advising decision-makers who need to understand model uncertainty.
Built-in Real-World Datasets
Rather than shipping toy data, lineagefreq includes four real CDC and public surveillance datasets:
cdc_sarscov2_jn1— SARS-CoV-2 JN.1 sublineage emergence datacdc_ba2_transition— the BA.1-to-BA.2 transition in the United Statesinfluenza_h3n2— influenza H3N2 clade competition datasarscov2_us_2022— US Omicron sublineage dynamics across 2022
These serve double duty: they make vignettes and examples reproducible, and they provide realistic test cases for validating new methods. The BA.2 transition dataset, for example, is the basis for the validation result I discuss below.
A Quick Example
Here is a minimal workflow using the BA.2 transition data. We fit a model, extract growth advantages, and generate a short-term forecast:
The growth_advantage() output includes point estimates and confidence intervals. For the BA.2 versus BA.1 comparison, lineagefreq estimates a growth advantage of approximately 1.34×, which aligns well with published estimates in the literature (typically reported in the range of 1.3–1.5×).
Generating a forecast and evaluating it is equally straightforward:
# 4-week-ahead frequency forecast
fc <- forecast(fit, horizon = 4)
# Rolling-origin backtesting with proper scoring
bt <- backtest(
data = cdc_ba2_transition,
engine = "mlr",
horizon = 4,
window = 8
)
# Score the backtested forecasts
scores <- score_forecasts(bt)
scoresThe backtest() function walks a sliding window across the time series, refitting the model at each origin and scoring out-of-sample predictions. score_forecasts() computes proper scoring rules (log score, Brier score) so you can compare engines or parameter choices on a level playing field.
Tidy Integration
lineagefreq fits naturally into the tidyverse ecosystem. Fitted model objects support the standard broom generics:
library(broom)
# Coefficient-level summaries
tidy(fit)
# Model-level goodness-of-fit statistics
glance(fit)
# Observation-level fitted values and residuals
augment(fit)This makes it easy to pipe results into ggplot2 for visualization or into dplyr workflows for further analysis, without learning a new set of accessor functions.
Eight Vignettes
The package ships with eight vignettes covering everything from a getting-started guide to advanced topics like hierarchical multi-region modelling and custom scoring rule implementation. The vignettes are built around the real datasets and walk through complete analysis pipelines, not just isolated function calls. You can browse them on the pkgdown site.
What Comes Next
The development version (0.6.0) on GitHub is where active work happens. Current priorities include improving the Bayesian engine performance, adding support for weighted observation models, and expanding the forecast evaluation toolkit. If you work in genomic surveillance or infectious disease modelling and want to try it out, I would love to hear your feedback.
- CRAN: https://cran.r-project.org/package=lineagefreq
- GitHub: https://github.com/CuiweiG/lineagefreq
- pkgdown: https://cuiweig.github.io/lineagefreq
- Posit Community tutorial: From Sequence Counts to Variant Forecasts
install.packages("lineagefreq")
# Or the development version:
# pak::pak("CuiweiG/lineagefreq")