<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Cuiwei Gao</title>
<link>https://cuiweig.github.io/blog.html</link>
<atom:link href="https://cuiweig.github.io/blog.xml" rel="self" type="application/rss+xml"/>
<description>R packages for genomic surveillance, clinical AI fairness, and privacy-preserving data generation</description>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Fri, 17 Apr 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Fair and Private Clinical Modeling: A Two-Package Workflow</title>
  <dc:creator>Cuiwei Gao</dc:creator>
  <link>https://cuiweig.github.io/posts/2026-04-19-fair-private-clinical-modeling/</link>
  <description><![CDATA[ 




<p>Clinical prediction models are increasingly used to triage patients, flag deterioration risk, and allocate scarce resources. Two problems shadow every deployment: <strong>privacy</strong> (can the training data be shared safely across institutions?) and <strong>fairness</strong> (does the model perform equitably across demographic groups?). These concerns are usually tackled separately, but they interact — you cannot audit a model for fairness if privacy rules prevent you from accessing the data it was trained on.</p>
<p>This post walks through a combined workflow using two R packages: <a href="https://cran.r-project.org/package=syntheticdata">syntheticdata</a> for generating and validating privacy-preserving synthetic clinical data, and <a href="https://cran.r-project.org/package=clinicalfair">clinicalfair</a> for auditing the resulting prediction model for group fairness. Every function call below uses real exported functions from these packages.</p>
<section id="step-1-generate-synthetic-data" class="level2">
<h2 class="anchored" data-anchor-id="step-1-generate-synthetic-data">Step 1: Generate Synthetic Data</h2>
<p>Suppose you have a clinical dataset with patient demographics, lab values, and a binary outcome (e.g., 30-day readmission). You want to share a synthetic copy with an external collaborator. <code>synthesize()</code> supports three methods: Gaussian copula, bootstrap resampling, and Laplace noise injection.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(syntheticdata)</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Original clinical dataset (not shared externally)</span></span>
<span id="cb1-4">data_original <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read.csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"clinical_cohort.csv"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate synthetic data using Gaussian copula</span></span>
<span id="cb1-7">data_synthetic <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">synthesize</span>(</span>
<span id="cb1-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> data_original,</span>
<span id="cb1-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">method =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"copula"</span>,</span>
<span id="cb1-10">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">n =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">nrow</span>(data_original)</span>
<span id="cb1-11">)</span></code></pre></div></div>
</div>
<p>The copula method estimates the joint distribution of all variables via their empirical marginals and a Gaussian copula, then samples from it. This preserves correlation structure while breaking the one-to-one link between synthetic and real records.</p>
</section>
<section id="step-2-validate-the-synthetic-data" class="level2">
<h2 class="anchored" data-anchor-id="step-2-validate-the-synthetic-data">Step 2: Validate the Synthetic Data</h2>
<p>Before trusting synthetic data for downstream analysis, you need to verify that it preserves the statistical properties of the original. <code>validate_synthetic()</code> runs a suite of distributional comparisons:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1">validation <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">validate_synthetic</span>(</span>
<span id="cb2-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">original =</span> data_original,</span>
<span id="cb2-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">synthetic =</span> data_synthetic</span>
<span id="cb2-4">)</span>
<span id="cb2-5">validation</span></code></pre></div></div>
</div>
<p>The output reports per-variable KS statistics for continuous columns and chi-squared statistics for categorical columns, along with an overall utility score. If a variable diverges substantially, you know to investigate before proceeding.</p>
</section>
<section id="step-3-assess-privacy-risk" class="level2">
<h2 class="anchored" data-anchor-id="step-3-assess-privacy-risk">Step 3: Assess Privacy Risk</h2>
<p>Even well-constructed synthetic data can leak information about individuals in the original dataset, especially for rare combinations of attributes. <code>privacy_risk()</code> performs membership inference testing and attribute disclosure risk assessment:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1">risk <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">privacy_risk</span>(</span>
<span id="cb3-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">original =</span> data_original,</span>
<span id="cb3-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">synthetic =</span> data_synthetic</span>
<span id="cb3-4">)</span>
<span id="cb3-5">risk</span></code></pre></div></div>
</div>
<p>The membership inference test trains a classifier to distinguish real from synthetic records; if it cannot do much better than chance, the synthetic data does not obviously memorize the training set. The attribute disclosure component checks whether knowing a subset of quasi-identifiers in the synthetic data allows reconstructing sensitive attributes from the original. Together, these provide a practical privacy profile you can report to your IRB or data governance board.</p>
</section>
<section id="step-4-train-a-prediction-model" class="level2">
<h2 class="anchored" data-anchor-id="step-4-train-a-prediction-model">Step 4: Train a Prediction Model</h2>
<p>With a validated, privacy-assessed synthetic dataset in hand, you (or your collaborator) can train a clinical prediction model. For this example we use a simple logistic regression, but the downstream fairness audit is model-agnostic.</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train on synthetic data</span></span>
<span id="cb4-2">model <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glm</span>(</span>
<span id="cb4-3">  readmission <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span> age <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> sex <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> race <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> lab_value_1 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> lab_value_2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span> comorbidity_index,</span>
<span id="cb4-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> data_synthetic,</span>
<span id="cb4-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">family =</span> binomial</span>
<span id="cb4-6">)</span>
<span id="cb4-7"></span>
<span id="cb4-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate predictions</span></span>
<span id="cb4-9">data_synthetic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pred_prob <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">predict</span>(model, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">type =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"response"</span>)</span>
<span id="cb4-10">data_synthetic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pred_class <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ifelse</span>(data_synthetic<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>pred_prob <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.5</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span></code></pre></div></div>
</div>
</section>
<section id="step-5-prepare-data-for-fairness-audit" class="level2">
<h2 class="anchored" data-anchor-id="step-5-prepare-data-for-fairness-audit">Step 5: Prepare Data for Fairness Audit</h2>
<p><code>fairness_data()</code> bundles the observed outcomes, predicted scores, and protected attributes into a structure that all downstream clinicalfair functions expect:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(clinicalfair)</span>
<span id="cb5-2"></span>
<span id="cb5-3">fdata <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fairness_data</span>(</span>
<span id="cb5-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> data_synthetic,</span>
<span id="cb5-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outcome =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"readmission"</span>,</span>
<span id="cb5-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">predicted =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pred_prob"</span>,</span>
<span id="cb5-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">group =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"race"</span></span>
<span id="cb5-8">)</span></code></pre></div></div>
</div>
</section>
<section id="step-6-compute-group-stratified-metrics" class="level2">
<h2 class="anchored" data-anchor-id="step-6-compute-group-stratified-metrics">Step 6: Compute Group-Stratified Metrics</h2>
<p><code>fairness_metrics()</code> computes performance metrics stratified by the protected attribute. By default it calculates sensitivity, specificity, PPV, NPV, AUC, and calibration slope for each group, with bootstrap confidence intervals:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb6-1">metrics <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fairness_metrics</span>(fdata)</span>
<span id="cb6-2">metrics</span></code></pre></div></div>
</div>
<p>The output is a tidy data frame — one row per group per metric — that you can pipe directly into ggplot2 for visualization or into reporting templates.</p>
</section>
<section id="step-7-four-fifths-rule-check" class="level2">
<h2 class="anchored" data-anchor-id="step-7-four-fifths-rule-check">Step 7: Four-Fifths Rule Check</h2>
<p>The four-fifths (or 80%) rule, borrowed from employment discrimination law and increasingly applied to clinical AI, flags a metric as disparate if the worst-performing group’s rate falls below 80% of the best-performing group’s rate. <code>fairness_report()</code> automates this check:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb7-1">report <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fairness_report</span>(fdata)</span>
<span id="cb7-2">report</span></code></pre></div></div>
</div>
<p>The report identifies which metrics violate the four-fifths rule, which group comparisons drive the violation, and the magnitude of the disparity. This gives you a concrete, auditable summary to include in model documentation or regulatory submissions.</p>
</section>
<section id="step-8-intersectional-analysis" class="level2">
<h2 class="anchored" data-anchor-id="step-8-intersectional-analysis">Step 8: Intersectional Analysis</h2>
<p>Single-axis fairness (stratifying by race alone, or sex alone) can mask compounded disparities at the intersection of multiple attributes. <code>intersectional_fairness()</code> crosses protected groups and repeats the analysis:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb8-1">inter <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">intersectional_fairness</span>(</span>
<span id="cb8-2">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> data_synthetic,</span>
<span id="cb8-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">outcome =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"readmission"</span>,</span>
<span id="cb8-4">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">predicted =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"pred_prob"</span>,</span>
<span id="cb8-5">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">groups =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">c</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"race"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sex"</span>)</span>
<span id="cb8-6">)</span>
<span id="cb8-7">inter</span></code></pre></div></div>
</div>
<p>This might reveal, for example, that the model performs adequately for each race group and each sex group in isolation, but fails for a specific race–sex intersection. These are the disparities that single-axis audits miss.</p>
</section>
<section id="step-9-threshold-based-mitigation" class="level2">
<h2 class="anchored" data-anchor-id="step-9-threshold-based-mitigation">Step 9: Threshold-Based Mitigation</h2>
<p>If disparities are identified, one practical mitigation strategy is to use group-specific decision thresholds that equalize a chosen metric (e.g., sensitivity) across groups. <code>threshold_optimize()</code> searches for the threshold set that achieves this:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb9-1">optimized <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">threshold_optimize</span>(</span>
<span id="cb9-2">  fdata,</span>
<span id="cb9-3">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">metric =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"sensitivity"</span></span>
<span id="cb9-4">)</span>
<span id="cb9-5">optimized</span></code></pre></div></div>
</div>
<p>The output includes per-group thresholds and the resulting equalized metric values. This is not a silver bullet — adjusting thresholds trades off one metric against another — but it provides a transparent, auditable mechanism for reducing disparity, which is often what regulatory reviewers want to see.</p>
</section>
<section id="putting-it-together" class="level2">
<h2 class="anchored" data-anchor-id="putting-it-together">Putting It Together</h2>
<p>The combined workflow — generate synthetic data, validate it, assess privacy risk, train a model, and audit for fairness — addresses the two core concerns in a single reproducible pipeline. The synthetic data lets you share and collaborate without exposing patient records; the fairness audit ensures the resulting model does not quietly disadvantage vulnerable subgroups.</p>
<p>Both packages are on CRAN:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb10-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"syntheticdata"</span>)</span>
<span id="cb10-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"clinicalfair"</span>)</span></code></pre></div></div>
</div>
<ul>
<li><strong>syntheticdata:</strong> <a href="https://cran.r-project.org/package=syntheticdata">CRAN</a> | <a href="https://github.com/CuiweiG/syntheticdata">GitHub</a> | <a href="https://cuiweig.github.io/syntheticdata">Docs</a></li>
<li><strong>clinicalfair:</strong> <a href="https://cran.r-project.org/package=clinicalfair">CRAN</a> | <a href="https://github.com/CuiweiG/clinicalfair">GitHub</a> | <a href="https://cuiweig.github.io/clinicalfair">Docs</a></li>
</ul>


</section>

 ]]></description>
  <category>R</category>
  <category>Clinical AI</category>
  <category>Fairness</category>
  <category>Privacy</category>
  <category>CRAN</category>
  <guid>https://cuiweig.github.io/posts/2026-04-19-fair-private-clinical-modeling/</guid>
  <pubDate>Fri, 17 Apr 2026 00:00:00 GMT</pubDate>
</item>
<item>
  <title>Introducing lineagefreq: Tracking Pathogen Variant Dynamics in R</title>
  <dc:creator>Cuiwei Gao</dc:creator>
  <link>https://cuiweig.github.io/posts/2026-04-17-introducing-lineagefreq/</link>
  <description><![CDATA[ 




<p>When a new pathogen variant begins circulating, one of the first questions public health agencies ask is: <em>how fast is it growing relative to existing lineages?</em> Answering that question from raw sequence counts is harder than it looks. Counts are noisy, sampling is uneven across time and geography, and the multinomial nature of the data — every sequence belongs to exactly one lineage — means you cannot just fit separate logistic curves and call it a day.</p>
<p><a href="https://cran.r-project.org/package=lineagefreq">lineagefreq</a> is an R package that tackles this problem end to end. It fits multinomial logistic regression models to genomic surveillance count data, estimates relative growth advantages between lineages, generates short-term frequency forecasts, and validates those forecasts with rigorous rolling-origin backtesting. The current CRAN release is 0.2.0; the development version (0.6.0) on <a href="https://github.com/CuiweiG/lineagefreq">GitHub</a> adds several new features.</p>
<section id="five-engines-one-interface" class="level2">
<h2 class="anchored" data-anchor-id="five-engines-one-interface">Five Engines, One Interface</h2>
<p>A key design decision in lineagefreq is offering multiple estimation engines behind a single <code>fit_model()</code> interface. You choose the engine with one argument; everything else — data format, output structure, downstream methods — stays the same.</p>
<p>The five engines are:</p>
<ul>
<li><strong><code>mlr</code></strong> — standard multinomial logistic regression (frequentist, fast, good default)</li>
<li><strong><code>hier_mlr</code></strong> — hierarchical multinomial logistic regression for multi-site or multi-region data</li>
<li><strong><code>piantham</code></strong> — the Piantham et al.&nbsp;method, commonly used in SARS-CoV-2 variant tracking</li>
<li><strong><code>fga</code></strong> — fixed growth advantage model, assumes constant relative fitness over the estimation window</li>
<li><strong><code>garw</code></strong> — growth advantage random walk, allows the fitness advantage to drift over time (Bayesian)</li>
</ul>
<p>This means you can benchmark multiple statistical approaches on the same dataset with minimal code changes, which is exactly what you want when advising decision-makers who need to understand model uncertainty.</p>
</section>
<section id="built-in-real-world-datasets" class="level2">
<h2 class="anchored" data-anchor-id="built-in-real-world-datasets">Built-in Real-World Datasets</h2>
<p>Rather than shipping toy data, lineagefreq includes four real CDC and public surveillance datasets:</p>
<ul>
<li><code>cdc_sarscov2_jn1</code> — SARS-CoV-2 JN.1 sublineage emergence data</li>
<li><code>cdc_ba2_transition</code> — the BA.1-to-BA.2 transition in the United States</li>
<li><code>influenza_h3n2</code> — influenza H3N2 clade competition data</li>
<li><code>sarscov2_us_2022</code> — US Omicron sublineage dynamics across 2022</li>
</ul>
<p>These serve double duty: they make vignettes and examples reproducible, and they provide realistic test cases for validating new methods. The BA.2 transition dataset, for example, is the basis for the validation result I discuss below.</p>
</section>
<section id="a-quick-example" class="level2">
<h2 class="anchored" data-anchor-id="a-quick-example">A Quick Example</h2>
<p>Here is a minimal workflow using the BA.2 transition data. We fit a model, extract growth advantages, and generate a short-term forecast:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb1-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(lineagefreq)</span>
<span id="cb1-2"></span>
<span id="cb1-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the BA.1 → BA.2 transition dataset</span></span>
<span id="cb1-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"cdc_ba2_transition"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Fit a multinomial logistic regression model</span></span>
<span id="cb1-7">fit <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">fit_model</span>(</span>
<span id="cb1-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> cdc_ba2_transition,</span>
<span id="cb1-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">engine =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mlr"</span></span>
<span id="cb1-10">)</span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Estimate growth advantages relative to the reference lineage</span></span>
<span id="cb1-13">ga <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">growth_advantage</span>(fit)</span>
<span id="cb1-14">ga</span></code></pre></div></div>
</div>
<p>The <code>growth_advantage()</code> output includes point estimates and confidence intervals. For the BA.2 versus BA.1 comparison, lineagefreq estimates a growth advantage of approximately 1.34×, which aligns well with published estimates in the literature (typically reported in the range of 1.3–1.5×).</p>
<p>Generating a forecast and evaluating it is equally straightforward:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb2-1"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># 4-week-ahead frequency forecast</span></span>
<span id="cb2-2">fc <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">forecast</span>(fit, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">horizon =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>)</span>
<span id="cb2-3"></span>
<span id="cb2-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Rolling-origin backtesting with proper scoring</span></span>
<span id="cb2-5">bt <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">backtest</span>(</span>
<span id="cb2-6">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">data =</span> cdc_ba2_transition,</span>
<span id="cb2-7">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">engine =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"mlr"</span>,</span>
<span id="cb2-8">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">horizon =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">4</span>,</span>
<span id="cb2-9">  <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">window =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">8</span></span>
<span id="cb2-10">)</span>
<span id="cb2-11"></span>
<span id="cb2-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Score the backtested forecasts</span></span>
<span id="cb2-13">scores <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">score_forecasts</span>(bt)</span>
<span id="cb2-14">scores</span></code></pre></div></div>
</div>
<p>The <code>backtest()</code> function walks a sliding window across the time series, refitting the model at each origin and scoring out-of-sample predictions. <code>score_forecasts()</code> computes proper scoring rules (log score, Brier score) so you can compare engines or parameter choices on a level playing field.</p>
</section>
<section id="tidy-integration" class="level2">
<h2 class="anchored" data-anchor-id="tidy-integration">Tidy Integration</h2>
<p>lineagefreq fits naturally into the tidyverse ecosystem. Fitted model objects support the standard <a href="https://broom.tidymodels.org/">broom</a> generics:</p>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb3-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(broom)</span>
<span id="cb3-2"></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Coefficient-level summaries</span></span>
<span id="cb3-4"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">tidy</span>(fit)</span>
<span id="cb3-5"></span>
<span id="cb3-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Model-level goodness-of-fit statistics</span></span>
<span id="cb3-7"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">glance</span>(fit)</span>
<span id="cb3-8"></span>
<span id="cb3-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Observation-level fitted values and residuals</span></span>
<span id="cb3-10"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">augment</span>(fit)</span></code></pre></div></div>
</div>
<p>This makes it easy to pipe results into ggplot2 for visualization or into dplyr workflows for further analysis, without learning a new set of accessor functions.</p>
</section>
<section id="eight-vignettes" class="level2">
<h2 class="anchored" data-anchor-id="eight-vignettes">Eight Vignettes</h2>
<p>The package ships with eight vignettes covering everything from a getting-started guide to advanced topics like hierarchical multi-region modelling and custom scoring rule implementation. The vignettes are built around the real datasets and walk through complete analysis pipelines, not just isolated function calls. You can browse them on the <a href="https://cuiweig.github.io/lineagefreq">pkgdown site</a>.</p>
</section>
<section id="what-comes-next" class="level2">
<h2 class="anchored" data-anchor-id="what-comes-next">What Comes Next</h2>
<p>The development version (0.6.0) on GitHub is where active work happens. Current priorities include improving the Bayesian engine performance, adding support for weighted observation models, and expanding the forecast evaluation toolkit. If you work in genomic surveillance or infectious disease modelling and want to try it out, I would love to hear your feedback.</p>
<ul>
<li><strong>CRAN:</strong> <a href="https://cran.r-project.org/package=lineagefreq" class="uri">https://cran.r-project.org/package=lineagefreq</a></li>
<li><strong>GitHub:</strong> <a href="https://github.com/CuiweiG/lineagefreq" class="uri">https://github.com/CuiweiG/lineagefreq</a></li>
<li><strong>pkgdown:</strong> <a href="https://cuiweig.github.io/lineagefreq" class="uri">https://cuiweig.github.io/lineagefreq</a></li>
<li><strong>Posit Community tutorial:</strong> <a href="https://forum.posit.co/t/from-sequence-counts-to-variant-forecasts-a-lineagefreq-tutorial/211035">From Sequence Counts to Variant Forecasts</a></li>
</ul>
<div class="cell">
<div class="code-copy-outer-scaffold"><div class="sourceCode cell-code" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb4-1"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">install.packages</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"lineagefreq"</span>)</span>
<span id="cb4-2"></span>
<span id="cb4-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Or the development version:</span></span>
<span id="cb4-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># pak::pak("CuiweiG/lineagefreq")</span></span></code></pre></div></div>
</div>


</section>

 ]]></description>
  <category>R</category>
  <category>Bioinformatics</category>
  <category>Genomic Surveillance</category>
  <category>CRAN</category>
  <guid>https://cuiweig.github.io/posts/2026-04-17-introducing-lineagefreq/</guid>
  <pubDate>Thu, 16 Apr 2026 00:00:00 GMT</pubDate>
</item>
</channel>
</rss>
