Cuiwei Gao

Fair and Private Clinical Modeling: A Two-Package Workflow

Cuiwei Gao — Fri, 17 Apr 2026 00:00:00 GMT

Clinical prediction models are increasingly used to triage patients, flag deterioration risk, and allocate scarce resources. Two problems shadow every deployment: privacy (can the training data be shared safely across institutions?) and fairness (does the model perform equitably across demographic groups?). These concerns are usually tackled separately, but they interact — you cannot audit a model for fairness if privacy rules prevent you from accessing the data it was trained on.

This post walks through a combined workflow using two R packages: syntheticdata for generating and validating privacy-preserving synthetic clinical data, and clinicalfair for auditing the resulting prediction model for group fairness. Every function call below uses real exported functions from these packages.

Step 1: Generate Synthetic Data

Suppose you have a clinical dataset with patient demographics, lab values, and a binary outcome (e.g., 30-day readmission). You want to share a synthetic copy with an external collaborator. synthesize() supports three methods: Gaussian copula, bootstrap resampling, and Laplace noise injection.

library(syntheticdata)

# Original clinical dataset (not shared externally)
data_original <- read.csv("clinical_cohort.csv")

# Generate synthetic data using Gaussian copula
data_synthetic <- synthesize(
  data = data_original,
  method = "copula",
  n = nrow(data_original)
)

The copula method estimates the joint distribution of all variables via their empirical marginals and a Gaussian copula, then samples from it. This preserves correlation structure while breaking the one-to-one link between synthetic and real records.

Step 2: Validate the Synthetic Data

Before trusting synthetic data for downstream analysis, you need to verify that it preserves the statistical properties of the original. validate_synthetic() runs a suite of distributional comparisons:

validation <- validate_synthetic(
  original = data_original,
  synthetic = data_synthetic
)
validation

The output reports per-variable KS statistics for continuous columns and chi-squared statistics for categorical columns, along with an overall utility score. If a variable diverges substantially, you know to investigate before proceeding.

Step 3: Assess Privacy Risk

Even well-constructed synthetic data can leak information about individuals in the original dataset, especially for rare combinations of attributes. privacy_risk() performs membership inference testing and attribute disclosure risk assessment:

risk <- privacy_risk(
  original = data_original,
  synthetic = data_synthetic
)
risk

The membership inference test trains a classifier to distinguish real from synthetic records; if it cannot do much better than chance, the synthetic data does not obviously memorize the training set. The attribute disclosure component checks whether knowing a subset of quasi-identifiers in the synthetic data allows reconstructing sensitive attributes from the original. Together, these provide a practical privacy profile you can report to your IRB or data governance board.

Step 4: Train a Prediction Model

With a validated, privacy-assessed synthetic dataset in hand, you (or your collaborator) can train a clinical prediction model. For this example we use a simple logistic regression, but the downstream fairness audit is model-agnostic.

# Train on synthetic data
model <- glm(
  readmission ~ age + sex + race + lab_value_1 + lab_value_2 + comorbidity_index,
  data = data_synthetic,
  family = binomial
)

# Generate predictions
data_synthetic$pred_prob <- predict(model, type = "response")
data_synthetic$pred_class <- ifelse(data_synthetic$pred_prob > 0.5, 1, 0)

Step 5: Prepare Data for Fairness Audit

fairness_data() bundles the observed outcomes, predicted scores, and protected attributes into a structure that all downstream clinicalfair functions expect:

library(clinicalfair)

fdata <- fairness_data(
  data = data_synthetic,
  outcome = "readmission",
  predicted = "pred_prob",
  group = "race"
)

Step 6: Compute Group-Stratified Metrics

fairness_metrics() computes performance metrics stratified by the protected attribute. By default it calculates sensitivity, specificity, PPV, NPV, AUC, and calibration slope for each group, with bootstrap confidence intervals:

metrics <- fairness_metrics(fdata)
metrics

The output is a tidy data frame — one row per group per metric — that you can pipe directly into ggplot2 for visualization or into reporting templates.

Step 7: Four-Fifths Rule Check

The four-fifths (or 80%) rule, borrowed from employment discrimination law and increasingly applied to clinical AI, flags a metric as disparate if the worst-performing group’s rate falls below 80% of the best-performing group’s rate. fairness_report() automates this check:

report <- fairness_report(fdata)
report

The report identifies which metrics violate the four-fifths rule, which group comparisons drive the violation, and the magnitude of the disparity. This gives you a concrete, auditable summary to include in model documentation or regulatory submissions.

Step 8: Intersectional Analysis

Single-axis fairness (stratifying by race alone, or sex alone) can mask compounded disparities at the intersection of multiple attributes. intersectional_fairness() crosses protected groups and repeats the analysis:

inter <- intersectional_fairness(
  data = data_synthetic,
  outcome = "readmission",
  predicted = "pred_prob",
  groups = c("race", "sex")
)
inter

This might reveal, for example, that the model performs adequately for each race group and each sex group in isolation, but fails for a specific race–sex intersection. These are the disparities that single-axis audits miss.

Step 9: Threshold-Based Mitigation

If disparities are identified, one practical mitigation strategy is to use group-specific decision thresholds that equalize a chosen metric (e.g., sensitivity) across groups. threshold_optimize() searches for the threshold set that achieves this:

optimized <- threshold_optimize(
  fdata,
  metric = "sensitivity"
)
optimized

The output includes per-group thresholds and the resulting equalized metric values. This is not a silver bullet — adjusting thresholds trades off one metric against another — but it provides a transparent, auditable mechanism for reducing disparity, which is often what regulatory reviewers want to see.

Putting It Together

The combined workflow — generate synthetic data, validate it, assess privacy risk, train a model, and audit for fairness — addresses the two core concerns in a single reproducible pipeline. The synthetic data lets you share and collaborate without exposing patient records; the fairness audit ensures the resulting model does not quietly disadvantage vulnerable subgroups.

Both packages are on CRAN:

install.packages("syntheticdata")
install.packages("clinicalfair")

syntheticdata: CRAN | GitHub | Docs
clinicalfair: CRAN | GitHub | Docs

Introducing lineagefreq: Tracking Pathogen Variant Dynamics in R

Cuiwei Gao — Thu, 16 Apr 2026 00:00:00 GMT

When a new pathogen variant begins circulating, one of the first questions public health agencies ask is: how fast is it growing relative to existing lineages? Answering that question from raw sequence counts is harder than it looks. Counts are noisy, sampling is uneven across time and geography, and the multinomial nature of the data — every sequence belongs to exactly one lineage — means you cannot just fit separate logistic curves and call it a day.

lineagefreq is an R package that tackles this problem end to end. It fits multinomial logistic regression models to genomic surveillance count data, estimates relative growth advantages between lineages, generates short-term frequency forecasts, and validates those forecasts with rigorous rolling-origin backtesting. The current CRAN release is 0.2.0; the development version (0.6.0) on GitHub adds several new features.

Five Engines, One Interface

A key design decision in lineagefreq is offering multiple estimation engines behind a single fit_model() interface. You choose the engine with one argument; everything else — data format, output structure, downstream methods — stays the same.

The five engines are:

mlr — standard multinomial logistic regression (frequentist, fast, good default)
hier_mlr — hierarchical multinomial logistic regression for multi-site or multi-region data
piantham — the Piantham et al. method, commonly used in SARS-CoV-2 variant tracking
fga — fixed growth advantage model, assumes constant relative fitness over the estimation window
garw — growth advantage random walk, allows the fitness advantage to drift over time (Bayesian)

This means you can benchmark multiple statistical approaches on the same dataset with minimal code changes, which is exactly what you want when advising decision-makers who need to understand model uncertainty.

Built-in Real-World Datasets

Rather than shipping toy data, lineagefreq includes four real CDC and public surveillance datasets:

cdc_sarscov2_jn1 — SARS-CoV-2 JN.1 sublineage emergence data
cdc_ba2_transition — the BA.1-to-BA.2 transition in the United States
influenza_h3n2 — influenza H3N2 clade competition data
sarscov2_us_2022 — US Omicron sublineage dynamics across 2022

These serve double duty: they make vignettes and examples reproducible, and they provide realistic test cases for validating new methods. The BA.2 transition dataset, for example, is the basis for the validation result I discuss below.

A Quick Example

Here is a minimal workflow using the BA.2 transition data. We fit a model, extract growth advantages, and generate a short-term forecast:

library(lineagefreq)

# Load the BA.1 → BA.2 transition dataset
data("cdc_ba2_transition")

# Fit a multinomial logistic regression model
fit <- fit_model(
  data = cdc_ba2_transition,
  engine = "mlr"
)

# Estimate growth advantages relative to the reference lineage
ga <- growth_advantage(fit)
ga

The growth_advantage() output includes point estimates and confidence intervals. For the BA.2 versus BA.1 comparison, lineagefreq estimates a growth advantage of approximately 1.34×, which aligns well with published estimates in the literature (typically reported in the range of 1.3–1.5×).

Generating a forecast and evaluating it is equally straightforward:

# 4-week-ahead frequency forecast
fc <- forecast(fit, horizon = 4)

# Rolling-origin backtesting with proper scoring
bt <- backtest(
  data = cdc_ba2_transition,
  engine = "mlr",
  horizon = 4,
  window = 8
)

# Score the backtested forecasts
scores <- score_forecasts(bt)
scores

The backtest() function walks a sliding window across the time series, refitting the model at each origin and scoring out-of-sample predictions. score_forecasts() computes proper scoring rules (log score, Brier score) so you can compare engines or parameter choices on a level playing field.

Tidy Integration

lineagefreq fits naturally into the tidyverse ecosystem. Fitted model objects support the standard broom generics:

library(broom)

# Coefficient-level summaries
tidy(fit)

# Model-level goodness-of-fit statistics
glance(fit)

# Observation-level fitted values and residuals
augment(fit)

This makes it easy to pipe results into ggplot2 for visualization or into dplyr workflows for further analysis, without learning a new set of accessor functions.

Eight Vignettes

The package ships with eight vignettes covering everything from a getting-started guide to advanced topics like hierarchical multi-region modelling and custom scoring rule implementation. The vignettes are built around the real datasets and walk through complete analysis pipelines, not just isolated function calls. You can browse them on the pkgdown site.

What Comes Next

The development version (0.6.0) on GitHub is where active work happens. Current priorities include improving the Bayesian engine performance, adding support for weighted observation models, and expanding the forecast evaluation toolkit. If you work in genomic surveillance or infectious disease modelling and want to try it out, I would love to hear your feedback.

CRAN: https://cran.r-project.org/package=lineagefreq
GitHub: https://github.com/CuiweiG/lineagefreq
pkgdown: https://cuiweig.github.io/lineagefreq
Posit Community tutorial: From Sequence Counts to Variant Forecasts

install.packages("lineagefreq")

# Or the development version:
# pak::pak("CuiweiG/lineagefreq")