Releasing Synthetic Clinical Data

A privacy-utility analysis across three synthesis methods

clinical-data
synthetic-data
privacy
data-sharing
Author

Cuiwei Gao

Published

April 19, 2026

Motivation

A research consortium wants to release a patient cohort to external collaborators. HIPAA and equivalent non-US frameworks block direct release. The consortium considers a synthetic surrogate — a dataset with the statistical texture of the original but no one-to-one mapping to any real patient. Three questions determine whether this is a safe release:

  1. Does the synthetic data still carry the real-data signal? Means, variances, and (crucially) correlations must be preserved or downstream analyses will reach different conclusions.
  2. Is it safe? Can an adversary link a synthetic record back to a specific patient? How close are synthetic records to real ones?
  3. Which synthesis method is right for this use case?

This case study applies syntheticdata (CRAN 0.1) to a simulated clinical cohort — comparing Gaussian copula, bootstrap-with-noise, and Laplace-noise synthesis against a shared utility-and-privacy scorecard, and then mapping the three methods onto a privacy-utility Pareto frontier.

Data

Because the point of this study is data release, the “real” cohort here is itself simulated — a 500-patient clinical-like dataset whose distributions and correlations match the kind of cohort a cardiology trial might collect (age, SBP, BMI, glucose, LDL, and a 30-day readmission outcome). Everything downstream treats it as the real data.

Code
suppressPackageStartupMessages({
  library(syntheticdata)
  library(ggplot2)
  library(dplyr)
  library(tidyr)
})

set.seed(42)
n <- 500
real <- data.frame(
  age     = pmax(35, pmin(90,  rnorm(n, 65,  10))),
  sbp     = pmax(85, pmin(200, rnorm(n, 132, 18))),
  bmi     = pmax(17, pmin(45,  rnorm(n, 28,   5))),
  glucose = pmax(65, pmin(280, rnorm(n, 98,  22))),
  ldl     = pmax(40, pmin(220, rnorm(n, 115, 28)))
)
# A correlated binary outcome (higher risk with higher SBP and glucose)
lp <- scale(real$sbp)[, 1] * 0.6 + scale(real$glucose)[, 1] * 0.4 - 2
real$readmit_30d <- rbinom(n, 1, plogis(lp))

glimpse(real)
Rows: 500
Columns: 6
$ age         <dbl> 78.70958, 59.35302, 68.63128, 71.32863, 69.04268, 63.93875…
$ sbp         <dbl> 150.5245, 148.4659, 131.9558, 134.4482, 119.0372, 128.4338…
$ bmi         <dbl> 39.62529, 30.62061, 32.85367, 29.88487, 23.02033, 25.01259…
$ glucose     <dbl> 84.76957, 95.01204, 76.28000, 116.30235, 80.50869, 105.490…
$ ldl         <dbl> 122.01619, 107.21813, 66.70740, 58.81226, 78.82937, 125.24…
$ readmit_30d <int> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0…
# A tibble: 5 × 3
  variable  mean    sd
  <chr>    <dbl> <dbl>
1 age       64.7   9.7
2 sbp      132.   18.4
3 bmi       27.8   4.8
4 glucose   99.2  21  
5 ldl      115.   27.9

Step 1 — Generate synthetic data with three methods

Code
syn_para <- synthesize(real, method = "parametric", seed = 42)
syn_boot <- synthesize(real, method = "bootstrap",  seed = 42)
syn_nois <- synthesize(real, method = "noise",      seed = 42)

Three distinct strategies:

  • parametric fits a Gaussian copula to the continuous variables and a multinomial model to categoricals, preserving marginal distributions and pairwise dependence.
  • bootstrap resamples rows with replacement, optionally adding small noise — preserves joint distribution exactly but exposes record-level similarity to the real data.
  • noise adds Laplace noise calibrated to each variable’s scale (a differential-privacy-inspired mechanism).

Step 2 — Marginal fidelity

Code
combine_syn <- function(real, syn, method_label) {
  bind_rows(
    mutate(real, source = "real"),
    mutate(syn$synthetic, source = "synthetic")
  ) |>
    select(age, sbp, bmi, glucose, ldl, source) |>
    pivot_longer(-source, names_to = "variable", values_to = "value") |>
    mutate(method = method_label)
}

all_df <- bind_rows(
  combine_syn(real, syn_para, "Gaussian copula"),
  combine_syn(real, syn_boot, "Bootstrap"),
  combine_syn(real, syn_nois, "Laplace noise")
)

ggplot(all_df, aes(x = value, fill = source, colour = source)) +
  geom_density(alpha = 0.35, linewidth = 0.5) +
  facet_grid(method ~ variable, scales = "free", switch = "y") +
  scale_fill_manual(values = c(real = "#1565C0", synthetic = "#E65100"),
                    name = NULL) +
  scale_colour_manual(values = c(real = "#1565C0", synthetic = "#E65100"),
                      name = NULL) +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 11) +
  theme(panel.grid.minor = element_blank(),
        strip.text.y.left = element_text(angle = 0, face = "bold"),
        strip.text.x = element_text(face = "bold"),
        axis.text.y = element_blank(),
        legend.position = "top")
Figure 1: Real vs synthetic marginal distributions across all five continuous variables, for each of the three methods. Solid = real, outlined = synthetic. The Gaussian copula tracks each marginal tightly; bootstrap preserves shape (by construction); Laplace noise inflates tails.

Step 3 — Multi-metric scorecard

compare_methods() runs all three synthesis methods against the same real data and computes a unified validation tibble: KS fidelity, correlation preservation, discriminative AUC (a classifier trained to tell real from synthetic — ideally ≈ 0.5), and nearest-neighbor distance ratio (> 1 means synthetic records are not closer to real records than real records are to each other).

Code
cmp <- compare_methods(real, seed = 42)
Table 1: Unified utility + privacy scorecard from compare_methods().
method ks_statistic_mean correlation_diff discriminative_auc nn_distance_ratio
bootstrap 0.111 0.010 0.543 0.285
noise 0.111 0.010 0.540 0.372
parametric 0.025 0.009 0.517 0.995

Step 4 — Privacy-utility Pareto plot

Code
ks_df <- cmp |> filter(metric == "ks_statistic_mean") |>
  transmute(method, fidelity = 1 - value)
nn_df <- cmp |> filter(metric == "nn_distance_ratio") |>
  transmute(method, privacy = value)
auc_df <- cmp |> filter(metric == "discriminative_auc") |>
  transmute(method, disc_auc = value)
corr_df <- cmp |> filter(metric == "correlation_diff") |>
  transmute(method, corr_diff = value)
plot_df <- ks_df |>
  left_join(nn_df,   by = "method") |>
  left_join(auc_df,  by = "method") |>
  left_join(corr_df, by = "method") |>
  mutate(label = recode(method,
                        parametric = "Gaussian copula",
                        bootstrap  = "Bootstrap",
                        noise      = "Laplace noise"))

ggplot(plot_df, aes(x = fidelity, y = privacy, colour = method)) +
  annotate("rect", xmin = -Inf, xmax = Inf, ymin = 1, ymax = Inf,
           fill = "#2E7D32", alpha = 0.06) +
  geom_hline(yintercept = 1, linetype = "dashed", colour = "grey50") +
  geom_point(size = 6) +
  geom_text(aes(label = label), vjust = -1.2, fontface = "bold",
            size = 4.1, show.legend = FALSE) +
  annotate("text",
           x = min(plot_df$fidelity) - 0.02, y = 1,
           label = "Privacy floor (NN ratio = 1)",
           hjust = 0, vjust = -0.5, size = 3.3, colour = "grey40") +
  scale_colour_manual(values = c(parametric = "#1565C0",
                                 bootstrap  = "#E65100",
                                 noise      = "#6A1B9A"),
                      guide = "none") +
  scale_x_continuous(labels = scales::percent_format(accuracy = 1),
                     limits = c(min(plot_df$fidelity) - 0.05, 1)) +
  scale_y_continuous(limits = c(min(plot_df$privacy, na.rm = TRUE) - 0.1,
                                max(plot_df$privacy, na.rm = TRUE) + 0.4)) +
  labs(x = "Distributional fidelity (1 − mean KS)",
       y = "Privacy (nearest-neighbor distance ratio)") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())
Figure 2: Each synthesis method as a point on a privacy–utility plane. X axis: distributional fidelity (1 − mean KS; higher = better utility). Y axis: nearest-neighbor distance ratio (higher = safer). The dashed line at 1 is the privacy floor: synthetic records start to look closer to real records than real-to-real when below it. The Gaussian copula sits highest in privacy with negligible utility penalty relative to bootstrap; Laplace noise and bootstrap crowd into the low-privacy region.

Step 5 — Does the synthetic data preserve predictive signal?

A privacy-preserving synthetic dataset is useless if downstream modelling on it gives different answers from modelling on the real data. model_fidelity() trains a logistic regression on the synthetic data, applies it to the real data, and reports AUC — compared against an in-sample real-data baseline.

Code
fi_para <- model_fidelity(syn_para, outcome = "readmit_30d") |>
  mutate(method = "Gaussian copula")
fi_boot <- model_fidelity(syn_boot, outcome = "readmit_30d") |>
  mutate(method = "Bootstrap")
fi_nois <- model_fidelity(syn_nois, outcome = "readmit_30d") |>
  mutate(method = "Laplace noise")
fi_all <- bind_rows(fi_para, fi_boot, fi_nois)
fi_all
# A tibble: 6 × 4
  train_data metric  value method         
  <chr>      <chr>   <dbl> <chr>          
1 real       auc     0.725 Gaussian copula
2 synthetic  auc     0.688 Gaussian copula
3 real       auc     0.725 Bootstrap      
4 synthetic  auc    NA     Bootstrap      
5 real       auc     0.725 Laplace noise  
6 synthetic  auc    NA     Laplace noise  
Code
fi_all |>
  mutate(train_data = recode(train_data,
                              real      = "Real baseline",
                              synthetic = "Synthetic")) |>
  ggplot(aes(x = method, y = value, fill = train_data)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.6) +
  geom_text(aes(label = sprintf("%.3f", value)),
            position = position_dodge(width = 0.7),
            vjust = -0.3, size = 3.6, fontface = "bold") +
  scale_fill_manual(values = c("Real baseline" = "#546E7A",
                               "Synthetic"     = "#1565C0"),
                    name = NULL) +
  scale_y_continuous(limits = c(0, 1),
                     labels = scales::number_format(accuracy = 0.01),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(x = NULL, y = "AUC for 30-day readmission") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        legend.position = "top")
Figure 3: Downstream model AUC (30-day readmission on SBP + glucose + other numerics) when training on real data versus training on synthetic data from each method. Closer to the real baseline is better. The Gaussian copula loses almost no predictive signal relative to the real baseline; bootstrap and Laplace noise retain essentially identical utility.

Interpretation

The three methods give meaningfully different trade-offs on this cohort. Gaussian copula preserves marginal shapes and pairwise correlation best (correlation Frobenius-diff 0.009 vs 0.01 for bootstrap), while also maintaining the highest nearest-neighbor distance ratio (NN = 1 vs 0.28 for bootstrap). That combination is what makes it the most defensible default for a release-to-collaborators scenario.

Bootstrap is a trap. It looks attractive because the synthetic marginals match the real ones exactly, and the downstream model AUC on real data is essentially identical. But it achieves that by including actual real records (with optional noise) — the nearest-neighbor distance ratio is substantially below 1, meaning a synthetic record is typically closer to some real record than a random real-to-real pair. A record-linkage adversary would have an easy job. Use it only for internal development.

Laplace noise is the differential-privacy-flavoured option but requires careful calibration of the noise scale. In the default configuration here it sacrifices some marginal fidelity for modest privacy gain — a reasonable choice when the consumer of the data only needs aggregate summaries, not individual-record realism.

The decision, framed operationally.

Intended use Recommended method
Release to external collaborators (e.g. cross-site trial enablement) Gaussian copula (parametric)
Internal model-development sandbox Bootstrap (bootstrap)
Aggregate-statistic release to the public Laplace noise (noise), with noise scale tuned to a target ε

Limitations

The Gaussian copula assumes each continuous marginal can be transformed to a standard normal — a good approximation for lab values and anthropometrics but a poor fit for highly-skewed counts (hospital length of stay), bounded proportions, or values clustered at detection limits; for those, a non-parametric or mixed-distribution synthesiser would be preferable. The nearest-neighbour distance ratio and membership-inference accuracy we report are empirical privacy measures, not formal differential-privacy guarantees, and do not bound the ε a regulator might ask for. The model_fidelity() test covers one downstream task — a logistic regression for 30-day readmission on numeric predictors; tree-based classifiers, survival models, and deep networks may transfer differently, and a production use-case should validate the specific downstream task it will support. Finally, all three generators are evaluated on cross-sectional data only; longitudinal data (EHR time stamps, multiple visits per patient, time-varying covariates) violates the i.i.d. assumption all three methods rely on and calls for sequence-aware generators not covered here.

About this analysis

  • Author: Cuiwei Gao
  • Date: 2026-04-19
  • Package: syntheticdata v0.1.0
  • Data: 500-patient simulated clinical cohort generated inline with set.seed(42)
  • Source: github.com/CuiweiG/portfolio
Code
sessionInfo()
R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

time zone: Asia/Singapore
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_1.3.2         dplyr_1.2.1         ggplot2_4.0.2      
[4] syntheticdata_0.1.0

loaded via a namespace (and not attached):
 [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
 [5] xfun_0.57          otel_0.2.0         purrr_1.2.2        generics_0.1.4    
 [9] S7_0.2.1-1         jsonlite_2.0.0     labeling_0.4.3     glue_1.8.0        
[13] htmltools_0.5.9    scales_1.4.0       rmarkdown_2.31     grid_4.5.3        
[17] tibble_3.3.1       evaluate_1.0.5     fastmap_1.2.0      yaml_2.3.12       
[21] lifecycle_1.0.5    compiler_4.5.3     RColorBrewer_1.1-3 pkgconfig_2.0.3   
[25] htmlwidgets_1.6.4  farver_2.1.2       digest_0.6.39      R6_2.6.1          
[29] utf8_1.2.6         tidyselect_1.2.1   pillar_1.11.1      magrittr_2.0.5    
[33] withr_3.0.2        tools_4.5.3        gtable_0.3.6