Skip to contents

Trains a predictive model on synthetic data and evaluates it on real data. Compares to a model trained on real data (gold standard). Measures whether synthetic data preserves predictive signal.

Usage

model_fidelity(x, outcome, predictors = NULL)

Arguments

x

A synthetic_data object.

outcome

Character. Name of the outcome column.

predictors

Character vector (optional). Predictor columns. Default: all other numeric columns.

Value

A tibble with columns: train_data, metric, value. For binary outcomes the metric is AUC; for continuous outcomes it is R-squared.

Details

The real-data baseline uses in-sample evaluation (train and test on the same real data) to provide an upper bound on achievable performance. The synthetic-data model is also evaluated on real data, so the comparison reflects how well the synthetic data preserves predictive signal.

References

Jordon J, et al. (2022). Synthetic Data – what, why and how? arXiv preprint arXiv:2205.03257. doi:10.48550/arXiv.2205.03257

Examples

set.seed(42)
real <- data.frame(
    x1 = rnorm(200), x2 = rnorm(200),
    y = rbinom(200, 1, 0.3))
syn <- synthesize(real, seed = 42)
model_fidelity(syn, outcome = "y")
#> Warning: non-integer #successes in a binomial glm!
#> # A tibble: 2 × 3
#>   train_data metric value
#>   <chr>      <chr>  <dbl>
#> 1 real       auc    0.577
#> 2 synthetic  auc    0.438