Post-hoc Fairness Audit of a Deployed Risk Score

A worked example on the COMPAS dataset

clinical-ai

algorithmic-fairness

audit

Author

Cuiwei Gao

Published

April 19, 2026

The decision this study supports

You are a clinician, compliance officer, or data-science auditor. A deployed risk-scoring model has landed on your desk with a request: audit this for disparate performance across protected groups, in a form the FDA or your IRB will accept. You cannot re-train it. You do not have its training weights. You have its outputs, the observed outcomes, and the protected attribute.

This is the problem clinicalfair (CRAN 0.1) is designed for — post-hoc fairness auditing. It takes predictions, labels, and a protected attribute, and returns the set of metrics and visual diagnostics that a regulator or review board expects. No GLM re-fitting, no access to the original training pipeline.

As a worked example I use the COMPAS dataset — the canonical recidivism-prediction audit in the algorithmic-fairness literature, shipped with clinicalfair as compas_sim (predictions + outcomes + race, based on ProPublica (2016) and Angwin et al.). COMPAS is a criminal-justice tool, not a clinical one, but the audit workflow applies identically to a clinical risk score; the final section generalises back.

Data

Code

suppressPackageStartupMessages({
  library(clinicalfair)
  library(ggplot2)
  library(dplyr)
})

data(compas_sim)
dim(compas_sim)

[1] 1000    3

Code

head(compas_sim, 3)

  risk_score recidivism  race
1     0.4864          1 Black
2     0.2680          0 Black
3     0.1677          0 White

compas_sim contains 1 000 rows with the deployed model’s risk score, the binary recidivism outcome, and the protected race attribute (Black vs White). Because we are in audit mode, we treat the risk score as an opaque input — the question is not how it was produced but how its errors distribute across groups.

Step 1 — Compute metric parity with bootstrap CIs

Code

fd <- fairness_data(
  predictions    = compas_sim$risk_score,
  labels         = compas_sim$recidivism,
  protected_attr = compas_sim$race,
  threshold      = 0.5
)

fm <- fairness_metrics(fd,
                       metrics = c("selection_rate", "tpr", "fpr",
                                   "ppv", "accuracy", "brier"),
                       ci = TRUE, n_boot = 2000)

Code

fm |>
  mutate(metric = factor(metric,
                         levels = c("selection_rate", "tpr", "fpr",
                                    "ppv", "accuracy", "brier"),
                         labels = c("Selection rate", "TPR", "FPR",
                                    "PPV", "Accuracy", "Brier"))) |>
  ggplot(aes(x = group, y = value, fill = group)) +
  geom_col(width = 0.6) +
  geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper),
                width = 0.2, colour = "grey20") +
  facet_wrap(~ metric, scales = "free_y", nrow = 2) +
  scale_fill_manual(values = c(Black = "#546E7A", White = "#E65100"),
                    guide = "none") +
  scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) +
  labs(x = NULL, y = "Metric value") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        strip.text = element_text(face = "bold"))

Figure 1: Group-wise performance with bootstrap 95 % confidence intervals. Black is the reference group (highest selection rate). The disparity story is concentrated in two metrics: selection rate and TPR are both substantially lower for White; FPR is also much lower. PPV and accuracy are close, which is the point around which the Northpointe–ProPublica debate pivoted.

Step 2 — Automated four-fifths screening

The four-fifths rule (EEOC, 29 CFR 1607.4(D); also used by FDA AI/ML guidance as a screening heuristic) flags any group whose metric value falls below 0.8× or above 1.25× the reference group’s value. fairness_report() runs this automatically.

Code

fr <- fairness_report(fd)
fr

The report flags 3 metric(s) that violate the four-fifths rule: selection_rate (ratio 0.51), tpr (ratio 0.64), fpr (ratio 0.26). Selection rate and error-rate disparities are large, formally triggering a disparate impact finding. PPV and Brier are not flagged — the model is approximately equally calibrated across groups, even though its errors land very differently.

Step 3 — Disparity in where errors fall (ROC by group)

Code

plot_roc(fd) + theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())

Figure 2: ROC curves by racial group, produced by `plot_roc()`. Both curves sit well above the 45° diagonal (AUC is comparable across groups), but at any fixed threshold the two groups’ false-positive and true-positive rates differ. A single global threshold cannot match both TPR and FPR across groups — this is the geometric reason for the disparity above.

Step 4 — Calibration by group

Code

plot_calibration(fd, n_bins = 10) + theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())

Figure 3: Calibration by group: mean predicted risk vs observed recidivism rate within 10 equal-frequency bins. The two curves are close to the diagonal and to each other — the model is *calibrated* within each group at roughly equal quality. Northpointe relied on this property to argue the score was fair.

Step 5 — Threshold optimisation

Changing the global 0.5 threshold to group-specific thresholds is the simplest post-hoc mitigation. threshold_optimize() searches the product of per-group thresholds for the choice that minimises a disparity objective subject to a minimum accuracy floor.

Code

topt_eo <- threshold_optimize(fd, objective = "equalized_odds",
                              min_accuracy = 0.5)
topt_dp <- threshold_optimize(fd, objective = "demographic_parity",
                              min_accuracy = 0.5)
topt_eo
topt_dp

Code

compare_df <- bind_rows(
  mutate(topt_eo$before, policy = "Global threshold 0.5"),
  mutate(topt_eo$after,  policy = "Equalized odds"),
  mutate(topt_dp$after,  policy = "Demographic parity")
) |>
  filter(metric == "selection_rate") |>
  mutate(policy = factor(policy, levels = c("Global threshold 0.5",
                                            "Equalized odds",
                                            "Demographic parity")))

ggplot(compare_df, aes(x = policy, y = value, fill = group)) +
  geom_col(position = position_dodge(width = 0.7), width = 0.6) +
  geom_text(aes(label = scales::percent(value, accuracy = 1)),
            position = position_dodge(width = 0.7),
            vjust = -0.3, size = 3.6, fontface = "bold") +
  scale_fill_manual(values = c(Black = "#546E7A", White = "#E65100"), name = NULL) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1),
                     expand = expansion(mult = c(0, 0.18))) +
  labs(x = NULL, y = "Selection rate") +
  theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank())

Figure 4: Selection rate by group before (global threshold 0.5) and after two group-specific threshold policies: equalized odds (matches TPR/FPR) and demographic parity (matches selection rate). Demographic parity equalises the bars exactly; equalized odds narrows the disparity while keeping error rates closer. Neither is free — both incur accuracy costs relative to the global threshold.

The accuracy cost of the two mitigations relative to the global threshold is small: -1.4% for equalized odds and -1.4% for demographic parity. Whether that cost is acceptable, and which fairness objective to choose, are policy questions — see below.

Interpretation: the impossibility result in plain terms

Three observations from this audit reproduce the most-cited tension in algorithmic fairness:

The score is approximately calibrated within each group. PPV, Brier, and the calibration curves by group are close. Under calibration as the fairness criterion (the position Northpointe took in responding to ProPublica), the model passes.
Error rates are very unequal across groups. Black defendants face a higher false-positive rate; White defendants face a lower true-positive rate. Under equalized odds — TPR and FPR parity across groups — the model clearly fails.
You cannot satisfy both simultaneously when base rates differ between groups (Chouldechova 2017; Kleinberg, Mullainathan, Raghavan 2017). This is a mathematical impossibility, not an engineering failure.

clinicalfair makes both views visible in the same report. The choice between them is a policy decision about which kind of harm matters more — systematic over-flagging of one group (an error-rate concern) versus unequal reliability of the score at equivalent predicted risks (a calibration concern). Statisticians can identify the tension; deciding whose harm counts more is a role for clinical leadership, ethics review, and the regulator.

Transfer back to the clinical setting

The identical workflow applies to a deployed clinical risk score — for example, a hospital-acquired-infection predictor, a mortality early-warning score, or a readmission model — simply by passing the relevant columns:

fd <- fairness_data(
  predictions    = hospital_df$score,
  labels         = hospital_df$event,
  protected_attr = hospital_df$race  # or sex, insurance, language, etc.
)
fairness_report(fd)

clinicalfair is deliberately model-agnostic. It doesn’t care whether the score came from logistic regression, gradient boosting, or a vendor black-box — only that predictions and outcomes share an index. That constraint — not access to training code, only access to outputs — is what makes it suitable for audits of deployed systems, which are typically the ones actually affecting patients.

Limitations

The compas_sim dataset records race as a binary Black/White attribute, analytically convenient but not representative of clinical populations: real audits must handle multi-category race and ethnicity, insurance status, primary language, and their intersections, each of which introduces smaller per-cell sample sizes and ethical questions about which axes to prioritise that a two-group worked example cannot surface. The mathematical impossibility result (Chouldechova 2017) means satisfying both calibration parity and equalised odds when base rates differ is formally infeasible; the choice between them is a value judgement rather than a modelling one, and clinicalfair is deliberately agnostic about which to prefer. The threshold-optimisation policy we apply was fit on the same data it is evaluated on; out-of-sample performance typically degrades, and production use should validate on a held-out cohort. Finally, an audit is a diagnosis, not a prescription — identifying a disparity leaves open whether the appropriate response is threshold adjustment, model redesign, better training data, or not deploying the model at all.

About this analysis

Environment

Author: Cuiwei Gao
Date: 2026-04-19
Package: clinicalfair v0.1.0
Data: compas_sim, based on ProPublica’s COMPAS investigation (Angwin et al. 2016)
Source: github.com/CuiweiG/portfolio

Code

sessionInfo()

R version 4.5.3 (2026-03-11 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default
  LAPACK version 3.12.1

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.utf8 
[2] LC_CTYPE=Chinese (Simplified)_China.utf8   
[3] LC_MONETARY=Chinese (Simplified)_China.utf8
[4] LC_NUMERIC=C                               
[5] LC_TIME=Chinese (Simplified)_China.utf8    

time zone: Asia/Singapore
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.2.1        ggplot2_4.0.2      clinicalfair_0.1.0

loaded via a namespace (and not attached):
 [1] vctrs_0.7.3        cli_3.6.6          knitr_1.51         rlang_1.2.0       
 [5] xfun_0.57          otel_0.2.0         generics_0.1.4     S7_0.2.1-1        
 [9] jsonlite_2.0.0     labeling_0.4.3     glue_1.8.0         htmltools_0.5.9   
[13] scales_1.4.0       rmarkdown_2.31     grid_4.5.3         tibble_3.3.1      
[17] evaluate_1.0.5     fastmap_1.2.0      yaml_2.3.12        lifecycle_1.0.5   
[21] compiler_4.5.3     RColorBrewer_1.1-3 pkgconfig_2.0.3    htmlwidgets_1.6.4 
[25] farver_2.1.2       digest_0.6.39      R6_2.6.1           tidyselect_1.2.1  
[29] pillar_1.11.1      magrittr_2.0.5     withr_3.0.2        tools_4.5.3       
[33] gtable_0.3.6

--- title: "Post-hoc Fairness Audit of a Deployed Risk Score" subtitle: "A worked example on the COMPAS dataset" author: "Cuiwei Gao" date: "2026-04-19" categories: [clinical-ai, algorithmic-fairness, audit] knitr: opts_chunk: message: false warning: false dev: "png" fig.align: "center" dpi: 150 --- ## The decision this study supports You are a clinician, compliance officer, or data-science auditor. A *deployed* risk-scoring model has landed on your desk with a request: *audit this for disparate performance across protected groups, in a form the FDA or your IRB will accept.* You cannot re-train it. You do not have its training weights. You have its outputs, the observed outcomes, and the protected attribute. This is the problem **[`clinicalfair`](https://CRAN.R-project.org/package=clinicalfair)** (CRAN 0.1) is designed for — *post-hoc* fairness auditing. It takes predictions, labels, and a protected attribute, and returns the set of metrics and visual diagnostics that a regulator or review board expects. No GLM re-fitting, no access to the original training pipeline. As a worked example I use the **COMPAS** dataset — the canonical recidivism-prediction audit in the algorithmic-fairness literature, shipped with `clinicalfair` as `compas_sim` (predictions + outcomes + race, based on [ProPublica (2016)](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing) and Angwin et al.). COMPAS is a criminal-justice tool, not a clinical one, but the audit *workflow* applies identically to a clinical risk score; the final section generalises back. ## Data ```{r} #| label: setup suppressPackageStartupMessages({ library(clinicalfair) library(ggplot2) library(dplyr) }) data(compas_sim) dim(compas_sim) head(compas_sim, 3) ``` `compas_sim` contains 1 000 rows with the deployed model's risk score, the binary recidivism outcome, and the protected race attribute (Black vs White). Because we are in *audit* mode, we treat the risk score as an opaque input — the question is not *how* it was produced but *how* its errors distribute across groups. ## Step 1 — Compute metric parity with bootstrap CIs ```{r} #| label: audit-core fd <- fairness_data( predictions = compas_sim$risk_score, labels = compas_sim$recidivism, protected_attr = compas_sim$race, threshold = 0.5 ) fm <- fairness_metrics(fd, metrics = c("selection_rate", "tpr", "fpr", "ppv", "accuracy", "brier"), ci = TRUE, n_boot = 2000) ``` ```{r} #| label: fig-disparity #| fig-cap: "Group-wise performance with bootstrap 95 % confidence intervals. Black is the reference group (highest selection rate). The disparity story is concentrated in two metrics: selection rate and TPR are both substantially lower for White; FPR is also much lower. PPV and accuracy are close, which is the point around which the Northpointe–ProPublica debate pivoted." #| fig-height: 5 fm |> mutate(metric = factor(metric, levels = c("selection_rate", "tpr", "fpr", "ppv", "accuracy", "brier"), labels = c("Selection rate", "TPR", "FPR", "PPV", "Accuracy", "Brier"))) |> ggplot(aes(x = group, y = value, fill = group)) + geom_col(width = 0.6) + geom_errorbar(aes(ymin = ci_lower, ymax = ci_upper), width = 0.2, colour = "grey20") + facet_wrap(~ metric, scales = "free_y", nrow = 2) + scale_fill_manual(values = c(Black = "#546E7A", White = "#E65100"), guide = "none") + scale_y_continuous(labels = scales::number_format(accuracy = 0.01)) + labs(x = NULL, y = "Metric value") + theme_minimal(base_size = 12) + theme(panel.grid.minor = element_blank(), strip.text = element_text(face = "bold")) ``` ## Step 2 — Automated four-fifths screening The **four-fifths rule** (EEOC, 29 CFR 1607.4(D); also used by FDA AI/ML guidance as a screening heuristic) flags any group whose metric value falls below 0.8× or above 1.25× the reference group's value. `fairness_report()` runs this automatically. ```{r} #| label: report fr <- fairness_report(fd) fr ``` ```{r} #| label: save-flags #| echo: false n_flags <- nrow(fr$flags) flag_txt <- paste(sprintf("%s (ratio %.2f)", fr$flags$metric, fr$flags$ratio), collapse = ", ") ``` The report flags **`r n_flags` metric(s)** that violate the four-fifths rule: `r flag_txt`. Selection rate and error-rate disparities are large, formally triggering a *disparate impact* finding. PPV and Brier are not flagged — the model is approximately equally *calibrated* across groups, even though its errors land very differently. ## Step 3 — Disparity in where errors fall (ROC by group) ```{r} #| label: fig-roc #| fig-cap: "ROC curves by racial group, produced by `plot_roc()`. Both curves sit well above the 45° diagonal (AUC is comparable across groups), but at any fixed threshold the two groups' false-positive and true-positive rates differ. A single global threshold cannot match both TPR and FPR across groups — this is the geometric reason for the disparity above." #| fig-height: 4 plot_roc(fd) + theme_minimal(base_size = 12) + theme(panel.grid.minor = element_blank()) ``` ## Step 4 — Calibration by group ```{r} #| label: fig-calibration #| fig-cap: "Calibration by group: mean predicted risk vs observed recidivism rate within 10 equal-frequency bins. The two curves are close to the diagonal and to each other — the model is *calibrated* within each group at roughly equal quality. Northpointe relied on this property to argue the score was fair." #| fig-height: 4 plot_calibration(fd, n_bins = 10) + theme_minimal(base_size = 12) + theme(panel.grid.minor = element_blank()) ``` ## Step 5 — Threshold optimisation Changing the global 0.5 threshold to group-specific thresholds is the simplest post-hoc mitigation. `threshold_optimize()` searches the product of per-group thresholds for the choice that minimises a disparity objective subject to a minimum accuracy floor. ```{r} #| label: optimize topt_eo <- threshold_optimize(fd, objective = "equalized_odds", min_accuracy = 0.5) topt_dp <- threshold_optimize(fd, objective = "demographic_parity", min_accuracy = 0.5) topt_eo topt_dp ``` ```{r} #| label: fig-threshold #| fig-cap: "Selection rate by group before (global threshold 0.5) and after two group-specific threshold policies: equalized odds (matches TPR/FPR) and demographic parity (matches selection rate). Demographic parity equalises the bars exactly; equalized odds narrows the disparity while keeping error rates closer. Neither is free — both incur accuracy costs relative to the global threshold." #| fig-height: 4.5 compare_df <- bind_rows( mutate(topt_eo$before, policy = "Global threshold 0.5"), mutate(topt_eo$after, policy = "Equalized odds"), mutate(topt_dp$after, policy = "Demographic parity") ) |> filter(metric == "selection_rate") |> mutate(policy = factor(policy, levels = c("Global threshold 0.5", "Equalized odds", "Demographic parity"))) ggplot(compare_df, aes(x = policy, y = value, fill = group)) + geom_col(position = position_dodge(width = 0.7), width = 0.6) + geom_text(aes(label = scales::percent(value, accuracy = 1)), position = position_dodge(width = 0.7), vjust = -0.3, size = 3.6, fontface = "bold") + scale_fill_manual(values = c(Black = "#546E7A", White = "#E65100"), name = NULL) + scale_y_continuous(labels = scales::percent_format(accuracy = 1), expand = expansion(mult = c(0, 0.18))) + labs(x = NULL, y = "Selection rate") + theme_minimal(base_size = 12) + theme(panel.grid.minor = element_blank()) ``` ```{r} #| label: save-acc-costs #| echo: false acc_before <- topt_eo$accuracy_before acc_eo <- topt_eo$accuracy_after acc_dp <- topt_dp$accuracy_after ``` The accuracy cost of the two mitigations relative to the global threshold is small: `r scales::percent(acc_before - acc_eo, accuracy = 0.1)` for equalized odds and `r scales::percent(acc_before - acc_dp, accuracy = 0.1)` for demographic parity. Whether that cost is acceptable, and which fairness objective to choose, are *policy* questions — see below. ## Interpretation: the impossibility result in plain terms Three observations from this audit reproduce the most-cited tension in algorithmic fairness: 1. **The score is approximately calibrated within each group.** PPV, Brier, and the calibration curves by group are close. Under *calibration* as the fairness criterion (the position Northpointe took in responding to ProPublica), the model passes. 2. **Error rates are very unequal across groups.** Black defendants face a higher false-positive rate; White defendants face a lower true-positive rate. Under *equalized odds* — TPR and FPR parity across groups — the model clearly fails. 3. **You cannot satisfy both simultaneously** when base rates differ between groups ([Chouldechova 2017](https://doi.org/10.1089/big.2016.0047); [Kleinberg, Mullainathan, Raghavan 2017](https://doi.org/10.4230/LIPIcs.ITCS.2017.43)). This is a mathematical impossibility, not an engineering failure. `clinicalfair` makes both views visible in the same report. The *choice* between them is a policy decision about which kind of harm matters more — systematic over-flagging of one group (an error-rate concern) versus unequal reliability of the score at equivalent predicted risks (a calibration concern). Statisticians can identify the tension; deciding whose harm counts more is a role for clinical leadership, ethics review, and the regulator. ## Transfer back to the clinical setting The identical workflow applies to a deployed clinical risk score — for example, a hospital-acquired-infection predictor, a mortality early-warning score, or a readmission model — simply by passing the relevant columns: ```r fd <- fairness_data( predictions = hospital_df$score, labels = hospital_df$event, protected_attr = hospital_df$race # or sex, insurance, language, etc. ) fairness_report(fd) ``` `clinicalfair` is deliberately **model-agnostic**. It doesn't care whether the score came from logistic regression, gradient boosting, or a vendor black-box — only that predictions and outcomes share an index. That constraint — not access to training code, only access to outputs — is what makes it suitable for audits of deployed systems, which are typically the ones actually affecting patients. ## Limitations The `compas_sim` dataset records race as a binary Black/White attribute, analytically convenient but not representative of clinical populations: real audits must handle multi-category race and ethnicity, insurance status, primary language, and their intersections, each of which introduces smaller per-cell sample sizes and ethical questions about which axes to prioritise that a two-group worked example cannot surface. The mathematical impossibility result ([Chouldechova 2017](https://doi.org/10.1089/big.2016.0047)) means satisfying both calibration parity and equalised odds when base rates differ is formally infeasible; the choice between them is a value judgement rather than a modelling one, and `clinicalfair` is deliberately agnostic about which to prefer. The threshold-optimisation policy we apply was fit on the same data it is evaluated on; out-of-sample performance typically degrades, and production use should validate on a held-out cohort. Finally, an audit is a diagnosis, not a prescription — identifying a disparity leaves open whether the appropriate response is threshold adjustment, model redesign, better training data, or not deploying the model at all. ## About this analysis ::: {.callout-note appearance="minimal" collapse="true"} ## Environment - **Author:** Cuiwei Gao - **Date:** 2026-04-19 - **Package:** `clinicalfair` v`r packageVersion("clinicalfair")` - **Data:** `compas_sim`, based on ProPublica's COMPAS investigation (Angwin et al. 2016) - **Source:** [github.com/CuiweiG/portfolio](https://github.com/CuiweiG/portfolio/tree/main/02-fair-clinical-prediction) ```{r} #| label: session sessionInfo() ``` :::