Posts | Patrick Rockenschaub

Graphical analysis of model stability

Fri, 27 May 2022 12:00:00 +0000

Predicting likely patient outcomes with machine learning has been a hot topic for several years now. The increasing collection of routine medical data have enabled the modelling of a wide range of different outcomes across varies medical specialties. This interest in data-driven diagnosis and prognosis has only further burgeoned with the arrival of the SARS-CoV-2 pandemic. Countless research groups across countries and institutions have published models that use routine data to predict everything from COVID-19 related deaths, escalation of care such as admission to intensive care units or initiation of invasive ventilation, or simply the presence or absence of the virus.

Unfortunately, if there is one thing my PhD’s taught me repeatedly it’s that deriving reliable models from routine medical data is challenging (as this systematic review of 232 COVID-19 prediction models can attest). There are many reasons why a given prediction model may not be reliable but the one I focus on in my own research — and which we will therefore discuss in more detail in this blog post — is model stability across environments. Here, environments can mean many different things but in the case of clinical prediction models, the environments of interest are often different healthcare providers (e.g., hospitals), with each provider representing a single environment in which we may want to use our model. Ideally, we would like our model to work well across many healthcare providers. If that’s the case, we can use a single model across all providers. The model may therefore be considered “stable”, “generalisable”, or “transferable”. If our models perform instead work well at only some providers but not at others, we may need to (re-)train them for each provider at which we want to use them. This not only causes additional overhead but also increases the risk of overfitting to any single provider and raises questions about the validation of each local model. Stability is therefore a desirable property of predictive models. In the remainder of this post, we will discuss the necessary conditions for stability and how we can identify likely instability in our prediction models.

Who this post is for

Here’s what I assume you to know:

You’re familiar with R and the tidyverse.
You know a little bit about fitting and evaluating linear regression models .
You have a working knowledge of causal inference and Directed Acyclical Graphs (DAG). We will use DAGs to represent assumptions about our data and graphically reason about (in)stability through the backdoor criterion. If these concepts are new to you, first have a look here and here.

We will use the following R packages throughout this post:

library(tidyverse)
library(ggdag)
library(ggthemes)

Model stability

In the introduction, I considered models stable if they worked comparable across multiple environments. While intuitive, this definition is of course very vague. Let’s spend a little more time on defining what exactly (in mathematical terms) we mean by stability. The definitions here closely follows that of Subbaswamy and Saria (2019), which recently introduced a (in my opinion) very neat framework to think and reason about model stability using DAGs.

Take for example the relatively simple DAG introduced in Subbaswamy and Saria (2018) and displayed in Figure 1. Let’s say we want to predict T, which may represent a clinical outcome of interest such as the onset of sepsis. In our dataset, we observe two variables Y and A that we could use to predict T. The arrows between T, Y, and A denote causal relationships between these variables, i.e., both T and A causally affect the value of Y. The absence of an arrow between T and A means that these variables do not directly affect each other. However, there is a final variable D that affects both the value of T and the value of A. We display D in grey because it is not observed in our dataset (e.g., because it is not routinely recorded by the clinician). If you had courses in statistics or epidemiology, you will know D as a confounding variable.

coords <- list(x = c(T = -1, A = 1, D = 0, Y = 0, S = 1),
                y = c(T = 0, A = 0, D = 1, Y = -1, S = 1))

dag <- dagify(
  T ~ D,
  A ~ D + S,
  Y ~ T + A,
  coords = coords
)

ggplot(dag, aes(x, y, xend = xend, yend = yend)) + 
  geom_dag_edges() + 
  # prediction target:
  geom_dag_point(data = ~ filter(., name == "T"), colour = "darkorange") +     
  # observed variables:
  geom_dag_point(data = ~ filter(., name %in% c("Y", "A")), colour = "darkblue") + 
  # unobserved variables:
  geom_dag_point(data = ~ filter(., name == "D"), colour = "grey") + 
  # selection variable indicating a distribution that changes across environments:
  geom_dag_point(data = ~ filter(., name == "S"), shape = 15) + 
  geom_dag_text() + 
  theme_dag()

Figure 1: Directed acyclical graph specifying the causal relationships between a prediction target T, observed predictors A and Y, and an unobserved confounder D. The square node S represents a auxiliary selection variable that indicates variables that are mutable, i.e., change across different environments.

So far this is a pretty standard DAG. However, there is an odd square node in this graph that we haven’t mentioned yet: the selection variable S. Subbaswamy and Saria (2019) suggest to use the auxiliary variable S to point to any variables in our graph that may vary arbitrarily across environments. Variables referenced by S are also called mutable variables. By including an arrow from S to A in Figure 1, we therefore claim that A is mutable and cannot be relied on in any environment that isn’t the training environment. Note that we do not make any claim as to why this variable is mutable, we merely state that its distribution may be shift across environments.

Once we have defined a DAG and all its mutable variables, we can graphically check whether our predictor is unstable by looking for any active unstable paths. Subbaswamy and Saria (2019) show that the non-existence of active unstable paths is a graphical criterion for determining […] stability. Easy, right? At least once we know what they mean by an active unstable path. Let’s look at it term for term:

path: a path is simply a sequence of nodes in which each consecutive pair of nodes is connected by an edge. Note that the direction of the edge (i.e., which way the arrow points) does not matter here. There are many different paths in Figure 1 such as D -> T -> Y or T <- D -> A <- S.
active: whether a path is active or closed can be determined using the standard rules of d-separation to determine stability (see chapter 6 of Hernán and Robins (2020) for a refresher on d-separation). Roughly speaking, a path is active if it either a) contains a variable that is conditioned on by including it in the model or b) contains a collider that it is not conditioned on. For example, T <- D -> A <- S is closed due to the collider -> A <- but becomes active if A is included in the model. It can be closed again by also including D in the model (if it were observed).
unstable: a path is unstable if it includes a selection variable S.

If you have worked with DAGs before, you probably already knew about active paths. The only new thing you need to learn is to only look for those active paths that are unstable, which is easy enough to verify. You don’t even need to look at all paths, only at those that include S! So let’s do it for our example in Figure 1.

Applying the theory to a toy example

Given the DAG in Figure 1, we could use different sets of variables to predict our target variable T. For example, we could a) decide to use the observed variables A and Y, b) use Y alone, c) explore the possibility to use all variables by collecting additional data on D or d) use no predictors (i.e., always predict the average). Let’s look at those options in turn and determine whether they would result in a stable model.

Use all observed variables as predictors

A common practice in prediction modelling is to include as many variables as possible (and available). In Figure 1, this would mean that we’d use A and Y to estimate the conditional probability \(P(T~\|~A, Y)\). Would such an estimate be stable? Let’s check for active unstable paths. There are two paths T -> Y <- A <- S and T <- D -> A <- S that include S. The first contains an open collider at -> Y <- (because it is included in the model) but it is blocked by also conditioning on A, making it closed. The second path also contains an open collider, namely at -> A <-. Since we do not observe D, this path is active and unstable.

Use only non-mutable observed variables as predictors

In recent years, researchers have become mindful of the fact that some relationships may be unreliable. For example, it is not unusual to see models that purposefully ignore information on medication to avoid spurious relationships. Following a similar line of argument, it could be tempting to remove A (which is mutable) from the model and only predict \(P(T~|~Y)\). After all, if we are not relying on mutable variables we may be safe from instability. Unfortunately, this isn’t an option either (at least not in this particular examples). If we remove A from the model, the previously blocked path T -> Y <- A <- S is now open and we are again left with an unstable model.

Collect additional data

By now, you might have thrown your hands up in despair. Neither option using the observed variables led to a stable model (note that adjusting only for A also does not solve the issue because there is still an open path via D). In our particular example, there is another possibility for a stable model if we have the time and resources to measure the previously unobserved variable D, but of course we only want to do so if it leads to a stable predictor. So is \(P(T~|~A, Y, D)\) stable? It turns out it is, as both T -> Y <- A <- S (by A) and T <- D -> A <- S (by D) are blocked and our model will therefore be stable across environments.

Use no predictors

What else can we do if we do not want or can’t collect data on D. One final option is always to admit defeat and simply make a prediction based on the average \(P(T)\). This estimate is stable but obviously isn’t a very good predictor. Yet what else is there left to do? Thankfully not all is lost and there are other smart things we could do to obtain a stable predictor without the need for additional data collection. I will talk about some of these options in my next posts.

Testing the theory

Up to now, we have used theory to determine whether a particular model would result in a stable predictor. In this final section, we simulate data for Figure 1 to test our conclusions and confirm the (lack of) stability of all models considered above. Following the example in Subbaswamy and Saria (2018), we use simple linear relationships and Gaussian noise for all variables, giving the following structural equations:

\[ \begin{aligned} D &\sim N(0, \sigma^2) \\ T &\sim N(\beta_1D, \sigma^2) \\ A &\sim N(\beta_2^eD, \sigma^2) \\ Y &\sim N(\beta_3T + \beta_4A, \sigma^2) \end{aligned} \]

You might have noticed the superscript \(e\) in \(\beta^e_2\). We use this superscript to indicate that the coefficient depends on the environment \(e \in \mathcal{E} \}\) where \(\mathcal{E}\) is the set of all possible environments. Since the value of the coefficient depends on the environment, A is mutable (note that we could have chosen other ways to make A mutable, for example by including another unobserved variable that influences A and changes across environments). All other coefficients are constant across environments, i.e., \(\beta_i^e = \beta_i\) for \(i \in \{1, 3, 4 \}\). Finally, we set a uniform noise \(\sigma^2=0.1\) for all variables. We combine this into a function that draws a sample of size \(n\).

simulate_data <- function(n, beta, dev = 0) {
  noise <- sqrt(0.1) # rnorm is parameterised as sigma instead of sigma^2
  
  D <- rnorm(n, sd = noise)
  T <- rnorm(n, beta[1] * D, sd = noise)
  A <- rnorm(n, (beta[2] + dev) * D, sd = noise)
  Y <- rnorm(n, beta[3] * T + beta[4] * A, sd = noise)
  
  tibble(D, T, A, Y)
}

set.seed(42)
n <- 30000

# Choose coefficients
beta <- vector("numeric", length = 4)
beta[2] <- 2                  # we manually set beta_2 and vary it by env
beta[c(1, 3, 4)] <- rnorm(3)  # we randomly draw values for the other betas

cat("Betas: ", beta)

## Betas:  1.370958 2 -0.5646982 0.3631284

We will define model performance in terms of the mean squared error (MSE) \(n^{-1} \sum_{i=1}^n (t_i - \hat t_i)^2\), where \(t_i\) is the true value of T for patient \(i\) and \(\hat t_i\) is the estimate given by our model. The function fit_and_eval() fits a linear regression model to the training data and returns its MSE on some test data. By varying \(\beta^{e}_2\) in the test environment, we can test how our models perform when the coefficient deviates more and more from the value seen in our training environment.

mse <- function(y, y_hat){
  mean((y - y_hat) ^ 2)
}

fit_and_eval <- function(formula, train, test) {
  fit <- lm(formula, data = train)
  pred <- predict(fit, test)
  mse(test$T, pred)
}

Now that we’ve defined everything we need to run our simulation, let’s see how our models fare. Since we are only running linear regressions that are easy to compute, we can set the number of samples to a high value (N=30,000) to get stable results. The performance of our four models across the range of \(\beta_2\) can be seen in Figure 2. Our simulations appear to confirm our theoretical analysis. The full model M3 (blue) retains a stable performance across all considered \(\beta_2\)’s. M1 and M2 on the other hand have U-shaped performance curves that depend on the value of \(\beta_2\) in the test environment. When \(\beta_2\) is close to the value in the training environment (vertical grey line), M1 achieves a performance that is almost as good as that of the full model M3. However, as the coefficient deviates from its value in the training environment, model performance quickly deteriorates and even becomes worse than simply using the global average (M4 green line).

# Training environment (always the same)
train <- simulate_data(n, beta)

# Test environments (beta_2 deviates from training env along a grid)
grid_len <- 100L
all_obs <- only_y <- add_d <- no_pred <- vector("numeric", grid_len)
devs <- seq(-12, 4, length.out = grid_len)

for (i in 1:grid_len) {
  # Draw test environment
  test <- simulate_data(n, beta, dev = devs[i])
  
  # Fit each model
  all_obs[i] <- fit_and_eval(T ~ Y + A    , train, test)
  only_y[i]  <- fit_and_eval(T ~ Y        , train, test)
  add_d[i]   <- fit_and_eval(T ~ Y + A + D, train, test)
  no_pred[i] <- fit_and_eval(T ~ 1        , train, test)
}

results <- tibble(devs, all_obs, only_y, add_d, no_pred)


ggplot(results, aes(x = beta[2] + devs)) + 
  geom_vline(xintercept = beta[2], colour = "darkgrey", size = 1) + 
  geom_point(aes(y = all_obs, colour = "M1: all observed variables"), size = 2) + 
  geom_point(aes(y = only_y, colour = "M2: non-mutable variables"), size = 2) + 
  geom_point(aes(y = add_d, colour = "M3: additional data"), size = 2) + 
  geom_point(aes(y = no_pred, colour = "M4: no predictor"), size = 2) + 
  scale_colour_colorblind() + 
  labs(
    x = expression(beta[2]),
    y = "Mean squared error\n"
  ) +
  coord_cartesian(ylim = c(0, 0.5), expand = FALSE) + 
  theme_bw() + 
  theme(
    panel.grid = element_blank()
  )

Figure 2: Mean squared error of all models across a range of test environments that differ in the coefficient for the relationship D -> A. The vertical grey line indicates the training environment.

I have to admit I was surprised by the results for M2, which I expected to be similar to but slightly worse than M1. Instead of a minimum close to the training environment, however, M2 achieved its best performance far away at \(\beta_2 \approx -7.5\) whereas the performance in the training environment was barely better than using no predictors. Its \(R^2\) was only 0.093 compared to 0.672 for M1 and 0.741 for M3. The reason for this seems to be a very low variance of Y given the particular set of \(\beta\)’s chosen. As the value of \(\beta_2\) decreases the variance of Y and its covariance with D and T changes such that it becomes a better predictor of T (and even crowds out the “direct” effect of D due to the active backdoor path D -> A -> Y <- T).

Finally, note that the scale of the curves in Figure 2 depends on the values chosen for \(\beta_1\), \(\beta_2\), and \(\beta_4\). The shape of the curves and the overall conclusions remain the same, though.

Note on patient mix

So far I’ve acted as if incorrectly estimated model coefficients are the only reason for changes in performance across environments. However, if you’ve ever performed (or read about) external validation of clinical prediction models, you may by now be shouting at your screen that there are other reasons for performance changes. In fact, even if our model is specified perfectly (i.e., all coefficients are estimated with their true causal values) it may not always be possible to achieve the same performance across environments. I discussed in a previous post how the AUC may change depending on the make-up of your target population even if we know the exact model that generated the data. The same general principle is true for MSE. Some patients may simply be harder to predict than others and if your population contains more of one type of patients than the other, average performance of your model may change (though model performance remains the same for each individual patient assuming your coefficients are correct!). The make-up of your population is often referred to as patient mix. In our case, patient mix remained stable across environments (we did not change D -> T). I chose this setup to focus on the effects of a mutable variables when estimating model parameters. However, thinking hard about your patient mix becomes indispensable when transferring our model to new populations. If you want to read up further on this topic, I can recommend chapter 19 of Ewout Steyerberg’s book on Clinical Prediction Models which includes some general advice on how to distinguish changes in patient mix from issues of model misspecification.

Acknowledgements

The structure of this post (and likely all future posts) was inspired by the great posts on Andrew Heiss’ blog and in particular his posts on inverse probability weighting in Bayesian models and the derivation of the trhee rules of do-calculus. Andrew is an assistant professor in the Department of Public Management and Policy at Georgia State University, teaching on causal inference, statistics, and data science. His posts on these topics have been a joy to read and I am striving to make mine as effortlessly educational.

Implementing the optimism-adjusted bootstrap with tidymodels

Wed, 26 Jan 2022 12:00:00 +0000

It is well known that prediction models have a tendency to overfit to the training data, especially if we only have a limited amount of training data. While performance of such overfitted models appears high when evaluated on the data available during training, their performance on new, previously unseen data is often considerably worse. Although it may be tempting to the analyst to choose a model with high training performance, it is the model’s performance in future data that we are really interested in.

Several resampling methods have been proposed to account for this issue. The most widely used techniques fall into two categories: cross-validation and bootstrapping. The idea underlying these techniques is similar. By repeating the model fitting multiple times on different subsets of the training data, we may get a better understanding of the magnitude of overfitting and can account for it in our model building and evaluation. Without going into too much detail, cross-validation separates the data into \(k\) mutually exclusive folds and always holds one back as a “hidden” test set. Note that the sample size available to the model during each training run necessarily decreases to \(\frac{k-1}{k}n\). Bootstrap, on the other hand, resamples (with replacement) a data set with the same size \(n\) as the original training set and then — depending on the exact method — uses a weighted combination randomly sampled and excluded observations.

Whereas the machine learning community almost exclusively uses cross-validation for model validation, bootstrap-based methods may be more commonly seen in biomedical sciences. One reason for this popularity may be the fact that they are championed by preeminent experts in the field: both Frank Harrell (Harrell 2015) and Ewout Steyerberg (Steyerberg 2019) prominently feature the bootstrap — and in particular the optimism-adjusted bootstrap (OAD) — in their textbooks. In this post, I give a brief introduction into OAD and compare it to repeated cross-validation and regular bootstrap. OAD is implemented in the R packages caret and Frank Harrell’s rms but not in the recent tidymodels ecosystem (Kuhn and Silge 2022). This post will therefore provide a step-by-step guide to doing OAD with tidymodels.

Who this post is for

Here’s what I assume you to know:

You’re familiar with R and the tidyverse, including the amazing tidymodels framework (if not go check it out now!).
You know a little bit about fitting and evaluating linear regression models.

We will use the following R packages throughout this post:

library(tidyverse)
library(tidymodels)

Optimism-adjusted bootstrap

Like other resampling schemes, the OAD aims to avoid overly optimistic estimation of model performance during internal validation — i.e., validation of model performance using the training dataset. As we will see further down, simply calculating performance metrics on the same data used for training leads to artificially high/good performance estimates. We will call this the “apparent” performance. OAD proposes to obtain a better estimate by directly estimating the amount of “optimism” in the apparent performance. The steps needed to do so are as follows (Steyerberg 2019):

Fit a model \(M\) to the original training set \(S\) and use \(M\) to calculate the apparent performance \(R(M, S)\) (e.g., accuracy) on the training data
Draw a bootstrapped sample \(S^*\) of the same size as \(S\) through sampling with replacement
Construct another model \(M*\) by performing all model building steps (pre-processing, imputation, model selection, etc.) on \(S^*\) and calculate it’s apparent performance \(R(M^*, S^*)\) on \(S*\)
Use \(M*\) to estimate the performance \(R(M^*, S)\) that it would have had on the original data \(S\).
Calculate the optimism \(O^* = R(M^*, S^*) - R(M^*, S)\) as the difference between the apparent and test performance of \(M*\).
Repeat steps 2.-5. may times \(B\) to obtain a sufficiently stable estimate (common recommendations range from 100-1000 times depending on the computational feasibility)
Subtract the mean optimism \(\frac{1}{B} \sum^B_{b=1} O^*_b\) from the apparent performance \(R_{app}\) in the original training data \(S\) to get a optimism-adjusted estimate of model performance.

The basic intuition behind this procedure is that the model \(M*\) will overfit to \(S^*\) in the same way as \(M\) overfits to \(S\). We can then estimate the difference between \(M\)’s observed apparent performance \(R(M, S)\) and its unobserved performance on future test data \(R(M, U)\) from the difference between the bootstrapped model \(M^*\)’s apparent performance \(R(M^*, S^*)\) and its test performance \(R(M^*, S)\) (which are both observed). The training data \(S\) acts as a stand-in test data for the bootstrapped model \(M*\).

The following sections will apply this basic idea to the Ames housing dataset and compare estimates derived via OAB to repeated cross-validation and standard bootstrap.

The Ames data set

The Ames data set contains information on 2,930 properties in Ames, Iowa, and contains 74 variables including the number of bedrooms, whether the property includes a garage, and the sale price. We choose this data set because it provides a decent sample size for predictive modelling and is already used prominently in the documentation of the R tidymodels ecosystem. More information on the Ames data set can be found in (Kuhn and Silge 2022).

set.seed(123)

data(ames)
dim(ames)

## [1] 2930   74

ames[1:5, 1:5]

## # A tibble: 5 × 5
##   MS_SubClass                         MS_Zoning     Lot_Frontage Lot_Area Street
##   <fct>                               <fct>                <dbl>    <int> <fct> 
## 1 One_Story_1946_and_Newer_All_Styles Residential_…          141    31770 Pave  
## 2 One_Story_1946_and_Newer_All_Styles Residential_…           80    11622 Pave  
## 3 One_Story_1946_and_Newer_All_Styles Residential_…           81    14267 Pave  
## 4 One_Story_1946_and_Newer_All_Styles Residential_…           93    11160 Pave  
## 5 Two_Story_1946_and_Newer            Residential_…           74    13830 Pave

For this exercise, we try to predict sale prices within the dataset. To keep preprocessing simple, we limit the predictors to only numeric variables, which we centre and scale. Since sale prices are right skewed, we log them before prediction. Finally, we will hold back a random quarter of the data to simulate external validation on an independent identically distributed test set.

# Define sale price as the prediction target
formula <- Sale_Price ~ .

# Remove categorical variables, log sale price, scale the numeric predictors
preproc <- recipe(formula, data = ames[0, ]) %>% 
  step_rm(all_nominal_predictors()) %>% 
  step_log(all_outcomes()) %>% 
  step_normalize(all_numeric_predictors(), -all_outcomes())

# Randomly split into training (3/4) and testing (1/4) sets
train_test_split <- initial_split(ames, prop = 3/4)
train <- training(train_test_split)
test <- testing(train_test_split)

Optimism-adjusted bootstrap with tidymodels

Now that we have set up the data, lets look into how we can build a linear regression model and validate it via OAB. We proceed according to the steps described above.

Step 1: Calculate apparent perforamnce

To start, we simply fit and evaulate our model \(M\) on the original training data \(S\) (note that we also apply preprocessing, therefore we strictly speaking train our model on the preprocessed data \(S'\)). Since our outcome is a continuous value strictly greater than zero, we will use the residual mean squared error as our performance metric.

prepped <- prep(preproc, train)
preproc_orig <- juice(prepped)
fit_orig <- fit(linear_reg(), formula, preproc_orig)
preds_orig <- predict(fit_orig, new_data = preproc_orig)
perf_orig <- rmse_vec(preproc_orig$Sale_Price, preds_orig$.pred)

perf_orig

## [1] 0.1693906

Step 2: Create bootstrapped samples

After obtaining \(M\) and \(R(M, S)\), we now produce a set of bootstrap samples to estimate the amount of optimism in this performance estiamte. We use the tidymodels sub-package rsample to create a data frame bs with 200 bootstrap samples. All of these resamples have training data of equal size to the original training data (n = 2197). Note however that the “testing data” set aside differs between splits, as it is defined by all rows that did not get sampled into the training data, which is a random variable and may vary between bootstraps. We won’t use this testing data for OAB but it is for example used in the simple bootstrap that we use for comparison later.

bs <- bootstraps(train, times = 100L)

bs %>% slice(1:5)

## # A tibble: 5 × 2
##   splits             id          
##   <list>             <chr>       
## 1 <split [2197/813]> Bootstrap001
## 2 <split [2197/818]> Bootstrap002
## 3 <split [2197/813]> Bootstrap003
## 4 <split [2197/786]> Bootstrap004
## 5 <split [2197/792]> Bootstrap005

bs %>% slice((n()-5):n())

## # A tibble: 6 × 2
##   splits             id          
##   <list>             <chr>       
## 1 <split [2197/806]> Bootstrap095
## 2 <split [2197/822]> Bootstrap096
## 3 <split [2197/778]> Bootstrap097
## 4 <split [2197/800]> Bootstrap098
## 5 <split [2197/833]> Bootstrap099
## 6 <split [2197/815]> Bootstrap100

Step 3: Fit bootstrapped models and calculate their apparent performance

We now use the bootstrap data.frame bs to preprocess each sample \(S^*\) individually, fit a linear regression \(M^*\) to it, and calculate its apparent performance \(R(M^*, S^*)\).

bs <- bs %>% 
  mutate(
    # Apply preprocessing separately for each bootstrapped sample S*
    processed = map(splits, ~ juice(prep(preproc, training(.)))),
    # Fit a separate model M* to each preprocessed bootstrap
    fitted = map(processed, ~ fit(linear_reg(), formula, data = .)),
    # Predict values for each bootstrap's training data S* and calculate RMSE
    pred_app = map2(fitted, processed, ~ predict(.x, new_data = .y)),
    perf_app = map2_dbl(processed, pred_app, ~ rmse_vec(.x$Sale_Price, .y$.pred))
  )

Step 4: Evaluate on the original training data

Since we stored the fitted models \(M^*_i\) in a column of the data.frame, we can easily re-use them to predict values for the original data and evaluate them. Remember that because some of the rows in the original dataset did not end up in the bootstrapped dataset, we expect the performance \(R(M^*_i, S)\) of each model \(M^*_i\) to be lower than the performance in its own training data \(R(M^*_i, S^*_i)\).

bs <- bs %>% 
  mutate(
    pred_test = map(fitted, ~ predict(., new_data = preproc_orig)),
    perf_test = map_dbl(pred_test, ~ rmse_vec(preproc_orig$Sale_Price, .$.pred)),
  )

Step 5: Estimate the optimism

The amount of optimism in our apparent estimate is now simply estimated by the differences between apparent and test performance in each bootstrap.

bs <- bs %>% 
  mutate(
    optim = perf_app - perf_test
  )

Steps 6-7: Adjust for optimism

We already repeated this procedure in parallel for 200 samples, therefore step 6 is fulfilled. In order to get a single, final estimate, all that’s left to do is to calculate the mean and standard deviation of the optimism and substract them (which approximately normal 95% Wald confidence limits) from the apparent performance obtained in step 1. This is now the performance that we report for our model after internal validation

mean_opt <- mean(bs$optim)
std_opt <- sd(bs$optim)

(perf_orig - mean_opt) + c(-2, 0, 2) * std_opt / sqrt(nrow(bs))

## [1] 0.1766124 0.1789846 0.1813568

External validation

Remember that we set aside a quarter of the data for external validation (external is a bit of misnomer here but more on that later). We can now compare how our estimate from internal validation compares to the performance in the held-out data. Indeed, the performance seems to have slightly dropped but — thankfully — it is still within the bounds suggested by OAB above.

preproc_test <- bake(prepped, test)
preds_test <- predict(fit_orig, new_data = preproc_test)
rmse_vec(preproc_test$Sale_Price, preds_test$.pred)

## [1] 0.1855153

Putting everything together

Using what we learned above, we can create a single function calculate_optimism_adjusted() that performs all steps and returns the adjusted model performance.

calculate_optimism_adjusted <- function(train_data, formula, preproc, n_resamples = 10L) {
  # Get apparent performance
  prepped <- prep(preproc, train_data)
  preproc_orig <- juice(prepped)
  fit_orig <- fit(linear_reg(), formula, preproc_orig)
  preds_orig <- predict(fit_orig, new_data = preproc_orig)
  perf_orig <- rmse_vec(last(preproc_orig), preds_orig$.pred)
  
  # Estimate optimism via bootstrap
  rsmpl <- bootstraps(train_data, times = n_resamples) %>% 
    mutate(
      processed = map(splits, ~ juice(prep(preproc, training(.)))),
      fitted = map(processed, ~ fit(linear_reg(), formula, data = .)),
      pred_app = map2(fitted, processed, ~ predict(.x, new_data = .y)),
      perf_app = map2_dbl(processed, pred_app, ~ rmse_vec(.x$Sale_Price, .y$.pred)),
      pred_test = map(fitted, ~ predict(., new_data = preproc_orig)),
      perf_test = map_dbl(pred_test, ~ rmse_vec(last(preproc_orig), .$.pred)),
      optim = perf_app - perf_test
    )
  
  mean_opt <- mean(rsmpl$optim)
  std_opt <- sd(rsmpl$optim)

  # Adjust for optimism
  tibble(
    .metric = "rmse",
    mean = perf_orig - mean_opt, 
    n = n_resamples, 
    std_err = std_opt / sqrt(n_resamples)
  )
}

We also define a similar function eval_test for the external validation and wrappers around tidymodel’s fit_resample to do the same for repeated cross-validation (calculate_repeated_cv()) and standard bootstrap (calculate_standard_bs()), which we will compare in a second.

eval_test <- function(train_data, test_data, formula, preproc) {
  
  prepped <- prep(preproc, train_data)
  preproc_train <- juice(prepped)
  preproc_test <- bake(prepped, test_data)
  fitted <- fit(linear_reg(), formula, data = preproc_train)
  preds <- predict(fitted, new_data = preproc_test)
  rmse_vec(preproc_test$Sale_Price, preds$.pred)
}

calculate_repeated_cv <- function(train_data, formula, preproc, v = 10L, repeats = 1L){
  rsmpl <- vfold_cv(train_data, v = v, repeats = repeats)

  show_best(fit_resamples(linear_reg(), preproc, rsmpl), metric = "rmse") %>% 
    select(-.estimator, -.config)
}

calculate_standard_bs <- function(train_data, formula, preproc, n_resamples = 10L) {
  rsmpl <- bootstraps(train_data, times = n_resamples, apparent = FALSE)

  show_best(fit_resamples(linear_reg(), preproc, rsmpl), metric = "rmse") %>% 
    select(-.estimator, -.config)
}

Comparison of validation methods

In this last section, we will compare the results obtained from OAB to two other well-known validation methods: repeated 10-fold cross-validation and standard bootstrap. In the former, we randomly split the data into 10 mutually exclusive folds of equal size. In a round-robin fashion, we set aside one fold as an evaluation set and use the remaining nine to train our model. We then choose the next fold and do the same. After one round, we have ten estimates of model performance, one for each held-out fold. We repeat this process several times with new random seeds to get the same number of resamples as were used for the bootstrap. With standard bootstrap, we fit our models on the same bootstrapped data but evaluate them on the samples that were randomly excluded from that particular bootstrap — similar to the held-out fold of cross-validation.

In order to get a good comparison of methods, we won’t stick with a single train-test split as before but use nested validation. The reason for this is that our test data isn’t truly external. Instead, it is randomly sampled from the entire development dataset (which in this case was all of Ames). Holding out a single chunk of that data as test data would be wasteful and could again result in us being particularly lucky or unlucky in the selection of that chunk. This is particularly problematic if we further perform hyperparameter searches. In nested validation, we mitigate this risk by wrapping our entire internal validation in another cross-validation loop, i.e., we treat the held-out set of an outer cross-validation as the “external” test set.

outer <- vfold_cv(ames, v = 5, repeats = 1)

outer <- outer %>% 
  mutate(
    opt = splits %>% 
      map(~ calculate_optimism_adjusted(training(.), formula, preproc, 100L)),
    cv = splits %>% 
      map(~ calculate_repeated_cv(training(.), formula, preproc, repeats = 10L)), 
    bs = splits %>% 
      map(~ calculate_standard_bs(training(.), formula, preproc, 100L)), 
    test = splits %>% 
      map_dbl(~ eval_test(training(.), testing(.), formula, preproc))
  )

We can see below that in this example, all resampling methods perform more or less similar. Notably, both bootstrap-based methods have narrower confidence intervals. This was to be expected, as cross-validation typically has high variance. This increased precision is traded for a risk of bias in bootstrap, which is usually pessimistic as with the standard bootstrap in this example. OAB here seems to have a slight optimistic bias. While its mean is similar to cross-validation, its increased confidence represented by narrower confidence interval means that the average test performance over the nested runs is not contained in the approximate confidence limits. However, all resampling methods give us a more accurate estimate of likely future model performance than the apparent performance of 0.169.

format_results <- function(outer, method) {
  method <- rlang::enquo(method)
  
  outer %>% 
    unnest(!!method) %>% 
    summarise(
      rsmpl_lower = mean(mean - 2 * std_err),
      rsmpl_mean  = mean(mean), 
      rsmpl_upper = mean(mean + 2 * std_err), 
      test_mean   = mean(test)
    )
}

tibble(method = c("opt", "cv", "bs")) %>% 
  bind_cols(bind_rows(
    format_results(outer, opt), 
    format_results(outer, cv), 
    format_results(outer, bs), 
  ))

## # A tibble: 3 × 5
##   method rsmpl_lower rsmpl_mean rsmpl_upper test_mean
##   <chr>        <dbl>      <dbl>       <dbl>     <dbl>
## 1 opt          0.175      0.178       0.180     0.179
## 2 cv           0.170      0.177       0.184     0.179
## 3 bs           0.179      0.182       0.186     0.179

References

Harrell, Frank E, Jr. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. Springer, Cham.

Kuhn, Max, and Julia Silge. 2022. Tidy Modeling with r. https://www.tmwr.org/.

Steyerberg, Ewout W. 2019. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, Cham.

Contextual nature of AUC

Fri, 26 Nov 2021 21:13:14 -0500

The area under the receiver operating characteristic (AUC) is arguably among the most frequently used measures of classification performance. Unlike other common measures like sensitivity, specificity, or accuracy, the AUC does not require (often arbitrary) thresholds. It also lends itself to a very simple and intuitive interpretation: a models AUC equals the probability that, for any randomly chosen pair with and without the outcome, the observation with the outcome is assigned a higher risk score by the model than the observation without the outcome, or

\[ P(f(x_i) > f(x_j)) \]

where \(f(x_i)\) is the risk score that the model assigned to observation \(i\) based on its covariates \(x_i\), \(i \in D_{y=1}\) is an observation taken from among all cases \(y=1\), and \(j \in D_{y=0}\) is an observation taken from among all controls \(y=0\). As such, the AUC has a nice probabilistic meaning and can be linked back to the well-known Mann-Whitney U test.

The ubiquitous use of AUC isn’t without controversy (which, like so many things these days, spilled over into Twitter). Regularly voiced criticisms of AUC — and the closely linked receiver operating characteristic (ROC) curve — include its indifference to class imbalance and extreme observations.

In this post, I want to take a closer look at a feature of AUC that — at least in my experience — is often overlooked when evaluating models in an external test set: the dependence of the AUC on the distribution of variables in the test set. We will see that this can lead to considerable changes in estimated AUC even if our model is actually correct, and make it harder to disentangle changes in performance due to model misspecification from changes in performance due to differences between development and test sets.

Who this post is for

Here’s what I assume you to know:

You’re familiar with R and the tidyverse.
You know a little bit about fitting and evaluating linear regression models.

We will use the following R packages throughout this post:

library(MASS)
library(tidyverse)
library(tidymodels)

Generating some dummy data

Let’s get started by simulating some fake medical data for 100,000 patients. Assume we are interested in predicting their probability of death (=outcome). We use the patients’ sex (binary) and three continuous measurements — age, blood pressure , and cholesterol — to do so. In this fake data set, being female has a strong protective effect (odds ratio = exp(-2) = 0.14) and all other variables have a moderate effect (odds ratio per standard deviation = exp(0.3) = 1.35). Any influence by other, unmeasured factors is simulated by drawing from a Bernoulli distribution with a probability defined by sex, age, blood pressure, and cholesterol.

set.seed(42)

# Set number of rows and predictors
n <- 100000
p <- 3

# Simulate an additional binary predictor (e.g., sex)
sex <- rep(0:1, each = n %/% 2)

# Simulate multivariate normal predictors (e.g., age, blood pressure, 
# cholesterol)
mu <- rep(0, p)
Sigma <- 0.8 * diag(p) + 0.2 * matrix(1, p, p)
other_covars <- MASS::mvrnorm(n, mu, Sigma)
colnames(other_covars) <- c("age", "bp", "chol")

# Simulate binary outcome (e.g., death)
logistic <- function(x) 1 / (1 + exp(-x))
betas <- c(0.8, -2, 0.3, 0.3, 0.3)
lp <- cbind(1, sex, other_covars) %*% betas
death <- rbinom(n, 1, logistic(lp)) 

# Make into a data.frame and split into a training set (first half) and 
# a biased test set (rows in the second half X * beta > 0)
data <- as_tibble(other_covars)
data$sex <- factor(sex, 0:1, c("male", "female"))
data$pred_risk <- as.vector(logistic(lp))
data$death <- factor(death, 0:1, c("no", "yes"))
data$id <- 1:nrow(data)

Estimating predictive performance

Now that we have some data, we can evaluate how well our model is able to predict each patient’s risk of death. To make our lives as simple as possible, we assume that we were able to divine the true effects of each of our variables, i.e., we know that the data is generated via a logistic regression model with \(\beta = [0.8, -2, 0.3, 0.3, 0.3]\) and there is no uncertainty around those estimates. Under these assumptions, our model would be able to achieve the following AUC in the simulated data.

auc <- function(data) {
  yardstick::roc_auc(data, death, pred_risk, event_level = "second")
}

data %>% auc()

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.785

Notably, this is a summary measure that depends on the entire data set \(D\) and cannot be calculated for an individual patient alone. Per definition, it requires at least one patient with and without the outcome. This has important implications for interpreting the AUC. Let’s see what happens if we evaluate our (true) model in men and women separately.

men <- data %>% filter(sex == "male")
men %>% auc()

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.662

women <- data %>% filter(sex == "female")
women %>% auc()

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.665

In each subset, the AUC dropped from 0.78 to around 0.66. This perhaps isn’t too surprising, given that sex was a strong predictor of death. However, remember that the model coefficients and hence the predicted risk for each individual patient — i.e., how “good” that prediction is for that patient — remain unchanged. We merely changed the set of patients that we included in the evaluation. Although this might be obvious, I believe this is an important point to highlight.

Looking at the distribution of risks in the total data and by sex might provide some further intuition for this finding. Predicted risks for both those who did and did not die are clearly bimodal in the total population around the average risk for men and women (Figure 1). Even so, there is good separation between them. The majority of patients who died (red curve) had a predicted risk >50%. Vice versa, the majority of patients who remained alive (green curve) had a risk <50%. Looking at the risks of men and women separately, however, we can see that most men had a high and most women a low predicted risk (Figure 2). There is much less separation between the red and green curves, as any differences among for example the men is entirely due to moderate effects of our simulated continuous covariates.

Figure 1: Distribution of predicted risk of death for patients that ultimately did and did not die.

Figure 2: Distribution of predicted risk of death for by sex.

So far, we have looked at two extremes: the entire data set, in which sex was perfectly balanced, and two completely separated subsets with only men or only women. Let’s see what would happen if we gradually reduce the number of men in our evaluation population. We can see that estimated performance drops as we remove more and more men from the data set (Figure 3), particularly at the right side of the graph when there are only few men left and there is an increasing sex imbalance.

auc_biased <- function(data, p_men_remove) {
  n_men <- sum(data$sex == "male")
  n_exclude <- floor(n_men * p_men_remove)
  
  data %>% 
    slice(-(1:n_exclude)) %>% 
    auc()
}

p_men_remove <- seq(0, 1, by = 0.01)
auc_shift <- p_men_remove %>% 
  map_dfr(auc_biased, data = data) %>% 
  mutate(p_men_remove = p_men_remove)

Figure 3: Estimated AUC by proportion of men removed from the full data set.

Difficulty to distinguish model misspecification

What we have seen so far is particularly problematic in the interpretation of external model validation, i.e., when we test a model that was developed in one set of patients (and potentially overfit to that population) in another patient population in order to estimate the model’s likely future performance. This is because in most real world cases, it isn’t quite as straightforward to quantify the difference between the development population and the evaluation cohort. Since we also usually don’t know the true model parameters (or even the true model class), it is difficult to disentangle the effects of population makeup from the effects of model misspecification. Let’s assume for example that — unlike earlier — we don’t know the exact model parameters and instead needed to estimate them from a prior development data set. As a result, we obtained \(\beta_{biased} = [0.8, -2, 0.3, 0.3, 0.3]\) which clearly differs from the true \(\beta\) used to generate the data. Now recalculate the AUC in a data set where there is some imbalance between men and women.

biased_betas <- c(0.4, -1, 0.3, -0.2, 0.5)
alt_risk <- logistic(cbind(1, sex, other_covars) %*% biased_betas) %>% 
  as.vector()

data %>% 
  mutate(pred_risk = alt_risk) %>% 
  auc_biased(0.5)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.722

Again, the estimated model performance dropped. However, how much of the drop was due to our biased estimate \(\beta_{biased} \neq \beta\) and how much of it was due to the fact that our evaluation data set contained fewer men? This, in general, is not straightforward to answer.

Takeaway

If there is one takeaway from this post it is that external validations of predictive models mustn’t solely report on differences in AUC but also need to comment on the comparability of development and test sets used. Such a discussion is warranted irrespective of whether the performance remained the same, dropped, or even increased in the test set. Only by discussion — and ideally even quantifying — differences between the data sets can the reader fully assess the evidence for retained model performance and judge its likely value in the future.