<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Posts | Patrick Rockenschaub</title>
    <link>https://www.patrick-rockenschaub.com/_post/</link>
      <atom:link href="https://www.patrick-rockenschaub.com/_post/index.xml" rel="self" type="application/rss+xml" />
    <description>Posts</description>
    <generator>Wowchemy (https://wowchemy.com)</generator><language>en-us</language><lastBuildDate>Fri, 27 May 2022 12:00:00 +0000</lastBuildDate>
    <image>
      <url>https://www.patrick-rockenschaub.com/media/icon_hu681d357afb5ae78ae86392db2273c6c5_77265_512x512_fill_lanczos_center_3.png</url>
      <title>Posts</title>
      <link>https://www.patrick-rockenschaub.com/_post/</link>
    </image>
    
    <item>
      <title>Graphical analysis of model stability</title>
      <link>https://www.patrick-rockenschaub.com/_post/2022-05-27-stable-prediction/</link>
      <pubDate>Fri, 27 May 2022 12:00:00 +0000</pubDate>
      <guid>https://www.patrick-rockenschaub.com/_post/2022-05-27-stable-prediction/</guid>
      <description>


&lt;p&gt;Predicting likely patient outcomes with machine learning has been a hot topic for several years now. The increasing collection of routine medical data have enabled the modelling of a wide range of different outcomes across varies medical specialties. This interest in data-driven diagnosis and prognosis has only further burgeoned with the arrival of the SARS-CoV-2 pandemic. Countless research groups across countries and institutions have published models that use routine data to predict everything from &lt;a href=&#34;https://bmcmedicine.biomedcentral.com/articles/10.1186/s12916-020-01893-3&#34;&gt;COVID-19 related deaths&lt;/a&gt;, escalation of care such as &lt;a href=&#34;https://www.nature.com/articles/s41598-021-83784-y&#34;&gt;admission to intensive care units or initiation of invasive ventilation&lt;/a&gt;, or simply the &lt;a href=&#34;https://erj.ersjournals.com/content/56/2/2000775.short&#34;&gt;presence or absence of the virus&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Unfortunately, if there is one thing my PhD’s taught me repeatedly it’s that deriving reliable models from routine medical data is challenging (as this &lt;a href=&#34;https://www.bmj.com/content/369/bmj.m1328&#34;&gt;systematic review of 232 COVID-19 prediction models&lt;/a&gt; can attest). There are many reasons why a given prediction model may not be reliable but the one I focus on in my own research — and which we will therefore discuss in more detail in this blog post — is model stability across environments. Here, environments can mean many different things but in the case of clinical prediction models, the environments of interest are often different healthcare providers (e.g., hospitals), with each provider representing a single environment in which we may want to use our model. Ideally, we would like our model to work well across many healthcare providers. If that’s the case, we can use a single model across all providers. The model may therefore be considered “stable”, “generalisable”, or “transferable”. If our models perform instead work well at only some providers but not at others, we may need to (re-)train them for each provider at which we want to use them. This not only causes additional overhead but also increases the risk of overfitting to any single provider and raises questions about the validation of each local model. Stability is therefore a desirable property of predictive models. In the remainder of this post, we will discuss the necessary conditions for stability and how we can identify likely instability in our prediction models.&lt;/p&gt;
&lt;div id=&#34;who-this-post-is-for&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Who this post is for&lt;/h1&gt;
&lt;p&gt;Here’s what I assume you to know:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’re familiar with &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; and the &lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;tidyverse&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You know a little bit about fitting and evaluating linear regression models .&lt;/li&gt;
&lt;li&gt;You have a working knowledge of causal inference and Directed Acyclical Graphs (DAG). We will use DAGs to represent assumptions about our data and graphically reason about (in)stability through the backdoor criterion. If these concepts are new to you, first have a look &lt;a href=&#34;https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/&#34;&gt;here&lt;/a&gt; and &lt;a href=&#34;https://www.andrewheiss.com/research/chapters/heiss-causal-inference-2021/10-causal-inference.pdf&#34;&gt;here&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will use the following &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; packages throughout this post:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(ggdag)
library(ggthemes)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;model-stability&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Model stability&lt;/h1&gt;
&lt;p&gt;In the introduction, I considered models stable if they worked comparable across multiple environments. While intuitive, this definition is of course very vague. Let’s spend a little more time on defining what exactly (in mathematical terms) we mean by stability. The definitions here closely follows that of &lt;a href=&#34;https://arxiv.org/abs/1812.04597&#34;&gt;Subbaswamy and Saria (2019)&lt;/a&gt;, which recently introduced a (in my opinion) very neat framework to think and reason about model stability using DAGs.&lt;/p&gt;
&lt;p&gt;Take for example the relatively simple DAG introduced in &lt;a href=&#34;https://arxiv.org/abs/1808.03253&#34;&gt;Subbaswamy and Saria (2018)&lt;/a&gt; and displayed in Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt;. Let’s say we want to predict T, which may represent a clinical outcome of interest such as the onset of sepsis. In our dataset, we observe two variables Y and A that we could use to predict T. The arrows between T, Y, and A denote causal relationships between these variables, i.e., both T and A causally affect the value of Y. The absence of an arrow between T and A means that these variables do not directly affect each other. However, there is a final variable D that affects both the value of T and the value of A. We display D in grey because it is not observed in our dataset (e.g., because it is not routinely recorded by the clinician). If you had courses in statistics or epidemiology, you will know D as a confounding variable.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;coords &amp;lt;- list(x = c(T = -1, A = 1, D = 0, Y = 0, S = 1),
                y = c(T = 0, A = 0, D = 1, Y = -1, S = 1))

dag &amp;lt;- dagify(
  T ~ D,
  A ~ D + S,
  Y ~ T + A,
  coords = coords
)&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ggplot(dag, aes(x, y, xend = xend, yend = yend)) + 
  geom_dag_edges() + 
  # prediction target:
  geom_dag_point(data = ~ filter(., name == &amp;quot;T&amp;quot;), colour = &amp;quot;darkorange&amp;quot;) +     
  # observed variables:
  geom_dag_point(data = ~ filter(., name %in% c(&amp;quot;Y&amp;quot;, &amp;quot;A&amp;quot;)), colour = &amp;quot;darkblue&amp;quot;) + 
  # unobserved variables:
  geom_dag_point(data = ~ filter(., name == &amp;quot;D&amp;quot;), colour = &amp;quot;grey&amp;quot;) + 
  # selection variable indicating a distribution that changes across environments:
  geom_dag_point(data = ~ filter(., name == &amp;quot;S&amp;quot;), shape = 15) + 
  geom_dag_text() + 
  theme_dag()&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:example-dag&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://www.patrick-rockenschaub.com/_post/2022-05-27-stable-prediction/index.en_files/figure-html/example-dag-1.png&#34; alt=&#34;Directed acyclical graph specifying the causal relationships between a prediction target T, observed predictors A and Y, and an unobserved confounder D. The square node S represents a auxiliary selection variable that indicates variables that are mutable, i.e., change across different environments.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Directed acyclical graph specifying the causal relationships between a prediction target T, observed predictors A and Y, and an unobserved confounder D. The square node S represents a auxiliary selection variable that indicates variables that are mutable, i.e., change across different environments.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;
 
&lt;/p&gt;
&lt;p&gt;So far this is a pretty standard DAG. However, there is an odd square node in this graph that we haven’t mentioned yet: the selection variable S. &lt;a href=&#34;https://arxiv.org/abs/1812.04597&#34;&gt;Subbaswamy and Saria (2019)&lt;/a&gt; suggest to use the auxiliary variable S to point to any variables in our graph that may vary arbitrarily across environments. Variables referenced by S are also called &lt;em&gt;mutable&lt;/em&gt; variables. By including an arrow from S to A in Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt;, we therefore claim that A is mutable and cannot be relied on in any environment that isn’t the training environment. Note that we do not make any claim as to why this variable is mutable, we merely state that its distribution may be shift across environments.&lt;/p&gt;
&lt;p&gt;Once we have defined a DAG and all its mutable variables, we can graphically check whether our predictor is unstable by looking for any active unstable paths. &lt;a href=&#34;https://arxiv.org/abs/1812.04597&#34;&gt;Subbaswamy and Saria (2019)&lt;/a&gt; show that &lt;em&gt;the non-existence of active unstable paths is a graphical criterion for determining […] stability&lt;/em&gt;. Easy, right? At least once we know what they mean by an active unstable path. Let’s look at it term for term:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;em&gt;path&lt;/em&gt;: a path is simply a sequence of nodes in which each consecutive pair of nodes is connected by an edge. Note that the direction of the edge (i.e., which way the arrow points) does not matter here. There are many different paths in Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt; such as &lt;code&gt;D -&amp;gt; T -&amp;gt; Y&lt;/code&gt; or &lt;code&gt;T &amp;lt;- D -&amp;gt; A &amp;lt;- S&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;active&lt;/em&gt;: whether a path is active or closed can be determined using the standard rules of d-separation to determine stability (see chapter 6 of &lt;a href=&#34;https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/&#34;&gt;Hernán and Robins (2020)&lt;/a&gt; for a refresher on d-separation). Roughly speaking, a path is active if it either a) contains a variable that is conditioned on by including it in the model or b) contains a collider that it is &lt;strong&gt;not&lt;/strong&gt; conditioned on. For example, &lt;code&gt;T &amp;lt;- D -&amp;gt; A &amp;lt;- S&lt;/code&gt; is closed due to the collider &lt;code&gt;-&amp;gt; A &amp;lt;-&lt;/code&gt; but becomes active if A is included in the model. It can be closed again by also including D in the model (if it were observed).&lt;/li&gt;
&lt;li&gt;&lt;em&gt;unstable&lt;/em&gt;: a path is unstable if it includes a selection variable S.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you have worked with DAGs before, you probably already knew about active paths. The only new thing you need to learn is to only look for those active paths that are unstable, which is easy enough to verify. You don’t even need to look at all paths, only at those that include S! So let’s do it for our example in Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt;.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;applying-the-theory-to-a-toy-example&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Applying the theory to a toy example&lt;/h1&gt;
&lt;p&gt;Given the DAG in Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt;, we could use different sets of variables to predict our target variable T. For example, we could a) decide to use the observed variables A and Y, b) use Y alone, c) explore the possibility to use all variables by collecting additional data on D or d) use no predictors (i.e., always predict the average). Let’s look at those options in turn and determine whether they would result in a stable model.&lt;/p&gt;
&lt;div id=&#34;use-all-observed-variables-as-predictors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Use all observed variables as predictors&lt;/h2&gt;
&lt;p&gt;A common practice in prediction modelling is to include as many variables as possible (and available). In Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt;, this would mean that we’d use A and Y to estimate the conditional probability &lt;span class=&#34;math inline&#34;&gt;\(P(T~\|~A, Y)\)&lt;/span&gt;. Would such an estimate be stable? Let’s check for active unstable paths. There are two paths &lt;code&gt;T -&amp;gt; Y &amp;lt;- A &amp;lt;- S&lt;/code&gt; and &lt;code&gt;T &amp;lt;- D -&amp;gt; A &amp;lt;- S&lt;/code&gt; that include S. The first contains an open collider at &lt;code&gt;-&amp;gt; Y &amp;lt;-&lt;/code&gt; (because it is included in the model) but it is blocked by also conditioning on A, making it closed. The second path also contains an open collider, namely at &lt;code&gt;-&amp;gt; A &amp;lt;-&lt;/code&gt;. Since we do not observe D, this path is active and unstable.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;use-only-non-mutable-observed-variables-as-predictors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Use only non-mutable observed variables as predictors&lt;/h2&gt;
&lt;p&gt;In recent years, researchers have become mindful of the fact that some relationships may be unreliable. For example, it is not unusual to see &lt;a href=&#34;https://arxiv.org/abs/2107.05230&#34;&gt;models that purposefully ignore information on medication to avoid spurious relationships&lt;/a&gt;. Following a similar line of argument, it could be tempting to remove A (which is mutable) from the model and only predict &lt;span class=&#34;math inline&#34;&gt;\(P(T~|~Y)\)&lt;/span&gt;. After all, if we are not relying on mutable variables we may be safe from instability. Unfortunately, this isn’t an option either (at least not in this particular examples). If we remove A from the model, the previously blocked path &lt;code&gt;T -&amp;gt; Y &amp;lt;- A &amp;lt;- S&lt;/code&gt; is now open and we are again left with an unstable model.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;collect-additional-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Collect additional data&lt;/h2&gt;
&lt;p&gt;By now, you might have thrown your hands up in despair. Neither option using the observed variables led to a stable model (note that adjusting only for A also does not solve the issue because there is still an open path via D). In our particular example, there is another possibility for a stable model if we have the time and resources to measure the previously unobserved variable D, but of course we only want to do so if it leads to a stable predictor. So is &lt;span class=&#34;math inline&#34;&gt;\(P(T~|~A, Y, D)\)&lt;/span&gt; stable? It turns out it is, as both &lt;code&gt;T -&amp;gt; Y &amp;lt;- A &amp;lt;- S&lt;/code&gt; (by A) and &lt;code&gt;T &amp;lt;- D -&amp;gt; A &amp;lt;- S&lt;/code&gt; (by D) are blocked and our model will therefore be stable across environments.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;use-no-predictors&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Use no predictors&lt;/h2&gt;
&lt;p&gt;What else can we do if we do not want or can’t collect data on D. One final option is always to admit defeat and simply make a prediction based on the average &lt;span class=&#34;math inline&#34;&gt;\(P(T)\)&lt;/span&gt;. This estimate is stable but obviously isn’t a very good predictor. Yet what else is there left to do? Thankfully not all is lost and there are other smart things we could do to obtain a stable predictor without the need for additional data collection. I will talk about some of these options in my next posts.&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;testing-the-theory&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Testing the theory&lt;/h1&gt;
&lt;p&gt;Up to now, we have used theory to determine whether a particular model would result in a stable predictor. In this final section, we simulate data for Figure &lt;a href=&#34;#fig:example-dag&#34;&gt;1&lt;/a&gt; to test our conclusions and confirm the (lack of) stability of all models considered above. Following the example in &lt;a href=&#34;https://arxiv.org/abs/1808.03253&#34;&gt;Subbaswamy and Saria (2018)&lt;/a&gt;, we use simple linear relationships and Gaussian noise for all variables, giving the following structural equations:&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
\begin{aligned}
D &amp;amp;\sim N(0, \sigma^2) \\
T &amp;amp;\sim N(\beta_1D, \sigma^2) \\
A &amp;amp;\sim N(\beta_2^eD, \sigma^2) \\
Y &amp;amp;\sim N(\beta_3T + \beta_4A, \sigma^2)
\end{aligned}
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;You might have noticed the superscript &lt;span class=&#34;math inline&#34;&gt;\(e\)&lt;/span&gt; in &lt;span class=&#34;math inline&#34;&gt;\(\beta^e_2\)&lt;/span&gt;. We use this superscript to indicate that the coefficient depends on the environment &lt;span class=&#34;math inline&#34;&gt;\(e \in \mathcal{E} \}\)&lt;/span&gt; where &lt;span class=&#34;math inline&#34;&gt;\(\mathcal{E}\)&lt;/span&gt; is the set of all possible environments. Since the value of the coefficient depends on the environment, A is mutable (note that we could have chosen other ways to make A mutable, for example by including another unobserved variable that influences A and changes across environments). All other coefficients are constant across environments, i.e., &lt;span class=&#34;math inline&#34;&gt;\(\beta_i^e = \beta_i\)&lt;/span&gt; for &lt;span class=&#34;math inline&#34;&gt;\(i \in \{1, 3, 4 \}\)&lt;/span&gt;. Finally, we set a uniform noise &lt;span class=&#34;math inline&#34;&gt;\(\sigma^2=0.1\)&lt;/span&gt; for all variables. We combine this into a function that draws a sample of size &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;simulate_data &amp;lt;- function(n, beta, dev = 0) {
  noise &amp;lt;- sqrt(0.1) # rnorm is parameterised as sigma instead of sigma^2
  
  D &amp;lt;- rnorm(n, sd = noise)
  T &amp;lt;- rnorm(n, beta[1] * D, sd = noise)
  A &amp;lt;- rnorm(n, (beta[2] + dev) * D, sd = noise)
  Y &amp;lt;- rnorm(n, beta[3] * T + beta[4] * A, sd = noise)
  
  tibble(D, T, A, Y)
}

set.seed(42)
n &amp;lt;- 30000

# Choose coefficients
beta &amp;lt;- vector(&amp;quot;numeric&amp;quot;, length = 4)
beta[2] &amp;lt;- 2                  # we manually set beta_2 and vary it by env
beta[c(1, 3, 4)] &amp;lt;- rnorm(3)  # we randomly draw values for the other betas

cat(&amp;quot;Betas: &amp;quot;, beta)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## Betas:  1.370958 2 -0.5646982 0.3631284&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We will define model performance in terms of the mean squared error (MSE) &lt;span class=&#34;math inline&#34;&gt;\(n^{-1} \sum_{i=1}^n (t_i - \hat t_i)^2\)&lt;/span&gt;, where &lt;span class=&#34;math inline&#34;&gt;\(t_i\)&lt;/span&gt; is the true value of T for patient &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(\hat t_i\)&lt;/span&gt; is the estimate given by our model. The function &lt;code&gt;fit_and_eval()&lt;/code&gt; fits a linear regression model to the training data and returns its MSE on some test data. By varying &lt;span class=&#34;math inline&#34;&gt;\(\beta^{e}_2\)&lt;/span&gt; in the test environment, we can test how our models perform when the coefficient deviates more and more from the value seen in our training environment.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mse &amp;lt;- function(y, y_hat){
  mean((y - y_hat) ^ 2)
}

fit_and_eval &amp;lt;- function(formula, train, test) {
  fit &amp;lt;- lm(formula, data = train)
  pred &amp;lt;- predict(fit, test)
  mse(test$T, pred)
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Now that we’ve defined everything we need to run our simulation, let’s see how our models fare. Since we are only running linear regressions that are easy to compute, we can set the number of samples to a high value (N=30,000) to get stable results. The performance of our four models across the range of &lt;span class=&#34;math inline&#34;&gt;\(\beta_2\)&lt;/span&gt; can be seen in Figure &lt;a href=&#34;#fig:run-simulations&#34;&gt;2&lt;/a&gt;. Our simulations appear to confirm our theoretical analysis. The full model &lt;code&gt;M3&lt;/code&gt; (blue) retains a stable performance across all considered &lt;span class=&#34;math inline&#34;&gt;\(\beta_2\)&lt;/span&gt;’s. &lt;code&gt;M1&lt;/code&gt; and &lt;code&gt;M2&lt;/code&gt; on the other hand have U-shaped performance curves that depend on the value of &lt;span class=&#34;math inline&#34;&gt;\(\beta_2\)&lt;/span&gt; in the test environment. When &lt;span class=&#34;math inline&#34;&gt;\(\beta_2\)&lt;/span&gt; is close to the value in the training environment (vertical grey line), &lt;code&gt;M1&lt;/code&gt; achieves a performance that is almost as good as that of the full model &lt;code&gt;M3&lt;/code&gt;. However, as the coefficient deviates from its value in the training environment, model performance quickly deteriorates and even becomes worse than simply using the global average (&lt;code&gt;M4&lt;/code&gt; green line).&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Training environment (always the same)
train &amp;lt;- simulate_data(n, beta)

# Test environments (beta_2 deviates from training env along a grid)
grid_len &amp;lt;- 100L
all_obs &amp;lt;- only_y &amp;lt;- add_d &amp;lt;- no_pred &amp;lt;- vector(&amp;quot;numeric&amp;quot;, grid_len)
devs &amp;lt;- seq(-12, 4, length.out = grid_len)

for (i in 1:grid_len) {
  # Draw test environment
  test &amp;lt;- simulate_data(n, beta, dev = devs[i])
  
  # Fit each model
  all_obs[i] &amp;lt;- fit_and_eval(T ~ Y + A    , train, test)
  only_y[i]  &amp;lt;- fit_and_eval(T ~ Y        , train, test)
  add_d[i]   &amp;lt;- fit_and_eval(T ~ Y + A + D, train, test)
  no_pred[i] &amp;lt;- fit_and_eval(T ~ 1        , train, test)
}

results &amp;lt;- tibble(devs, all_obs, only_y, add_d, no_pred)


ggplot(results, aes(x = beta[2] + devs)) + 
  geom_vline(xintercept = beta[2], colour = &amp;quot;darkgrey&amp;quot;, size = 1) + 
  geom_point(aes(y = all_obs, colour = &amp;quot;M1: all observed variables&amp;quot;), size = 2) + 
  geom_point(aes(y = only_y, colour = &amp;quot;M2: non-mutable variables&amp;quot;), size = 2) + 
  geom_point(aes(y = add_d, colour = &amp;quot;M3: additional data&amp;quot;), size = 2) + 
  geom_point(aes(y = no_pred, colour = &amp;quot;M4: no predictor&amp;quot;), size = 2) + 
  scale_colour_colorblind() + 
  labs(
    x = expression(beta[2]),
    y = &amp;quot;Mean squared error\n&amp;quot;
  ) +
  coord_cartesian(ylim = c(0, 0.5), expand = FALSE) + 
  theme_bw() + 
  theme(
    panel.grid = element_blank()
  )&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:run-simulations&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://www.patrick-rockenschaub.com/_post/2022-05-27-stable-prediction/index.en_files/figure-html/run-simulations-1.png&#34; alt=&#34;Mean squared error of all models across a range of test environments that differ in the coefficient for the relationship D -&amp;gt; A. The vertical grey line indicates the training environment.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 2: Mean squared error of all models across a range of test environments that differ in the coefficient for the relationship D -&amp;gt; A. The vertical grey line indicates the training environment.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;I have to admit I was surprised by the results for &lt;code&gt;M2&lt;/code&gt;, which I expected to be similar to but slightly worse than &lt;code&gt;M1&lt;/code&gt;. Instead of a minimum close to the training environment, however, &lt;code&gt;M2&lt;/code&gt; achieved its best performance far away at &lt;span class=&#34;math inline&#34;&gt;\(\beta_2 \approx -7.5\)&lt;/span&gt; whereas the performance in the training environment was barely better than using no predictors. Its &lt;span class=&#34;math inline&#34;&gt;\(R^2\)&lt;/span&gt; was only 0.093 compared to 0.672 for &lt;code&gt;M1&lt;/code&gt; and 0.741 for &lt;code&gt;M3&lt;/code&gt;. The reason for this seems to be a very low variance of Y given the particular set of &lt;span class=&#34;math inline&#34;&gt;\(\beta\)&lt;/span&gt;’s chosen. As the value of &lt;span class=&#34;math inline&#34;&gt;\(\beta_2\)&lt;/span&gt; decreases the variance of Y and its covariance with D and T changes such that it becomes a better predictor of T (and even crowds out the “direct” effect of D due to the active backdoor path &lt;code&gt;D -&amp;gt; A -&amp;gt; Y &amp;lt;- T&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;Finally, note that the scale of the curves in Figure &lt;a href=&#34;#fig:run-simulations&#34;&gt;2&lt;/a&gt; depends on the values chosen for &lt;span class=&#34;math inline&#34;&gt;\(\beta_1\)&lt;/span&gt;, &lt;span class=&#34;math inline&#34;&gt;\(\beta_2\)&lt;/span&gt;, and &lt;span class=&#34;math inline&#34;&gt;\(\beta_4\)&lt;/span&gt;. The shape of the curves and the overall conclusions remain the same, though.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;note-on-patient-mix&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Note on patient mix&lt;/h1&gt;
&lt;p&gt;So far I’ve acted as if incorrectly estimated model coefficients are the only reason for changes in performance across environments. However, if you’ve ever performed (or read about) external validation of clinical prediction models, you may by now be shouting at your screen that there are other reasons for performance changes. In fact, even if our model is specified perfectly (i.e., all coefficients are estimated with their true causal values) it may not always be possible to achieve the same performance across environments. I discussed in a &lt;a href=&#34;https://www.patrick-rockenschaub.com/posts/2021/11/contextual-nature-of-auc/&#34;&gt;previous post&lt;/a&gt; how the AUC may change depending on the make-up of your target population even if we know the exact model that generated the data. The same general principle is true for MSE. Some patients may simply be harder to predict than others and if your population contains more of one type of patients than the other, average performance of your model may change (though model performance remains the same for each individual patient assuming your coefficients are correct!). The make-up of your population is often referred to as patient mix. In our case, patient mix remained stable across environments (we did not change &lt;code&gt;D -&amp;gt; T&lt;/code&gt;). I chose this setup to focus on the effects of a mutable variables when estimating model parameters. However, thinking hard about your patient mix becomes indispensable when transferring our model to new populations. If you want to read up further on this topic, I can recommend chapter 19 of &lt;a href=&#34;https://link.springer.com/book/10.1007/978-0-387-77244-8&#34;&gt;Ewout Steyerberg’s book on Clinical Prediction Models&lt;/a&gt; which includes some general advice on how to distinguish changes in patient mix from issues of model misspecification.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;acknowledgements&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;The structure of this post (and likely all future posts) was inspired by the great posts on &lt;a href=&#34;https://www.andrewheiss.com/&#34;&gt;Andrew Heiss’ blog&lt;/a&gt; and in particular his posts on &lt;a href=&#34;https://www.andrewheiss.com/blog/2021/12/18/bayesian-propensity-scores-weights/&#34;&gt;inverse probability weighting in Bayesian models&lt;/a&gt; and the &lt;a href=&#34;https://www.andrewheiss.com/blog/2021/09/07/do-calculus-backdoors/&#34;&gt;derivation of the trhee rules of do-calculus&lt;/a&gt;. Andrew is an assistant professor in the Department of Public Management and Policy at Georgia State University, teaching on causal inference, statistics, and data science. His posts on these topics have been a joy to read and I am striving to make mine as effortlessly educational.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Implementing the optimism-adjusted bootstrap with tidymodels</title>
      <link>https://www.patrick-rockenschaub.com/_post/2022-01-26-tidymodels-optimism-bootstrap/</link>
      <pubDate>Wed, 26 Jan 2022 12:00:00 +0000</pubDate>
      <guid>https://www.patrick-rockenschaub.com/_post/2022-01-26-tidymodels-optimism-bootstrap/</guid>
      <description>


&lt;p&gt;It is well known that prediction models have a tendency to overfit to the training data, especially if we only have a limited amount of training data. While performance of such overfitted models appears high when evaluated on the data available during training, their performance on new, previously unseen data is often considerably worse. Although it may be tempting to the analyst to choose a model with high training performance, it is the model’s performance in future data that we are really interested in.&lt;/p&gt;
&lt;p&gt;Several resampling methods have been proposed to account for this issue. The most widely used techniques fall into two categories: cross-validation and bootstrapping. The idea underlying these techniques is similar. By repeating the model fitting multiple times on different subsets of the training data, we may get a better understanding of the magnitude of overfitting and can account for it in our model building and evaluation. Without going into too much detail, cross-validation separates the data into &lt;span class=&#34;math inline&#34;&gt;\(k\)&lt;/span&gt; mutually exclusive folds and always holds one back as a “hidden” test set. Note that the sample size available to the model during each training run necessarily decreases to &lt;span class=&#34;math inline&#34;&gt;\(\frac{k-1}{k}n\)&lt;/span&gt;. Bootstrap, on the other hand, resamples (with replacement) a data set with the same size &lt;span class=&#34;math inline&#34;&gt;\(n\)&lt;/span&gt; as the original training set and then — depending on the exact method — uses a weighted combination randomly sampled and excluded observations.&lt;/p&gt;
&lt;p&gt;Whereas the machine learning community almost exclusively uses cross-validation for model validation, bootstrap-based methods may be more commonly seen in biomedical sciences. One reason for this popularity may be the fact that they are championed by preeminent experts in the field: both Frank Harrell &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Harrell2015-ws&#34; role=&#34;doc-biblioref&#34;&gt;Harrell 2015&lt;/a&gt;)&lt;/span&gt; and Ewout Steyerberg &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Steyerberg2019-yc&#34; role=&#34;doc-biblioref&#34;&gt;Steyerberg 2019&lt;/a&gt;)&lt;/span&gt; prominently feature the bootstrap — and in particular the optimism-adjusted bootstrap (OAD) — in their textbooks. In this post, I give a brief introduction into OAD and compare it to repeated cross-validation and regular bootstrap. OAD is implemented in the R packages &lt;a href=&#34;https://cran.r-project.org/web/packages/caret/&#34;&gt;&lt;em&gt;caret&lt;/em&gt;&lt;/a&gt; and Frank Harrell’s &lt;a href=&#34;https://cran.r-project.org/web/packages/rms/&#34;&gt;&lt;em&gt;rms&lt;/em&gt;&lt;/a&gt; but not in the recent &lt;a href=&#34;https://cran.r-project.org/web/packages/tidymodels/&#34;&gt;&lt;em&gt;tidymodels&lt;/em&gt;&lt;/a&gt; ecosystem &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Kuhn2022-al&#34; role=&#34;doc-biblioref&#34;&gt;Kuhn and Silge 2022&lt;/a&gt;)&lt;/span&gt;. This post will therefore provide a step-by-step guide to doing OAD with &lt;a href=&#34;https://cran.r-project.org/web/packages/tidymodels/&#34;&gt;&lt;em&gt;tidymodels&lt;/em&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;div id=&#34;who-this-post-is-for&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Who this post is for&lt;/h1&gt;
&lt;p&gt;Here’s what I assume you to know:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’re familiar with &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; and the &lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;tidyverse&lt;/a&gt;, including the amazing &lt;a href=&#34;https://www.tidymodels.org/&#34;&gt;tidymodels&lt;/a&gt; framework (if not go check it out now!).&lt;/li&gt;
&lt;li&gt;You know a little bit about fitting and evaluating linear regression models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will use the following &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; packages throughout this post:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(tidyverse)
library(tidymodels)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;optimism-adjusted-bootstrap&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Optimism-adjusted bootstrap&lt;/h1&gt;
&lt;p&gt;Like other resampling schemes, the OAD aims to avoid overly optimistic estimation of model performance during internal validation — i.e., validation of model performance using the training dataset. As we will see further down, simply calculating performance metrics on the same data used for training leads to artificially high/good performance estimates. We will call this the “apparent” performance. OAD proposes to obtain a better estimate by directly estimating the amount of “optimism” in the apparent performance. The steps needed to do so are as follows &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Steyerberg2019-yc&#34; role=&#34;doc-biblioref&#34;&gt;Steyerberg 2019&lt;/a&gt;)&lt;/span&gt;:&lt;/p&gt;
&lt;ol style=&#34;list-style-type: decimal&#34;&gt;
&lt;li&gt;Fit a model &lt;span class=&#34;math inline&#34;&gt;\(M\)&lt;/span&gt; to the original training set &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt; and use &lt;span class=&#34;math inline&#34;&gt;\(M\)&lt;/span&gt; to calculate the apparent performance &lt;span class=&#34;math inline&#34;&gt;\(R(M, S)\)&lt;/span&gt; (e.g., accuracy) on the training data&lt;/li&gt;
&lt;li&gt;Draw a bootstrapped sample &lt;span class=&#34;math inline&#34;&gt;\(S^*\)&lt;/span&gt; of the same size as &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt; through sampling &lt;em&gt;with&lt;/em&gt; replacement&lt;/li&gt;
&lt;li&gt;Construct another model &lt;span class=&#34;math inline&#34;&gt;\(M*\)&lt;/span&gt; by performing all model building steps (pre-processing, imputation, model selection, etc.) on &lt;span class=&#34;math inline&#34;&gt;\(S^*\)&lt;/span&gt; and calculate it’s apparent performance &lt;span class=&#34;math inline&#34;&gt;\(R(M^*, S^*)\)&lt;/span&gt; on &lt;span class=&#34;math inline&#34;&gt;\(S*\)&lt;/span&gt;&lt;/li&gt;
&lt;li&gt;Use &lt;span class=&#34;math inline&#34;&gt;\(M*\)&lt;/span&gt; to estimate the performance &lt;span class=&#34;math inline&#34;&gt;\(R(M^*, S)\)&lt;/span&gt; that it would have had on the original data &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;Calculate the optimism &lt;span class=&#34;math inline&#34;&gt;\(O^* = R(M^*, S^*) - R(M^*, S)\)&lt;/span&gt; as the difference between the apparent and test performance of &lt;span class=&#34;math inline&#34;&gt;\(M*\)&lt;/span&gt;.&lt;/li&gt;
&lt;li&gt;Repeat steps 2.-5. may times &lt;span class=&#34;math inline&#34;&gt;\(B\)&lt;/span&gt; to obtain a sufficiently stable estimate (common recommendations range from 100-1000 times depending on the computational feasibility)&lt;/li&gt;
&lt;li&gt;Subtract the mean optimism &lt;span class=&#34;math inline&#34;&gt;\(\frac{1}{B} \sum^B_{b=1} O^*_b\)&lt;/span&gt; from the apparent performance &lt;span class=&#34;math inline&#34;&gt;\(R_{app}\)&lt;/span&gt; in the original training data &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt; to get a optimism-adjusted estimate of model performance.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;The basic intuition behind this procedure is that the model &lt;span class=&#34;math inline&#34;&gt;\(M*\)&lt;/span&gt; will overfit to &lt;span class=&#34;math inline&#34;&gt;\(S^*\)&lt;/span&gt; in the same way as &lt;span class=&#34;math inline&#34;&gt;\(M\)&lt;/span&gt; overfits to &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt;. We can then estimate the difference between &lt;span class=&#34;math inline&#34;&gt;\(M\)&lt;/span&gt;’s observed apparent performance &lt;span class=&#34;math inline&#34;&gt;\(R(M, S)\)&lt;/span&gt; and its unobserved performance on future test data &lt;span class=&#34;math inline&#34;&gt;\(R(M, U)\)&lt;/span&gt; from the difference between the bootstrapped model &lt;span class=&#34;math inline&#34;&gt;\(M^*\)&lt;/span&gt;’s apparent performance &lt;span class=&#34;math inline&#34;&gt;\(R(M^*, S^*)\)&lt;/span&gt; and its test performance &lt;span class=&#34;math inline&#34;&gt;\(R(M^*, S)\)&lt;/span&gt; (which are both observed). The training data &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt; acts as a stand-in test data for the bootstrapped model &lt;span class=&#34;math inline&#34;&gt;\(M*\)&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;The following sections will apply this basic idea to the Ames housing dataset and compare estimates derived via OAB to repeated cross-validation and standard bootstrap.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;the-ames-data-set&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;The Ames data set&lt;/h1&gt;
&lt;p&gt;The Ames data set contains information on 2,930 properties in Ames, Iowa, and contains 74 variables including the number of bedrooms, whether the property includes a garage, and the sale price. We choose this data set because it provides a decent sample size for predictive modelling and is already used prominently in the documentation of the R &lt;code&gt;tidymodels&lt;/code&gt; ecosystem. More information on the Ames data set can be found in &lt;span class=&#34;citation&#34;&gt;(&lt;a href=&#34;#ref-Kuhn2022-al&#34; role=&#34;doc-biblioref&#34;&gt;Kuhn and Silge 2022&lt;/a&gt;)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(123)

data(ames)
dim(ames)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 2930   74&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;ames[1:5, 1:5]&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 × 5
##   MS_SubClass                         MS_Zoning     Lot_Frontage Lot_Area Street
##   &amp;lt;fct&amp;gt;                               &amp;lt;fct&amp;gt;                &amp;lt;dbl&amp;gt;    &amp;lt;int&amp;gt; &amp;lt;fct&amp;gt; 
## 1 One_Story_1946_and_Newer_All_Styles Residential_…          141    31770 Pave  
## 2 One_Story_1946_and_Newer_All_Styles Residential_…           80    11622 Pave  
## 3 One_Story_1946_and_Newer_All_Styles Residential_…           81    14267 Pave  
## 4 One_Story_1946_and_Newer_All_Styles Residential_…           93    11160 Pave  
## 5 Two_Story_1946_and_Newer            Residential_…           74    13830 Pave&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;For this exercise, we try to predict sale prices within the dataset. To keep preprocessing simple, we limit the predictors to only numeric variables, which we centre and scale. Since sale prices are right skewed, we log them before prediction. Finally, we will hold back a random quarter of the data to simulate external validation on an independent identically distributed test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;# Define sale price as the prediction target
formula &amp;lt;- Sale_Price ~ .

# Remove categorical variables, log sale price, scale the numeric predictors
preproc &amp;lt;- recipe(formula, data = ames[0, ]) %&amp;gt;% 
  step_rm(all_nominal_predictors()) %&amp;gt;% 
  step_log(all_outcomes()) %&amp;gt;% 
  step_normalize(all_numeric_predictors(), -all_outcomes())

# Randomly split into training (3/4) and testing (1/4) sets
train_test_split &amp;lt;- initial_split(ames, prop = 3/4)
train &amp;lt;- training(train_test_split)
test &amp;lt;- testing(train_test_split)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;optimism-adjusted-bootstrap-with-tidymodels&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Optimism-adjusted bootstrap with &lt;em&gt;tidymodels&lt;/em&gt;&lt;/h1&gt;
&lt;p&gt;Now that we have set up the data, lets look into how we can build a linear regression model and validate it via OAB. We proceed according to the steps described above.&lt;/p&gt;
&lt;div id=&#34;step-1-calculate-apparent-perforamnce&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 1: Calculate apparent perforamnce&lt;/h2&gt;
&lt;p&gt;To start, we simply fit and evaulate our model &lt;span class=&#34;math inline&#34;&gt;\(M\)&lt;/span&gt; on the original training data &lt;span class=&#34;math inline&#34;&gt;\(S\)&lt;/span&gt; (note that we also apply preprocessing, therefore we strictly speaking train our model on the preprocessed data &lt;span class=&#34;math inline&#34;&gt;\(S&amp;#39;\)&lt;/span&gt;). Since our outcome is a continuous value strictly greater than zero, we will use the residual mean squared error as our performance metric.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;prepped &amp;lt;- prep(preproc, train)
preproc_orig &amp;lt;- juice(prepped)
fit_orig &amp;lt;- fit(linear_reg(), formula, preproc_orig)
preds_orig &amp;lt;- predict(fit_orig, new_data = preproc_orig)
perf_orig &amp;lt;- rmse_vec(preproc_orig$Sale_Price, preds_orig$.pred)

perf_orig&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.1693906&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-2-create-bootstrapped-samples&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 2: Create bootstrapped samples&lt;/h2&gt;
&lt;p&gt;After obtaining &lt;span class=&#34;math inline&#34;&gt;\(M\)&lt;/span&gt; and &lt;span class=&#34;math inline&#34;&gt;\(R(M, S)\)&lt;/span&gt;, we now produce a set of bootstrap samples to estimate the amount of optimism in this performance estiamte. We use the &lt;em&gt;tidymodels&lt;/em&gt; sub-package &lt;em&gt;rsample&lt;/em&gt; to create a data frame &lt;code&gt;bs&lt;/code&gt; with &lt;code&gt;200&lt;/code&gt; bootstrap samples. All of these resamples have training data of equal size to the original training data (n = 2197). Note however that the “testing data” set aside differs between splits, as it is defined by all rows that did not get sampled into the training data, which is a random variable and may vary between bootstraps. We won’t use this testing data for OAB but it is for example used in the simple bootstrap that we use for comparison later.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bs &amp;lt;- bootstraps(train, times = 100L)

bs %&amp;gt;% slice(1:5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 5 × 2
##   splits             id          
##   &amp;lt;list&amp;gt;             &amp;lt;chr&amp;gt;       
## 1 &amp;lt;split [2197/813]&amp;gt; Bootstrap001
## 2 &amp;lt;split [2197/818]&amp;gt; Bootstrap002
## 3 &amp;lt;split [2197/813]&amp;gt; Bootstrap003
## 4 &amp;lt;split [2197/786]&amp;gt; Bootstrap004
## 5 &amp;lt;split [2197/792]&amp;gt; Bootstrap005&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bs %&amp;gt;% slice((n()-5):n())&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 6 × 2
##   splits             id          
##   &amp;lt;list&amp;gt;             &amp;lt;chr&amp;gt;       
## 1 &amp;lt;split [2197/806]&amp;gt; Bootstrap095
## 2 &amp;lt;split [2197/822]&amp;gt; Bootstrap096
## 3 &amp;lt;split [2197/778]&amp;gt; Bootstrap097
## 4 &amp;lt;split [2197/800]&amp;gt; Bootstrap098
## 5 &amp;lt;split [2197/833]&amp;gt; Bootstrap099
## 6 &amp;lt;split [2197/815]&amp;gt; Bootstrap100&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-3-fit-bootstrapped-models-and-calculate-their-apparent-performance&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 3: Fit bootstrapped models and calculate their apparent performance&lt;/h2&gt;
&lt;p&gt;We now use the bootstrap data.frame &lt;code&gt;bs&lt;/code&gt; to preprocess each sample &lt;span class=&#34;math inline&#34;&gt;\(S^*\)&lt;/span&gt; individually, fit a linear regression &lt;span class=&#34;math inline&#34;&gt;\(M^*\)&lt;/span&gt; to it, and calculate its apparent performance &lt;span class=&#34;math inline&#34;&gt;\(R(M^*, S^*)\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bs &amp;lt;- bs %&amp;gt;% 
  mutate(
    # Apply preprocessing separately for each bootstrapped sample S*
    processed = map(splits, ~ juice(prep(preproc, training(.)))),
    # Fit a separate model M* to each preprocessed bootstrap
    fitted = map(processed, ~ fit(linear_reg(), formula, data = .)),
    # Predict values for each bootstrap&amp;#39;s training data S* and calculate RMSE
    pred_app = map2(fitted, processed, ~ predict(.x, new_data = .y)),
    perf_app = map2_dbl(processed, pred_app, ~ rmse_vec(.x$Sale_Price, .y$.pred))
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-4-evaluate-on-the-original-training-data&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 4: Evaluate on the original training data&lt;/h2&gt;
&lt;p&gt;Since we stored the fitted models &lt;span class=&#34;math inline&#34;&gt;\(M^*_i\)&lt;/span&gt; in a column of the data.frame, we can easily re-use them to predict values for the original data and evaluate them. Remember that because some of the rows in the original dataset did not end up in the bootstrapped dataset, we expect the performance &lt;span class=&#34;math inline&#34;&gt;\(R(M^*_i, S)\)&lt;/span&gt; of each model &lt;span class=&#34;math inline&#34;&gt;\(M^*_i\)&lt;/span&gt; to be lower than the performance in its own training data &lt;span class=&#34;math inline&#34;&gt;\(R(M^*_i, S^*_i)\)&lt;/span&gt;.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bs &amp;lt;- bs %&amp;gt;% 
  mutate(
    pred_test = map(fitted, ~ predict(., new_data = preproc_orig)),
    perf_test = map_dbl(pred_test, ~ rmse_vec(preproc_orig$Sale_Price, .$.pred)),
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;step-5-estimate-the-optimism&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Step 5: Estimate the optimism&lt;/h2&gt;
&lt;p&gt;The amount of optimism in our apparent estimate is now simply estimated by the differences between apparent and test performance in each bootstrap.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;bs &amp;lt;- bs %&amp;gt;% 
  mutate(
    optim = perf_app - perf_test
  )&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;steps-6-7-adjust-for-optimism&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;Steps 6-7: Adjust for optimism&lt;/h2&gt;
&lt;p&gt;We already repeated this procedure in parallel for 200 samples, therefore step 6 is fulfilled. In order to get a single, final estimate, all that’s left to do is to calculate the mean and standard deviation of the optimism and substract them (which approximately normal 95% Wald confidence limits) from the apparent performance obtained in step 1. This is now the performance that we report for our model after internal validation&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;mean_opt &amp;lt;- mean(bs$optim)
std_opt &amp;lt;- sd(bs$optim)

(perf_orig - mean_opt) + c(-2, 0, 2) * std_opt / sqrt(nrow(bs))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.1766124 0.1789846 0.1813568&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;external-validation&#34; class=&#34;section level2&#34;&gt;
&lt;h2&gt;External validation&lt;/h2&gt;
&lt;p&gt;Remember that we set aside a quarter of the data for external validation (external is a bit of misnomer here but more on that later). We can now compare how our estimate from internal validation compares to the performance in the held-out data. Indeed, the performance seems to have slightly dropped but — thankfully — it is still within the bounds suggested by OAB above.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;preproc_test &amp;lt;- bake(prepped, test)
preds_test &amp;lt;- predict(fit_orig, new_data = preproc_test)
rmse_vec(preproc_test$Sale_Price, preds_test$.pred)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## [1] 0.1855153&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;putting-everything-together&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Putting everything together&lt;/h1&gt;
&lt;p&gt;Using what we learned above, we can create a single function &lt;code&gt;calculate_optimism_adjusted()&lt;/code&gt; that performs all steps and returns the adjusted model performance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;calculate_optimism_adjusted &amp;lt;- function(train_data, formula, preproc, n_resamples = 10L) {
  # Get apparent performance
  prepped &amp;lt;- prep(preproc, train_data)
  preproc_orig &amp;lt;- juice(prepped)
  fit_orig &amp;lt;- fit(linear_reg(), formula, preproc_orig)
  preds_orig &amp;lt;- predict(fit_orig, new_data = preproc_orig)
  perf_orig &amp;lt;- rmse_vec(last(preproc_orig), preds_orig$.pred)
  
  # Estimate optimism via bootstrap
  rsmpl &amp;lt;- bootstraps(train_data, times = n_resamples) %&amp;gt;% 
    mutate(
      processed = map(splits, ~ juice(prep(preproc, training(.)))),
      fitted = map(processed, ~ fit(linear_reg(), formula, data = .)),
      pred_app = map2(fitted, processed, ~ predict(.x, new_data = .y)),
      perf_app = map2_dbl(processed, pred_app, ~ rmse_vec(.x$Sale_Price, .y$.pred)),
      pred_test = map(fitted, ~ predict(., new_data = preproc_orig)),
      perf_test = map_dbl(pred_test, ~ rmse_vec(last(preproc_orig), .$.pred)),
      optim = perf_app - perf_test
    )
  
  mean_opt &amp;lt;- mean(rsmpl$optim)
  std_opt &amp;lt;- sd(rsmpl$optim)

  # Adjust for optimism
  tibble(
    .metric = &amp;quot;rmse&amp;quot;,
    mean = perf_orig - mean_opt, 
    n = n_resamples, 
    std_err = std_opt / sqrt(n_resamples)
  )
}&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We also define a similar function &lt;code&gt;eval_test&lt;/code&gt; for the external validation and wrappers around &lt;em&gt;tidymodel&lt;/em&gt;’s &lt;code&gt;fit_resample&lt;/code&gt; to do the same for repeated cross-validation (&lt;code&gt;calculate_repeated_cv()&lt;/code&gt;) and standard bootstrap (&lt;code&gt;calculate_standard_bs()&lt;/code&gt;), which we will compare in a second.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;eval_test &amp;lt;- function(train_data, test_data, formula, preproc) {
  
  prepped &amp;lt;- prep(preproc, train_data)
  preproc_train &amp;lt;- juice(prepped)
  preproc_test &amp;lt;- bake(prepped, test_data)
  fitted &amp;lt;- fit(linear_reg(), formula, data = preproc_train)
  preds &amp;lt;- predict(fitted, new_data = preproc_test)
  rmse_vec(preproc_test$Sale_Price, preds$.pred)
}

calculate_repeated_cv &amp;lt;- function(train_data, formula, preproc, v = 10L, repeats = 1L){
  rsmpl &amp;lt;- vfold_cv(train_data, v = v, repeats = repeats)

  show_best(fit_resamples(linear_reg(), preproc, rsmpl), metric = &amp;quot;rmse&amp;quot;) %&amp;gt;% 
    select(-.estimator, -.config)
}

calculate_standard_bs &amp;lt;- function(train_data, formula, preproc, n_resamples = 10L) {
  rsmpl &amp;lt;- bootstraps(train_data, times = n_resamples, apparent = FALSE)

  show_best(fit_resamples(linear_reg(), preproc, rsmpl), metric = &amp;quot;rmse&amp;quot;) %&amp;gt;% 
    select(-.estimator, -.config)
}&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;comparison-of-validation-methods&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Comparison of validation methods&lt;/h1&gt;
&lt;p&gt;In this last section, we will compare the results obtained from OAB to two other well-known validation methods: repeated 10-fold cross-validation and standard bootstrap. In the former, we randomly split the data into 10 mutually exclusive folds of equal size. In a round-robin fashion, we set aside one fold as an evaluation set and use the remaining nine to train our model. We then choose the next fold and do the same. After one round, we have ten estimates of model performance, one for each held-out fold. We repeat this process several times with new random seeds to get the same number of resamples as were used for the bootstrap. With standard bootstrap, we fit our models on the same bootstrapped data but evaluate them on the samples that were randomly excluded from that particular bootstrap — similar to the held-out fold of cross-validation.&lt;/p&gt;
&lt;p&gt;In order to get a good comparison of methods, we won’t stick with a single train-test split as before but use nested validation. The reason for this is that our test data isn’t truly external. Instead, it is randomly sampled from the entire development dataset (which in this case was all of Ames). Holding out a single chunk of that data as test data would be wasteful and could again result in us being particularly lucky or unlucky in the selection of that chunk. This is particularly problematic if we further perform hyperparameter searches. In nested validation, we mitigate this risk by wrapping our entire internal validation in another cross-validation loop, i.e., we treat the held-out set of an outer cross-validation as the “external” test set.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;outer &amp;lt;- vfold_cv(ames, v = 5, repeats = 1)

outer &amp;lt;- outer %&amp;gt;% 
  mutate(
    opt = splits %&amp;gt;% 
      map(~ calculate_optimism_adjusted(training(.), formula, preproc, 100L)),
    cv = splits %&amp;gt;% 
      map(~ calculate_repeated_cv(training(.), formula, preproc, repeats = 10L)), 
    bs = splits %&amp;gt;% 
      map(~ calculate_standard_bs(training(.), formula, preproc, 100L)), 
    test = splits %&amp;gt;% 
      map_dbl(~ eval_test(training(.), testing(.), formula, preproc))
  )&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can see below that in this example, all resampling methods perform more or less similar. Notably, both bootstrap-based methods have narrower confidence intervals. This was to be expected, as cross-validation typically has high variance. This increased precision is traded for a risk of bias in bootstrap, which is usually pessimistic as with the standard bootstrap in this example. OAB here seems to have a slight optimistic bias. While its mean is similar to cross-validation, its increased confidence represented by narrower confidence interval means that the average test performance over the nested runs is not contained in the approximate confidence limits. However, all resampling methods give us a more accurate estimate of likely future model performance than the apparent performance of 0.169.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;format_results &amp;lt;- function(outer, method) {
  method &amp;lt;- rlang::enquo(method)
  
  outer %&amp;gt;% 
    unnest(!!method) %&amp;gt;% 
    summarise(
      rsmpl_lower = mean(mean - 2 * std_err),
      rsmpl_mean  = mean(mean), 
      rsmpl_upper = mean(mean + 2 * std_err), 
      test_mean   = mean(test)
    )
}

tibble(method = c(&amp;quot;opt&amp;quot;, &amp;quot;cv&amp;quot;, &amp;quot;bs&amp;quot;)) %&amp;gt;% 
  bind_cols(bind_rows(
    format_results(outer, opt), 
    format_results(outer, cv), 
    format_results(outer, bs), 
  ))&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 3 × 5
##   method rsmpl_lower rsmpl_mean rsmpl_upper test_mean
##   &amp;lt;chr&amp;gt;        &amp;lt;dbl&amp;gt;      &amp;lt;dbl&amp;gt;       &amp;lt;dbl&amp;gt;     &amp;lt;dbl&amp;gt;
## 1 opt          0.175      0.178       0.180     0.179
## 2 cv           0.170      0.177       0.184     0.179
## 3 bs           0.179      0.182       0.186     0.179&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;references&#34; class=&#34;section level1 unnumbered&#34;&gt;
&lt;h1&gt;References&lt;/h1&gt;
&lt;div id=&#34;refs&#34; class=&#34;references csl-bib-body hanging-indent&#34;&gt;
&lt;div id=&#34;ref-Harrell2015-ws&#34; class=&#34;csl-entry&#34;&gt;
Harrell, Frank E, Jr. 2015. &lt;em&gt;Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis&lt;/em&gt;. Springer, Cham.
&lt;/div&gt;
&lt;div id=&#34;ref-Kuhn2022-al&#34; class=&#34;csl-entry&#34;&gt;
Kuhn, Max, and Julia Silge. 2022. &lt;em&gt;Tidy Modeling with r&lt;/em&gt;. &lt;a href=&#34;https://www.tmwr.org/&#34;&gt;https://www.tmwr.org/&lt;/a&gt;.
&lt;/div&gt;
&lt;div id=&#34;ref-Steyerberg2019-yc&#34; class=&#34;csl-entry&#34;&gt;
Steyerberg, Ewout W. 2019. &lt;em&gt;Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating&lt;/em&gt;. Springer, Cham.
&lt;/div&gt;
&lt;/div&gt;
&lt;/div&gt;
</description>
    </item>
    
    <item>
      <title>Contextual nature of AUC</title>
      <link>https://www.patrick-rockenschaub.com/_post/2021-11-26-contextual-nature-of-auc/</link>
      <pubDate>Fri, 26 Nov 2021 21:13:14 -0500</pubDate>
      <guid>https://www.patrick-rockenschaub.com/_post/2021-11-26-contextual-nature-of-auc/</guid>
      <description>


&lt;p&gt;The area under the receiver operating characteristic (AUC) is arguably among
the most frequently used measures of classification performance. Unlike other
common measures like sensitivity, specificity, or accuracy, the AUC does not
require (often arbitrary) thresholds. It also lends itself to a very simple
and intuitive interpretation: a models AUC equals the probability that, for any
randomly chosen pair with and without the outcome, the observation with the
outcome is assigned a higher risk score by the model than the observation
without the outcome, or&lt;/p&gt;
&lt;p&gt;&lt;span class=&#34;math display&#34;&gt;\[
P(f(x_i) &amp;gt; f(x_j))
\]&lt;/span&gt;&lt;/p&gt;
&lt;p&gt;where &lt;span class=&#34;math inline&#34;&gt;\(f(x_i)\)&lt;/span&gt; is the risk score that the model
assigned to observation &lt;span class=&#34;math inline&#34;&gt;\(i\)&lt;/span&gt; based on its covariates &lt;span class=&#34;math inline&#34;&gt;\(x_i\)&lt;/span&gt;,
&lt;span class=&#34;math inline&#34;&gt;\(i \in D_{y=1}\)&lt;/span&gt; is an observation taken from among all cases &lt;span class=&#34;math inline&#34;&gt;\(y=1\)&lt;/span&gt;, and
&lt;span class=&#34;math inline&#34;&gt;\(j \in D_{y=0}\)&lt;/span&gt; is an observation taken from among all controls &lt;span class=&#34;math inline&#34;&gt;\(y=0\)&lt;/span&gt;. As such,
the AUC has a nice probabilistic meaning and can be linked back to the
well-known &lt;a href=&#34;https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test#Area-under-curve_(AUC)_statistic_for_ROC_curves&#34;&gt;Mann-Whitney U test&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;The ubiquitous use of AUC isn’t without controversy (which, like so
many things these days, spilled over into
&lt;a href=&#34;https://twitter.com/cecilejanssens/status/1104134423673479169&#34;&gt;Twitter&lt;/a&gt;).
Regularly voiced criticisms of AUC — and the closely linked receiver operating
characteristic (ROC) curve — include its indifference to class imbalance and
extreme observations.&lt;/p&gt;
&lt;p&gt;In this post, I want to take a closer look at a feature of AUC that — at least
in my experience — is often overlooked when evaluating models in an external
test set: the dependence of the AUC on the distribution of variables in the
test set. We will see that this can lead to considerable changes in estimated
AUC even if our model is actually correct, and make it harder to disentangle
changes in performance due to model misspecification from changes in performance
due to differences between development and test sets.&lt;/p&gt;
&lt;div id=&#34;who-this-post-is-for&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Who this post is for&lt;/h1&gt;
&lt;p&gt;Here’s what I assume you to know:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;You’re familiar with &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; and the &lt;a href=&#34;https://www.tidyverse.org/&#34;&gt;tidyverse&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;You know a little bit about fitting and evaluating linear regression models.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We will use the following &lt;a href=&#34;https://www.r-project.org/&#34;&gt;R&lt;/a&gt; packages throughout this post:&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;library(MASS)
library(tidyverse)
library(tidymodels)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;generating-some-dummy-data&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Generating some dummy data&lt;/h1&gt;
&lt;p&gt;Let’s get started by
simulating some fake medical data for 100,000 patients. Assume we are
interested in predicting their probability of death (=outcome). We use the
patients’ sex (binary) and three continuous measurements — age, blood pressure
, and cholesterol — to do so. In this fake data set, being female has a strong
protective effect (odds ratio = exp(-2) = 0.14) and all other variables have a
moderate effect (odds ratio per standard deviation = exp(0.3) = 1.35).
Any influence by other, unmeasured factors is simulated by drawing from a
Bernoulli distribution with a probability defined by sex, age, blood pressure,
and cholesterol.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;set.seed(42)

# Set number of rows and predictors
n &amp;lt;- 100000
p &amp;lt;- 3

# Simulate an additional binary predictor (e.g., sex)
sex &amp;lt;- rep(0:1, each = n %/% 2)

# Simulate multivariate normal predictors (e.g., age, blood pressure, 
# cholesterol)
mu &amp;lt;- rep(0, p)
Sigma &amp;lt;- 0.8 * diag(p) + 0.2 * matrix(1, p, p)
other_covars &amp;lt;- MASS::mvrnorm(n, mu, Sigma)
colnames(other_covars) &amp;lt;- c(&amp;quot;age&amp;quot;, &amp;quot;bp&amp;quot;, &amp;quot;chol&amp;quot;)

# Simulate binary outcome (e.g., death)
logistic &amp;lt;- function(x) 1 / (1 + exp(-x))
betas &amp;lt;- c(0.8, -2, 0.3, 0.3, 0.3)
lp &amp;lt;- cbind(1, sex, other_covars) %*% betas
death &amp;lt;- rbinom(n, 1, logistic(lp)) 

# Make into a data.frame and split into a training set (first half) and 
# a biased test set (rows in the second half X * beta &amp;gt; 0)
data &amp;lt;- as_tibble(other_covars)
data$sex &amp;lt;- factor(sex, 0:1, c(&amp;quot;male&amp;quot;, &amp;quot;female&amp;quot;))
data$pred_risk &amp;lt;- as.vector(logistic(lp))
data$death &amp;lt;- factor(death, 0:1, c(&amp;quot;no&amp;quot;, &amp;quot;yes&amp;quot;))
data$id &amp;lt;- 1:nrow(data)&lt;/code&gt;&lt;/pre&gt;
&lt;/div&gt;
&lt;div id=&#34;estimating-predictive-performance&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Estimating predictive performance&lt;/h1&gt;
&lt;p&gt;Now that we have some data, we can evaluate how well our model is able to
predict each patient’s risk of death. To make our lives as simple as possible,
we assume that we were able to divine the true effects of each of our variables,
i.e., we know that the data is generated via a logistic regression model with
&lt;span class=&#34;math inline&#34;&gt;\(\beta = [0.8, -2, 0.3, 0.3, 0.3]\)&lt;/span&gt; and there is no uncertainty around those
estimates. Under these assumptions, our model would be able to achieve the
following AUC in the simulated data.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;auc &amp;lt;- function(data) {
  yardstick::roc_auc(data, death, pred_risk, event_level = &amp;quot;second&amp;quot;)
}

data %&amp;gt;% auc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary         0.785&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Notably, this is a summary measure that depends on the entire data set &lt;span class=&#34;math inline&#34;&gt;\(D\)&lt;/span&gt; and
&lt;em&gt;cannot&lt;/em&gt; be calculated for an individual patient alone. Per definition, it
requires at least one patient with and without the outcome. This has important
implications for interpreting the AUC. Let’s see what happens if we evaluate
our (true) model in men and women separately.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;men &amp;lt;- data %&amp;gt;% filter(sex == &amp;quot;male&amp;quot;)
men %&amp;gt;% auc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary         0.662&lt;/code&gt;&lt;/pre&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;women &amp;lt;- data %&amp;gt;% filter(sex == &amp;quot;female&amp;quot;)
women %&amp;gt;% auc()&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary         0.665&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In each subset, the AUC dropped from 0.78 to around 0.66. This perhaps isn’t
too surprising, given that sex was a strong predictor of death.
However, remember that the model coefficients and hence the predicted risk for
each individual patient — i.e., how “good” that prediction is for that patient
— remain unchanged. We merely changed the set of patients that we included in
the evaluation. Although this might be obvious, I believe this is an important
point to highlight.&lt;/p&gt;
&lt;p&gt;Looking at the distribution of risks in the total data and by sex might provide
some further intuition for this finding. Predicted risks for both those who did
and did not die are clearly bimodal in the total population around the average
risk for men and women (Figure &lt;a href=&#34;#fig:dist-of-risks-overall&#34;&gt;1&lt;/a&gt;). Even so, there
is good separation between them. The majority of patients who died (red curve)
had a predicted risk &amp;gt;50%. Vice versa, the majority of patients who remained
alive (green curve) had a risk &amp;lt;50%. Looking at the risks of men and women
separately, however, we can see that most men had a high and most women a low
predicted risk (Figure &lt;a href=&#34;#fig:dist-of-risks-by-sex&#34;&gt;2&lt;/a&gt;). There is much less
separation between the red and green curves, as any differences among for
example the men is entirely due to moderate effects of our simulated continuous
covariates.&lt;/p&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:dist-of-risks-overall&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://www.patrick-rockenschaub.com/_post/2021-11-26-contextual-nature-of-auc/index.en_files/figure-html/dist-of-risks-overall-1.png&#34; alt=&#34;Distribution of predicted risk of death for patients that ultimately did and did not die.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 1: Distribution of predicted risk of death for patients that ultimately did and did not die.
&lt;/p&gt;
&lt;/div&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:dist-of-risks-by-sex&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://www.patrick-rockenschaub.com/_post/2021-11-26-contextual-nature-of-auc/index.en_files/figure-html/dist-of-risks-by-sex-1.png&#34; alt=&#34;Distribution of predicted risk of death for by sex.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 2: Distribution of predicted risk of death for by sex.
&lt;/p&gt;
&lt;/div&gt;
&lt;p&gt;So far, we have looked at two extremes: the entire data set, in which sex was
perfectly balanced, and two completely separated subsets with only men or only
women. Let’s see what would happen if we gradually reduce the number of men in
our evaluation population. We can see that estimated performance drops as we
remove more and more men from the data set
(Figure &lt;a href=&#34;#fig:plot-auc-under-shift&#34;&gt;3&lt;/a&gt;),
particularly at the right side of the graph when there are only few men left and
there is an increasing sex imbalance.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;auc_biased &amp;lt;- function(data, p_men_remove) {
  n_men &amp;lt;- sum(data$sex == &amp;quot;male&amp;quot;)
  n_exclude &amp;lt;- floor(n_men * p_men_remove)
  
  data %&amp;gt;% 
    slice(-(1:n_exclude)) %&amp;gt;% 
    auc()
}

p_men_remove &amp;lt;- seq(0, 1, by = 0.01)
auc_shift &amp;lt;- p_men_remove %&amp;gt;% 
  map_dfr(auc_biased, data = data) %&amp;gt;% 
  mutate(p_men_remove = p_men_remove)&lt;/code&gt;&lt;/pre&gt;
&lt;div class=&#34;figure&#34; style=&#34;text-align: center&#34;&gt;&lt;span style=&#34;display:block;&#34; id=&#34;fig:plot-auc-under-shift&#34;&gt;&lt;/span&gt;
&lt;img src=&#34;https://www.patrick-rockenschaub.com/_post/2021-11-26-contextual-nature-of-auc/index.en_files/figure-html/plot-auc-under-shift-1.png&#34; alt=&#34;Estimated AUC by proportion of men removed from the full data set.&#34; width=&#34;672&#34; /&gt;
&lt;p class=&#34;caption&#34;&gt;
Figure 3: Estimated AUC by proportion of men removed from the full data set.
&lt;/p&gt;
&lt;/div&gt;
&lt;/div&gt;
&lt;div id=&#34;difficulty-to-distinguish-model-misspecification&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Difficulty to distinguish model misspecification&lt;/h1&gt;
&lt;p&gt;What we have seen so far is particularly problematic in the interpretation of
external model validation, i.e., when we test a model that was developed in one
set of patients (and potentially overfit to that population) in another patient
population in order to estimate the model’s likely future performance. This is
because in most real world cases, it isn’t quite as straightforward to quantify
the difference between the development population and the evaluation cohort.
Since we also usually don’t know the true model parameters (or even the true
model class), it is difficult to disentangle the effects of population makeup
from the effects of model misspecification. Let’s assume for example that —
unlike earlier — we don’t know the exact model parameters and instead needed
to estimate them from a prior development data set. As a result, we obtained
&lt;span class=&#34;math inline&#34;&gt;\(\beta_{biased} = [0.8, -2, 0.3, 0.3, 0.3]\)&lt;/span&gt; which clearly differs from the true
&lt;span class=&#34;math inline&#34;&gt;\(\beta\)&lt;/span&gt; used to generate the data. Now recalculate the AUC in a data set where
there is some imbalance between men and women.&lt;/p&gt;
&lt;pre class=&#34;r&#34;&gt;&lt;code&gt;biased_betas &amp;lt;- c(0.4, -1, 0.3, -0.2, 0.5)
alt_risk &amp;lt;- logistic(cbind(1, sex, other_covars) %*% biased_betas) %&amp;gt;% 
  as.vector()

data %&amp;gt;% 
  mutate(pred_risk = alt_risk) %&amp;gt;% 
  auc_biased(0.5)&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code&gt;## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   &amp;lt;chr&amp;gt;   &amp;lt;chr&amp;gt;          &amp;lt;dbl&amp;gt;
## 1 roc_auc binary         0.722&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, the estimated model performance dropped. However, how much of the drop
was due to our biased estimate &lt;span class=&#34;math inline&#34;&gt;\(\beta_{biased} \neq \beta\)&lt;/span&gt; and how much of it
was due to the fact that our evaluation data set contained fewer men? This, in
general, is not straightforward to answer.&lt;/p&gt;
&lt;/div&gt;
&lt;div id=&#34;takeaway&#34; class=&#34;section level1&#34;&gt;
&lt;h1&gt;Takeaway&lt;/h1&gt;
&lt;p&gt;If there is one takeaway from this post it is that external validations of
predictive models mustn’t solely report on differences in AUC but also need to
comment on the comparability of development and test sets used. Such a
discussion is warranted irrespective of whether the performance remained the
same, dropped, or even increased in the test set. Only by discussion — and
ideally even quantifying — differences between the data sets can the reader
fully assess the evidence for retained model performance and judge its likely
value in the future.&lt;/p&gt;
&lt;/div&gt;
</description>
    </item>
    
  </channel>
</rss>
