The logic of statistical estimation via Monte Carlo simulation

Example with seed germination

You have generated a new kind of parsley plant. You think it makes the most delicious parsley. Now, you want to sell your seeds, but before you do, you need to figure out the probability that the seed will germinate. You know from your development of the new kind of plant that lots of seeds fail to germinate. So, you want to do an experiment. You will plant \(n\) parsley seeds. You will water them, keep them warm, and give them some light. Then, you will examine which of the seeds germinate within 4 weeks of planting. Denote your data as \(X_1, \dots, X_n \in \{0,1\}\), where \(X_i = 1\) denotes that seed \(i\) germinates.

What is the probability that a seed will germinate?

We are going to imagine that the \(X_i\)’s are random variables that are independent and identically distributed Bernoulli(\(p\)). In this statistical model, we want to estimate \(p\).

You know that if you plant enough seeds, then the average of the \(X_i\)’s will be really close to the true value \(p\). However, the experiment takes up space and time. So, you cannot afford to plant a gazillion seeds. Key questions:

How many seeds should you plant if you are willing to be off by 2%? This is a question about experimental design.
Suppose you planted 2500 seeds and 2157 of the seeds germinated. We estimate \(p\) as \(2157/2500 = .8628\). How close is this estimate to the true value of \(p\)?

Example with correlation

We observe pairs of mother and daughter heights. For example, the mother 167cm and the daughter is 169cm. We observe this data for 100 mothers and daughters. We want to estimate the correlation between these two quantities. With 100 pairs (mother height, daughter height), we can compute the sample correlation between those two heights. However, if we were to collect data on another 100 pairs, or if we were to collect more data, this sample quantity would change. We want to compute an interval around our “point estimate” such that the interval is likely to contain the “long run average” of getting lots and lots of pairs of heights.

Expectation

Often, we want to estimate the “long run average” of our Statistic. That is, what happens if we have a lot of data. Notice that this is the case in both the seed example and the heights example. The “Expectation of a random variable” is the essential mathematical notion for what we want to estimate.

Recall that we this is one of the three quantities that we discussed in Monte Carlo.

What is statistical estimation?

Here is the more general setup for statistical estimation. Suppose that you have collected your data \(X_1, \dots, X_n\). If you were to repeat your data collection, you would likely obtain a different value of the \(X_i\)’s. So, we imagine these values \(X_i\) to be the realization of a random variable. For example, the \(X_i\) could represent whether seed \(i\) germinated in the parsely experiment. Then, we imagine that \(X_1, \dots, X_n\) are realizations of Bernoulli(\(p\)) independent random variables. Alternatively, \(X_i = (M_i, D_i)\) could be a pair of numbers (mother height, daughter height) and we imagine these pairs being normally distributed with correlation \(\rho\), with each \(X_i\) indepenendent from all the others.

With your data, you summarize it with a statistic \(S(X_1, \dots, X_n)\). For example, in the seed experiment, we summarized the \(X_i\)’s with their average \[S(X_1, \dots, X_n) = \frac{1}{n}(X_1 + \dots + X_n).\] With the heights data, the statistic is the sample correlation between the two heights.

\[S(X_1, \dots, X_n) = \frac{\sum\limits_{i=1}^n (M_i-\bar{M})(D_i-\bar{D})} {\sqrt{\sum\limits_{i=1}^n (M_i-\bar{M})^2 \sum\limits_{i=1}^n (D_i-\bar{D})^2}}\]

This statistic \(S\) can be any function of your data. In the homeworks, you are going to explore some statistics that are very strange!

In any experiment for which we “do statistics,” we imagine that if we were to repeat the experiment, then we would get different data. Thus, we imagine that the data is random and thus we also imagine that the statistic \(S\) is random. So, just like \(X_i\) has a “distribution over possible values”, so does \(S\).

Seed germination continued; experimental design.

Let’s do a Monte Carlo simulation and fiddle with \(n\) and \(p\) to see how many seeds you should plant so that your estimate is within 2%.

First, write a function to simulate S, given values of \(n\) and \(p\).

library(tidyverse)
library(purrr)
n = 100
p = .5

simulate_S = function(){
  X = rbernoulli(n = n, p)
  S = mean(X)
  return(S)
}

Then, write a function to evaluate whether \(S\) is within 2% of p. We want to compute \[P(S \in (p - .02, p + .02)) = P(|S - p| < .02).\] So…

check_if_S_in_A = function(S){
  return(abs(S-p)  < .02 )
}

Now, we use the same Monte Carlo code from before…

r = 1000
monte_carlo = data.frame(replicate = 1:r, 
                         S = rep(NA,r), 
                         S_in_A = rep(NA, r)) 
for(i in 1:r){
  monte_carlo$S[i] = simulate_S()
  monte_carlo$S_in_A[i] = check_if_S_in_A(monte_carlo$S[i])
 
}

monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(`P(|S - p| < .02)` = mean(S_in_A))

## # A tibble: 1 x 1
##   `P(|S - p| < .02)`
##                <dbl>
## 1              0.248

Here is what the distribution of outcomes looks like.

monte_carlo %>% ggplot(aes(x = S)) + geom_histogram(bins = 30)

Exercise: What happens when you change \(n\) and keep \(p\) fixed? What happens when you fix \(n\) and change \(p\)? Can you find a value of \(n\) such that for all possible values of \(p\), \(P(|S - p| < .02) \ge .95\)?

Estimation is impossible (if you expect too much)

Your estimate is never going to be right all the time. Even if you allow for an interval of uncertainty (e.g. +/- 2% as above), that interval is never going to contain the truth on every experiement. The best we can do is to have a formula/function that creates an interval and on 95% of (random) experiments, that formula will create an interval that includes the true value.

What are we trying to estimate?

Everything above might still feel like hypothesis testing, a little bit. But, this is different. Now, we are interested in (1) estimating \(p\) and (2) knowing how good our estimate is.

In general, after you compute the statistic \(S\), you could just report it and be done with it. However, we often want to know “how close are we?” If you knew \(p\), then you could say how close you are… but you don’t! That’s the problem! So, how can we say how close we are?

Above, \(p\) was defined as a parameter in a model. However, often times, “parameters” can be imagined as something different. Here are two other ways:

Imagine getting an infinite amount of data. What would be the value of \(S\) with an infinite amount of data? Or,
Imagine repeating the whole experiment lots of different times and on experiment \(i\) you created statistics \(S_i\). What is the average of those statistics \(S_1, S_2, \dots\)?

For most functions of the data \(S\), these two values are the same thing. The first one might be a bit easier to think about. However, if they are different (and sometimes they are), it is the second one that we are actually going to use. We call that second value the expected value of the statistic \(\mathbb{E}(S)\), or to be fancy, \(\mathbb{E}(S(X_1, \dots, X_n))\).

Example: The maximum of the \(X_i\)’s is one statistic for which those two notions are not the same. So is the minimum. Why?

So, here is the problem. \(S\) is a random variable. We only observe one value of it, but we want to estimate \(\mathbb{E}(S)\). As discussed above, estimation is impossible. The best we can do is to make an interval that is usually right.

The logic of statistical estimation

You want to compute an interval of uncertainty (or a “confidence interval”) around the statistic \(S\). You want this interval to contain the expected value of that statistic.

Imagine that we were able to repeat the actual experiment lots of times. From those replicates of the statistic \(S\), we can compute the 2.5 and 97.5 percentiles. What this means, is that 95% of the experiments create an \(S\) within that interval. For example, suppose that interval is \((\mathbb{E}(S) - 3, \mathbb{E}(S) + 3)\). So, the sampled value of \(S\) is within 3 of \(\mathbb{E}(S)\) on 95% of experiments. Said another way, \(\mathbb{E}(S)\) is within 3 of \(S\) on 95% of experiments. That is the 95% confidence interval! Now, all we need to do is be able to find that value of 3. To do that, we will run a simulation where we pretend to know the true model; we do this by estimating the model parameters with our data.

Here are the steps to creating a confidence interval:

Use the collected data to estimate the parameter(s) of the model that we imagine generating our data.
Simulate lots of imagined experiments from that model, using the estimated parameters.
On each imagined experiment, compute the statistic (this might be the parameter estimate computed on the simulated data). You now have a distribution of that statistic over the imagined experiments.
For a 95% confidence interval, find the 2.5% and 97.5% quantiles of that distribution. This leaves 2.5% on both sides of the interval. Those add up to 5% (100%-95% = 5%). These quantiles give you a 95% confidence interval. (There is some fine print, see Version 2 below.)

Notice that in the parsley seed example, the statistic was the parameter of the model. This is often the case, but not always.

Notice how this logic aligns with the notion of imagined experiments above: Imagine repeating the whole experiment lots of different times and on experiment \(i\) you created statistics \(S_i\). What is the average of those statistics \(S_1, S_2, \dots\)? We literally repeat the experiment in the above logic. By doing so, we see the distribution of the \(S_i\) under the estimated model. By studying that distribution (i.e. computing quantiles), we can understand how much uncertainty there is in each \(S_i\) and thus in our original data.

Seed example after data collection.

Suppose you planted 2500 seeds and 2157 of the seeds germinated (2157/2500 = .8628). When we were designing the experiment, we didn’t know the value of \(p\). Now that we have an estimate of \(p\), we often call it \(\hat p\) (“p hat”). We can use that!

We can do another simulation to estimate our confidence interval:

n = 2500
phat = 2157/2500

simulate_S = function(){
  X = rbernoulli(n = n, phat)
  S = mean(X)
  return(S)
}

In the previous chapter, we

Now, we use the same Monte Carlo code from before…

r = 1000
monte_carlo = data.frame(replicate = 1:r, 
                         S = rep(NA,r)) 
for(i in 1:r){
  monte_carlo$S[i] = simulate_S()
}

monte_carlo = as_tibble(monte_carlo)

qs = monte_carlo$S %>% quantile(prob =c(.025, .975))
qs

##   2.5%  97.5% 
## 0.8484 0.8768

hist(monte_carlo$S, main = "Imagined Experiments", breaks = 30)
lines(qs[1]*c(1,1), c(0,10^9), col = "red", lwd = 3)
lines(qs[2]*c(1,1), c(0,10^9), col = "red", lwd = 3)

If you have learned a little bit about confidence intervals before this class, then the above logic should look strange. However, if scientists had “electronic computers” when they first started doing statistics, then this is exactly what they would have done and it is exactly what you would have been taught. However, when us humans started doing statistics, it was very hard to perform simulations. So, we came up with mathematical formulas to approximate the above logic. Those formulas are the things you might have previously learned. Once you understand the logic above, I hope that you can return to those formulas and find a deeper understanding for them. At that point, the formulas are great! They are fast and nimble; much easier than writing a new simulation for every different variation.

What about “parameter estimation” to “fit the model” in step 1?

In some cases, it is easy to imagine how we should estimate the parameter in the model (e.g. take an average). In other cases, this is more complicated. For example, within a couple weeks, we will discuss logistic regression where we want to predict \(Y \in \{0,1\}\) which could represent “whether a person defaults on a loan?” or “do they die?” or anything else that has 2 possible outcomes. To predict it, we are given \(p\) features/predictors/covariates about the person \(x_1, \dots, x_p\). Here, this \(p\) is different from above! This one stands for “predictors” instead of “probability”… this is standard notation and a super big bummer that it conflicts!

The logistic regression model is

\[P(Y = 1| x_1, \dots, x_p) = \frac{\exp(\sum_i x_i \beta_i)}{1+\exp(\sum_i x_i \beta_i)},\]

where \(\beta_1, \dots, \beta_p \in R\) are the parameters of the model. Given a bunch of people’s data, we want to estimate the \(\beta\)’s and give confidence intervals. How do we estimate these parameters? It’s complicated! But there is code in R that makes it easy. Once we can do that, then we can follow the logic above!

What makes this logic logical?

There are three key conditions that we need to ensure that the above logic works. Two of these conditions are easy to check. The other is not as easy.

Suppose that \(S\) is the observed statistic and \(S^*\) is one “simulated version” of \(S\) from the fitted model. Let \(q_.025\) and \(q_.975\) be the 2.5% and 97.5% quantiles from the distribution of \(S^*\).

Key condition 1: The quantiles of \(S^*\) are symmetric around \(\mathbb{E}(S^*)\). That is, there exists a number \(a\) such that \[(q_.025, q_.975) = (\mathbb{E}(S^*)-a, \mathbb{E}(S^*)+a).\]

Notice that all of these quantities are observed in the logic of statistical estimation. So, this condition can be easily examined by checking if \(S - q_.025\) is close to \(q_.975 -S\).

Key condition 2: The “width of the distribution” of \(S^*\) is pretty close to the “width of the distribution” of \(S\). That is, for values of \(a\) like the one in Key condition 1, \[P(S^* \in (\mathbb{E}(S^*)-a, \mathbb{E}(S^*)+a)) \approx P(S \in (\mathbb{E}(S)-a, \mathbb{E}(S)+a)) \] This is likely to be true because we imagine \(S\) as being generated in the world from the same statistical model that we use to generate \(S^*\) on our computer subject to slightly different values of parameters in their models; that is, we estimate the parameters that generate \(S\). The big time magic is that those parameters can be a little bit off, but this key condition is still reasonable.

This condition is harder to check. How could you do it? It is way more complicated than anything we will do. You would do a 'second layer' of resampling. For each simulation that gives you an S*, you would fit a model and resample a bunch of 'S**'. Then, you would compare P(S*...) to P(S** ...).

Key condition 3: \(\mathbb{E}(S^*) = S\)

This condition is easy to check: take an average over \(r\) replicates of \(S*\). Is that average close to \(S\)?

Why it works…

Note that a 95% confidence interval \((L,U)\) for \(\mathbb{E}(S)\) should satisfy \[P(\mathbb{E}(S) \in (L,U)) = .95,\] Why is this event random? (scroll over eyes for answer) The *interval* (L,U) is a random interval because L and U are random variables that we compute from data.

From the definition of \(q_.025\) and \(q_.975\), we know that \[P(S^* \in (q_.025, q_.975)) = .95.\] What is random here?

We are going to set \(L=q_.025\) and \(U = q_.975\) and we want to show that the first probability is still (approximately) .95. That would mean that \((L,U) = (q_.025, q_.975)\) is a 95% confidence interval.

Ok. This should be confusing. In the first probability, \(L\) is random. Then, we are saying that \(q_.025\) is fixed, but that \(L = q_.025\) (and similarly for \(U\) and \(q_.975\)). Yes. That should be confusing. If you are not confused, try harder! :)

The reason both are true is that the first and second probabilities are entirely different notions of probability. The first one is “random stuff in the world that generates your data”. The second one is “random stuff that you do with Monte Carlo”. And in that second one, we typically no longer consider our data to be random. Weird. Let’s return to this confusion in a bit.

Here we go

By key condition 1, we have a number \(a\) such that

\[P(S^* \in (\mathbb{E}(S^*)-a, \mathbb{E}(S^*)+a)) = P(S^* \in (q_.025, q_.975)).\] We know that this is equal to .95 from the definitions of \(q_.025\) and \(q_.975\).

Using that same value of \(a\) and the key condition 2… \[P(S^* \in (\mathbb{E}(S^*)-a, \mathbb{E}(S^*)+a)) \approx P(S \in (\mathbb{E}(S)-a, \mathbb{E}(S)+a)) = P(\mathbb{E}(S) \in (S-a, S+a)) .\]

Finally, using key condition 3, \[(S-a, S+a) = (\mathbb{E}(S^*)-a, \mathbb{E}(S^*)+a) = (q_.025, q_.975).\]

Thus, putting the pieces together, \[.95 = P(S^* \in (\mathbb{E}(S^*)-a, \mathbb{E}(S^*)+a)) \approx P(\mathbb{E}(S) \in (S-a, S+a)) = P(\mathbb{E}(S) \in (q_.025, q_.975)). \]

Version 2 of the logic for statistical estimation (advanced)

We wish to find an interval \((S-a_.95, S+b_.95)\) such that \[P\left(\mathbb{E}(S) \in (S - a_.95,S+b_.95)\right) = .95,\] where \(S\) is the only thing that is random in the above probability. If we can find values \(a_.95\) and \(b_.95\) that satisfy that probability, then \((S-a_.95, S+b_.95)\) is a 95% confidence interval for \(\mathbb{E}(S)\).

Exercise: Show that \[P\left(\mathbb{E}(S) \in (S - a_.95,S+b_.95)\right) = P\left(S \in (\mathbb{E}(S) - b_.95,\mathbb{E}(S)+ a_.95)\right).\] The right hand side is easier to think about because for \(A = (\mathbb{E}(S) - b_.95, \mathbb{E}(S)+a_.95)\), we will need to compute \(P(S \in A)\) using some type of Monte Carlo. Here is the answer to the above exercise.

We want to find \(a_.95\) and \(b_.95\) that satisfy \[P\left(\mathbb{E}(S) \in (S - a_.95,S+b_.95)\right) = .95.\] The problem is that we don’t know the distribution of \(S\) (and we don’t have a null hypothesis like before). So, we are going to find values for \(a_.95\) and \(b_.95\) using Monte Carlo simulation and an approximation that allows us to do the sampling.

First, we must estimate the distribution of \(S\). Once we estimate that distribution, we can generate random variables \(S^*\) from that estimated model. Here, the * superscript denotes that \(S^*\) is from the imagined model with estimated parameters; this has a different distribution that \(S\). However, we can use it in the following way. We assume that the “Monte Carlo” \(S^*\) behaves like the original random variable \(S\) in that for any choice of \(a\) and \(b\), \[P\left(S^* \in (\mathbb{E}(S^*) - b,\mathbb{E}(S^*)+ a)\right) \approx P\left(S \in (\mathbb{E}(S) - b,\mathbb{E}(S)+ a)\right).\]

Under that approximation, it is sufficient to find \(a_.95\) and \(b_.95\) using Monte Carlo with \(S^*\), \[P\left(S^* \in (\mathbb{E}(S^*) - b_.95,\mathbb{E}(S^*)+ a_.95)\right) = .95.\]

Here, the set \(A\) is \((\mathbb{E}(S^*) - b_.95,\mathbb{E}(S^*)+ a_.95)\) and we can simulate \(S^*\). Once you find those values for \(S^*\) (see below for computing \(a_.95\) and \(b_.95\)), then we use the approximation to get a confidence interval: \[.95 = P\left(S^* \in (\mathbb{E}(S^*) - b_{.95},\mathbb{E}(S^*)+ a_{.95})\right) \approx P\left(\mathbb{E}(S) \in (S - a_{.95},S+ b_{.95})\right).\] Thus, \((S-a_{.95}, S+b_{.95})\) is a 95% confidence interval for \(\mathbb{E}(S)\).

How to find \(a_{.95}\) and \(b_{.95}\)

Find the .025 and .975 quantiles of \(S^*\); call these \(q_{.025}\) and \(q_{.975}\). Then, \(b_.95\) solves \(\mathbb{E}(S^*) - b_.95 = q_.025\) and \(a_.95\) solves \(\mathbb{E}(S^*) + a_.95 = q_.975\). You can compute \(\mathbb{E}(S^*)\) by taking the average of your Monte Carlo simulations of \(S^*\). \[b_.95 = \mathbb{E}(S^*) - q_.025 \quad a_.95 = q_.975 - \mathbb{E}(S^*)\]

When does Version 1 = Version 2?

If \(\mathbb{E}(S) = \mathbb{E}(S^*)\) (“unbiased” or centered) and if \(|b_.95| = |a_.95|\) (“symmetric”), then Version 1 and Version 2 will provide the same interval, provided you use enough Monte Carlo replicates.