Suppose one of your classmates gave you a sequence of 200 H’s and T’s and you wanted to do “fraud detection” to see if it was generated “by hand” or with a random number generator.
sequence_of_flips = "THTHHHTTTTHTTTHTTHTHTTTTHHHTHHHTHTHHTHHHTHTTHHHTTTHTTTHHHTHTTTHTHHHTHTHTHHTTTHTHTHTTTHTHHHTTHTHHHHTHTTHTHTTHTHHHTHTHHTHTHTHHTHHHHTTHHTTHTTTHHHTHHHTHHTHHTHTTHTHHTHHHTHTHHHHTTHHHHTTTHTTHHHHTTHHTTTTHHHHH"
We could do this by testing the hypothesis
\[H_0: \mbox{Each coin flip is independent Bernoulli($p=1/2$)}\]
The null hypothesis is a possible statistical model for the data we observed; often called the “null model”. Now, we want to examine whether there is a “pattern in our data” that would be unlikely “under the null”. Why? Be sure to read the excerpt from Jordan Ellenberg’s book!
There are lots of “patterns” or test statistics we could examine. For example, “the proportion of H’s” is the one you maybe learned about in a previous statistics course (this statistic has the advantage of converging to the normal distribution, i.e. Z-table and 1.96, etc).
Because we suspect that the sequence was generated by hand, the proportion of H’s might not be a very good test statistic; it is easily “gamed” by the person writing it. It is much harder to “game” the length of the longest sequence. This is because humans have a hard time imagining what that length should be. As a result, fraudulent sequences have runs which tend to be too short.
So, the pattern in our data that we are going to examine is the length of the longest run. This will be our test statistic. We will call the test statistic \(S\) when it is a random variable and we will call it \(s\) when it is the observed value from our data. Capital letters for things we simulate, lower case letters for things from our data. So, \(S\) is a random variable that we can simulate and \(s\) is our data.
In the .Rmd file for this page, there is a function longestRun()
. It counts the length of the longest run. For the sequence above…
s= sequence_of_flips %>% strsplit( split = "") %>% unlist() %>% longestRun()
s
## [1] 5
How surprising would this value be if our data was generated from the null model?
In homework 2, we defined a random variable \(X\) that is the length of the longest run in a sequence of 200 independent, fair coin flips (i.e. exactly the null above). Now, we are going to call this random variable \(S\) because we imagine it as a test statistic. Below is a histogram of 10,000 simulations of \(S\). It definitely is not normally distributed! How can we see this? Don’t use a Z-table or 1.96 for this statistic; even if it did have central limit behavior (which it doesn’t!), computing the expectation and standard error in order to get a Z-statistic would be really really hard.
There is a red line at 5, the length of the longest run in the sequence your classmate gave you. So, most simulated values of \(S\) are larger than 5. This is consistent with what was stated above… “humans have a hard time imagining what that length should be. As a result, fraudulent sequences have runs which tend to be too short.” So far, it looks like that might be the case for this sequence.
The p-value tells us exactly how surprising \(s= 5\) is under the null hypothesis. It is the probability of observing \(S\) equal to, or more extreme than our observed value.
\[\mbox{p-value} = P(S\le 5).\]
Why “or more extreme”?
[deep breath]
Before learning about Monte Carlo, if someone asked you to compute some weird probability like \(P(S\le 5)\) where \(S\) is the longest run in a sequence of 200 coin flips, what would you have done? Beth Harmon or Mark Wahlberg?
Of course, now you know Monte Carlo. So, you know how to compute these probabilities! In what fraction of our simulations is \(S\le 5\)?
This Monte Carlo probability is a p-value (that’s why I called the variable p_value
). It is equal to 0.0325, which many would say is “statistically significant” because it is less than .05; this .05 number is culturally important to many researchers, but there is nothing mathematically important or special about the .05 cutoff.
Because we observe a longest run which is improbable under the null, we “reject the null hypothesis”.
What if the longest run was 4? What if the longest run was still 5, but the sequence of flips was 10,000 instead of 200? How would we compute p-values for these scenarios?
What other patterns could we examine besides the longest run and proportion of H’s?
The above example is a hypothesis testing problem. Here is the outline for how we do these problems:
In this process, we combine our understanding of the world, our data, and our model. With all three, we make statistical inference.
The above logic performs a Monte Carlo simulation under a hypothesized reality. Suppose that you observe a statistic \(s\). Using a simulation of hypothesized reality, we can simulate a random variable \(S\) that we imagine as mimicking \(s\) under the null. Using Monte Carlo, we can compute (approximate) \(P(S \ge s)\) or \(P(S\le s)\) (i.e. one-sided tests) or any other comparison that seems reasonable given the circumstances (e.g. two-sided tests, etc). If the null model is unlikely to generate a value equal to \(s\), or more extreme, then we reject the null hypothesis.
If you have learned a little bit about hypothesis testing before this class, then the above logic might look very strange. However, if scientists had “electronic computers” when they first started doing statistics, then this is exactly what they would have done and it is exactly what you would have been taught. However, when us humans started doing statistics, it was very hard to simulate lots of \(S\) values. So, we came up with mathematical formulas to approximate the above logic. Those formulas are the things you might have previously learned. Once you understand the logic above, I hope that you can return to those formulas and find a deeper understanding for them.
A friend wants to gamble with you and proposes a coin flip game. Before playing the game with her, you want to ensure whether the coin is fair.
Let’s follow the logic above…
library(purrr)
simulate_S = function(){
lots_of_random_variables = rbernoulli(n = 1000, p = 1/2)
X = mean(lots_of_random_variables)
## or, you could write:
# X = rbinom(n = 1, size = 1000, prob = 1/2) / 1000
S = X - 1/2
return(S)
}
Then, we would find large values of \(S\) to be surprising; large and positive or large and negative. This is a “two-sided test”.
check_if_S_in_surprising_set = function(S){
return(abs(S) >= .04)
}
library(dplyr)
r = 10000
monte_carlo = data.frame(replicate = 1:r,
S = rep(NA,r),
S_in_suprising_set = rep(NA, r))
for(i in 1:r){
monte_carlo$S[i] = simulate_S()
monte_carlo$S_in_suprising_set[i] = check_if_S_in_surprising_set(monte_carlo$S[i])
}
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_suprising_set))
## # A tibble: 1 x 1
## `mean(S_in_suprising_set)`
## <dbl>
## 1 0.0115
So, if the coin were fair, this would happen only about 1% of the time. So, I would say that what we’ve observed (54% heads) is pretty surprising!
What has changed in the logic above?
Here is the simulation…
check_if_S_in_surprising_set = function(S){
return(abs(S) >= .02) # why .02????
}
library(dplyr)
r = 10000
monte_carlo = data.frame(replicate = 1:r,
S = rep(NA,r),
S_in_suprising_set = rep(NA, r))
for(i in 1:r){
monte_carlo$S[i] = simulate_S()
monte_carlo$S_in_suprising_set[i] = check_if_S_in_surprising_set(monte_carlo$S[i])
}
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_suprising_set))
## # A tibble: 1 x 1
## `mean(S_in_suprising_set)`
## <dbl>
## 1 0.214
So, if the coin were fair, then roughly 20% of experiments would create something at least as surprising as 52% heads. I don’t think this is that surprising. Does this mean that the coin is fair? No. It means that we don’t have enough evidence to suggest it is unfair. That is, we don’t “accept the null hypothsis (fair coin)”. We never “accept” the null hypothesis.
What has changed in the logic above?
Compared to the previous question, we have the same proportion (52%), but 10 times as may samples.
Here is the simulation…
simulate_S = function(){
lots_of_random_variables = rbernoulli(n = 10000, p = 1/2)
X = mean(lots_of_random_variables)
## or, you could write:
# X = rbinom(n = 1, size = 10000, prob = 1/2) / 10000
S = X - 1/2
return(S)
}
r = 10000
monte_carlo = data.frame(replicate = 1:r,
S = rep(NA,r),
S_in_suprising_set = rep(NA, r))
for(i in 1:r){
monte_carlo$S[i] = simulate_S()
monte_carlo$S_in_suprising_set[i] = check_if_S_in_surprising_set(monte_carlo$S[i])
}
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_suprising_set))
## # A tibble: 1 x 1
## `mean(S_in_suprising_set)`
## <dbl>
## 1 0.0002
Ten times more samples makes 52%, 1000 times less likely! Whoa.
Discuss… What is a more productive variation on this question?
In 2018, 39,404 people in the US died in car accidents. In 2019, it was 38,800. So, it went down by roughly 1.8%. Is this a difference that could have happened by chance? Or, is this difference “statistically significant”?
Follow the logic of statistical testing above.
s = (38800-39404)/(39404)
s
## [1] -0.01532839
library(dplyr)
# First, write a function to simulate S
simulate_S = function(){
D2018 = rpois(1,39102)
D2019 = rpois(1,39102)
S = (D2019-D2018)/(D2018)
return(S)
}
# Second, write a function to evaluate whether S \in A.
check_if_S_in_A = function(S){
return(abs(S) > abs(s))
}
# Now, we are going to do it lots of times.
# Let's arrange the simulations in a data.frame with three columns
r = 100000
monte_carlo = data.frame(replicate = 1:r,
S = rep(NA,r),
S_in_A = rep(NA, r))
for(i in 1:r){
# monte_carlo$S[i] = simulate_S()
# monte_carlo$S_in_A[i] = check_if_S_in_A(monte_carlo$X[i])
# I'm going to use the alterative way...
monte_carlo[i,2] = simulate_S()
monte_carlo[i,3] = check_if_S_in_A(monte_carlo[i,2])
}
monte_carlo = as_tibble(monte_carlo)
monte_carlo %>% summarise(mean(S_in_A))
## # A tibble: 1 x 1
## `mean(S_in_A)`
## <dbl>
## 1 0.0324
monte_carlo %>% ggplot(aes(x = S)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the logic above, it looks very clear… steps 1, 2, 3, 4. This is an ideal. Often times, it is more iterative, back-and-forth. For example, it might look like this…
January 1 is an odd date, because 1 is odd. What percentage of students are born on odd dates? Given that half of the counting numbers are odd, let’s test the null hypothesis \[H_0: \mbox{half of students have odd birthdays}\] by collecting some data. For example, the data that I imagine observing could look like this:
# enter the data:
odd = 20
even = 15
proportionOdd = odd/(odd+even)
proportionOdd
## [1] 0.5714286
The proportion have odd birthdays will be our test statistic \(S\).
In class, we will collect some real data. It might look like we are ready to do that now. However, the fundamental thing from steps 1 and 2 is that we are able to simulate \(S\) under the null hypothesis.
How would we do that? The null hypothesis does not really create a statistical model for data we observe. We will need to clarify it…
Perhaps we think the code below can clarify our statistical model (step 2 before step 1?). Well, it certainly expresses the null/imagined model exactly.
library(purrr)
simulate_S = function(){
which_birthdays_are_odd = rbernoulli(n = 35, p = .5) # why 35? why .5?
oddBirthdays = sum(which_birthdays_are_odd)
proportionOdd = oddBirthdays/35
S = proportionOdd
# # rbernoulli comes from purrr.
# # if you prefer very concise code
# # you could write:
# X = rbinom(n = 1,size = 35,prob = .5)/35
# note that this concise code doesn't depend on purrr
return(S)
}
This simulation model actually makes another modeling assumption that we haven’t discussed! It is one of the most important assumptions in all of statistics and it is always hidden. It’s an assumption that we always need to talk about in the context of statistical testing and p-values. In the long tradition of hiding the discussion of this assumption, you will need to find it in this chapter.
In our model (i.e. in our imagination), we can generate lots of simulated/imagined data sets that have the same sample size as our observed data. Then, we are going to Monte Carlo (yes, it is now a verb) with \(S\) = proportionOdd
and \(A =\) ? to compute \(P(S ?)\).
In the end, we will have data. We want to know something about the world that generated that data. This is inference (“going up”). To do statistical inference we need a statistical model and we are going to do what Jordan calls reductio ad unlikely (be sure to read the excerpt from Jordan’s book). In this example, the steps of the logic are all jumbled up. Sometimes, when you are still finding your way, this is what it looks like.
Statistical Independence In order to generate more than one birthday, the code above “simply generates 35 birthdays”. The assumption is inside of that.
This implicitly models the birthdays as “independent”. This means that if you know some set of the birthdays, you are not able to predict predict anything about the other birthdays. For this example, in the real world, I would call this a “reasonable assumption that is definitely wrong”. Why is the assumption wrong? Why is it still reasonable?
If you think the percentage of odd birthdays is greater than 50%, roughly what would that number be? How many samples would you need to see this difference? Note that “seeing a difference” is random! Even if there is a difference, sometimes you won’t see it due to random chance. The probability that we detect the difference is called the power of the test. These ideas are important when you are designing an experiment; you want to make sure that your experiment succeeds! How many samples would we need, in order to make the power in the Odd Birthdays experiment at least 80%? Do some simulations!