These notes weave together a basic understanding for doing science with data, focusing on the role of statistical models and algorithms.
In the fabric of these notes, we will make sense of statistical inference as a know-how that synthesizes your understanding of the world, your data, and your model.
There is a trichotomy between world, measurements, and models.
We want to see the world. So, we first summarize the world by taking measurements of the world. Then, we summarize our measurements with a statistic. Finally, a model lets us see the world, through our measurements and statistics. While these steps happen sequentially in time, statistical inference in the final step requires all three to align, at the same time.
Science lives in the world. Statistics live with measurements. Models live in your head.
Why is data science a misnomer?
There are five types of statistical practice that we will touch on. We will call them
Wouldn’t it all be so easy if correlation implied causation?
Well, good news! Sometimes, it does! Sometimes, correlation implies causation.
This requires that the “treatment” is somehow randomly allocated. The best example is in a randomized, placebo controlled, double-blind experiment.
Example:
On November 16, 2020 Moderna released the first results from their phase 3 clinical trail for their COVID-19 vaccine.
In summary, over 30,000 people in the trial. I assume half recieved the vaccine and the other half recieved the placebo (can someone confirm this?). In total, there were 95 cases of COVID-19 among the participants; 90 among the placebo group and 5 among the “treated” group. Of the 95 cases, 11 were severe cases and all of those were in the placebo group.
In this study, vaccination is correlated with reduction of risk. Does this mean that the vaccination causally reduced COVID-19 infection? Why or why not?
Presidential election polls often have +/- 3% “confidence interval”. What is that all about?
Example:
At the end of October 2020, many wanted to know the outcome of the forthcoming 2020 presidential election. A common technique is “random digit dialing”.
This poll of Wisconsin, reached roughly 806 registered voters in Wisconsin. After a significant amount of statistical work, this poll reported that 48% of likely voters will choose Biden, 43% Trump, 2% Jorgensen, and 7% undecided. The margin of error for this poll was reported to be +/- 4.3%.
In order to reach those 806 participants, many more numbers need to be dialed. In this poll, the response rate was 4.3%. See here for more data and the definition of how the 4.3 number is computed. In that link, you see that in fact over 100,000 numbers were dialed! Most of those calls were never picked up. Among those dialed, 806 were registered voters and agreed to participate. There were 1113 that refused to participate (or hung up). There is so much work that goes in to making these polls work!
In the actual election, Biden received 49.45% of votes cast in Wisconsin and Trump received 48.82%.
What if your data looked like this, what would you do?
In exploration, we want to group things into clusters or decompose things into separate pieces… but we don’t say what those clusters/pieces are. Instead, we find it in the data.
Example exploration: My research studies how the news media directs our attention. We focus on Twitter. Given that the news media is highly segmented (“media bubbles”), we grouped leading journalists and media Twitter accounts into 11 different groups. We did not predetermine what these groups should be! Instead, we used the data to find “clusters”. In particular, the data we used is who-follows-who on Twitter.
To be discussed later.
Random variables are a way of expressing random numbers with mathematical notation.
Learning objectives: After Chapter 1, you should be able to
Suppose you have a complicated random variable or random process \(X\) and you want to know how often \(X\) has a certain property \(A\). For a simple example, \(X\) could be a number and \(A\) could be the property that \(X>5\). We write \(X \in A\) if \(X\) has the property \(A\). We want to compute \(P(X \in A)\). As we will see in the next chapters, we need these types of probabilities to test hypotheses and create confidence intervals. For example, you have ever used a Z-table before, it gives you such probabilities for one type of random variable \(X\) and one type of set \(A\).
Monte Carlo simulation is a really fancy phrase to describe a super simple idea: To “compute” \(P(X \in A)\), simulate the random variable \(X\) with your computer. Then, check if \(X\) has that property. Do this simulation lots of times. The proportion of those simulations for which \(X\) has property \(A\) is an estimate of \(P(X \in A)\). Monte Carlo is a key idea for this course. We will use it again and again and again.
Learning objectives: After the chapter on Monte Carlo, you should know how to compute
by simulating the random variable \(X\) multiple times. These techniques are referred to as Monte Carlo simulation.
Randomness creates all sorts of artifacts. Sometimes, those artifacts look like “signal,” leading us to make inferences that are false. Hypothesis testing asks, “might we have observed this thing simply due to chance?” Asked another way, “could this pattern be an artifact of noise?”
If you have learned about hypothesis testing before, buckle up! We are going to use a more general formulation in this course, using Monte Carlo simulation. If you have not heard of hypothesis testing, that’s ok too!
Either way, read the excerpt from Jordan Ellenberg’s New York Times Best Selling book How not to be wrong; the power of mathematical thinking that should be posted in canvas. It offers a fresh view of hypothesis testing. While it is not the main point of the reading, one of my favorite parts is how it relates the proof that \(\sqrt{2}\) is irrational (the proof is given in the reading and it is surprisingly simple!) to the logic of statistical hypothesis testing. We will have a reading quiz on the main points from the reading.
Learning objectives: After the logic of statistical testing via Monte Carlo Simulation, you should know how to test a hypothesis with Monte Carlo. This involves three steps.
homework for testing. Note that you can change the web address to get the .Rmd code.
In this section, we talk about creating confidence intervals around statistical point estimates. These confidence intervals are constructed so that they include the point estimate on (e.g.) 95% of experiments. If you have learned about confidence intervals before, buckle up! We are going to have a more general approach that uses Monte Carlo simulation.
The logic of statistical estimation via Monte Carlo simulation
The CLT says that an average of lots of random variables “looks like” a Normal random variable. Here are the notes.
In a prediction problem, you are given data pairs \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\) and you want to use \(X_i\) to predict \(Y_i\). We typically imagine \(X_i\) as containing several values (i.e. it is a “vector”).
There are two types of prediction problems, continuous \(Y_i\) and discrete \(Y_i\). For example, you might want to predict tomorrow’s the price for asset \(i\) using data that is available today. So, develop an historical training set, where you have information on asset \(i\) from one day contained in \(X_i\) and that asset’s price for the next day contained in \(Y_i\). Here, stock price is continuous.
Alternatively, you might only be interested in knowing if you should buy the stock, sell the stock, or hold the stock. So, you develop an historical training set where \(X_i\) contains the information that is available on one day. Then, you develop “labels” \(Y_i\) using data from the next day that say whether you should have bought, sold, or held onto the asset. Here, the label (buy, sell, hold) is discrete.
We will often call continuous outcomes “regression” and discrete outcomes “classification”.
Alternatively, perhaps we want to make predictions about the 2020 election. You could try to predict who is going to win (classification) or the number of delegates/votes that the Republicans recieve (regression).
In the cases above, there are two natural versions of the same problem (one is regression and one is classification). However, many classification problems do not have an analogous regression problem. For example, in the handwritten digit example in Chapter 1 of ISLR, \(X_i\) is an image of a handwritten digit and \(Y_i\) is a label that says whether the digit is 0, 1, 2, 3, 4,… , or 9.
We are going to imagine two broad approaches to regression and classification.
Chapter 2 in ISLR provides a broad overview of prediction. In the previous weeks of this course, Monte Carlo provided the basic computational tool; we were always working to get the problem stated as something that we could solve with Monte Carlo. Now, the basic computational tool is numerical optimization. We will not write code to do optimization. Instead, we will see optimization problems multiple times; it is often used to define our various techniques.
Note that you can download the textbook for ISLR and get all of the R labs at the book’s website.
Homework 4: ISLR (p52) 2.4.1, 2.4.2, 2.4.10.