These notes weave together a basic understanding for doing science with data, focusing on the role of statistical models and algorithms.

What is data science?

In the fabric of these notes, we will make sense of statistical inference as a know-how that synthesizes your understanding of the world, your data, and your model.

There is a trichotomy between world, measurements, and models.

We want to see the world. So, we first summarize the world by taking measurements of the world. Then, we summarize our measurements with a statistic. Finally, a model lets us see the world, through our measurements and statistics. While these steps happen sequentially in time, statistical inference in the final step requires all three to align, at the same time.

Science lives in the world. Statistics live with measurements. Models live in your head.

Why is data science a misnomer?

Data science.

There are five types of statistical practice that we will touch on. We will call them

  1. Experiments
  2. Sampling
  3. Exploration (my favorite)
  4. Prediction
  5. Observational data analysis

0.1 Experiments

Wouldn’t it all be so easy if correlation implied causation?
Well, good news! Sometimes, it does! Sometimes, correlation implies causation.

This requires that the “treatment” is somehow randomly allocated. The best example is in a randomized, placebo controlled, double-blind experiment.

Example:

On November 16, 2020 Moderna released the first results from their phase 3 clinical trail for their COVID-19 vaccine.

In summary, over 30,000 people in the trial. I assume half recieved the vaccine and the other half recieved the placebo (can someone confirm this?). In total, there were 95 cases of COVID-19 among the participants; 90 among the placebo group and 5 among the “treated” group. Of the 95 cases, 11 were severe cases and all of those were in the placebo group.

In this study, vaccination is correlated with reduction of risk. Does this mean that the vaccination causally reduced COVID-19 infection? Why or why not?

0.2 Sampling

Presidential election polls often have +/- 3% “confidence interval”. What is that all about?

Example:

At the end of October 2020, many wanted to know the outcome of the forthcoming 2020 presidential election. A common technique is “random digit dialing”.

This poll of Wisconsin, reached roughly 806 registered voters in Wisconsin. After a significant amount of statistical work, this poll reported that 48% of likely voters will choose Biden, 43% Trump, 2% Jorgensen, and 7% undecided. The margin of error for this poll was reported to be +/- 4.3%.

In order to reach those 806 participants, many more numbers need to be dialed. In this poll, the response rate was 4.3%. See here for more data and the definition of how the 4.3 number is computed. In that link, you see that in fact over 100,000 numbers were dialed! Most of those calls were never picked up. Among those dialed, 806 were registered voters and agreed to participate. There were 1113 that refused to participate (or hung up). There is so much work that goes in to making these polls work!

In the actual election, Biden received 49.45% of votes cast in Wisconsin and Trump received 48.82%.

0.3 Exploration

What if your data looked like this, what would you do?

In exploration, we want to group things into clusters or decompose things into separate pieces… but we don’t say what those clusters/pieces are. Instead, we find it in the data.

Example exploration: My research studies how the news media directs our attention. We focus on Twitter. Given that the news media is highly segmented (“media bubbles”), we grouped leading journalists and media Twitter accounts into 11 different groups. We did not predetermine what these groups should be! Instead, we used the data to find “clusters”. In particular, the data we used is who-follows-who on Twitter.

0.4 Prediction

Chapter 1 in ISLR

See handwritten digits.

0.5 Observational data analysis

To be discussed later.

1 Random Variables

Random variables are a way of expressing random numbers with mathematical notation.

Learning objectives: After Chapter 1, you should be able to

  1. Identify random variables for basic things that we want to model.
  2. Critique why a certain distribution is a poor model for some real world phenomenon.
  3. Start to build richer models with basic random variables.

homework1.

2 Monte Carlo Simulation.

Suppose you have a complicated random variable or random process \(X\) and you want to know how often \(X\) has a certain property \(A\). For a simple example, \(X\) could be a number and \(A\) could be the property that \(X>5\). We write \(X \in A\) if \(X\) has the property \(A\). We want to compute \(P(X \in A)\). As we will see in the next chapters, we need these types of probabilities to test hypotheses and create confidence intervals. For example, you have ever used a Z-table before, it gives you such probabilities for one type of random variable \(X\) and one type of set \(A\).

Monte Carlo simulation is a really fancy phrase to describe a super simple idea: To “compute” \(P(X \in A)\), simulate the random variable \(X\) with your computer. Then, check if \(X\) has that property. Do this simulation lots of times. The proportion of those simulations for which \(X\) has property \(A\) is an estimate of \(P(X \in A)\). Monte Carlo is a key idea for this course. We will use it again and again and again.

Learning objectives: After the chapter on Monte Carlo, you should know how to compute

  1. probabilities \(P(X \in A)\) as frequencies
  2. expectations \(\mathbb{E}(X)\) as averages and
  3. “distributions” as histograms

by simulating the random variable \(X\) multiple times. These techniques are referred to as Monte Carlo simulation.

Homework

3 Testing; how do you know that you are not fooling yourself?

Randomness creates all sorts of artifacts. Sometimes, those artifacts look like “signal,” leading us to make inferences that are false. Hypothesis testing asks, “might we have observed this thing simply due to chance?” Asked another way, “could this pattern be an artifact of noise?”

If you have learned about hypothesis testing before, buckle up! We are going to use a more general formulation in this course, using Monte Carlo simulation. If you have not heard of hypothesis testing, that’s ok too!

Either way, read the excerpt from Jordan Ellenberg’s New York Times Best Selling book How not to be wrong; the power of mathematical thinking that should be posted in canvas. It offers a fresh view of hypothesis testing. While it is not the main point of the reading, one of my favorite parts is how it relates the proof that \(\sqrt{2}\) is irrational (the proof is given in the reading and it is surprisingly simple!) to the logic of statistical hypothesis testing. We will have a reading quiz on the main points from the reading.

Learning objectives: After the logic of statistical testing via Monte Carlo Simulation, you should know how to test a hypothesis with Monte Carlo. This involves three steps.

  1. Convert null hypothesis into a statistical model from which we can simulate (recall chapter 1).
  2. Develop a test statistic \(S\) and “surprising set” for \(S\) based upon our understanding of the setting.
  3. Compute \(P(S \in SurprisingSet)\) with Monte Carlo to get a p-value (recall chapter 2).

homework for testing. Note that you can change the web address to get the .Rmd code.

4 Estimation

In this section, we talk about creating confidence intervals around statistical point estimates. These confidence intervals are constructed so that they include the point estimate on (e.g.) 95% of experiments. If you have learned about confidence intervals before, buckle up! We are going to have a more general approach that uses Monte Carlo simulation.

The logic of statistical estimation via Monte Carlo simulation

Homework 3.

5 Variance and the Central Limit Theorem (CLT).

The CLT says that an average of lots of random variables “looks like” a Normal random variable. Here are the notes.

6 Prediction

In a prediction problem, you are given data pairs \((X_1, Y_1), (X_2, Y_2), \dots, (X_n, Y_n)\) and you want to use \(X_i\) to predict \(Y_i\). We typically imagine \(X_i\) as containing several values (i.e. it is a “vector”).

There are two types of prediction problems, continuous \(Y_i\) and discrete \(Y_i\). For example, you might want to predict tomorrow’s the price for asset \(i\) using data that is available today. So, develop an historical training set, where you have information on asset \(i\) from one day contained in \(X_i\) and that asset’s price for the next day contained in \(Y_i\). Here, stock price is continuous.

Alternatively, you might only be interested in knowing if you should buy the stock, sell the stock, or hold the stock. So, you develop an historical training set where \(X_i\) contains the information that is available on one day. Then, you develop “labels” \(Y_i\) using data from the next day that say whether you should have bought, sold, or held onto the asset. Here, the label (buy, sell, hold) is discrete.

We will often call continuous outcomes “regression” and discrete outcomes “classification”.

Alternatively, perhaps we want to make predictions about the 2020 election. You could try to predict who is going to win (classification) or the number of delegates/votes that the Republicans recieve (regression).

In the cases above, there are two natural versions of the same problem (one is regression and one is classification). However, many classification problems do not have an analogous regression problem. For example, in the handwritten digit example in Chapter 1 of ISLR, \(X_i\) is an image of a handwritten digit and \(Y_i\) is a label that says whether the digit is 0, 1, 2, 3, 4,… , or 9.

We are going to imagine two broad approaches to regression and classification.

  1. Model-based approaches parameterize the distribution of \(Y_i\) given \(X_i\). That is, we imagine \(Y_i\) being a random variable that follows some distribution and that distribution is somehow parameterized by the variables in \(X_i\).
  2. Black-box approaches are defined algorithmically.

Chapter 2 in ISLR provides a broad overview of prediction. In the previous weeks of this course, Monte Carlo provided the basic computational tool; we were always working to get the problem stated as something that we could solve with Monte Carlo. Now, the basic computational tool is numerical optimization. We will not write code to do optimization. Instead, we will see optimization problems multiple times; it is often used to define our various techniques.

Note that you can download the textbook for ISLR and get all of the R labs at the book’s website.

Homework 4: ISLR (p52) 2.4.1, 2.4.2, 2.4.10.