STAT340: Discussion 4: Monte Carlo

Exercises (rest of discussion)

Now, break off into groups of 3-4 students. In your group, nominate one person to share your screen. As a group, choose at least 2 of the following exercises to complete (you’re welcome to choose more!)

At the end, please delete any exercises or parts you did not attempt before submitting! This will help us grade more easily, thanks in advance!

As usual, bonus parts are optional but will give extra statistical EXP. Do them if you have time and like the challenge! 😃

Estimating \(\pi\).
1. Randomly generate \(N\) pairs of points \((x,y)\) uniformly between \(-1\) and \(1\) for some large \(N\) (at least \(1000\), but the more the better!)
2. Using R, plot the points as small dots, along with the unit circle (i.e. circle centered at \((0,0)\) with radius 1) and a square that circumscribes it. Your result should look similar to this:
3. What proportion of your points lie inside the circle? (Hint)
4. How does this proportion relate to \(\pi\), and how can you use it to estimate \(\pi\)?
5. Calculate your percentage error from the true value.
6. Bonus: can you increase the accuracy of your estimation by using more points? How low can you get the error?
7. Extra bonus: Make a plot showing rate of convergence of your estimations, including both the value at each \(N\), actual value (using a horizontal line), and the percentage error.

# Do exercise 1 here

Estimating Robbin’s constant (mean distance between points in a cube).
1. Randomly generate 2 points \((x_1,y_1,z_1)\), \((x_2,y_2,z_2)\) uniformly in the unit cube a total of \(N\) times (again, at least \(1000\), but the more the better!)
  - hint: you can easily generate all the coordinates you need at once with runif(6*N), then reshape as an \(N\times 6\) matrix (one column for each coordinate component, with each row representing a pair of points) and then perform the arithmetic in the next step using the columns all at once to improve computational efficiency
  - if you are having difficulties with the above^ you can always use the naive way of running a for loop N times, where in each step of the loop you generate 2 points (6 coordinates total) and then perform the arithmetic in the next step
2. Next, compute the standard Euclidean distance between each pair of points and find the mean distance. (Bonus: plot the distribution of these distances!)
3. Calculate your percentage error from the true value.
4. Bonus: can you increase the accuracy of your estimation by using more points? How low can you get the error?
5. Super bonus: Repeat the above for another 2D or 3D object of your choice (how about a triangle or a sphere?)

# Do exercise 2 here

How many heads in a row should you see in \(N\) flips of a fair coin?
1. Start by randomly flipping a fair coin (\(p=0.5\)) a total of \(N=10\) times (hint: use either rbernoulli function from purrr or rbinom with n=10 and size=1) and record how many heads (defined as a value of \(1\)) in a row you observe (this has been implemented in the function longestHeadRun function below for you.
2. Repeat the above step \(M\) times (at least \(1000\) times, but this time, don’t use an extremely large \(M\), since we will repeat the previous step for other values of \(N\)). What is the mean length of the largest run of heads in \(10\) flips?
  - NOTE: \(N\) here is the size of each experiment (i.e. each experiment consists of \(N\) flips), whereas \(M\) is how many experiments are performed. It is common in Monte Carlo methods to have different parameters governing each experiment vs how many experiments are performed. Increasing \(N\) (number of flips in each experiment) will increase the mean-run-length, whereas increasing \(M\) (number of experiments) will increase the precision of your mean-run-length estimate for a particular number of flips.
3. Now, repeat the above (you may use the same \(M\)) for at least 3 other values of \(N\) (again, feel free to do more if you wish!). Display your results in a table.
  - NOTE this step should be easy if you’ve written your code with good style. I recommend writing a function that does all the above for any given \(N\) and \(M\) and maybe \(p\), e.g. findMeanRun = function(N,M,p=0.5){......}. Then, for different values of \(N\) and \(M\) you can simply change the arguments given to the function, e.g. findMeanRun(10,1000) or findMeanRun(20,1000), etc.
  - ALSO NOTE the above function syntax^ sets N and M as arguments to the function without default values, but sets 0.5 as the default value of the argument p. For a different example, see this.
4. Validate your results against other people’s results (for example, this post). Are your results consistent with others?
5. Bonus: run a few more values of \(N\) and plot the results, showing the mean run length vs number of flips \(N\). (bonus²: what happens if you increase \(M\)?)
6. DoublePlusBonus if you still want MORE: Like the post referenced above, can you fit a smooth curve through the points?

# given output of rbernoulli or rbinom (a vector of 0's and 1's)
# compute the length of the longest continuous run of 1's
longestHeadRun = function(trials){
  return(with(rle(trials),max(c(0,lengths[values==1]))))
}

# demo (output hidden for brevity)
longestHeadRun(c(0,0,0,0,0,0,0,0,0,0)) # returns 0
longestHeadRun(c(1,0,1,1,0,1,1,1,1,0)) # returns 4

# Do exercise 3 here

Estimating a \(t\)-distribution with \(N-1\) degrees of freedom.
1. Choose an arbitrary \(\mu\) and \(\sigma>0\) to use for the rest of the problem (you may choose the standard normal \(N(0,1)\) if you really wish, but where’s the fun in that?).
2. Start by sampling \(N=2\) values from the normal distribution with mean \(\mu\) and standard deviation \(\sigma\) (note this counts as \(1\) experiment) and calculate the \(t\)-statistic of your sample. Recall the \(t\)-statistic for a sample \(X\) is defined as \[t=\frac{\overline{X}-\mu}{s/\sqrt{N}}~,~~~s=\sqrt{\frac{1}{N-1}\sum_{i=1}^{N}(X_i-\overline{X})^2}\] where \(\overline{X}\) is the sample mean and \(s\) is the sample standard deviation
  - Make sure you’re actually computing the \(s\) for the sample, and NOT just using \(\sigma\) here!
  - You can use the built-in mean( ) and sd( ), but if you really want to do a barebones-style Monte Carlo, feel free to compute the \(t\)-statistic manually.
  - NOTE: Similar to the note in exercise 3.b., \(N\) here is the size of each experiment and \(M\) is how many experiments are performed. Increasing \(N\) gives a \(t\)-distribution with a different number of degrees of freedom (namely, \(N-1\)), whereas increasing \(M\) gives a more accurate estimate of each distribution of a particular degree.
3. Repeat the above step \(M\) times (similar to exercise 3.b., use at least \(1000\) times, but don’t use an extremely large \(M\) since we will repeat this for other values of \(N\)).
4. You’ve just simulated drawing from a \(t\)-distribution with \(N-1=1\) degree of freedom! Now plot the resultant values in a density plot.
5. For comparison, plot the theoretical distribution with \(1\) degree of freedom (this page may be helpful). For best results, overlay this on top of the previous plot, but you’re having trouble with this, you can also plot them side-by-side.
6. Repeat the above steps for at least 3 other values of \(N\) (for example 3, 6, 11, but feel free to choose your own or choose more than 3!). For each \(N\), plot both your simulated distribution and the theoretical distribution.
  - NOTE: again, like the note in exercise 3.c., this should be easy if you used a function!
7. Bonus: What do you notice about these distributions? What do they converge to and why?

# Do exercise 4 here

STAT340: Discussion 4: Monte Carlo

Names

Date

Group Discussion (5-10 min)

Exercises (rest of discussion)

Submission