The goal of today’s discussion is to get you thinking about ways of synthesizing data from multiple sources to create a data frame that gives a more complete picture of the world and is more interesting and productive to study.
One of the easiest ways of doing this is by using what Karl calls data playgrounds; these are usually some kind of unique identifier that can be used to connect two disjoint data sets. Examples include FIPS county codes, NPI numbers, postal codes, etc.
Using the latest figures from USA Facts, we have prepared a dataset of daily covid-19 cases and deaths (direct download link) for each county in the US (.csv.gz means it’s a gzip compressed csv file (to save space), which R can read directly like any other csv file using standard read.csv
or read_csv
from the tidyverse package). The data frame includes the date, names and FIPS codes for each state and county, population, and confirmed cases & deaths.
# install package manager (if not installed) for convenience, then load tidyverse
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse)
# change this path to wherever you downloaded the dataset
# col_types used to force FIPS codes to read as characters (to preserve leading 0's)
# read for more info: https://readr.tidyverse.org/reference/read_delim.html
covid19 = read_csv("../../../data/covid/covid19.csv.gz",col_types="Dcccciii")
# inspect the data frame
head(covid19)
# # A tibble: 6 x 8
# date stateName stateFIPS countyName countyFIPS population cases deaths
# <date> <chr> <chr> <chr> <chr> <int> <int> <int>
# 1 2020-01-22 NY 36 New York Ci~ <NA> NA 0 0
# 2 2020-01-22 AL 01 Autauga Cou~ 01001 55869 0 0
# 3 2020-01-22 AL 01 Baldwin Cou~ 01003 223234 0 0
# 4 2020-01-22 AL 01 Barbour Cou~ 01005 24686 0 0
# 5 2020-01-22 AL 01 Bibb County 01007 22394 0 0
# 6 2020-01-22 AL 01 Blount Coun~ 01009 57826 0 0
tail(covid19)
# # A tibble: 6 x 8
# date stateName stateFIPS countyName countyFIPS population cases deaths
# <date> <chr> <chr> <chr> <chr> <int> <int> <int>
# 1 2021-01-28 WY 56 Sublette Co~ 56035 9831 648 7
# 2 2021-01-28 WY 56 Sweetwater ~ 56037 42343 3440 32
# 3 2021-01-28 WY 56 Teton County 56039 23464 3028 6
# 4 2021-01-28 WY 56 Uinta County 56041 20226 1957 12
# 5 2021-01-28 WY 56 Washakie Co~ 56043 7805 864 25
# 6 2021-01-28 WY 56 Weston Coun~ 56045 6927 610 4
Things to note:
Your task is to merge the covid-19 data with a second dataset to give it a new and unique perspective, using the FIPS codes. To save time, Karl has written a script to provide you with monthly county statistics from the Bureau of Labor Statistics (BLS) over a period of about a year, including estimates of the labor force, number of employed vs unemployment, and employment rate. Use this as your second dataset.
# run Karl's script
source("https://raw.githubusercontent.com/karlrohe/340-Spring21/master/data/bls/recent_county_months.R")
# run Karl's function to get BLS county data
bls_county = get_bls_county()
# inspect the data frame
str(bls_county)
# tibble [43,974 x 8] (S3: tbl_df/tbl/data.frame)
# $ period : Date[1:43974], format: "2019-10-01" "2019-10-01" ...
# $ fips_state : chr [1:43974] "01" "01" "01" "01" ...
# $ fips_county : chr [1:43974] "001" "003" "005" "007" ...
# $ area_title : chr [1:43974] "Autauga County, AL" "Baldwin County, AL" "Barbour County, AL" "Bibb County, AL" ...
# $ labor_force : num [1:43974] 26143 97213 8672 8722 25460 ...
# $ employed : num [1:43974] 25530 94967 8374 8501 24890 ...
# $ unemployed : num [1:43974] 613 2246 298 221 570 ...
# $ unemployed_rate: num [1:43974] 2.3 2.3 3.4 2.5 2.2 3.6 3.3 3 2.5 2.4 ...
# - attr(*, "na.action")= 'omit' Named int [1:5] 45067 45068 45069 45070 45071
# ..- attr(*, "names")= chr [1:5] "45067" "45068" "45069" "45070" ...
head(bls_county)
# # A tibble: 6 x 8
# period fips_state fips_county area_title labor_force employed unemployed
# <date> <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2019-10-01 01 001 Autauga C~ 26143 25530 613
# 2 2019-10-01 01 003 Baldwin C~ 97213 94967 2246
# 3 2019-10-01 01 005 Barbour C~ 8672 8374 298
# 4 2019-10-01 01 007 Bibb Coun~ 8722 8501 221
# 5 2019-10-01 01 009 Blount Co~ 25460 24890 570
# 6 2019-10-01 01 011 Bullock C~ 4850 4674 176
# # ... with 1 more variable: unemployed_rate <dbl>
tail(bls_county)
# # A tibble: 6 x 8
# period fips_state fips_county area_title labor_force employed unemployed
# <date> <chr> <chr> <chr> <dbl> <dbl> <dbl>
# 1 2020-11-01 56 035 Sublette ~ 4018 3800 218
# 2 2020-11-01 56 037 Sweetwate~ 21155 19882 1273
# 3 2020-11-01 56 039 Teton Cou~ 14191 13371 820
# 4 2020-11-01 56 041 Uinta Cou~ 8990 8503 487
# 5 2020-11-01 56 043 Washakie ~ 3878 3695 183
# 6 2020-11-01 56 045 Weston Co~ 3826 3701 125
# # ... with 1 more variable: unemployed_rate <dbl>
Note this dataset covers a different period of time than the covid-19 dataset, so you will probably need to either take a subset of a single point in time, or use some kind of summarization/aggregation to make the two datasets compatible before joining them. You should use your best judgment as aspiring data scientists to determine what method both makes sense and is also interesting.
After joining them, make at least 1 interesting plot visualizing the data in some way that was not possible before merging the two datasets.
Make sure the names of everyone who worked on this with you is included in the header of this document. Then, knit this document and submit both this file and the HTML output on Canvas under Assignments ⇒ Discussion 2.