Link to source file

Synthesizing data using “playgrounds”

Background

The goal of today’s discussion is to get you thinking about ways of synthesizing data from multiple sources to create a data frame that gives a more complete picture of the world and is more interesting and productive to study.

One of the easiest ways of doing this is by using what Karl calls data playgrounds; these are usually some kind of unique identifier that can be used to connect two disjoint data sets. Examples include FIPS county codes, NPI numbers, postal codes, etc.

Using the latest figures from USA Facts, we have prepared a dataset of daily covid-19 cases and deaths (direct download link) for each county in the US (.csv.gz means it’s a gzip compressed csv file (to save space), which R can read directly like any other csv file using standard read.csv or read_csv from the tidyverse package). The data frame includes the date, names and FIPS codes for each state and county, population, and confirmed cases & deaths.

# install package manager (if not installed) for convenience, then load tidyverse
if(!require(pacman)) install.packages("pacman")
pacman::p_load(tidyverse)

# change this path to wherever you downloaded the dataset
# col_types used to force FIPS codes to read as characters (to preserve leading 0's)
# read for more info: https://readr.tidyverse.org/reference/read_delim.html
covid19 = read_csv("../../../data/covid/covid19.csv.gz",col_types="Dcccciii")

# inspect the data frame
head(covid19)
 # # A tibble: 6 x 8
 #   date       stateName stateFIPS countyName   countyFIPS population cases deaths
 #   <date>     <chr>     <chr>     <chr>        <chr>           <int> <int>  <int>
 # 1 2020-01-22 NY        36        New York Ci~ <NA>               NA     0      0
 # 2 2020-01-22 AL        01        Autauga Cou~ 01001           55869     0      0
 # 3 2020-01-22 AL        01        Baldwin Cou~ 01003          223234     0      0
 # 4 2020-01-22 AL        01        Barbour Cou~ 01005           24686     0      0
 # 5 2020-01-22 AL        01        Bibb County  01007           22394     0      0
 # 6 2020-01-22 AL        01        Blount Coun~ 01009           57826     0      0
tail(covid19)
 # # A tibble: 6 x 8
 #   date       stateName stateFIPS countyName   countyFIPS population cases deaths
 #   <date>     <chr>     <chr>     <chr>        <chr>           <int> <int>  <int>
 # 1 2021-01-28 WY        56        Sublette Co~ 56035            9831   648      7
 # 2 2021-01-28 WY        56        Sweetwater ~ 56037           42343  3440     32
 # 3 2021-01-28 WY        56        Teton County 56039           23464  3028      6
 # 4 2021-01-28 WY        56        Uinta County 56041           20226  1957     12
 # 5 2021-01-28 WY        56        Washakie Co~ 56043            7805   864     25
 # 6 2021-01-28 WY        56        Weston Coun~ 56045            6927   610      4

Things to note:

  • These are only the confirmed counts (actual counts likely to be much higher in most counties).
  • Some rows denote unallocated cases and may need special treatment.
  • The state FIPS code forms the first two digits of the county FIPS code (e.g. Wyoming (WY) has state FIPS code 56, so every county in WY has a FIPS code starting with “56…”).
  • In some datasets, the county FIPS code may not include the state FIPS code, which may be in a separate column (e.g. Teton County (WY) may have FIPS code listed as just “039” instead of “56039”).

Exercise

Your task is to merge the covid-19 data with a second dataset to give it a new and unique perspective, using the FIPS codes. To save time, Karl has written a script to provide you with monthly county statistics from the Bureau of Labor Statistics (BLS) over a period of about a year, including estimates of the labor force, number of employed vs unemployment, and employment rate. Use this as your second dataset.

# run Karl's script
source("https://raw.githubusercontent.com/karlrohe/340-Spring21/master/data/bls/recent_county_months.R")

# run Karl's function to get BLS county data
bls_county = get_bls_county()

# inspect the data frame
str(bls_county)
 # tibble [43,974 x 8] (S3: tbl_df/tbl/data.frame)
 #  $ period         : Date[1:43974], format: "2019-10-01" "2019-10-01" ...
 #  $ fips_state     : chr [1:43974] "01" "01" "01" "01" ...
 #  $ fips_county    : chr [1:43974] "001" "003" "005" "007" ...
 #  $ area_title     : chr [1:43974] "Autauga County, AL" "Baldwin County, AL" "Barbour County, AL" "Bibb County, AL" ...
 #  $ labor_force    : num [1:43974] 26143 97213 8672 8722 25460 ...
 #  $ employed       : num [1:43974] 25530 94967 8374 8501 24890 ...
 #  $ unemployed     : num [1:43974] 613 2246 298 221 570 ...
 #  $ unemployed_rate: num [1:43974] 2.3 2.3 3.4 2.5 2.2 3.6 3.3 3 2.5 2.4 ...
 #  - attr(*, "na.action")= 'omit' Named int [1:5] 45067 45068 45069 45070 45071
 #   ..- attr(*, "names")= chr [1:5] "45067" "45068" "45069" "45070" ...
head(bls_county)
 # # A tibble: 6 x 8
 #   period     fips_state fips_county area_title labor_force employed unemployed
 #   <date>     <chr>      <chr>       <chr>            <dbl>    <dbl>      <dbl>
 # 1 2019-10-01 01         001         Autauga C~       26143    25530        613
 # 2 2019-10-01 01         003         Baldwin C~       97213    94967       2246
 # 3 2019-10-01 01         005         Barbour C~        8672     8374        298
 # 4 2019-10-01 01         007         Bibb Coun~        8722     8501        221
 # 5 2019-10-01 01         009         Blount Co~       25460    24890        570
 # 6 2019-10-01 01         011         Bullock C~        4850     4674        176
 # # ... with 1 more variable: unemployed_rate <dbl>
tail(bls_county)
 # # A tibble: 6 x 8
 #   period     fips_state fips_county area_title labor_force employed unemployed
 #   <date>     <chr>      <chr>       <chr>            <dbl>    <dbl>      <dbl>
 # 1 2020-11-01 56         035         Sublette ~        4018     3800        218
 # 2 2020-11-01 56         037         Sweetwate~       21155    19882       1273
 # 3 2020-11-01 56         039         Teton Cou~       14191    13371        820
 # 4 2020-11-01 56         041         Uinta Cou~        8990     8503        487
 # 5 2020-11-01 56         043         Washakie ~        3878     3695        183
 # 6 2020-11-01 56         045         Weston Co~        3826     3701        125
 # # ... with 1 more variable: unemployed_rate <dbl>

Note this dataset covers a different period of time than the covid-19 dataset, so you will probably need to either take a subset of a single point in time, or use some kind of summarization/aggregation to make the two datasets compatible before joining them. You should use your best judgment as aspiring data scientists to determine what method both makes sense and is also interesting.

After joining them, make at least 1 interesting plot visualizing the data in some way that was not possible before merging the two datasets.

Submission

Make sure the names of everyone who worked on this with you is included in the header of this document. Then, knit this document and submit both this file and the HTML output on Canvas under Assignments ⇒ Discussion 2.