Intro to longpca with nycflights data • longpca

Introduction

If you’ve already read the readme, you can skip to the section PCA the nycflights

This package introduces a novel formula syntax for PCA. In modern applications (where data is often in “long format”), the formula syntax helps to fluidly imagine PCA without thinking about matrices. In other words, it provides a layer of abstraction above matrices. Given the formula and the (long) data, the code in this package transforms your data into a proper format for fast PCA via sparse linear algebra. The package also provides code to 1) help pick the number of dimensions to compute, 2) diagnose the suitability of PCA (both pre and post PCA), 3) rotate the PCs with varimax, 4) visualize and interpret the dimensions uncovered, and (not yet) 5) make predictions. This package uses “PCA” as a broad term for computing the leading singular vectors of a normalized (sometimes incomplete) matrix. Some might refer to specific instances as factor analysis, correspondence analysis, latent symantic analysis, social network analysis, or low-rank matrix completion, among other possible terms. This is big-tent PCA, all included. longpca is in development. So, functions and syntax might change.

The current approach to PCA (principal components analysis) is matrix first. This note begins to explore an alternative path, one that is model first. The formula syntax provides an alternative way to think about PCA that makes matrices transparent; completely hidden, unless you want to see the code.

I hope this makes PCA legible for folks that have not yet learned linear algebra (just like linear models are legible without solving linear systems of equations).

I am personally inspired by this approach because (despite the fact that I love matrices and linear algebra) I find that this model first way of thinking is so much easier and more direct.

This document gives an illustration with a data analysis of the popular nycflights13 data via PCA. Headline: we find two seasonal effects (annual and weekly) and also the “fly-over-zone” (midwest 4ever. ride or die <3 much love to my midwest fam). Code details follow this analysis.

(Disclaimer: this is very early in this project. So, the syntax and the code is likely to change a great deal. Input is very welcome about ways to improve it.)

Install

The functions for PCA for the People are contained in an R package longpca. If you do not already have devtools installed, you will first need to install that:

install.packages("devtools")
devtools::install_github("karlrohe/longpca")

Thank you to Alex Hayes for helpful feedback in this process and suggesting the name longpca.

PCA the nycflights.

To illustrate the code, consider the popular data example nycflights13 which contains a row for every flight departing from the 3 main New York City airports during 2013 (LGA, JFK, and EWR). It includes things like date, destination, and information about delays.

library(nycflights13)
flights
#> # A tibble: 336,776 × 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#>  1  2013     1     1      517            515         2      830            819
#>  2  2013     1     1      533            529         4      850            830
#>  3  2013     1     1      542            540         2      923            850
#>  4  2013     1     1      544            545        -1     1004           1022
#>  5  2013     1     1      554            600        -6      812            837
#>  6  2013     1     1      554            558        -4      740            728
#>  7  2013     1     1      555            600        -5      913            854
#>  8  2013     1     1      557            600        -3      709            723
#>  9  2013     1     1      557            600        -3      838            846
#> 10  2013     1     1      558            600        -2      753            745
#> # ℹ 336,766 more rows
#> # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#> #   hour <dbl>, minute <dbl>, time_hour <dttm>

There are so many matrices “inside” of this data, but you don’t think of them when you see this data. Many applications are like this. The data does not look like matrix. Instead, it looks like a spreadsheet or a SQL database or a tidy tibble. That is how the users often think about their data. And underneath it, there are so many possible matrices. This code will reveal these possibilities.

The first step to using longpca is the function make_interaction_model. This requires two arguments, the data (tidy data in “long format”) and the formula (explained in detail below).

Right now, there are four ways to specify a model with make_interaction_model. Hopefully, we can find more. Perhaps you will have a way?

Once you make an interaction_model, there are many things you can do. Before running PCA, you can diagnose to see if the data is “too sparse”, then if so, you can core to find the dense core of observed data. You can also pick_dim to estimate how large of a model k your data can infer.

Here, for the introduction, we will just run PCA right away:

library(nycflights13)
im = make_interaction_model(flights, ~ (month & day)*(dest))
pcs = pca(im, k = 6)

This performs PCA. The key innovation is the “formula”:

~ (month & day)*(dest),

which specifies the model that we want. If you are already comfortable with matrices and the SVD, then you can imagine this as the matrix we will apply SVD to (in the future, this will be imagined as fitting a statistical model via least squares).

The basic syntax of the formula is this:

outcome ~ unit * context.

The outcome can be left blank (Specification 1 below), otherwise it should be a variable in the data. unit and context can be variables in the data, or they might be specified by multiple variables in the data. For example, in the formula ~ (month & day)*(dest), the units are specified by (month & day).

The next section discusses the various ways of using the formula.

Four ways to specify an interaction_model

Specification 1: Empty left-side

The first and most basic use (perhaps also be the most powerful) is to not specify an outcome:

~ (month & day)*(dest).

Specification 1 happens when the left-side of the formula is empty. It specifies a matrix where

the units (or rows) are indexed by (month & day), two variables of the flights data,
the context (or columns) are indexed by destinations dest, and
the outcome (or elements of the matrix) are the number of flights to each dest on each (month & day).

So, there 365 rows (one for each day), about 100 columns (one for each destination), and the sum of all elements in the matrix is 336,776, the number of rows in the flights data.

Said another way, if you leave the left-side empty, it counts co-occurrences of (month & day) with dest and makes a matrix of those co-occurrences. Some people call this matrix a cross-tab or a contingency table. When we do PCA to this matrix of counts, some folks call that Correspondence Analysis.

Specification 2: A variable the left-side that counts something

In this formulation, the outcome (on the left-side of the formula) will have a variable. For example, nycflights13 has the number of seats available in each plane. We can join this data to the flights to find the number of seats available on each flight.

dat = flights %>% 
  select(month, day, dest, tailnum) %>% 
  left_join(planes %>% select(tailnum, seats))
#> Joining with `by = join_by(tailnum)`

Then, we can use the formula,

seats ~ (month & day)*dest

with the new data dat in make_interaction_model. This is Specification 2. It happens when (i) we specify an outcome on the left-hand side of the formula and (ii) we use the default settings of make_interaction_model.

im_seats = make_interaction_model(dat, seats ~ (month & day)*dest)

In this formula above, the units and context are identical to the example for Specification 1. The difference is that now the outcome counts the total number of seats that flew to each destination, on each day.

In fact, Specification 2 is very similar to Specification 1. Imagine that flights had a variable called 1 and every element of that varialbe is a numeric 1. Then, the formula

1 ~ (month & day)*(dest)

could be interpreted via Specification 2. And in fact, you can type that formula without having a variable 1 in your data and make_interaction_model will understand it the same as ~ (month & day)*(dest).

Specification 3: The outcome variable (the thing on the left-side of the formula) should be averaged

The flights data contains a column arr_delay which gives the arrival delay for the flight. One possible model is

arr_delay ~ (month & day)*(dest).

It does not make sense to simply add up the arrival delays. It makes sense to average them. This is specified in make_interaction_model with the argument duplicates = "average":

make_interaction_model(flights, arr_delay ~ (month & day)*(dest), duplicates = "average")

This uses the same units and context as the last two examples, but it is helpful to begin imagining some things that correspond to arrival delays and how might they interact. From this, we build our model. For example, in my experience, flights later in the day are more likely to be delayed. Also, weekday flights are worse than weekend flights. I wonder if there is an interaction between them.

day_dat = flights |> 
  mutate(day = wday(time_hour, label = TRUE, abbr = FALSE)) 

im_delays = day_dat |> 
  select(hour, day , arr_delay) |> 
  make_interaction_model(arr_delay ~ hour*day,duplicates = "average")

Specification 3 happens when (i) there is a variable on the left-hand side of the formula, (ii) duplicates = "average" in make_interaction_model, and (iii) typically you then want to use pca_na instead of pca to compute the principal components:

pc_delays = pca_na(im_delays, k = 3)
#> Taking 3 core.  Starting with:
#>  20 rows
#>  7 columns
#>  133 observed values[1] "adding graph summaries (coreness and connected components)."
#> After taking 3 core.  There remain:
#>  19 rows
#>  7 columns
#>  133 observed values

Often, many (or most) of the possible combinations of the (unit, context) pairs do not appear in the data. For example, there were no flights to MSN (Madison, Wisconsin) on January 5th. When most of the possible combinations do not appear, we say that the data is “sparse”. (You can diagnose the level of sparsity with diagnose and you can look at a “dense subset” of the data with core).

If any of the combinations do not appear in the data, then you need to decide if PCA should understand those values to be zero or NA. You need to think about your problem to decide which one makes sense. In Specifications 1 and 2, it makes sense to make the missing elements zero because the sum of zero numbers is zero. In this case, you use the function pca. In Specification 3, it likely makes sense to have the missing values be NA because the average of zero numbers is NA. In this case, call pca_na which computes the PCs via softImpute::softImpute.

Another way to specify this model (i.e. a different interaction to explore) would be to make the unit (day & hour) and the context dest:

im_dest_delay = day_dat |> 
  make_interaction_model(arr_delay ~ (day&hour)*dest,duplicates = "average")

Specification 4: The variable after the * is a text field (or a sequence) that needs to be parsed

There are no good motivating examples for Specification 4 in the flights data. Instead, please see the parse_text vignette. You can access that functionality with the argument parse_text = TRUE inside make_interaction_model.

What is inside the output of pca

The output of pca (and pca_na) contains the pc’s and their loadings in a tidy format:

im = make_interaction_model(flights, ~ (month & day)*(dest))
pcs = pca(im, k = 6)


pcs$row_features %>% sample_n(size = 3)
#> # A tibble: 3 × 12
#>   month   day     n row_num degree weighted_degree pc_1_rows pc_2_rows pc_3_rows
#>   <int> <int> <int>   <int>  <int>           <dbl>     <dbl>     <dbl>     <dbl>
#> 1    10    23   975     143     84             975      1.03    -1.02    0.684  
#> 2     3    20   970     161     83             970      1.03     1.28    1.17   
#> 3     8    26   982     109     87             982      1.03    -0.827   0.00865
#> # ℹ 3 more variables: pc_4_rows <dbl>, pc_5_rows <dbl>, pc_6_rows <dbl>
pcs$column_features %>% sample_n(size = 3)
#> # A tibble: 3 × 11
#>   dest      n col_num degree weighted_degree pc_1_columns pc_2_columns
#>   <chr> <int>   <int>  <int>           <dbl>        <dbl>        <dbl>
#> 1 LEX       1     104      1               1      0.00135     -0.00124
#> 2 MKE    2802      35    365            2802      0.935       -0.0786 
#> 3 BWI    1781      48    365            1781      0.726        1.94   
#> # ℹ 4 more variables: pc_3_columns <dbl>, pc_4_columns <dbl>,
#> #   pc_5_columns <dbl>, pc_6_columns <dbl>

Because they are tidy, it makes them pretty easy to ggplot.

First, let’s do the units (i.e. dates/rows). To interpret PC’s, it is best to plot it in the native space. For dates, the native space is a time series or a sequence. Let’s plot it there. I give my interpretation after the plots.

im = make_interaction_model(flights, ~ (month & day)*(dest))
pcs = pca(im, k = 6)

pcs$row_features %>% 
  mutate(date = make_date(day = day, month=month, year = 2013)) %>% 
  select(date, contains("pc_")) %>% 
  pivot_longer(contains("pc_"), names_to = "pc_dimension", values_to = "loadings") %>% 
  ggplot(aes(x = date, y = loadings)) + geom_line() + 
  facet_wrap(~pc_dimension, scales= "free") + geom_smooth()
#> `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

I always think of the first PC as the “mean”. What we see is that flights are more or less constant throughout the year (see y-axis). I presume that the oscillations are for the weekends. pc_1 says that, across destinations, there are more flights during the work week and fewer flights on the weekends. The second pc gives a seasonal effect (fewer flights in winter, more in summer); importantly, after pc_1, some destinations will have negative values of this (i.e. more in the winter, fewer in the summer). The third pc is positive on weekend destinations (more flights on the weekends and fewer during the weekdays relative to pc_1). Again, like pc_2 some destinations will have a negative value (i.e. more flights on the weekends and fewer during the weekdays relative to the previous two pc’s). The last three are harder to interpret. My uninformed guess is that it is some artifact of airline decisions. If you have a guess, I’d love to hear it. Also, later on with pick_dim, we have some evidence that they are noise.

Now, let’s do the columns (i.e. destinations). The “native space” for destinations is a map. Let’s plot it there. Be sure you have maps installed.

airports %>% sample_n(size = 3)
#> # A tibble: 3 × 8
#>   faa   name                            lat   lon   alt    tz dst   tzone       
#>   <chr> <chr>                         <dbl> <dbl> <dbl> <dbl> <chr> <chr>       
#> 1 09J   Jekyll Island Airport          31.1 -81.4    11    -5 A     America/New…
#> 2 GTB   Wheeler Sack Aaf               44.1 -75.7   690    -5 A     America/New…
#> 3 MTH   Florida Keys Marathon Airport  24.7 -81.1     7    -5 A     America/New…

# first, get the lat and lon for the airports:
airport_dat = pcs$column_features %>% 
  left_join(airports %>% select(dest=faa, lat,lon)) %>% 
  select(lat, lon, contains("_col")) %>% 
  pivot_longer(contains("pc_"),
               names_to = "pc_dimension", values_to = "loadings") %>% 
  drop_na()
#> Joining with `by = join_by(dest)`


library(maps)
usa_map <- map_data("state")
p <- ggplot() + 
  geom_polygon(data = usa_map, aes(x = long, y = lat, group = group), 
               fill = "white", color = "black") +
  coord_fixed(1.3, xlim = c(-125, -65), ylim = c(25, 50)) 
# i'm only keeping lower 48 states, dropping Anchorage and Honolulu.


p + geom_point(data = airport_dat, aes(x = lon, y = lat, 
                                       size = abs(loadings), color = loadings)) +
  facet_wrap(~ pc_dimension)  +
  scale_color_gradient2(low = "red", high = "blue", mid = "white")

Here pc_1 should align with larger and smaller airports (bigger airports <-> more flights throughout the year). pc_2 is negative on Denver and Florida and positive in Maine. Looking back at the time series plots, I interpret this to mean that people go to Denver (skiing) and Florida (beach) in the winter and Maine (coastline) in the summer. pc_3 picks up the “fly-over zone”… looking back at the time series, folks prefer to travel here during the work week. So, the blue areas are more weekend (vacation) destinations and the red areas are the fly-over. The other pc’s are difficult for me to interpret (my guess is that they are weird artifacts of airline things… noise). We do see that the last three are heavily localized on a few airports, looking back at the pairs plots you can see this localization. Given that, my sense is that they are not so interesting, but if I needed to make sense of them, I would print out their most extreme elements and dig into those airports. Making this function is a todo item.

So, using the code is easy. You just need to specify a formula. It’s fun to think of other combinations and easy to try them out.

There are three functions that you might like diagnose, pick_dim, and plot that are explained below.

A deeper look inside the code.

What is an interaction_model?

To make maximum use of this package, it is helpful to think about models, not matrices. Each of the key functions in this package is handling a class interaction_model:

formula = ~ (month & day)*dest
im = make_interaction_model(flights, formula)
class(im)
#> [1] "interaction_model"
names(im)
#> [1] "interaction_tibble" "row_universe"       "column_universe"   
#> [4] "settings"

im is a list of four elements. First, $interaction_tibble which can be thought of as a sparse matrix in triplet form; get_Matrix(im) uses this to construct a sparse matrix. Then, $row_universe and $column_universe which can be thought of as holding the information corresponding to each row/column. Finally, $settings contains various details about the construction.

Let’s do a more careful analysis

Examining the matrix sparsity

This package contains a few helper functions. First, if lots of rows or columns have very few non-zero elements, this can cause “localization issues”. The matrix needs to be “dense enough” for the PCA to find good stuff. So, diagnose prints some simple diagnostics and plots the “degree distribution” for the rows and columns. Here, “degree” is the number of non-zero elements in that row or column.

# inspect "degree distributions" with this funciton:
#  recall that im is the interaction_model defined above.
diagnose(im)

#> # A tibble: 6 × 3
#>   measurement      dest `month & day`
#>   <chr>           <dbl>         <dbl>
#> 1 number_of_items   105           365
#> 2 average_degree    297            86
#> 3 median_degree     365            86
#> 4 percent_le_1        2             0
#> 5 percent_le_2        2             0
#> 6 percent_le_3        2             0

For example, if either average degree was less than 10, then I might be worried. Another diagnostic in the print out is percent_le_x which gives the percent of rows/columns that have row/col sums less than or equal to x. If a majority of rows or columns has degree less than or equal to 3, then I would be worried. These are clues that the matrix is very sparse and you might have trouble. Issues with sparsity will likely manifest in localization; something that will be evaluated in functions below.

One possibility is to take the “core”:

im_cored = core(im,core_threshold = 3)
#> [1] "adding graph summaries (coreness and connected components)."
nrow(im$column_universe)
#> [1] 105
nrow(im_cored$column_universe)
#> [1] 103
diagnose(im_cored)
#> Warning in scale_y_log10(): log-10 transformation introduced
#> infinite values.
#> Warning: Removed 27 rows containing missing values or values outside the scale range
#> (`geom_bar()`).

#> # A tibble: 6 × 3
#>   measurement      dest `month & day`
#>   <chr>           <dbl>         <dbl>
#> 1 number_of_items   103           365
#> 2 average_degree    303            86
#> 3 median_degree     365            86
#> 4 percent_le_1        0             0
#> 5 percent_le_2        0             0
#> 6 percent_le_3        0             0

This finds the largest subset of rows and columns such that each row and column has at least core_threshold = 3 data points (and it will also return the largest connected component). In this case, it discards two destination airports, LEX and LGA:

im$column_universe |> anti_join(im_cored$column_universe, by = "dest")
#> # A tibble: 2 × 3
#>   dest      n col_num
#>   <chr> <int>   <int>
#> 1 LEX       1     104
#> 2 LGA       1     105

Picking k with cross-validated eigenvalues

When doing a PCA, you need to pick the model size $k$. The way that we do this in my lab is with cross-validated eigenvalues. It gives you a Z-score and a p-value. Here is the arxiv paper. Alex Hayes and Fan Chen made it a proper R package on CRAN gdim. For this example, it picks k=4.

cv_eigs = pick_dim(im, dimMax = 10,num_bootstraps = 5) 
plot(cv_eigs)

cv_eigs
#> Estimated graph dimension:    4
#> 
#> Number of bootstraps:         5
#> Edge splitting probabaility:  0.1
#> Significance level:       0.05
#> 
#>  ------------ Summary of Tests ------------
#>   k           z        pvals         padj
#>   1 166.9096296 0.000000e+00 0.000000e+00
#>   2  12.5328707 2.467185e-36 2.467185e-36
#>   3   8.7270943 1.306496e-18 1.306496e-18
#>   4   4.7824034 8.660580e-07 8.660580e-07
#>   5   0.6311604 2.639678e-01 2.639678e-01
#>   6  -2.8350333 9.977090e-01 9.977090e-01
#>   7  -5.9380729 1.000000e+00 1.000000e+00
#>   8  -6.0137710 1.000000e+00 1.000000e+00
#>   9  -7.7210977 1.000000e+00 1.000000e+00
#>  10  -7.6142011 1.000000e+00 1.000000e+00

Notice that the top-line of the printout says that the estimated graph dimension is 4. So, we will use k=6 and see that in this example they become harder to interpret. This is what we would expect if it was just noise… but also, maybe they are not just noise?

Let’s get the people some PCA

For right now, the default of make_interaction_model is that if there are multiple rows of the long data that have the same values for (month & day) and also for dest, then the value inside the matrix is a sum of the values on the left hand side of the formula. If there is no variable specified on the left hand side, then it is like imagining there is an additional column in your data, 1, that has the value 1 in every row. So, formula = ~ (month & day)*(dest) is equivalent to formula = 1 ~ (month & day)*(dest). By summing over the outcomes, if it is a 1, then it counts how many times that entry appears.

Because it is a count, and the variance stabilizing transformation for a Poisson is square root, the default in the code for pca is to square root each count. Then, the code computes the normalized and regularized Laplacian L (using the number of non-zero entries as the degree). Then, computes the leading k singular vectors. This is all done with sparse linear algebra via the packages Matrix and irlba.

pcs = pca(im, k = 6)
names(pcs)
#> [1] "row_features"    "column_features" "middle_B"        "settings"

The row_features and column_features are the PC’s and loadings (I don’t prefer those old terms). middle_B gives the singular values. settings contains some details that are handy in later functions.

sample_n(pcs$row_features, size = 3)
#> # A tibble: 3 × 12
#>   month   day     n row_num degree weighted_degree pc_1_rows pc_2_rows pc_3_rows
#>   <int> <int> <int>   <int>  <int>           <dbl>     <dbl>     <dbl>     <dbl>
#> 1     4    23   965     177     82             965     1.04      0.146  -0.671  
#> 2     6    26   995      40     88             995     1.04     -0.719  -0.00845
#> 3     6    22   812     307     88             812     0.928    -0.368   2.24   
#> # ℹ 3 more variables: pc_4_rows <dbl>, pc_5_rows <dbl>, pc_6_rows <dbl>
sample_n(pcs$column_features, size=3)
#> # A tibble: 3 × 11
#>   dest      n col_num degree weighted_degree pc_1_columns pc_2_columns
#>   <chr> <int>   <int>  <int>           <dbl>        <dbl>        <dbl>
#> 1 EYW      17      99     17              17       0.0215        0.423
#> 2 SDF    1157      56    365            1157       0.596        -0.526
#> 3 HDN      15     100     15              15       0.0183        0.612
#> # ℹ 4 more variables: pc_3_columns <dbl>, pc_4_columns <dbl>,
#> #   pc_5_columns <dbl>, pc_6_columns <dbl>

Notice that these features are in a wide and tidy form, making it easy to lubridate::make_date (for row_features) and left-join with airports (to get latitude and longitude) for column_features.

Diagnostic plots

You can plot(pcs). It makes these five plots, each described after all plots are displayed.

plot(pcs)

#> Press [Enter] to continue to the next plot...

#> Press [Enter] to continue to the next plot...
#> `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

#> Press [Enter] to continue to the next plot...

#> Press [Enter] to continue to the next plot...

These are the five plots:

Screeplot: The top k singular values of L.
Better screeplot: its singular values 2:k (because the first one is usually dominant and difficult to see an elbow past it).
A “localization plot” which is very similar (maybe exact?) to this stuff; for each row (and column) compute its degree and its leverage score. Take the log of both. Fit a linear model log(leverage)~log(degree) and plot the residuals against log(degree). If there is localization, I suspect that there will be a big curl on the right side.
Pairs plot of row_features. This is the plot emphasized in the varimax paper. In these example plots below, we do not see very clear radial streaks.
A pairs plot for column_features. In both pairs plots, if there are more than 1000 points, then the code samples 1000 points with probability proportional to their leverage scores. It will plot up to k=10 dimensions. If k is larger, then it plots the first 5 and the last 5.