Karl Rohe 2024-01-03

This package introduces a novel formula syntax for PCA. In modern applications (where data is often in “long format”), the formula syntax helps to fluidly imagine PCA without thinking about matrices. In other words, it provides a layer of abstraction above matrices. Given the formula and the (long) data, the code in this package transforms your data into a proper format for fast PCA via sparse linear algebra. The package also provides code to 1) help pick the number of dimensions to compute, 2) diagnose the suitability of PCA (both pre and post PCA), 3) rotate the PCs with varimax, 4) visualize and interpret the dimensions uncovered, and (not yet) 5) make predictions. This package uses “PCA” as a broad term for computing the leading singular vectors of a normalized (sometimes incomplete) matrix. Some might refer to specific instances as factor analysis, correspondence analysis, latent symantic analysis, social network analysis, or low-rank matrix completion, among other possible terms. This is big-tent PCA, all included. `longpca`

is in development. So, functions and syntax might change.

The current approach to PCA (principal components analysis) is *matrix first*. This note begins to explore an alternative path, one that is *model first*. The formula syntax provides an alternative way to think about PCA that makes matrices transparent; completely hidden, unless you want to see the code.

I hope this makes PCA legible for folks that have not yet learned linear algebra (just like linear models are legible without solving linear systems of equations).

I am personally inspired by this approach because (despite the fact that I love matrices and linear algebra) I find that this *model first* way of thinking is so much easier and more direct.

This document gives an illustration with a data analysis of the popular `nycflights13`

data via PCA. Headline: we find two seasonal effects (annual and weekly) and also the “fly-over-zone” (midwest 4ever. ride or die <3 much love to my midwest fam). Code details follow this analysis.

(Disclaimer: this is very early in this project. So, the syntax and the code is likely to change a great deal. Input is very welcome about ways to improve it.)

#### Install

The functions for PCA for the People are contained in an R package `longpca`

. If you do not already have `devtools`

installed, you will first need to install that:

```
install.packages("devtools")
devtools::install_github("karlrohe/longpca")
```

Thank you to Alex Hayes for helpful feedback in this process and suggesting the name `longpca`

.

### PCA the nycflights.

The code is fast and nimble. First you define “the model” with a formula… and some data:

```
formula = 1 ~ (month & day)*(dest)
im = make_interaction_model(flights, formula)
pcs = pca(im, k = 6)
```

There are three functions to run on `im`

: `diagnose`

, `pick_dim`

, and `pca`

.

There are three key functions to run on `pcs`

: `plot`

, `rotate`

, and `top`

.

See the vignettes for further illustrations:

#### Slightly more detail…

The hope is that *model first* PCA with the formula makes interacting with the matrix / linear algebra unnecessary. That said, it might be instructive to understand the class `interaction_model`

to see how it represents a matrix “under the hood”.

The function `make_interaction_model`

constructs a list with the class `interaction_model`

. You can think of this as an abstraction of a matrix…

```
formula = 1 ~ (month & day)*(dest)
im = make_interaction_model(flights,formula)
names(im)
```

`class(im)`

In this “matrix like thing,” the `month & day`

index the rows and `dest`

indexes the columns. This is because `month & day`

come before the interaction `*`

in the formula and `dest`

comes afterwords.

`im$row_universe`

```
## # A tibble: 365 × 4
## month day n row_num
## <int> <int> <int> <int>
## 1 11 27 1014 1
## 2 7 11 1006 2
## 3 7 8 1004 3
## 4 7 10 1004 4
## 5 12 2 1004 5
## 6 7 18 1003 6
## 7 7 25 1003 7
## 8 7 12 1002 8
## 9 7 9 1001 9
## 10 7 17 1001 10
## # ℹ 355 more rows
```

`im$column_universe`

```
## # A tibble: 105 × 3
## dest n col_num
## <chr> <int> <int>
## 1 ORD 17283 1
## 2 ATL 17215 2
## 3 LAX 16174 3
## 4 BOS 15508 4
## 5 MCO 14082 5
## 6 CLT 14064 6
## 7 SFO 13331 7
## 8 FLL 12055 8
## 9 MIA 11728 9
## 10 DCA 9705 10
## # ℹ 95 more rows
```

Then, “the matrix” is in sparse triplet form:

`im$interaction_tibble`

```
## # A tibble: 31,229 × 3
## row_num col_num outcome
## <int> <int> <dbl>
## 1 1 1 52
## 2 1 2 51
## 3 1 3 49
## 4 1 4 43
## 5 1 5 40
## 6 1 6 42
## 7 1 7 43
## 8 1 8 38
## 9 1 9 37
## 10 1 10 28
## # ℹ 31,219 more rows
```

If for any reason you actually wanted the sparse `Matrix`

…

```
A = get_Matrix(im, import_names = TRUE)
str(A)
```

```
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:31229] 0 1 2 3 4 5 6 7 8 9 ...
## ..@ p : int [1:106] 0 365 730 1095 1460 1825 2190 2555 2920 3285 ...
## ..@ Dim : int [1:2] 365 105
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:365] "11/27" "7/11" "7/8" "7/10" ...
## .. ..$ : chr [1:105] "ORD" "ATL" "LAX" "BOS" ...
## ..@ x : num [1:31229] 52 55 55 55 49 54 55 55 54 55 ...
## ..@ factors : list()
```

The hope is that model first PCA with the `interaction_model`

makes data analysis more direct, i.e. that you should not need to think about this matrix (too much). Instead, this path is simply a way to estimate a “low rank” statistical model via least squares.