# 7. Singular Value Decomposition (SVD)¶

In addition to regular packages contained in Anaconda by default, this lecture also requires:

```
!pip install quandl
```

```
Collecting quandl
```

```
Downloading Quandl-3.7.0-py2.py3-none-any.whl (26 kB)
```

```
Requirement already satisfied: pandas>=0.14 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from quandl) (1.4.4)
Requirement already satisfied: numpy>=1.8 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from quandl) (1.23.5)
Requirement already satisfied: inflection>=0.3.1 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from quandl) (0.5.1)
Requirement already satisfied: python-dateutil in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from quandl) (2.8.2)
Requirement already satisfied: six in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from quandl) (1.16.0)
```

```
Collecting more-itertools
Downloading more_itertools-9.0.0-py3-none-any.whl (52 kB)
?25l ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/52.8 kB ? eta -:--:--
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 52.8/52.8 kB 10.9 MB/s eta 0:00:00
?25hRequirement already satisfied: requests>=2.7.0 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from quandl) (2.28.1)
Requirement already satisfied: pytz>=2020.1 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from pandas>=0.14->quandl) (2022.7)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (1.26.11)
Requirement already satisfied: charset-normalizer<3,>=2 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (2.0.4)
```

```
Requirement already satisfied: idna<4,>=2.5 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (3.3)
Requirement already satisfied: certifi>=2017.4.17 in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (2022.9.14)
```

```
Installing collected packages: more-itertools, quandl
```

```
Successfully installed more-itertools-9.0.0 quandl-3.7.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
```

```
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
%matplotlib inline
import quandl as ql
import pandas as pd
```

## 7.1. Overview¶

The **singular value decomposition** (SVD) is a work-horse in applications of least squares projection that
form foundations for many statistical and machine learning methods.

After defining the SVD, we’ll describe how it connects to

the

**four fundamental spaces**of linear algebraunder-determined and over-determined

**least squares regressions****principal components analysis**(PCA)

We’ll also tell the essential role that the SVD plays in

dynamic mode decomposition (DMD)

Like principal components analysis (PCA), DMD can be thought of as a data-reduction procedure that represents salient patterns by projecting data onto a limited set of factors.

## 7.2. The Setting¶

Let \(X\) be an \(m \times n\) matrix of rank \(p\).

Necessarily, \(p \leq \min(m,n)\).

In much of this lecture, we’ll think of \(X\) as a matrix of **data** in which

each column is an

**individual**– a time period or person, depending on the applicationeach row is a

**random variable**describing an attribute of a time period or a person, depending on the application

We’ll be interested in two situations

A

**short and fat**case in which \(m << n\), so that there are many more columns (individuals) than rows (attributes).A

**tall and skinny**case in which \(m >> n\), so that there are many more rows (attributes) than columns (individuals).

We’ll apply a **singular value decomposition** of \(X\) in both situations.

In the \( m < < n\) case in which there are many more individuals \(n\) than attributes \(m\), we can calculate sample moments of a joint distribution by taking averages across observations of functions of the observations.

In this \( m < < n\) case, we’ll look for **patterns** by using a **singular value decomposition** to do a **principal components analysis** (PCA).

In the \(m > > n\) case in which there are many more attributes \(m\) than individuals \(n\) and when we are in a time-series setting in which \(n\) equals the number of time periods covered in the data set \(X\), we’ll proceed in a different way.

We’ll again use a **singular value decomposition**, but now to construct a **dynamic mode decomposition** (DMD)

## 7.3. Singular Value Decomposition¶

A **singular value decomposition** of an \(m \times n\) matrix \(X\) of rank \(p \leq \min(m,n)\) is

where

and

\(U\) is an \(m \times m\) orthogonal matrix of

**left singular vectors**of \(X\)Columns of \(U\) are eigenvectors of \(X^T X\)

\(V\) is an \(n \times n\) orthogonal matrix of

**right singular values**of \(X\)Columns of \(V\) are eigenvectors of \(X X^T\)

\(\Sigma\) is an \(m \times n\) matrix in which the first \(p\) places on its main diagonal are positive numbers \(\sigma_1, \sigma_2, \ldots, \sigma_p\) called

**singular values**; remaining entries of \(\Sigma\) are all zeroThe \(p\) singular values are positive square roots of the eigenvalues of the \(m \times m\) matrix \(X X^T\) and also of the \(n \times n\) matrix \(X^T X\)

We adopt a convention that when \(U\) is a complex valued matrix, \(U^T\) denotes the

**conjugate-transpose**or**Hermitian-transpose**of \(U\), meaning that \(U_{ij}^T\) is the complex conjugate of \(U_{ji}\).Similarly, when \(V\) is a complex valued matrix, \(V^T\) denotes the

**conjugate-transpose**or**Hermitian-transpose**of \(V\)

The matrices \(U,\Sigma,V\) entail linear transformations that reshape in vectors in the following ways:

multiplying vectors by the unitary matrices \(U\) and \(V\)

**rotates**them, but leaves**angles between vectors**and**lengths of vectors**unchanged.multiplying vectors by the diagonal matrix \(\Sigma\) leaves

**angles between vectors**unchanged but**rescales**vectors.

Thus, representation (7.1) asserts that multiplying an \(n \times 1\) vector \(y\) by the \(m \times n\) matrix \(X\) amounts to performing the following three multiplcations of \(y\) sequentially:

**rotating**\(y\) by computing \(V^T y\)**rescaling**\(V^T y\) by multipying it by \(\Sigma\)**rotating**\(\Sigma V^T y\) by multiplying it by \(U\)

This structure of the \(m \times n\) matrix \(X\) opens the door to constructing systems
of data **encoders** and **decoders**.

Thus,

\(V^T y\) is an encoder

\(\Sigma\) is an operator to be applied to the encoded data

\(U\) is a decoder to be applied to the output from applying operator \(\Sigma\) to the encoded data

We’ll apply this circle of ideas later in this lecture when we study Dynamic Mode Decomposition.

**Road Ahead**

What we have described above is called a **full** SVD.

In a **full** SVD, the shapes of \(U\), \(\Sigma\), and \(V\) are \(\left(m, m\right)\), \(\left(m, n\right)\), \(\left(n, n\right)\), respectively.

Later we’ll also describe an **economy** or **reduced** SVD.

Before we study a **reduced** SVD we’ll say a little more about properties of a **full** SVD.

## 7.4. Four Fundamental Subspaces¶

Let \({\mathcal C}\) denote a column space, \({\mathcal N}\) denote a null space, and \({\mathcal R}\) denote a row space.

Let’s start by recalling the four fundamental subspaces of an \(m \times n\) matrix \(X\) of rank \(p\).

The

**column space**of \(X\), denoted \({\mathcal C}(X)\), is the span of the columns of \(X\), i.e., all vectors \(y\) that can be written as linear combinations of columns of \(X\). Its dimension is \(p\).The

**null space**of \(X\), denoted \({\mathcal N}(X)\) consists of all vectors \(y\) that satisfy \(X y = 0\). Its dimension is \(m-p\).The

**row space**of \(X\), denoted \({\mathcal R}(X)\) is the column space of \(X^T\). It consists of all vectors \(z\) that can be written as linear combinations of rows of \(X\). Its dimension is \(p\).The

**left null space**of \(X\), denoted \({\mathcal N}(X^T)\), consist of all vectors \(z\) such that \(X^T z =0\). Its dimension is \(n-p\).

For a full SVD of a matrix \(X\), the matrix \(U\) of left singular vectors and the matrix \(V\) of right singular vectors contain orthogonal bases for all four subspaces.

They form two pairs of orthogonal subspaces that we’ll describe now.

Let \(u_i, i = 1, \ldots, m\) be the \(m\) column vectors of \(U\) and let \(v_i, i = 1, \ldots, n\) be the \(n\) column vectors of \(V\).

Let’s write the full SVD of X as

where \( \Sigma_p\) is a \(p \times p\) diagonal matrix with the \(p\) singular values on the diagonal and

Representation (7.2) implies that

or

or

Equations (7.4) tell how the transformation \(X\) maps a pair of orthonormal vectors \(v_i, v_j\) for \(i\) and \(j\) both less than or equal to the rank \(p\) of \(X\) into a pair of orthonormal vectors \(u_i, u_j\).

Equations (7.3) assert that

Taking transposes on both sides of representation (7.2) implies

or

or

Notice how equations (7.6) assert that the transformation \(X^T\) maps a pairsof distinct orthonormal vectors \(u_i, u_j\) for \(i\) and \(j\) both less than or equal to the rank \(p\) of \(X\) into a pair of distinct orthonormal vectors \(v_i, v_j\) .

Equations (7.5) assert that

Thus, taken together, the systems of quations (7.3) and (7.5) describe the four fundamental subspaces of \(X\) in the following ways:

Since \(U\) and \(V\) are both orthonormal matrices, collection (7.7) asserts that

\(U_L\) is an orthonormal basis for the column space of \(X\)

\(U_R\) is an orthonormal basis for the null space of \(X^T\)

\(V_L\) is an orthonormal basis for the row space of \(X\)

\(V_R\) is an orthonormal basis for the null space of \(X\)

We have verified the four claims in (7.7) simply by performing the multiplications called for by the right side of (7.2) and reading them.

The claims in (7.7) and the fact that \(U\) and \(V\) are both unitary (i.e, orthonormal) matrices imply that

the column space of \(X\) is orthogonal to the null space of of \(X^T\)

the null space of \(X\) is orthogonal to the row space of \(X\)

Sometimes these properties are described with the following two pairs of orthogonal complement subspaces:

\({\mathcal C}(X)\) is the orthogonal complement of \( {\mathcal N}(X^T)\)

\({\mathcal R}(X)\) is the orthogonal complement \({\mathcal N}(X)\)

Let’s do an example.

```
np.set_printoptions(precision=2)
# Define the matrix
A = np.array([[1, 2, 3, 4, 5],
[2, 3, 4, 5, 6],
[3, 4, 5, 6, 7],
[4, 5, 6, 7, 8],
[5, 6, 7, 8, 9]])
# Compute the SVD of the matrix
U, S, V = np.linalg.svd(A,full_matrices=True)
# Compute the rank of the matrix
rank = np.linalg.matrix_rank(A)
# Print the rank of the matrix
print("Rank of matrix:\n", rank)
print("S: \n", S)
# Compute the four fundamental subspaces
row_space = U[:, :rank]
col_space = V[:, :rank]
null_space = V[:, rank:]
left_null_space = U[:, rank:]
print("U:\n", U)
print("Column space:\n", col_space)
print("Left null space:\n", left_null_space)
print("V.T:\n", V.T)
print("Row space:\n", row_space.T)
print("Right null space:\n", null_space.T)
```

```
Rank of matrix:
2
S:
[2.69e+01 1.86e+00 7.83e-16 3.27e-16 4.69e-17]
U:
[[-0.27 -0.73 -0.47 0.06 -0.42]
[-0.35 -0.42 0.1 -0.18 0.81]
[-0.43 -0.11 0.75 -0.27 -0.4 ]
[-0.51 0.19 0.06 0.83 0.05]
[-0.59 0.5 -0.45 -0.45 -0.04]]
Column space:
[[-0.27 -0.35]
[ 0.73 0.42]
[ 0.02 0.06]
[ 0.52 -0.83]
[-0.37 -0.08]]
Left null space:
[[-0.47 0.06 -0.42]
[ 0.1 -0.18 0.81]
[ 0.75 -0.27 -0.4 ]
[ 0.06 0.83 0.05]
[-0.45 -0.45 -0.04]]
V.T:
[[-0.27 0.73 0.02 0.52 -0.37]
[-0.35 0.42 0.06 -0.83 -0.08]
[-0.43 0.11 0.29 0.18 0.82]
[-0.51 -0.19 -0.83 0.06 0.04]
[-0.59 -0.5 0.46 0.07 -0.42]]
Row space:
[[-0.27 -0.35 -0.43 -0.51 -0.59]
[-0.73 -0.42 -0.11 0.19 0.5 ]]
Right null space:
[[-0.43 0.11 0.29 0.18 0.82]
[-0.51 -0.19 -0.83 0.06 0.04]
[-0.59 -0.5 0.46 0.07 -0.42]]
```

## 7.5. Eckart-Young Theorem¶

Suppose that we want to construct the best rank \(r\) approximation of an \(m \times n\) matrix \(X\).

By best we mean a matrix \(X_r\) of rank \(r < p\) that, among all rank \(r\) matrices, minimizes

where \( || \cdot || \) denotes a norm of a matrix \(X\) and where \(X_r\) belongs to the space of all rank \(r\) matrices of dimension \(m \times n\).

Three popular **matrix norms** of an \(m \times n\) matrix \(X\) can be expressed in terms of the singular values of \(X\)

the

**spectral**or \(l^2\) norm \(|| X ||_2 = \max_{y \in \textbf{R}^n} \frac{||X y ||}{||y||} = \sigma_1\)the

**Frobenius**norm \(||X ||_F = \sqrt{\sigma_1^2 + \cdots + \sigma_p^2}\)the

**nuclear**norm \( || X ||_N = \sigma_1 + \cdots + \sigma_p \)

The Eckart-Young theorem states that for each of these three norms, same rank \(r\) matrix is best and that it equals

You can read about the Eckart-Young theorem and some of its uses here https://en.wikipedia.org/wiki/Low-rank_approximation.

We’ll make use of this theorem when we discuss principal components analysis (PCA) and also dynamic mode decomposition (DMD).

## 7.6. Full and Reduced SVD’s¶

Up to now we have described properties of a **full** SVD in which shapes of \(U\), \(\Sigma\), and \(V\) are \(\left(m, m\right)\), \(\left(m, n\right)\), \(\left(n, n\right)\), respectively.

There is an alternative bookkeeping convention called an **economy** or **reduced** SVD in which the shapes of \(U, \Sigma\) and \(V\) are different from what they are in a full SVD.

Thus, note that because we assume that \(X\) has rank \(p\), there are only \(p\) nonzero singular values, where \(p=\textrm{rank}(X)\leq\min\left(m, n\right)\).

A **reduced** SVD uses this fact to express \(U\), \(\Sigma\), and \(V\) as matrices with shapes \(\left(m, p\right)\), \(\left(p, p\right)\), \(\left( n, p\right)\).

You can read about reduced and full SVD here https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html

For a full SVD,

But not all these properties hold for a **reduced** SVD.

Which properties hold depend on whether we are in a **tall-skinny** case or a **short-fat** case.

In a

**tall-skinny**case in which \(m > > n\), for a**reduced**SVD

In a

**short-fat**case in which \(m < < n\), for a**reduced**SVD

When we study Dynamic Mode Decomposition below, we shall want to remember these properties when we use a reduced SVD to compute some DMD representations.

Let’s do an exercise to compare **full** and **reduced** SVD’s.

To review,

in a

**full**SVD\(U\) is \(m \times m\)

\(\Sigma\) is \(m \times n\)

\(V\) is \(n \times n\)

in a

**reduced**SVD\(U\) is \(m \times p\)

\(\Sigma\) is \(p\times p\)

\(V\) is \(n \times p\)

First, let’s study a case in which \(m = 5 > n = 2\).

(This is a small example of the **tall-skinny** case that will concern us when we study **Dynamic Mode Decompositions** below.)

```
import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V =')
U, S, V
```

```
U, S, V =
```

```
(array([[-0.36, -0.48, -0.51, -0.51, -0.35],
[-0.5 , -0.51, 0.6 , 0.33, -0.15],
[-0.51, 0.62, 0.34, -0.49, -0.1 ],
[-0.5 , 0.33, -0.5 , 0.61, -0.11],
[-0.34, -0.16, -0.12, -0.13, 0.91]]),
array([2.22, 0.29]),
array([[-0.65, -0.76],
[-0.76, 0.65]]))
```

```
print('Uhat, Shat, Vhat = ')
Uhat, Shat, Vhat
```

```
Uhat, Shat, Vhat =
```

```
(array([[-0.36, -0.48],
[-0.5 , -0.51],
[-0.51, 0.62],
[-0.5 , 0.33],
[-0.34, -0.16]]),
array([2.22, 0.29]),
array([[-0.65, -0.76],
[-0.76, 0.65]]))
```

```
rr = np.linalg.matrix_rank(X)
print(f'rank of X = {rr}')
```

```
rank of X = 2
```

**Properties:**

Where \(U\) is constructed via a full SVD, \(U^T U = I_{p\times p}\) and \(U U^T = I_{m \times m}\)

Where \(\hat U\) is constructed via a reduced SVD, although \(\hat U^T \hat U = I_{p\times p}\) it happens that \(\hat U \hat U^T \neq I_{m \times m}\)

We illustrate these properties for our example with the following code cells.

```
UTU = U.T@U
UUT = U@U.T
print('UUT, UTU = ')
UUT, UTU
```

```
UUT, UTU =
```

```
(array([[ 1.00e+00, 1.24e-16, 9.25e-17, 2.25e-16, 1.38e-16],
[ 1.24e-16, 1.00e+00, -5.67e-17, -7.72e-17, -2.87e-17],
[ 9.25e-17, -5.67e-17, 1.00e+00, -1.31e-17, 3.44e-17],
[ 2.25e-16, -7.72e-17, -1.31e-17, 1.00e+00, 5.61e-17],
[ 1.38e-16, -2.87e-17, 3.44e-17, 5.61e-17, 1.00e+00]]),
array([[ 1.00e+00, 2.36e-16, 2.86e-16, 1.56e-16, 9.77e-17],
[ 2.36e-16, 1.00e+00, -5.50e-20, 5.82e-17, 9.95e-17],
[ 2.86e-16, -5.50e-20, 1.00e+00, 8.35e-17, 4.01e-17],
[ 1.56e-16, 5.82e-17, 8.35e-17, 1.00e+00, 4.13e-17],
[ 9.77e-17, 9.95e-17, 4.01e-17, 4.13e-17, 1.00e+00]]))
```

```
UhatUhatT = Uhat@Uhat.T
UhatTUhat = Uhat.T@Uhat
print('UhatUhatT, UhatTUhat= ')
UhatUhatT, UhatTUhat
```

```
UhatUhatT, UhatTUhat=
```

```
(array([[ 0.36, 0.42, -0.11, 0.02, 0.2 ],
[ 0.42, 0.51, -0.06, 0.08, 0.25],
[-0.11, -0.06, 0.64, 0.46, 0.07],
[ 0.02, 0.08, 0.46, 0.36, 0.11],
[ 0.2 , 0.25, 0.07, 0.11, 0.14]]),
array([[1.00e+00, 2.36e-16],
[2.36e-16, 1.00e+00]]))
```

**Remarks:**

The cells above illustrate application of the `fullmatrices=True`

and `full-matrices=False`

options.
Using `full-matrices=False`

returns a reduced singular value decomposition.

The **full** and **reduced** SVd’s both accurately decompose an \(m \times n\) matrix \(X\)

When we study Dynamic Mode Decompositions below, it will be important for us to remember the preceding properties of full and reduced SVD’s in such tall-skinny cases.

Now let’s turn to a short-fat case.

To illustrate this case, we’ll set \(m = 2 < 5 = n \) and compute both full and reduced SVD’s.

```
import numpy as np
X = np.random.rand(2,5)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V = ')
U, S, V
```

```
U, S, V =
```

```
(array([[-0.54, -0.84],
[-0.84, 0.54]]),
array([1.72, 0.5 ]),
array([[-0.46, -0.68, -0.37, -0.4 , -0.19],
[ 0.63, -0.63, 0.43, 0. , -0.14],
[-0.55, 0.03, 0.82, -0.15, -0.06],
[-0.3 , -0.3 , -0.02, 0.91, -0.07],
[-0.05, -0.24, 0.04, -0.02, 0.97]]))
```

```
print('Uhat, Shat, Vhat = ')
Uhat, Shat, Vhat
```

```
Uhat, Shat, Vhat =
```

```
(array([[-0.54, -0.84],
[-0.84, 0.54]]),
array([1.72, 0.5 ]),
array([[-0.46, -0.68, -0.37, -0.4 , -0.19],
[ 0.63, -0.63, 0.43, 0. , -0.14]]))
```

Let’s verify that our reduced SVD accurately represents \(X\)

```
SShat=np.diag(Shat)
np.allclose(X, Uhat@SShat@Vhat)
```

```
True
```

## 7.7. Polar Decomposition¶

A **reduced** singular value decomposition (SVD) of \(X\) is related to a **polar decomposition** of \(X\)

where

Here

\(S\) is an \(m \times m\)

**symmetric**matrix\(Q\) is an \(m \times n\)

**orthogonal**matrix

and in our reduced SVD

\(U\) is an \(m \times p\) orthonormal matrix

\(\Sigma\) is a \(p \times p\) diagonal matrix

\(V\) is an \(n \times p\) orthonormal

## 7.8. Principal Components Analysis (PCA)¶

Let’s begin with a case in which \(n >> m\), so that we have many more individuals \(n\) than attributes \(m\).

The matrix \(X\) is **short and fat** in an \(n >> m\) case as opposed to a **tall and skinny** case with \(m > > n \) to be discussed later.

We regard \(X\) as an \(m \times n\) matrix of **data**:

where for \(j = 1, \ldots, n\) the column vector \(X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}\) is a vector of observations on variables \(\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}\).

In a **time series** setting, we would think of columns \(j\) as indexing different **times** at which random variables are observed, while rows index different random variables.

In a **cross section** setting, we would think of columns \(j\) as indexing different **individuals** for which random variables are observed, while rows index different **attributes**.

The number of positive singular values equals the rank of matrix \(X\).

Arrange the singular values in decreasing order.

Arrange the positive singular values on the main diagonal of the matrix \(\Sigma\) of into a vector \(\sigma_R\).

Set all other entries of \(\Sigma\) to zero.

## 7.9. Relationship of PCA to SVD¶

To relate a SVD to a PCA (principal component analysis) of data set \(X\), first construct the SVD of the data matrix \(X\):

where

In equation (7.9), each of the \(m \times n\) matrices \(U_{j}V_{j}^T\) is evidently of rank \(1\).

Thus, we have

Here is how we would interpret the objects in the matrix equation (7.10) in a time series context:

\( \textrm{for each} \ k=1, \ldots, n \), the object \(\lbrace V_{kj} \rbrace_{j=1}^n\) is a time series for the \(k\)th

**principal component**\(U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \ k=1, \ldots, m\) is a vector of

**loadings**of variables \(X_i\) on the \(k\)th principal component, \(i=1, \ldots, m\)\(\sigma_k \) for each \(k=1, \ldots, p\) is the strength of \(k\)th

**principal component**, where strength means contribution to the overall covariance of \(X\).

## 7.10. PCA with Eigenvalues and Eigenvectors¶

We now use an eigen decomposition of a sample covariance matrix to do PCA.

Let \(X_{m \times n}\) be our \(m \times n\) data matrix.

Let’s assume that sample means of all variables are zero.

We can assure this by **pre-processing** the data by subtracting sample means.

Define a sample covariance matrix \(\Omega\) as

Then use an eigen decomposition to represent \(\Omega\) as follows:

Here

\(P\) is \(m×m\) matrix of eigenvectors of \(\Omega\)

\(\Lambda\) is a diagonal matrix of eigenvalues of \(\Omega\)

We can then represent \(X\) as

where

and

We can verify that

It follows that we can represent the data matrix \(X\) as

To reconcile the preceding representation with the PCA that we had obtained earlier through the SVD, we first note that \(\epsilon_j^2=\lambda_j\equiv\sigma^2_j\).

Now define \(\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}\), which implies that \(\tilde{\epsilon}_j\tilde{\epsilon}_j^T=1\).

Therefore

which agrees with

provided that we set

\(U_j=P_j\) (a vector of loadings of variables on principal component \(j\))

\({V_k}^{T}=\tilde{\epsilon_k}\) (the \(k\)th principal component)

Because there are alternative algorithms for computing \(P\) and \(U\) for given a data matrix \(X\), depending on algorithms used, we might have sign differences or different orders between eigenvectors.

We can resolve such ambiguities about \(U\) and \(P\) by

sorting eigenvalues and singular values in descending order

imposing positive diagonals on \(P\) and \(U\) and adjusting signs in \(V^T\) accordingly

## 7.11. Connections¶

To pull things together, it is useful to assemble and compare some formulas presented above.

First, consider an SVD of an \(m \times n\) matrix:

Compute:

Compare representation (7.12) with equation (7.11) above.

Evidently, \(U\) in the SVD is the matrix \(P\) of eigenvectors of \(XX^T\) and \(\Sigma \Sigma^T\) is the matrix \(\Lambda\) of eigenvalues.

Second, let’s compute

Thus, the matrix \(V\) in the SVD is the matrix of eigenvectors of \(X^TX\)

Summarizing and fitting things together, we have the eigen decomposition of the sample covariance matrix

where \(P\) is an orthogonal matrix.

Further, from the SVD of \(X\), we know that

where \(U\) is an orthonal matrix.

Thus, \(P = U\) and we have the representation of \(X\)

It follows that

Note that the preceding implies that

so that everything fits together.

Below we define a class `DecomAnalysis`

that wraps PCA and SVD for a given a data matrix `X`

.

```
class DecomAnalysis:
"""
A class for conducting PCA and SVD.
"""
def __init__(self, X, n_component=None):
self.X = X
self.Ω = (X @ X.T)
self.m, self.n = X.shape
self.r = LA.matrix_rank(X)
if n_component:
self.n_component = n_component
else:
self.n_component = self.m
def pca(self):
𝜆, P = LA.eigh(self.Ω) # columns of P are eigenvectors
ind = sorted(range(𝜆.size), key=lambda x: 𝜆[x], reverse=True)
# sort by eigenvalues
self.𝜆 = 𝜆[ind]
P = P[:, ind]
self.P = P @ diag_sign(P)
self.Λ = np.diag(self.𝜆)
self.explained_ratio_pca = np.cumsum(self.𝜆) / self.𝜆.sum()
# compute the N by T matrix of principal components
self.𝜖 = self.P.T @ self.X
P = self.P[:, :self.n_component]
𝜖 = self.𝜖[:self.n_component, :]
# transform data
self.X_pca = P @ 𝜖
def svd(self):
U, 𝜎, VT = LA.svd(self.X)
ind = sorted(range(𝜎.size), key=lambda x: 𝜎[x], reverse=True)
# sort by eigenvalues
d = min(self.m, self.n)
self.𝜎 = 𝜎[ind]
U = U[:, ind]
D = diag_sign(U)
self.U = U @ D
VT[:d, :] = D @ VT[ind, :]
self.VT = VT
self.Σ = np.zeros((self.m, self.n))
self.Σ[:d, :d] = np.diag(self.𝜎)
𝜎_sq = self.𝜎 ** 2
self.explained_ratio_svd = np.cumsum(𝜎_sq) / 𝜎_sq.sum()
# slicing matrices by the number of components to use
U = self.U[:, :self.n_component]
Σ = self.Σ[:self.n_component, :self.n_component]
VT = self.VT[:self.n_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def fit(self, n_component):
# pca
P = self.P[:, :n_component]
𝜖 = self.𝜖[:n_component, :]
# transform data
self.X_pca = P @ 𝜖
# svd
U = self.U[:, :n_component]
Σ = self.Σ[:n_component, :n_component]
VT = self.VT[:n_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))
return D
```

We also define a function that prints out information so that we can compare decompositions obtained by different algorithms.

```
def compare_pca_svd(da):
"""
Compare the outcomes of PCA and SVD.
"""
da.pca()
da.svd()
print('Eigenvalues and Singular values\n')
print(f'λ = {da.λ}\n')
print(f'σ^2 = {da.σ**2}\n')
print('\n')
# loading matrices
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('loadings')
axs[0].plot(da.P.T)
axs[0].set_title('P')
axs[0].set_xlabel('m')
axs[1].plot(da.U.T)
axs[1].set_title('U')
axs[1].set_xlabel('m')
plt.show()
# principal components
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('principal components')
axs[0].plot(da.ε.T)
axs[0].set_title('ε')
axs[0].set_xlabel('n')
axs[1].plot(da.VT[:da.r, :].T * np.sqrt(da.λ))
axs[1].set_title('$V^T*\sqrt{\lambda}$')
axs[1].set_xlabel('n')
plt.show()
```

For an example PCA applied to analyzing the structure of intelligence tests see this lecture Multivariable Normal Distribution.

Look at parts of that lecture that describe and illustrate the classic factor analysis model.

## 7.12. Vector Autoregressions¶

We want to fit a **first-order vector autoregression**

where \(\epsilon_{t+1}\) is the time \(t+1\) instance of an i.i.d. \(m \times 1\) random vector with mean vector zero and identity covariance matrix and where the \( m \times 1 \) vector \( X_t \) is

and where \( T \) again denotes complex transposition and \( X_{i,t} \) is an observation on variable \( i \) at time \( t \).

We want to fit equation (7.13).

Our data are organized in an \( m \times (n+1) \) matrix \( \tilde X \)

where for \( t = 1, \ldots, n +1 \), the \( m \times 1 \) vector \( X_t \) is given by (7.14).

Thus, we want to estimate a system (7.13) that consists of \( m \) least squares regressions of **everything** on one lagged value of **everything**.

The \(i\)’th equation of (7.13) is a regression of \(X_{i,t+1}\) on the vector \(X_t\).

We proceed as follows.

From \( \tilde X \), we form two \(m \times n\) matrices

and

Here \( ' \) is part of the name of the matrix \( X' \) and does not indicate matrix transposition.

We continue to use \(\cdot^T\) to denote matrix transposition or its extension to complex matrices.

In forming \( X \) and \( X' \), we have in each case dropped a column from \( \tilde X \), the last column in the case of \( X \), and the first column in the case of \( X' \).

Evidently, \( X \) and \( X' \) are both \( m \times n \) matrices.

We denote the rank of \( X \) as \( p \leq \min(m, n) \).

Two possible cases are

\( n > > m\), so that we have many more time series observations \(n\) than variables \(m\)

\(m > > n\), so that we have many more variables \(m \) than time series observations \(n\)

At a general level that includes both of these special cases, a common formula describes the least squares estimator \(\hat A\) of \(A\).

But important details differ.

The common formula is

where \(X^+\) is the pseudo-inverse of \(X\).

To read about the **Moore-Penrose pseudo-inverse** please see Moore-Penrose pseudo-inverse

Applicable formulas for the pseudo-inverse differ for our two cases.

**Short-Fat Case:**

When \( n > > m\), so that we have many more time series observations \(n\) than variables \(m\) and when
\(X\) has linearly independent **rows**, \(X X^T\) has an inverse and the pseudo-inverse \(X^+\) is

Here \(X^+\) is a **right-inverse** that verifies \( X X^+ = I_{m \times m}\).

In this case, our formula (7.15) for the least-squares estimator of the population matrix of regression coefficients \(A\) becomes

This formula for least-squares regression coefficients widely used in econometrics.

For example, it is used to estimate vector autorgressions.

The right side of formula (7.16) is proportional to the empirical cross second moment matrix of \(X_{t+1}\) and \(X_t\) times the inverse of the second moment matrix of \(X_t\).

**Tall-Skinny Case:**

When \(m > > n\), so that we have many more attributes \(m \) than time series observations \(n\) and when \(X\) has linearly independent **columns**,
\(X^T X\) has an inverse and the pseudo-inverse \(X^+\) is

Here \(X^+\) is a **left-inverse** that verifies \(X^+ X = I_{n \times n}\).

In this case, our formula (7.15) for a least-squares estimator of \(A\) becomes

Please compare formulas (7.16) and (7.17) for \(\hat A\).

Here we are interested in formula (7.17).

The \( i \)th row of \( \hat A \) is an \( m \times 1 \) vector of regression coefficients of \( X_{i,t+1} \) on \( X_{j,t}, j = 1, \ldots, m \).

If we use formula (7.17) to calculate \(\hat A X\) we find that

so that the regression equation **fits perfectly**.

This is the usual outcome in an **underdetermined least-squares** model.

To reiterate, in our **tall-skinny** case in which we have a number \(n\) of observations that is small relative to the number \(m\) of
attributes that appear in the vector \(X_t\), we want to fit equation (7.13).

To offer ideas about how we can efficiently calculate the pseudo-inverse \(X^+\), as our estimator \(\hat A\) of \(A\) we form an \(m \times m\) matrix that solves the least-squares best-fit problem

where \(|| \cdot ||_F\) denotes the Frobenius (or Euclidean) norm of a matrix.

The Frobenius norm is defined as

The minimizer of the right side of equation (7.18) is

where the (possibly huge) \( n \times m \) matrix \( X^{+} = (X^T X)^{-1} X^T\) is again a pseudo-inverse of \( X \).

For some situations that we are interested in, \(X^T X \) can be close to singular, a situation that can make some numerical algorithms be error-prone.

To acknowledge that possibility, we’ll use efficient algorithms for computing and for constructing reduced rank approximations of \(\hat A\) in formula (7.17).

The \( i \)th row of \( \hat A \) is an \( m \times 1 \) vector of regression coefficients of \( X_{i,t+1} \) on \( X_{j,t}, j = 1, \ldots, m \).

An efficient way to compute the pseudo-inverse \(X^+\) is to start with a singular value decomposition

where we remind ourselves that for a **reduced** SVD, \(X\) is an \(m \times n\) matrix of data, \(U\) is an \(m \times p\) matrix, \(\Sigma\) is a \(p \times p\) matrix, and \(V\) is an \(n \times p\) matrix.

We can efficiently construct the pertinent pseudo-inverse \(X^+\) by recognizing the following string of equalities.

(Since we are in the \(m > > n\) case in which \(V^T V = I_{p \times p}\) in a reduced SVD, we can use the preceding string of equalities for a reduced SVD as well as for a full SVD.)

Thus, we shall construct a pseudo-inverse \( X^+ \) of \( X \) by using a singular value decomposition of \(X\) in equation (7.20) to compute

where the matrix \( \Sigma^{-1} \) is constructed by replacing each non-zero element of \( \Sigma \) with \( \sigma_j^{-1} \).

We can use formula (7.22) together with formula (7.19) to compute the matrix \( \hat A \) of regression coefficients.

Thus, our estimator \(\hat A = X' X^+\) of the \(m \times m\) matrix of coefficients \(A\) is

## 7.13. Dynamic Mode Decomposition (DMD)¶

We turn to the **tall and skinny** case associated with **Dynamic Mode Decomposition**, the case in which \( m >>n \).

Here an \( m \times n \) data matrix \( \tilde X \) contains many more attributes \( m \) than individuals \( n \).

Dynamic mode decomposition was introduced by [Sch10],

You can read about Dynamic Mode Decomposition here [KBBWP16] and here [BK19] (section 7.2).

The key idea underlying **Dynamic Mode Decomposition** (DMD) is to compute a rank \( r < p > \) approximation to the least square regression coefficients \( \hat A \) that we described above by formula (7.23).

We’ll build up gradually to a formulation that is typically used in applications of DMD.

We’ll do this by describing three alternative representations of our first-order linear dynamic system, i.e., our vector autoregression.

**Guide to three representations:** In practice, we’ll be interested in Representation 3. We present the first 2 in order to set the stage for some intermediate steps that might help us understand what is under the hood of Representation 3. In applications, we’ll use only a small subset of the DMD to approximate dynamics. To do that, we’ll want to use the **reduced** SVD’s affiliated with representation 3, not the **full** SVD’s affiliated with representations 1 and 2.

**Guide to impatient reader:** In our applications, we’ll be using Representation 3. You might want to skip
the stage-setting representations 1 and 2 on first reading.

## 7.14. Representation 1¶

In this representation, we shall use a **full** SVD of \(X\).

We use the \(m\) **columns** of \(U\), and thus the \(m\) **rows** of \(U^T\), to define a \(m \times 1\) vector \(\tilde b_t\) as

The original data \(X_t\) can be represented as

(Here we use \(b\) to remind ourselves that we are creating a **basis** vector.)

Since we are now using a **full** SVD, \(U U^T = I_{m \times m}\).

So it follows from equation (7.24) that we can reconstruct \(X_t\) from \(\tilde b_t\).

In particular,

Equation (7.24) serves as an

**encoder**that**rotates**the \(m \times 1\) vector \(X_t\) to become an \(m \times 1\) vector \(\tilde b_t\)Equation (7.25) serves as a

**decoder**that**reconstructs**the \(m \times 1\) vector \(X_t\) by rotating the \(m \times 1\) vector \(\tilde b_t\)

Define a transition matrix for an \(m \times 1\) basis vector \(\tilde b_t\) by

We can recover \(\hat A\) from

Dynamics of the \(m \times 1\) basis vector \(\tilde b_t\) are governed by

To construct forecasts \(\overline X_t\) of future values of \(X_t\) conditional on \(X_1\), we can apply decoders (i.e., rotators) to both sides of this equation and deduce

where we use \(\overline X_{t+1}, t \geq 1 \) to denote a forecast.

## 7.15. Representation 2¶

This representation is related to one originally proposed by [Sch10].

It can be regarded as an intermediate step on the way to obtaining a related representation 3 to be presented later

As with Representation 1, we continue to

use a

**full**SVD and**not**a reduced SVD

As we observed and illustrated earlier in this lecture

(a) for a full SVD \(U U^T = I_{m \times m} \) and \(U^T U = I_{p \times p}\) are both identity matrices

(b) for a reduced SVD of \(X\), \(U^T U \) is not an identity matrix.

As we shall see later, a full SVD is too confining for what we ultimately want to do, namely, cope with situations in which \(U^T U\) is **not** an identity matrix because we use a reduced SVD of \(X\).

But for now, let’s proceed under the assumption that we are using a full SVD so that both of the preceding two requirements (a) and (b) are satisfied.

Form an eigendecomposition of the \(m \times m\) matrix \(\tilde A = U^T \hat A U\) defined in equation (7.26):

where \(\Lambda\) is a diagonal matrix of eigenvalues and \(W\) is an \(m \times m\) matrix whose columns are eigenvectors corresponding to rows (eigenvalues) in \(\Lambda\).

When \(U U^T = I_{m \times m}\), as is true with a full SVD of \(X\), it follows that

According to equation (7.28), the diagonal matrix \(\Lambda\) contains eigenvalues of \(\hat A\) and corresponding eigenvectors of \(\hat A\) are columns of the matrix \(UW\).

It follows that the systematic (i.e., not random) parts of the \(X_t\) dynamics captured by our first-order vector autoregressions are described by

Multiplying both sides of the above equation by \(W^{-1} U^T\) gives

or

where our **encoder** is now

and our **decoder** is

We can use this representation to construct a predictor \(\overline X_{t+1}\) of \(X_{t+1}\) conditional on \(X_1\) via:

In effect, [Sch10] defined an \(m \times m\) matrix \(\Phi_s\) as

and a generalized inverse

[Sch10] then represented equation (7.29) as

Components of the basis vector \( \hat b_t = W^{-1} U^T X_t \equiv \Phi_s^+ X_t\) are often called DMD **modes**, or sometimes also
DMD **projected modes**.

To understand why they are called **projected modes**, notice that

so that the \(m \times p\) matrix

is a matrix of regression coefficients of the \(m \times n\) matrix \(X\) on the \(m \times p\) matrix \(\Phi_s\).

We’ll say more about this interpretation in a related context when we discuss representation 3.

We turn next to an alternative representation suggested by Tu et al. [TRL+14].

It is more appropriate to use this alternative representation when, as is typically the case in practice, we use a reduced SVD.

## 7.16. Representation 3¶

Departing from the procedures used to construct Representations 1 and 2, each of which deployed a **full** SVD, we now use a **reduced** SVD.

Again, we let \(p \leq \textrm{min}(m,n)\) be the rank of \(X\).

Construct a **reduced** SVD

where now \(\tilde U\) is \(m \times p\), \(\tilde \Sigma\) is \( p \times p\), and \(\tilde V^T\) is \(p \times n\).

Our minimum-norm least-squares estimator approximator of \(A\) now has representation

Paralleling a step in Representation 1, define a transition matrix for a rotated \(p \times 1\) state \(\tilde b_t\) by

**Interpretation as projection coefficients**

[BK22] remark that \(\tilde A\) can be interpreted in terms of a projection of \(\hat A\) onto the \(p\) modes in \(\tilde U\).

To verify this, first note that, because \( \tilde U^T \tilde U = I\), it follows that

Next, we’ll just compute the regression coefficients in a projection of \(\hat A\) on \(\tilde U\) using the standard least-square formula

Note that because we are now working with a reduced SVD, \(\tilde U \tilde U^T \neq I\).

Consequently,

and we can’t simply recover \(\hat A\) from \(\tilde A\) and \(\tilde U\).

Nevertheless, we hope for the best and proceed to construct an eigendecomposition of the \(p \times p\) matrix \(\tilde A\):

Mimicking our procedure in Representation 2, we cross our fingers and compute an \(m \times p\) matrix

that corresponds to (7.30) for a full SVD.

At this point, where \(\hat A\) is given by formula (7.33) it is interesting to compute \(\hat A \tilde \Phi_s\):

That
\( \hat A \tilde \Phi_s \neq \tilde \Phi_s \Lambda \) means, that unlike the corresponding situation in Representation 2, columns of \(\tilde \Phi_s = \tilde U \tilde W\)
are **not** eigenvectors of \(\hat A\) corresponding to eigenvalues on the diagonal of matix \(\Lambda\).

But in a quest for eigenvectors of \(\hat A\) that we **can** compute with a reduced SVD, let’s define the \(m \times p\) matrix
\(\Phi\) as

It turns out that columns of \(\Phi\) **are** eigenvectors of \(\hat A\).

This is a consequence of a result established by Tu et al. [TRL+14], which we now present.

**Proposition** The \(p\) columns of \(\Phi\) are eigenvectors of \(\hat A\).

**Proof:** From formula (7.38) we have

Thus, we have deduced that

Let \(\phi_i\) be the \(i\)th column of \(\Phi\) and \(\lambda_i\) be the corresponding \(i\) eigenvalue of \(\tilde A\) from decomposition (7.36).

Writing out the \(m \times 1\) vectors on both sides of equation (7.39) and equating them gives

This equation confirms that \(\phi_i\) is an eigenvector of \(\hat A\) that corresponds to eigenvalue \(\lambda_i\) of both \(\tilde A\) and \(\hat A\).

This concludes the proof.

Also see [BK22] (p. 238)

### 7.16.1. Decoder of \(X\) as a linear projection¶

From eigendecomposition (7.39) we can represent \(\hat A\) as

From formula (7.40) we can deduce the reduced dimension dynamics

where

Since the \(m \times p\) matrix \(\Phi\) has \(p\) linearly independent columns, the generalized inverse of \(\Phi\) is

and so

The \(p \times n\) matrix \(\check b\) is recognizable as a matrix of least squares regression coefficients of the \(m \times n\) matrix \(X\) on the \(m \times p\) matrix \(\Phi\) and consequently

is an \(m \times n\) matrix of least squares projections of \(X\) on \(\Phi\).

By virtue of least-squares projection theory discussed in this quantecon lecture e https://python-advanced.quantecon.org/orth_proj.html, we can represent \(X\) as the sum of the projection \(\check X\) of \(X\) on \(\Phi\) plus a matrix of errors.

To verify this, note that the least squares projection \(\check X\) is related to \(X\) by

or

where \(\epsilon\) is an \(m \times n\) matrix of least squares errors satisfying the least squares orthogonality conditions \(\epsilon^T \Phi =0 \) or

Rearranging the orthogonality conditions (7.45) gives \(X^T \Phi = \check b \Phi^T \Phi\), which implies formula (7.42).

### 7.16.2. A useful approximation¶

There is a useful way to approximate the \(p \times 1\) vector \(\check b_t\) instead of using formula (7.41).

In particular, the following argument adapted from [BK22] (page 240) provides a computationally efficient way to approximate \(\check b_t\).

For convenience, we’ll do this first for time \(t=1\).

For \(t=1\), from equation (7.44) we have

where \(\check b_1\) is a \(p \times 1\) vector.

Recall from representation 1 above that \(X_1 = U \tilde b_1\), where \(\tilde b_1\) is a time \(1\) basis vector for representation 1 and \(U\) is from a full SVD of \(X\).

It then follows from equation (7.44) that

where \(\epsilon_1\) is a least-squares error vector from equation (7.44).

It follows that

Replacing the error term \(U^T \epsilon_1\) by zero, and replacing \(U\) from a full SVD of \(X\) with \(\tilde U\) from a reduced SVD, we obtain an approximation \(\hat b_1\) to \(\tilde b_1\):

Recall that from equation (7.35), \( \tilde A = \tilde U^T X' \tilde V \tilde \Sigma^{-1}\).

It then follows that

and therefore, by the eigendecomposition (7.36) of \(\tilde A\), we have

Consequently,

or

which is a computationally efficient approximation to the following instance of equation (7.41) for the initial vector \(\check b_1\):

(To highlight that (7.47) is an approximation, users of DMD sometimes call components of the basis vector \(\check b_t = \Phi^+ X_t \) the **exact** DMD modes.)

Conditional on \(X_t\), we can compute our decoded \(\check X_{t+j}, j = 1, 2, \ldots \) from either

or use the approximation

We can then use \(\check X_{t+j}\) or \(\hat X_{t+j}\) to forecast \(X_{t+j}\).

### 7.16.3. Using Fewer Modes¶

In applications, we’ll actually want to just a few modes, often three or less.

Some of the preceding formulas assume that we have retained all \(p\) modes associated with the positive singular values of \(X\).

We can adjust our formulas to describe a situation in which we instead retain only the \(r < p\) largest singular values.

In that case, we simply replace \(\tilde \Sigma\) with the appropriate \(r\times r\) matrix of singular values, \(\tilde U\) with the \(m \times r\) matrix whose columns correspond to the \(r\) largest singular values, and \(\tilde V\) with the \(n \times r\) matrix whose columns correspond to the \(r\) largest singular values.

Counterparts of all of the salient formulas above then apply.

## 7.17. Source for Some Python Code¶

You can find a Python implementation of DMD here: