# 6. Singular Value Decomposition (SVD)¶

In addition to regular packages contained in Anaconda by default, this lecture also requires:

```
!pip install quandl
```

```
Collecting quandl
```

```
Downloading Quandl-3.7.0-py2.py3-none-any.whl (26 kB)
Requirement already satisfied: numpy>=1.8 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (1.20.3)
Requirement already satisfied: inflection>=0.3.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (0.5.1)
Requirement already satisfied: six in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (1.16.0)
Requirement already satisfied: requests>=2.7.0 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (2.26.0)
Requirement already satisfied: pandas>=0.14 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (1.3.4)
Requirement already satisfied: more-itertools in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (8.10.0)
Requirement already satisfied: python-dateutil in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from quandl) (2.8.2)
```

```
Requirement already satisfied: pytz>=2017.3 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from pandas>=0.14->quandl) (2021.3)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from requests>=2.7.0->quandl) (2.0.4)
Requirement already satisfied: certifi>=2017.4.17 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from requests>=2.7.0->quandl) (2021.10.8)
Requirement already satisfied: idna<4,>=2.5 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from requests>=2.7.0->quandl) (3.2)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.8/site-packages (from requests>=2.7.0->quandl) (1.26.7)
```

```
Installing collected packages: quandl
```

```
Successfully installed quandl-3.7.0
```

```
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
%matplotlib inline
import quandl as ql
import pandas as pd
```

## 6.1. Overview¶

The **singular value decomposition** is a work-horse in applications of least squares projection that
form the backbone of important parts of modern machine learning methods.

This lecture describes the singular value decomposition and two of its uses:

principal components analysis (PCA)

dynamic mode decomposition (DMD)

Each of these can be thought of as data-reduction methods that are designed to capture principal patterns in data by projecting data onto a limited set of factors.

## 6.2. The Setup¶

Let \(X\) be an \(m \times n\) matrix of rank \(r\).

In this lecture, we’ll think of \(X\) as a matrix of **data**.

each column is an

**individual**– a time period or person, depending on the applicationeach row is a

**random variable**measuring an attribute of a time period or a person, depending on the application

We’ll be interested in two distinct cases

A

**short and fat**case in which \(m << n\), so that there are many more columns than rows.A

**tall and skinny**case in which \(m >> n\), so that there are many more rows than columns.

We’ll apply a **singular value decomposition** of \(X\) in both situations.

In the first case in which there are many more observations \(n\) than random variables \(m\), we learn about the joint distribution of the random variables by taking averages across observations of functions of the observations.

Here we’ll look for **patterns** by using a **singular value decomosition** to do a **principal components analysis** (PCA).

In the second case in which there are many more random variables \(m\) than observations \(n\), we’ll proceed in a different way.

We’ll again use a **singular value decomposition**, but now to do a **dynamic mode decomposition** (DMD)

## 6.3. Singular Value Decomposition¶

A **singular value decomposition** of an \(m \times n\) matrix \(X\) of rank \(r \leq \min(m,n)\) is

where

where

\(U\) is an \(m \times m\) matrix whose columns are eigenvectors of \(X^T X\)

\(V\) is an \(n \times n\) matrix whose columns are eigenvectors of \(X X^T\)

\(\Sigma\) is an \(m \times r\) matrix in which the first \(r\) places on its main diagonal are positive numbers \(\sigma_1, \sigma_2, \ldots, \sigma_r\) called

**singular values**; remaining entries of \(\Sigma\) are all zeroThe \(r\) singular values are square roots of the eigenvalues of the \(m \times m\) matrix \(X X^T\) and the \(n \times n\) matrix \(X^T X\)

When \(U\) is a complex valued matrix, \(U^T\) denotes the

**conjugate-transpose**or**Hermitian-transpose**of \(U\), meaning that \(U_{ij}^T\) is the complex conjugate of \(U_{ji}\).Similarly, when \(V\) is a complex valued matrix, \(V^T\) denotes the

**conjugate-transpose**or**Hermitian-transpose**of \(V\)

The shapes of \(U\), \(\Sigma\), and \(V\) are \(\left(m, m\right)\), \(\left(m, n\right)\), \(\left(n, n\right)\), respectively.

Below, we shall assume these shapes.

However, though we chose not to, there is an alternative shape convention that we could have used.

Thus, note that because we assume that \(A\) has rank \(r\), there are only \(r \) nonzero singular values, where \(r=\textrm{rank}(A)\leq\min\left(m, n\right)\).

Therefore, we could also write \(U\), \(\Sigma\), and \(V\) as matrices with shapes \(\left(m, r\right)\), \(\left(r, r\right)\), \(\left(r, n\right)\).

Sometimes, we will choose the former one to be consistent with what is adopted by `numpy`

.

At other times, we’ll use the latter convention in which \(\Sigma\) is an \(r \times r\) diagonal matrix.

Also, when we discuss the **dynamic mode decomposition** below, we’ll use a special case of the latter convention in which it is understood that
\(r\) is just a pre-specified small number of leading singular values that we think capture the most interesting dynamics.

## 6.4. Digression: Polar Decomposition¶

Through the following identities, the singular value decomposition (SVD) is related to the **polar decomposition** of \(X\)

where \(S\) is evidently a symmetric matrix and \(Q\) is an orthogonal matrix.

## 6.5. Principle Components Analysis (PCA)¶

Let’s begin with a case in which \(n >> m\), so that we have many more observations \(n\) than random variables \(m\).

The data matrix \(X\) is **short and fat** in an \(n >> m\) case as opposed to a **tall and skinny** case with \(m > > n \) to be discussed later in this lecture.

We regard \(X\) as an \(m \times n\) matrix of **data**:

where for \(j = 1, \ldots, n\) the column vector \(X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}\) is a vector of observations on variables \(\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}\).

In a **time series** setting, we would think of columns \(j\) as indexing different **times** at which random variables are observed, while rows index different random variables.

In a **cross section** setting, we would think of columns \(j\) as indexing different **individuals** for which random variables are observed, while rows index different **random variables**.

The number of singular values equals the rank of matrix \(X\).

Arrange the singular values in decreasing order.

Arrange the positive singular values on the main diagonal of the matrix \(\Sigma\) of into a vector \(\sigma_R\).

Set all other entries of \(\Sigma\) to zero.

## 6.6. Relationship of PCA to SVD¶

To relate a SVD to a PCA (principal component analysis) of data set \(X\), first construct the SVD of the data matrix \(X\):

where

In equation (6.1), each of the \(m \times n\) matrices \(U_{j}V_{j}^T\) is evidently of rank \(1\).

Thus, we have

Here is how we would interpret the objects in the matrix equation (6.2) in a time series context:

\( V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n \) is a time series \(\lbrace V_{kj} \rbrace_{j=1}^n\) for the \(k\)th principal component

\(U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \ k=1, \ldots, m\) is a vector of loadings of variables \(X_i\) on the \(k\)th principle component, \(i=1, \ldots, m\)

\(\sigma_k \) for each \(k=1, \ldots, r\) is the strength of \(k\)th

**principal component**

## 6.7. Reduced Versus Full SVD¶

You can read about reduced and full SVD here https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html

Let’s do a small experiment to see the difference

```
import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True) # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V ='), U, S, V
```

```
U, S, V =
```

```
(None,
array([[-0.24206603, 0.06289228, -0.37106607, -0.56479415, -0.69337302],
[-0.30685803, -0.37828315, -0.74432356, 0.44343507, 0.10994444],
[-0.47185752, -0.6846985 , 0.52260407, 0.01313754, -0.18775207],
[-0.46404262, 0.56628115, 0.18613475, 0.55910981, -0.34167262],
[-0.63973207, 0.25191279, -0.02304866, -0.41424212, 0.59594887]]),
array([1.49124531, 0.4202209 ]),
array([[-0.8096618 , -0.58689673],
[-0.58689673, 0.8096618 ]]))
```

```
print('Uhat, Shat, Vhat = '), Uhat, Shat, Vhat
```

```
Uhat, Shat, Vhat =
```

```
(None,
array([[-0.24206603, 0.06289228],
[-0.30685803, -0.37828315],
[-0.47185752, -0.6846985 ],
[-0.46404262, 0.56628115],
[-0.63973207, 0.25191279]]),
array([1.49124531, 0.4202209 ]),
array([[-0.8096618 , -0.58689673],
[-0.58689673, 0.8096618 ]]))
```

```
rr = np.linalg.matrix_rank(X)
rr
```

```
2
```

## 6.8. PCA with Eigenvalues and Eigenvectors¶

We now turn to using the eigen decomposition of a sample covariance matrix to do PCA.

Let \(X_{m \times n}\) be our \(m \times n\) data matrix.

Let’s assume that sample means of all variables are zero.

We can make sure that this is true by **pre-processing** the data by substracting sample means appropriately.

Define the sample covariance matrix \(\Omega\) as

Then use an eigen decomposition to represent \(\Omega\) as follows:

Here

\(P\) is \(m×m\) matrix of eigenvectors of \(\Omega\)

\(\Lambda\) is a diagonal matrix of eigenvalues of \(\Omega\)

We can then represent \(X\) as

where

We can verify that

It follows that we can represent the data matrix as

where

To reconcile the preceding representation with the PCA that we obtained through the SVD above, we first note that \(\epsilon_j^2=\lambda_j\equiv\sigma^2_j\).

Now define \(\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}\) which evidently implies that \(\tilde{\epsilon}_j\tilde{\epsilon}_j^T=1\).

Therefore

which evidently agrees with

provided that we set

\(U_j=P_j\) (the loadings of variables on principal components)

\({V_k}^{T}=\tilde{\epsilon_k}\) (the principal components)

Since there are several possible ways of computing \(P\) and \(U\) for given a data matrix \(X\), depending on algorithms used, we might have sign differences or different orders between eigenvectors.

We want a way that leads to the same \(U\) and \(P\).

In the following, we accomplish this by

sorting eigenvalues and singular values in descending order

imposing positive diagonals on \(P\) and \(U\) and adjusting signs in \(V^T\) accordingly

## 6.9. Connections¶

To pull things together, it is useful to assemble and compare some formulas presented above.

First, consider the following SVD of an \(m \times n\) matrix:

Compute:

Thus, \(U\) in the SVD is the matrix \(P\) of eigenvectors of \(XX^T\) and \(\Sigma \Sigma^T\) is the matrix \(\Lambda\) of eigenvalues.

Second, let’s compute

Thus, the matrix \(V\) in the SVD is the matrix of eigenvectors of \(X^TX\)

Summarizing and fitting things together, we have the eigen decomposition of the sample covariance matrix

where \(P\) is an orthogonal matrix.

Further, from the SVD of \(X\), we know that

where \(U\) is an orthonal matrix.

Thus, \(P = U\) and we have the representation of \(X\)

It follows that

Note that the preceding implies that

so that everything fits together.

Below we define a class `DecomAnalysis`

that wraps PCA and SVD for a given a data matrix `X`

.

```
class DecomAnalysis:
"""
A class for conducting PCA and SVD.
"""
def __init__(self, X, n_component=None):
self.X = X
self.Ω = (X @ X.T)
self.m, self.n = X.shape
self.r = LA.matrix_rank(X)
if n_component:
self.n_component = n_component
else:
self.n_component = self.m
def pca(self):
𝜆, P = LA.eigh(self.Ω) # columns of P are eigenvectors
ind = sorted(range(𝜆.size), key=lambda x: 𝜆[x], reverse=True)
# sort by eigenvalues
self.𝜆 = 𝜆[ind]
P = P[:, ind]
self.P = P @ diag_sign(P)
self.Λ = np.diag(self.𝜆)
self.explained_ratio_pca = np.cumsum(self.𝜆) / self.𝜆.sum()
# compute the N by T matrix of principal components
self.𝜖 = self.P.T @ self.X
P = self.P[:, :self.n_component]
𝜖 = self.𝜖[:self.n_component, :]
# transform data
self.X_pca = P @ 𝜖
def svd(self):
U, 𝜎, VT = LA.svd(self.X)
ind = sorted(range(𝜎.size), key=lambda x: 𝜎[x], reverse=True)
# sort by eigenvalues
d = min(self.m, self.n)
self.𝜎 = 𝜎[ind]
U = U[:, ind]
D = diag_sign(U)
self.U = U @ D
VT[:d, :] = D @ VT[ind, :]
self.VT = VT
self.Σ = np.zeros((self.m, self.n))
self.Σ[:d, :d] = np.diag(self.𝜎)
𝜎_sq = self.𝜎 ** 2
self.explained_ratio_svd = np.cumsum(𝜎_sq) / 𝜎_sq.sum()
# slicing matrices by the number of components to use
U = self.U[:, :self.n_component]
Σ = self.Σ[:self.n_component, :self.n_component]
VT = self.VT[:self.n_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def fit(self, n_component):
# pca
P = self.P[:, :n_component]
𝜖 = self.𝜖[:n_component, :]
# transform data
self.X_pca = P @ 𝜖
# svd
U = self.U[:, :n_component]
Σ = self.Σ[:n_component, :n_component]
VT = self.VT[:n_component, :]
# transform data
self.X_svd = U @ Σ @ VT
def diag_sign(A):
"Compute the signs of the diagonal of matrix A"
D = np.diag(np.sign(np.diag(A)))
return D
```

We also define a function that prints out information so that we can compare decompositions obtained by different algorithms.

```
def compare_pca_svd(da):
"""
Compare the outcomes of PCA and SVD.
"""
da.pca()
da.svd()
print('Eigenvalues and Singular values\n')
print(f'λ = {da.λ}\n')
print(f'σ^2 = {da.σ**2}\n')
print('\n')
# loading matrices
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('loadings')
axs[0].plot(da.P.T)
axs[0].set_title('P')
axs[0].set_xlabel('m')
axs[1].plot(da.U.T)
axs[1].set_title('U')
axs[1].set_xlabel('m')
plt.show()
# principal components
fig, axs = plt.subplots(1, 2, figsize=(14, 5))
plt.suptitle('principal components')
axs[0].plot(da.ε.T)
axs[0].set_title('ε')
axs[0].set_xlabel('n')
axs[1].plot(da.VT[:da.r, :].T * np.sqrt(da.λ))
axs[1].set_title('$V^T*\sqrt{\lambda}$')
axs[1].set_xlabel('n')
plt.show()
```

## 6.10. Dynamic Mode Decomposition (DMD)¶

We now turn to the case in which \(m >>n \) so that there are many more random variables \(m\) than observations \(n\).

This is the **tall and skinny** case associated with **Dynamic Mode Decomposition**.

You can read about Dynamic Mode Decomposition here [KBBWP16].

Starting with an \(m \times n \) matrix of data \(X\), we form two matrices

and

In forming \(\tilde X\) and \(\tilde X'\), we have in each case dropped a column from \(X\).

Evidently, \(\tilde X\) and \(\tilde X'\) are both \(m \times \tilde n\) matrices where \(\tilde n = n - 1\).

We start with a system consisting of \(m\) least squares regressions of **everything** on one lagged value of **everything**:

where

and where the (huge) \(m \times m \) matrix \(X^{+}\) is the Moore-Penrose generalized inverse of \(X\) that we could compute as

where the matrix \(\Sigma^{-1}\) is constructed by replacing each non-zero element of \(\Sigma\) with \(\sigma_j^{-1}\).

The idea behind **dynamic mode decomposition** is to construct an approximation that

sidesteps computing the generalized inverse \(X^{+}\)

retains only the largest \(\tilde r< < r\) eigenvalues and associated eigenvectors of \(U\) and \(V^T\)

constructs an \(m \times \tilde r\) matrix \(\Phi\) that captures effects of \(r\) dynamic modes on all \(m\) variables

uses \(\Phi\) and the \(\tilde r\) leading singular values to forecast

*future*\(X_t\)’s

The magic of **dynamic mode decomposition** is that we accomplish this without ever computing the regression coefficients \(A = X' X^{+}\).

To accomplish a DMD, we deploy the following steps:

Compute the singular value decomposition

\[ X = U \Sigma V^T \]where \(U\) is \(m \times r\), \(\Sigma\) is an \(r \times r\) diagonal matrix, and \(V^T\) is an \(r \times \tilde n\) matrix.

Notice that (though it would be costly), we could compute \(A\) by solving

\[ A = X' V \Sigma^{-1} U^T \]But we won’t do that.

Instead we’ll proceed as follows.

Note that since, \(X' = A U \Sigma V^T\), we know that

\[ A U = X' V \Sigma^{-1} \]so that

\[ U^T X' V \Sigma^{-1} = U^T A U \equiv \tilde A \]At this point, in constructing \(\tilde A\) according to the above formula, we take only the columns of \(U\) corresponding to the \(\tilde r\) largest singular values.

Tu et al. [TRL+14] verify that eigenvalues and eigenvectors of \(\tilde A\) equal the leading eigenvalues and associated eigenvectors of \(A\).

Construct an eigencomposition of \(\tilde A\) that satisfies

\[ \tilde A W = W \Lambda \]where \(\Lambda\) is a \(\tilde r \times \tilde r\) diagonal matrix of eigenvalues and the columns of \(W\) are corresponding eigenvectors of \(\tilde A\). Both \(\Lambda\) and \(W\) are \(\tilde r \times \tilde r\) matrices.

Construct the \(m \times \tilde r\) matrix

\[ \Phi = X' V \Sigma^{-1} W \]Let \(\Phi^{+}\) be a generalized inverse of \(\Phi\); \(\Phi^{+}\) is an \(\tilde r \times m\) matrix.

Define an initial vector \(b\) of dominant modes by

\[ b= \Phi^{+} X_1 \]where evidently \(b\) is an \(\tilde r \times 1\) vector.

With \(\Lambda, \Phi\) in hand, our least-squares fitted dynamics fitted to the \(r\) dominant modes are governed by

Conditional on \(X_t\), forecasts \(\check X_{t+j} \) of \(X_{t+j}, j = 1, 2, \ldots, \) are evidently given by

## 6.11. Source for Some Python Code¶

You can find a Python implementation of DMD here: