7. Singular Value Decomposition (SVD)

In addition to regular packages contained in Anaconda by default, this lecture also requires:

!pip install quandl
Collecting quandl
  Downloading Quandl-3.7.0-py2.py3-none-any.whl (26 kB)
Requirement already satisfied: inflection>=0.3.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from quandl) (0.5.1)
Requirement already satisfied: requests>=2.7.0 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from quandl) (2.27.1)
Requirement already satisfied: six in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from quandl) (1.16.0)
Requirement already satisfied: pandas>=0.14 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from quandl) (1.4.2)
Collecting more-itertools
  Downloading more_itertools-8.13.0-py3-none-any.whl (51 kB)
?25l
     |██████▍                         | 10 kB 43.4 MB/s eta 0:00:01
     |████████████▊                   | 20 kB 49.5 MB/s eta 0:00:01
     |███████████████████             | 30 kB 39.3 MB/s eta 0:00:01
     |█████████████████████████▍      | 40 kB 24.7 MB/s eta 0:00:01
     |███████████████████████████████▊| 51 kB 29.3 MB/s eta 0:00:01
     |████████████████████████████████| 51 kB 1.4 MB/s 
?25hRequirement already satisfied: numpy>=1.8 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from quandl) (1.21.5)
Requirement already satisfied: python-dateutil in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from quandl) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from pandas>=0.14->quandl) (2021.3)
Requirement already satisfied: idna<4,>=2.5 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (3.3)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (1.26.9)
Requirement already satisfied: certifi>=2017.4.17 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (2021.10.8)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/share/miniconda3/envs/quantecon/lib/python3.9/site-packages (from requests>=2.7.0->quandl) (2.0.4)
Installing collected packages: more-itertools, quandl
Successfully installed more-itertools-8.13.0 quandl-3.7.0
import numpy as np
import numpy.linalg as LA
import matplotlib.pyplot as plt
%matplotlib inline
import quandl as ql
import pandas as pd

7.1. Overview

The singular value decomposition is a work-horse in applications of least squares projection that form a foundation for some important machine learning methods.

This lecture describes the singular value decomposition and two of its uses:

  • principal components analysis (PCA)

  • dynamic mode decomposition (DMD)

Each of these can be thought of as a data-reduction procedure designed to capture salient patterns by projecting data onto a limited set of factors.

7.2. The Setup

Let \(X\) be an \(m \times n\) matrix of rank \(r\).

Necessarily, \(r \leq \min(m,n)\).

In this lecture, we’ll think of \(X\) as a matrix of data.

  • each column is an individual – a time period or person, depending on the application

  • each row is a random variable measuring an attribute of a time period or a person, depending on the application

We’ll be interested in two cases

  • A short and fat case in which \(m << n\), so that there are many more columns than rows.

  • A tall and skinny case in which \(m >> n\), so that there are many more rows than columns.

We’ll apply a singular value decomposition of \(X\) in both situations.

In the first case in which there are many more observations \(n\) than random variables \(m\), we learn about a joint distribution by taking averages across observations of functions of the observations.

Here we’ll look for patterns by using a singular value decomposition to do a principal components analysis (PCA).

In the second case in which there are many more random variables \(m\) than observations \(n\), we’ll proceed in a different way.

We’ll again use a singular value decomposition, but now to do a dynamic mode decomposition (DMD)

7.3. Singular Value Decomposition

A singular value decomposition of an \(m \times n\) matrix \(X\) of rank \(r \leq \min(m,n)\) is

\[ X = U \Sigma V^T \]

where

\[\begin{align*} UU^T & = I & \quad U^T U = I \cr VV^T & = I & \quad V^T V = I \end{align*}\]

where

  • \(U\) is an \(m \times m\) matrix whose columns are eigenvectors of \(X^T X\)

  • \(V\) is an \(n \times n\) matrix whose columns are eigenvectors of \(X X^T\)

  • \(\Sigma\) is an \(m \times n\) matrix in which the first \(r\) places on its main diagonal are positive numbers \(\sigma_1, \sigma_2, \ldots, \sigma_r\) called singular values; remaining entries of \(\Sigma\) are all zero

  • The \(r\) singular values are square roots of the eigenvalues of the \(m \times m\) matrix \(X X^T\) and the \(n \times n\) matrix \(X^T X\)

  • When \(U\) is a complex valued matrix, \(U^T\) denotes the conjugate-transpose or Hermitian-transpose of \(U\), meaning that \(U_{ij}^T\) is the complex conjugate of \(U_{ji}\).

  • Similarly, when \(V\) is a complex valued matrix, \(V^T\) denotes the conjugate-transpose or Hermitian-transpose of \(V\)

In what is called a full SVD, the shapes of \(U\), \(\Sigma\), and \(V\) are \(\left(m, m\right)\), \(\left(m, n\right)\), \(\left(n, n\right)\), respectively.

There is also an alternative shape convention called an economy or reduced SVD .

Thus, note that because we assume that \(X\) has rank \(r\), there are only \(r \) nonzero singular values, where \(r=\textrm{rank}(X)\leq\min\left(m, n\right)\).

A reduced SVD uses this fact to express \(U\), \(\Sigma\), and \(V\) as matrices with shapes \(\left(m, r\right)\), \(\left(r, r\right)\), \(\left(r, n\right)\).

Sometimes, we will use a full SVD in which \(U\), \(\Sigma\), and \(V\) have shapes \(\left(m, m\right)\), \(\left(m, n\right)\), \(\left(n, n\right)\)

Caveat: The properties

\[\begin{align*} UU^T & = I & \quad U^T U = I \cr VV^T & = I & \quad V^T V = I \end{align*}\]

apply to a full SVD but not to a reduced SVD.

In the tall-skinny case in which \(m > > n\), for a reduced SVD

\[\begin{align*} UU^T & \neq I & \quad U^T U = I \cr VV^T & = I & \quad V^T V = I \end{align*}\]

while in the short-fat case in which \(m < < n\), for a reduced SVD

\[\begin{align*} UU^T & = I & \quad U^T U = I \cr VV^T & = I & \quad V^T V \neq I \end{align*}\]

When we study Dynamic Mode Decomposition below, we shall want to remember this caveat because we’ll be using reduced SVD’s to compute key objects.

7.4. Reduced Versus Full SVD

Earlier, we mentioned full and reduced SVD’s.

You can read about reduced and full SVD here https://numpy.org/doc/stable/reference/generated/numpy.linalg.svd.html

In a full SVD

  • \(U\) is \(m \times m\)

  • \(\Sigma\) is \(m \times n\)

  • \(V\) is \(n \times n\)

In a reduced SVD

  • \(U\) is \(m \times r\)

  • \(\Sigma\) is \(r \times r\)

  • \(V\) is \(n \times r\)

Let’s do a some small exercise to compare full and reduced SVD’s.

First, let’s study a case in which \(m = 5 > n = 2\).

(This is a small example of the tall-skinny that will concern us when we study Dynamic Mode Decompositions below.)

import numpy as np
X = np.random.rand(5,2)
U, S, V = np.linalg.svd(X,full_matrices=True)  # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V ='), U, S, V
U, S, V =
(None,
 array([[-0.34052222, -0.09787182, -0.5846155 , -0.72972194,  0.01401177],
        [-0.38181254, -0.26160055,  0.4990526 , -0.20008981, -0.70476585],
        [-0.33771234,  0.69106226,  0.45761617, -0.29529329,  0.33432416],
        [-0.74468446,  0.09913197, -0.31013576,  0.58232208, -0.01829527],
        [-0.26393036, -0.65923597,  0.32182989, -0.03424646,  0.62530105]]),
 array([1.60274772, 0.59508313]),
 array([[-0.70109594, -0.71306696],
        [ 0.71306696, -0.70109594]]))
print('Uhat, Shat, Vhat = '), Uhat, Shat, Vhat
Uhat, Shat, Vhat = 
(None,
 array([[-0.34052222, -0.09787182],
        [-0.38181254, -0.26160055],
        [-0.33771234,  0.69106226],
        [-0.74468446,  0.09913197],
        [-0.26393036, -0.65923597]]),
 array([1.60274772, 0.59508313]),
 array([[-0.70109594, -0.71306696],
        [ 0.71306696, -0.70109594]]))
rr = np.linalg.matrix_rank(X)
print('rank of X - '), rr
rank of X - 
(None, 2)

Properties:

  • Where \(U\) is constructed via a full SVD, \(U^T U = I_{r \times r}\) and \(U U^T = I_{m \times m}\)

  • Where \(\hat U\) is constructed via a reduced SVD, although \(\hat U^T \hat U = I_{r \times r}\) it happens that \(\hat U \hat U^T \neq I_{m \times m}\)

We illustrate these properties for our example with the following code cells.

UTU = U.T@U
UUT = U@U.T
print('UUT, UTU = '), UUT, UTU 
UUT, UTU = 
(None,
 array([[ 1.00000000e+00, -8.05184352e-17, -1.19656263e-16,
         -2.48986421e-17, -6.82998569e-17],
        [-8.05184352e-17,  1.00000000e+00,  2.14558810e-17,
         -8.38393706e-17, -9.77995777e-17],
        [-1.19656263e-16,  2.14558810e-17,  1.00000000e+00,
         -7.17705303e-17,  1.45382600e-16],
        [-2.48986421e-17, -8.38393706e-17, -7.17705303e-17,
          1.00000000e+00, -4.66847349e-17],
        [-6.82998569e-17, -9.77995777e-17,  1.45382600e-16,
         -4.66847349e-17,  1.00000000e+00]]),
 array([[ 1.00000000e+00, -8.14900249e-18, -9.65482051e-19,
         -1.90029973e-16,  1.58919362e-17],
        [-8.14900249e-18,  1.00000000e+00,  2.40409630e-17,
          2.80505338e-17, -4.58320922e-17],
        [-9.65482051e-19,  2.40409630e-17,  1.00000000e+00,
         -1.38376530e-17,  2.96849500e-17],
        [-1.90029973e-16,  2.80505338e-17, -1.38376530e-17,
          1.00000000e+00,  1.93847587e-17],
        [ 1.58919362e-17, -4.58320922e-17,  2.96849500e-17,
          1.93847587e-17,  1.00000000e+00]]))
UhatUhatT = Uhat@Uhat.T
UhatTUhat = Uhat.T@Uhat
print('UhatUhatT, UhatTUhat= '), UhatUhatT, UhatTUhat
UhatUhatT, UhatTUhat= 
(None,
 array([[ 0.12553427,  0.15561898,  0.04736303,  0.24387938,  0.15439477],
        [ 0.15561898,  0.21421567, -0.05183946,  0.25839689,  0.27322842],
        [ 0.04736303, -0.05183946,  0.59161667,  0.31999549, -0.36644056],
        [ 0.24387938,  0.25839689,  0.31999549,  0.56438209,  0.13119347],
        [ 0.15439477,  0.27322842, -0.36644056,  0.13119347,  0.5042513 ]]),
 array([[ 1.00000000e+00, -8.14900249e-18],
        [-8.14900249e-18,  1.00000000e+00]]))

Remark: The cells above illustrate application of the fullmatrices=True and full-matrices=False options. Using full-matrices=False returns a reduced singular value decomposition. This option implements an optimal reduced rank approximation of a matrix, in the sense of minimizing the Frobenius norm of the discrepancy between the approximating matrix and the matrix being approximated. Optimality in this sense is established in the celebrated Eckart–Young theorem. See https://en.wikipedia.org/wiki/Low-rank_approximation.

When we study Dynamic Mode Decompositions below, it will be important for us to remember the following important properties of full and reduced SVD’s in such tall-skinny cases.

Let’s do another exercise, but now we’ll set \(m = 2 < 5 = n \)

import numpy as np
X = np.random.rand(2,5)
U, S, V = np.linalg.svd(X,full_matrices=True)  # full SVD
Uhat, Shat, Vhat = np.linalg.svd(X,full_matrices=False) # economy SVD
print('U, S, V ='), U, S, V
U, S, V =
(None,
 array([[-0.56798888,  0.82303623],
        [-0.82303623, -0.56798888]]),
 array([2.00712531, 0.41399433]),
 array([[-0.34791301, -0.60810382, -0.40868281, -0.42717968, -0.39957748],
        [ 0.62772545,  0.07120616, -0.10971881, -0.7297166 ,  0.23741504],
        [-0.07114726, -0.36995093,  0.89371147, -0.24168038, -0.03073731],
        [ 0.55088397, -0.66837365, -0.10113969,  0.46921007,  0.13934199],
        [-0.41997883, -0.20384405, -0.10950213, -0.08039718,  0.87384837]]))
print('Uhat, Shat, Vhat = '), Uhat, Shat, Vhat
Uhat, Shat, Vhat = 
(None,
 array([[-0.56798888,  0.82303623],
        [-0.82303623, -0.56798888]]),
 array([2.00712531, 0.41399433]),
 array([[-0.34791301, -0.60810382, -0.40868281, -0.42717968, -0.39957748],
        [ 0.62772545,  0.07120616, -0.10971881, -0.7297166 ,  0.23741504]]))
rr = np.linalg.matrix_rank(X)
print('rank X = '), rr
rank X = 
(None, 2)

7.5. Digression: Polar Decomposition

A singular value decomposition (SVD) is related to the polar decomposition of \(X\)

\[ X = SQ \]

where

\[\begin{align*} S & = U\Sigma U^T \cr Q & = U V^T \end{align*}\]

and \(S\) is evidently a symmetric matrix and \(Q\) is an orthogonal matrix.

7.6. Principle Components Analysis (PCA)

Let’s begin with a case in which \(n >> m\), so that we have many more observations \(n\) than random variables \(m\).

The matrix \(X\) is short and fat in an \(n >> m\) case as opposed to a tall and skinny case with \(m > > n \) to be discussed later.

We regard \(X\) as an \(m \times n\) matrix of data:

\[ X = \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_n\end{bmatrix} \]

where for \(j = 1, \ldots, n\) the column vector \(X_j = \begin{bmatrix}X_{1j}\\X_{2j}\\\vdots\\X_{mj}\end{bmatrix}\) is a vector of observations on variables \(\begin{bmatrix}x_1\\x_2\\\vdots\\x_m\end{bmatrix}\).

In a time series setting, we would think of columns \(j\) as indexing different times at which random variables are observed, while rows index different random variables.

In a cross section setting, we would think of columns \(j\) as indexing different individuals for which random variables are observed, while rows index different random variables.

The number of positive singular values equals the rank of matrix \(X\).

Arrange the singular values in decreasing order.

Arrange the positive singular values on the main diagonal of the matrix \(\Sigma\) of into a vector \(\sigma_R\).

Set all other entries of \(\Sigma\) to zero.

7.7. Relationship of PCA to SVD

To relate a SVD to a PCA (principal component analysis) of data set \(X\), first construct the SVD of the data matrix \(X\):

(7.1)\[ X = U \Sigma V^T = \sigma_1 U_1 V_1^T + \sigma_2 U_2 V_2^T + \cdots + \sigma_r U_r V_r^T \]

where

\[ U=\begin{bmatrix}U_1|U_2|\ldots|U_m\end{bmatrix} \]
\[\begin{split} V^T = \begin{bmatrix}V_1^T\\V_2^T\\\ldots\\V_n^T\end{bmatrix} \end{split}\]

In equation (7.1), each of the \(m \times n\) matrices \(U_{j}V_{j}^T\) is evidently of rank \(1\).

Thus, we have

(7.2)\[\begin{split} X = \sigma_1 \begin{pmatrix}U_{11}V_{1}^T\\U_{21}V_{1}^T\\\cdots\\U_{m1}V_{1}^T\\\end{pmatrix} + \sigma_2\begin{pmatrix}U_{12}V_{2}^T\\U_{22}V_{2}^T\\\cdots\\U_{m2}V_{2}^T\\\end{pmatrix}+\ldots + \sigma_r\begin{pmatrix}U_{1r}V_{r}^T\\U_{2r}V_{r}^T\\\cdots\\U_{mr}V_{r}^T\\\end{pmatrix} \end{split}\]

Here is how we would interpret the objects in the matrix equation (7.2) in a time series context:

  • \( V_{k}^T= \begin{bmatrix}V_{k1} & V_{k2} & \ldots & V_{kn}\end{bmatrix} \quad \textrm{for each} \ k=1, \ldots, n \) is a time series \(\lbrace V_{kj} \rbrace_{j=1}^n\) for the \(k\)th principal component

  • \(U_j = \begin{bmatrix}U_{1k}\\U_{2k}\\\ldots\\U_{mk}\end{bmatrix} \ k=1, \ldots, m\) is a vector of loadings of variables \(X_i\) on the \(k\)th principle component, \(i=1, \ldots, m\)

  • \(\sigma_k \) for each \(k=1, \ldots, r\) is the strength of \(k\)th principal component

7.8. PCA with Eigenvalues and Eigenvectors

We now use an eigen decomposition of a sample covariance matrix to do PCA.

Let \(X_{m \times n}\) be our \(m \times n\) data matrix.

Let’s assume that sample means of all variables are zero.

We can assure this by pre-processing the data by subtracting sample means.

Define the sample covariance matrix \(\Omega\) as

\[ \Omega = XX^T \]

Then use an eigen decomposition to represent \(\Omega\) as follows:

\[ \Omega =P\Lambda P^T \]

Here

  • \(P\) is \(m×m\) matrix of eigenvectors of \(\Omega\)

  • \(\Lambda\) is a diagonal matrix of eigenvalues of \(\Omega\)

We can then represent \(X\) as

\[ X=P\epsilon \]

where

\[ \epsilon\epsilon^T=\Lambda . \]

We can verify that

\[ XX^T=P\Lambda P^T . \]

It follows that we can represent the data matrix as

\[\begin{equation*} X=\begin{bmatrix}X_1|X_2|\ldots|X_m\end{bmatrix} =\begin{bmatrix}P_1|P_2|\ldots|P_m\end{bmatrix} \begin{bmatrix}\epsilon_1\\\epsilon_2\\\ldots\\\epsilon_m\end{bmatrix} = P_1\epsilon_1+P_2\epsilon_2+\ldots+P_m\epsilon_m \end{equation*}\]

where

\[ \epsilon\epsilon^T=\Lambda . \]

To reconcile the preceding representation with the PCA that we obtained through the SVD above, we first note that \(\epsilon_j^2=\lambda_j\equiv\sigma^2_j\).

Now define \(\tilde{\epsilon_j} = \frac{\epsilon_j}{\sqrt{\lambda_j}}\), which evidently implies that \(\tilde{\epsilon}_j\tilde{\epsilon}_j^T=1\).

Therefore

\[\begin{split} \begin{aligned} X&=\sqrt{\lambda_1}P_1\tilde{\epsilon_1}+\sqrt{\lambda_2}P_2\tilde{\epsilon_2}+\ldots+\sqrt{\lambda_m}P_m\tilde{\epsilon_m}\\ &=\sigma_1P_1\tilde{\epsilon_2}+\sigma_2P_2\tilde{\epsilon_2}+\ldots+\sigma_mP_m\tilde{\epsilon_m} , \end{aligned} \end{split}\]

which evidently agrees with

\[ X=\sigma_1U_1{V_1}^{T}+\sigma_2 U_2{V_2}^{T}+\ldots+\sigma_{r} U_{r}{V_{r}}^{T} \]

provided that we set

  • \(U_j=P_j\) (the loadings of variables on principal components)

  • \({V_k}^{T}=\tilde{\epsilon_k}\) (the principal components)

Since there are several possible ways of computing \(P\) and \(U\) for given a data matrix \(X\), depending on algorithms used, we might have sign differences or different orders between eigenvectors.

We can resolve such ambiguities about \(U\) and \(P\) by

  1. sorting eigenvalues and singular values in descending order

  2. imposing positive diagonals on \(P\) and \(U\) and adjusting signs in \(V^T\) accordingly

7.9. Connections

To pull things together, it is useful to assemble and compare some formulas presented above.

First, consider the following SVD of an \(m \times n\) matrix:

\[ X = U\Sigma V^T \]

Compute:

\[\begin{align*} XX^T&=U\Sigma V^TV\Sigma^T U^T\cr &\equiv U\Sigma\Sigma^TU^T\cr &\equiv U\Lambda U^T \end{align*}\]

Thus, \(U\) in the SVD is the matrix \(P\) of eigenvectors of \(XX^T\) and \(\Sigma \Sigma^T\) is the matrix \(\Lambda\) of eigenvalues.

Second, let’s compute

\[\begin{align*} X^TX &=V\Sigma^T U^TU\Sigma V^T\\ &=V\Sigma^T{\Sigma}V^T \end{align*}\]

Thus, the matrix \(V\) in the SVD is the matrix of eigenvectors of \(X^TX\)

Summarizing and fitting things together, we have the eigen decomposition of the sample covariance matrix

\[ X X^T = P \Lambda P^T \]

where \(P\) is an orthogonal matrix.

Further, from the SVD of \(X\), we know that

\[ X X^T = U \Sigma \Sigma^T U^T \]

where \(U\) is an orthonal matrix.

Thus, \(P = U\) and we have the representation of \(X\)

\[ X = P \epsilon = U \Sigma V^T \]

It follows that

\[ U^T X = \Sigma V^T = \epsilon \]

Note that the preceding implies that

\[ \epsilon \epsilon^T = \Sigma V^T V \Sigma^T = \Sigma \Sigma^T = \Lambda , \]

so that everything fits together.

Below we define a class DecomAnalysis that wraps PCA and SVD for a given a data matrix X.

class DecomAnalysis:
    """
    A class for conducting PCA and SVD.
    """

    def __init__(self, X, n_component=None):

        self.X = X

        self.Ω = (X @ X.T)

        self.m, self.n = X.shape
        self.r = LA.matrix_rank(X)

        if n_component:
            self.n_component = n_component
        else:
            self.n_component = self.m

    def pca(self):

        𝜆, P = LA.eigh(self.Ω)    # columns of P are eigenvectors

        ind = sorted(range(𝜆.size), key=lambda x: 𝜆[x], reverse=True)

        # sort by eigenvalues
        self.𝜆 = 𝜆[ind]
        P = P[:, ind]
        self.P = P @ diag_sign(P)

        self.Λ = np.diag(self.𝜆)

        self.explained_ratio_pca = np.cumsum(self.𝜆) / self.𝜆.sum()

        # compute the N by T matrix of principal components 
        self.𝜖 = self.P.T @ self.X

        P = self.P[:, :self.n_component]
        𝜖 = self.𝜖[:self.n_component, :]

        # transform data
        self.X_pca = P @ 𝜖

    def svd(self):

        U, 𝜎, VT = LA.svd(self.X)

        ind = sorted(range(𝜎.size), key=lambda x: 𝜎[x], reverse=True)

        # sort by eigenvalues
        d = min(self.m, self.n)

        self.𝜎 = 𝜎[ind]
        U = U[:, ind]
        D = diag_sign(U)
        self.U = U @ D
        VT[:d, :] = D @ VT[ind, :]
        self.VT = VT

        self.Σ = np.zeros((self.m, self.n))
        self.Σ[:d, :d] = np.diag(self.𝜎)

        𝜎_sq = self.𝜎 ** 2
        self.explained_ratio_svd = np.cumsum(𝜎_sq) / 𝜎_sq.sum()

        # slicing matrices by the number of components to use
        U = self.U[:, :self.n_component]
        Σ = self.Σ[:self.n_component, :self.n_component]
        VT = self.VT[:self.n_component, :]

        # transform data
        self.X_svd = U @ Σ @ VT

    def fit(self, n_component):

        # pca
        P = self.P[:, :n_component]
        𝜖 = self.𝜖[:n_component, :]

        # transform data
        self.X_pca = P @ 𝜖

        # svd
        U = self.U[:, :n_component]
        Σ = self.Σ[:n_component, :n_component]
        VT = self.VT[:n_component, :]

        # transform data
        self.X_svd = U @ Σ @ VT

def diag_sign(A):
    "Compute the signs of the diagonal of matrix A"

    D = np.diag(np.sign(np.diag(A)))

    return D

We also define a function that prints out information so that we can compare decompositions obtained by different algorithms.

def compare_pca_svd(da):
    """
    Compare the outcomes of PCA and SVD.
    """

    da.pca()
    da.svd()

    print('Eigenvalues and Singular values\n')
    print(f'λ = {da.λ}\n')
    print(f'σ^2 = {da.σ**2}\n')
    print('\n')

    # loading matrices
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    plt.suptitle('loadings')
    axs[0].plot(da.P.T)
    axs[0].set_title('P')
    axs[0].set_xlabel('m')
    axs[1].plot(da.U.T)
    axs[1].set_title('U')
    axs[1].set_xlabel('m')
    plt.show()

    # principal components
    fig, axs = plt.subplots(1, 2, figsize=(14, 5))
    plt.suptitle('principal components')
    axs[0].plot(da.ε.T)
    axs[0].set_title('ε')
    axs[0].set_xlabel('n')
    axs[1].plot(da.VT[:da.r, :].T * np.sqrt(da.λ))
    axs[1].set_title('$V^T*\sqrt{\lambda}$')
    axs[1].set_xlabel('n')
    plt.show()

For an example PCA applied to analyzing the structure of intelligence tests see this lecture Multivariable Normal Distribution.

Look at the parts of that lecture that describe and illustrate the classic factor analysis model.

7.10. Dynamic Mode Decomposition (DMD)

We turn to the case in which \( m >>n \).

Here an \( m \times n \) data matrix \( \tilde X \) contains many more random variables \( m \) than observations \( n \).

This tall and skinny case is associated with Dynamic Mode Decomposition.

Dynamic mode decomposition was introduced by [Sch10],

You can read more about Dynamic Mode Decomposition here [KBBWP16] and here [BK19] (section 7.2).

We want to fit a first-order vector autoregression

(7.3)\[ X_{t+1} = A X_t + C \epsilon_{t+1} \]

where \(\epsilon_{t+1}\) is the time \(t+1\) instance of an i.i.d. \(m \times 1\) random vector with mean vector zero and identity covariance matrix and

where the \( m \times 1 \) vector \( X_t \) is

(7.4)\[ X_t = \begin{bmatrix} X_{1,t} & X_{2,t} & \cdots & X_{m,t} \end{bmatrix}^T \]

and where \( T \) again denotes complex transposition and \( X_{i,t} \) is an observation on variable \( i \) at time \( t \).

We want to fit equation (7.3).

Our data are organized in an \( m \times (n+1) \) matrix \( \tilde X \)

\[ \tilde X = \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_n \mid X_{n+1} \end{bmatrix} \]

where for \( t = 1, \ldots, n +1 \), the \( m \times 1 \) vector \( X_t \) is given by (7.4).

Thus, we want to estimate a system (7.3) that consists of \( m \) least squares regressions of everything on one lagged value of everything.

The \(i\)’th equation of (7.3) is a regression of \(X_{i,t+1}\) on the vector \(X_t\).

We proceed as follows.

From \( \tilde X \), we form two \(m \times n\) matrices

\[ X = \begin{bmatrix} X_1 \mid X_2 \mid \cdots \mid X_{n}\end{bmatrix} \]

and

\[ X' = \begin{bmatrix} X_2 \mid X_3 \mid \cdots \mid X_{n+1}\end{bmatrix} \]

Here \( ' \) does not indicate matrix transposition but instead is part of the name of the matrix \( X' \).

In forming \( X \) and \( X' \), we have in each case dropped a column from \( \tilde X \), the last column in the case of \( X \), and the first column in the case of \( X' \).

Evidently, \( X \) and \( X' \) are both \( m \times n \) matrices.

We denote the rank of \( X \) as \( p \leq \min(m, n) \).

Two possible cases are

  • \( n > > m\), so that we have many more time series observations \(n\) than variables \(m\)

  • \(m > > n\), so that we have many more variables \(m \) than time series observations \(n\)

At a general level that includes both of these special cases, a common formula describes the least squares estimator \(\hat A\) of \(A\) for both cases, but important details differ.

The common formula is

(7.5)\[ \hat A = X' X^+ \]

where \(X^+\) is the pseudo-inverse of \(X\).

Formulas for the pseudo-inverse differ for our two cases.

When \( n > > m\), so that we have many more time series observations \(n\) than variables \(m\) and when \(X\) has linearly independent rows, \(X X^T\) has an inverse and the pseudo-inverse \(X^+\) is

\[ X^+ = X^T (X X^T)^{-1} \]

Here \(X^+\) is a right-inverse that verifies \( X X^+ = I_{m \times m}\).

In this case, our formula (7.5) for the least-squares estimator of the population matrix of regression coefficients \(A\) becomes

\[ \hat A = X' X^T (X X^T)^{-1} \]

This formula is widely used in economics to estimate vector autorgressions.

The right side is proportional to the empirical cross second moment matrix of \(X_{t+1}\) and \(X_t\) times the inverse of the second moment matrix of \(X_t\).

This least-squares formula widely used in econometrics.

Tall-Skinny Case:

When \(m > > n\), so that we have many more variables \(m \) than time series observations \(n\) and when \(X\) has linearly independent columns, \(X^T X\) has an inverse and the pseudo-inverse \(X^+\) is

\[ X^+ = (X^T X)^{-1} X^T \]

Here \(X^+\) is a left-inverse that verifies \(X^+ X = I_{n \times n}\).

In this case, our formula (7.5) for a least-squares estimator of \(A\) becomes

(7.6)\[ \hat A = X' (X^T X)^{-1} X^T \]

This is the case that we are interested in here.

If we use formula (7.6) to calculate \(\hat A X\) we find that

\[ \hat A X = X' \]

so that the regression equation fits perfectly, the usual outcome in an underdetermined least-squares model.

Thus, we want to fit equation (7.3) in a situation in which we have a number \(n\) of observations that is small relative to the number \(m\) of variables that appear in the vector \(X_t\).

To reiterate and offer an idea about how we can efficiently calculate the pseudo-inverse \(X^+\), as our estimator \(\hat A\) of \(A\) we form an \(m \times m\) matrix that solves the least-squares best-fit problem

(7.7)\[ \hat A = \textrm{argmin}_{\check A} || X' - \check A X ||_F \]

where \(|| \cdot ||_F\) denotes the Frobeneus norm of a matrix.

The minimizer of the right side of equation (7.7) is

(7.8)\[ \hat A = X' X^{+} \]

where the (possibly huge) \( n \times m \) matrix \( X^{+} = (X^T X)^{-1} X^T\) is again a pseudo-inverse of \( X \).

The \( i \)th row of \( \hat A \) is an \( m \times 1 \) vector of regression coefficients of \( X_{i,t+1} \) on \( X_{j,t}, j = 1, \ldots, m \).

For some situations that we are interested in, \(X^T X \) can be close to singular, a situation that can make some numerical algorithms be error-prone.

To confront that possibility, we’ll use efficient algorithms for computing and for constructing reduced rank approximations of \(\hat A\) in formula (7.6).

The \( i \)th row of \( \hat A \) is an \( m \times 1 \) vector of regression coefficients of \( X_{i,t+1} \) on \( X_{j,t}, j = 1, \ldots, m \).

An efficient way to compute the pseudo-inverse \(X^+\) is to start with a singular value decomposition

(7.9)\[ X = U \Sigma V^T \]

We can use the singular value decomposition (7.9) efficiently to construct the pseudo-inverse \(X^+\) by recognizing the following string of equalities.

(7.10)\[\begin{split} \begin{aligned} X^{+} & = (X^T X)^{-1} X^T \\ & = (V \Sigma U^T U \Sigma V^T)^{-1} V \Sigma U^T \\ & = (V \Sigma \Sigma V^T)^{-1} V \Sigma U^T \\ & = V \Sigma^{-1} \Sigma^{-1} V^T V \Sigma U^T \\ & = V \Sigma^{-1} U^T \end{aligned} \end{split}\]

(Since we are in the \(m > > n\) case in which \(V^T V = I\) in a reduced SVD, we can use the preceding string of equalities for a reduced SVD as well as for a full SVD.)

Thus, we shall construct a pseudo-inverse \( X^+ \) of \( X \) by using a singular value decomposition of \(X\) in equation (7.9) to compute

(7.11)\[ X^{+} = V \Sigma^{-1} U^T \]

where the matrix \( \Sigma^{-1} \) is constructed by replacing each non-zero element of \( \Sigma \) with \( \sigma_j^{-1} \).

We can use formula (7.11) together with formula (7.8) to compute the matrix \( \hat A \) of regression coefficients.

Thus, our estimator \(\hat A = X' X^+\) of the \(m \times m\) matrix of coefficients \(A\) is

(7.12)\[ \hat A = X' V \Sigma^{-1} U^T \]

In addition to doing that, we’ll eventually use dynamic mode decomposition to compute a rank \( r \) approximation to \( \hat A \), where \( r < p \).

Remark: We described and illustrated a reduced singular value decomposition above, and compared it with a full singular value decomposition. In our Python code, we’ll typically use a reduced SVD.

Next, we describe alternative representations of our first-order linear dynamic system.

7.11. Representation 1

In this representation, we shall use a full SVD of \(X\).

We use the \(m\) columns of \(U\), and thus the \(m\) rows of \(U^T\), to define a \(m \times 1\) vector \(\tilde b_t\) as follows

(7.13)\[ \tilde b_t = U^T X_t \]

and

(7.14)\[ X_t = U \tilde b_t \]

(Here we use the notation \(b\) to remind ourselves that we are creating a basis vector.)

Since we are using a full SVD, \(U U^T\) is an \(m \times m\) identity matrix.

So it follows from equation (7.13) that we can reconstruct \(X_t\) from \(\tilde b_t\) by using

  • Equation (7.13) serves as an encoder that rotates the \(m \times 1\) vector \(X_t\) to become an \(m \times 1\) vector \(\tilde b_t\)

  • Equation (7.14) serves as a decoder that recovers the \(m \times 1\) vector \(X_t\) by rotating the \(m \times 1\) vector \(\tilde b_t\)

Define a transition matrix for a rotated \(m \times 1\) state \(\tilde b_t\) by

(7.15)\[ \tilde A = U^T \hat A U \]

We can evidently recover \(\hat A\) from

\[ \hat A = U \tilde A U^T \]

Dynamics of the rotated \(m \times 1\) state \(\tilde b_t\) are governed by

\[ \tilde b_{t+1} = \tilde A \tilde b_t \]

To construct forecasts \(\overline X_t\) of future values of \(X_t\) conditional on \(X_1\), we can apply decoders (i.e., rotators) to both sides of this equation and deduce

\[ \overline X_{t+1} = U \tilde A^t U^T X_1 \]

where we use \(\overline X_t\) to denote a forecast.

7.12. Representation 2

This representation is related to one originally proposed by [Sch10].

It can be regarded as an intermediate step to a related and perhaps more useful representation 3.

As with Representation 1, we continue to

  • use a full SVD and not a reduced SVD

As we observed and illustrated earlier in this lecture, for a full SVD \(U U^T\) and \(U^T U\) are both identity matrices; but under a reduced SVD of \(X\), \(U^T U\) is not an identity matrix.

As we shall see, a full SVD is too confining for what we ultimately want to do, namely, situations in which \(U^T U\) is not an identity matrix because we use a reduced SVD of \(X\).

But for now, let’s proceed under the assumption that both of the preceding two requirements are satisfied.

Form an eigendecomposition of the \(m \times m\) matrix \(\tilde A = U^T \hat A U\) defined in equation (7.15):

(7.16)\[ \tilde A = W \Lambda W^{-1} \]

where \(\Lambda\) is a diagonal matrix of eigenvalues and \(W\) is an \(m \times m\) matrix whose columns are eigenvectors corresponding to rows (eigenvalues) in \(\Lambda\).

When \(U U^T = I_{m \times m}\), as is true with a full SVD of \(X\), it follows that

(7.17)\[ \hat A = U \tilde A U^T = U W \Lambda W^{-1} U^T \]

Evidently, according to equation (7.17), the diagonal matrix \(\Lambda\) contains eigenvalues of \(\hat A\) and corresponding eigenvectors of \(\hat A\) are columns of the matrix \(UW\).

Thus, the systematic (i.e., not random) parts of the \(X_t\) dynamics captured by our first-order vector autoregressions are described by

\[ X_{t+1} = U W \Lambda W^{-1} U^T X_t \]

Multiplying both sides of the above equation by \(W^{-1} U^T\) gives

\[ W^{-1} U^T X_{t+1} = \Lambda W^{-1} U^T X_t \]

or

\[ \hat b_{t+1} = \Lambda \hat b_t \]

where now our encoder is

\[ \hat b_t = W^{-1} U^T X_t \]

and our decoder is

\[ X_t = U W \hat b_t \]

We can use this representation to construct a predictor \(\overline X_{t+1}\) of \(X_{t+1}\) conditional on \(X_1\) via:

(7.18)\[ \overline X_{t+1} = U W \Lambda^t W^{-1} U^T X_1 \]

In effect, [Sch10] defined an \(m \times m\) matrix \(\Phi_s\) as

(7.19)\[ \Phi_s = UW \]

and represented equation (7.18) as

(7.20)\[ \overline X_{t+1} = \Phi_s \Lambda^t \Phi_s^+ X_1 \]

Components of the basis vector \( \hat b_t = W^{-1} U^T X_t \equiv \Phi_s^+\) are often called DMD modes, or sometimes also DMD projected nodes.

We turn next to an alternative representation suggested by Tu et al. [TRL+14], one that is more appropriate to use when, as in practice is typically the case, we use a reduced SVD.

7.13. Representation 3

Departing from the procedures used to construct Representations 1 and 2, each of which deployed a full SVD, we now use a reduced SVD.

Again, we let \(p \leq \textrm{min}(m,n)\) be the rank of \(X\).

Construct a reduced SVD

\[ X = \tilde U \tilde \Sigma \tilde V^T, \]

where now \(U\) is \(m \times p\) and \(\Sigma\) is \( p \times p\) and \(V^T\) is \(p \times n\).

Our minimum-norm least-squares estimator approximator of \(A\) now has representation

\[ \hat A = X' \tilde V \tilde \Sigma^{-1} \tilde U^T \]

Paralleling a step in Representation 1, define a transition matrix for a rotated \(p \times 1\) state \(\tilde b_t\) by

(7.21)\[ \tilde A =\tilde U^T \hat A \tilde U \]

Because we are now working with a reduced SVD, so that \(\tilde U \tilde U^T \neq I\), since \(\hat A \neq \tilde U \tilde A \tilde U^T\), we can’t simply recover \(\hat A\) from \(\tilde A\) and \(\tilde U\).

Nevertheless, hoping for the best, we persist and construct an eigendecomposition of what is now a \(p \times p\) matrix \(\tilde A\):

(7.22)\[ \tilde A = W \Lambda W^{-1} \]

Mimicking our procedure in Representation 2, we cross our fingers and compute the \(m \times p\) matrix

(7.23)\[ \tilde \Phi_s = \tilde U W \]

that corresponds to (7.19) for a full SVD.

At this point, it is interesting to compute \(\hat A \tilde \Phi_s\):

\[\begin{split} \begin{aligned} \hat A \tilde \Phi_s & = (X' \tilde V \tilde \Sigma^{-1} \tilde U^T) (\tilde U W) \\ & = X' \tilde V \tilde \Sigma^{-1} W \\ & \neq (\tilde U W) \Lambda \\ & = \tilde \Phi_s \Lambda \end{aligned} \end{split}\]

That \( \hat A \tilde \Phi_s \neq \tilde \Phi_s \Lambda \) means, that unlike the corresponding situation in Representation 2, columns of \(\tilde \Phi_s = \tilde U W\) are not eigenvectors of \(\hat A\) corresponding to eigenvalues \(\Lambda\).

But in a quest for eigenvectors of \(\hat A\) that we can compute with a reduced SVD, let’s define

\[ \Phi \equiv \hat A \tilde \Phi_s = X' \tilde V \tilde \Sigma^{-1} W \]

It turns out that columns of \(\Phi\) are eigenvectors of \(\hat A\), a consequence of a result established by Tu et al. [TRL+14].

To present their result, for convenience we’ll drop the tilde \(\tilde \cdot\) above \(U, V,\) and \(\Sigma\) and adopt the understanding that each of them is computed with a reduced SVD.

Thus, we now use the notation that the \(m \times p\) matrix \(\Phi\) is defined as

(7.24)\[ \Phi = X' V \Sigma^{-1} W \]

Proposition The \(p\) columns of \(\Phi\) are eigenvectors of \(\check A\).

Proof: From formula (7.24) we have

\[ \begin{aligned} \hat A \Phi & = (X' V \Sigma^{-1} U^T) (X' V \Sigma^{-1} W) \cr & = X' V \Sigma^{-1} \tilde A W \cr & = X' V \Sigma^{-1} W \Lambda \cr & = \Phi \Lambda \end{aligned} \]

Thus, we have deduced that

(7.25)\[ \hat A \Phi = \Phi \Lambda \]

Let \(\phi_i\) be the the \(i\)the column of \(\Phi\) and \(\lambda_i\) be the corresponding \(i\) eigenvalue of \(\tilde A\) from decomposition (7.22).

Writing out the \(m \times 1\) vectors on both sides of equation (7.25) and equating them gives

\[ \hat A \phi_i = \lambda_i \phi_i . \]

Thus, \(\phi_i\) is an eigenvector of \(\hat A\) that corresponds to eigenvalue \(\lambda_i\) of \(\tilde A\).

This concludes the proof.

Also see [BK19] (p. 238)

7.13.1. Decoder of \(X\) as linear projection

From eigendecomposition (7.25) we can represent \(\hat A\) as

(7.26)\[ \hat A = \Phi \Lambda \Phi^+ . \]

From formula (7.26) we can deduce the reduced dimension dynamics

\[ \check b_{t+1} = \Lambda \check b_t \]

where

(7.27)\[ \check b_t = \Phi^+ X_t \]

Since \(\Phi\) has \(p\) linearly independent columns, the generalized inverse of \(\Phi\) is

\[ \Phi^{\dagger} = (\Phi^T \Phi)^{-1} \Phi^T \]

and so

(7.28)\[ \check b = (\Phi^T \Phi)^{-1} \Phi^T X \]

\(\check b\) is recognizable as the matrix of least squares regression coefficients of the matrix \(X\) on the matrix \(\Phi\) and

\[ \check X = \Phi \check b \]

is the least squares projection of \(X\) on \(\Phi\).

By virtue of least-squares projection theory discussed here https://python-advanced.quantecon.org/orth_proj.html, we can represent \(X\) as the sum of the projection \(\check X\) of \(X\) on \(\Phi\) plus a matrix of errors.

To verify this, note that the least squares projection \(\check X\) is related to \(X\) by

\[ X = \Phi \check b + \epsilon \]

where \(\epsilon\) is an \(m \times n\) matrix of least squares errors satisfying the least squares orthogonality conditions \(\epsilon^T \Phi =0 \) or

(7.29)\[ (X - \Phi \check b)^T \Phi = 0_{m \times p} \]

Rearranging the orthogonality conditions (7.29) gives \(X^T \Phi = \check b \Phi^T \Phi\) which implies formula (7.28).

7.13.2. Alternative algorithm

There is a better way to compute the \(p \times 1\) vector \(\check b_t\) than provided by formula (7.27).

In particular, the following argument from [BK19] (page 240) provides a computationally efficient way to compute \(\check b_t\).

For convenience, we’ll do this first for time \(t=1\).

For \(t=1\), we have

(7.30)\[ X_1 = \Phi \check b_1 \]

where \(\check b_1\) is an \(r \times 1\) vector.

Recall from representation 1 above that \(X_1 = U \tilde b_1\), where \(\tilde b_1\) is the time \(1\) basis vector for representation 1.

It then follows from equation (7.24) that

\[ U \tilde b_1 = X' V \Sigma^{-1} W \check b_1 \]

and consequently

\[ \tilde b_1 = U^T X' V \Sigma^{-1} W \check b_1 \]

Recall that from equation (7.12), \( \tilde A = U^T X' V \Sigma^{-1}\).

It then follows that

\[ \tilde b_1 = \tilde A W \check b_1 \]

and therefore, by the eigendecomposition (7.16) of \(\tilde A\), we have

\[ \tilde b_1 = W \Lambda \check b_1 \]

Consequently,

\[ \check b_1 = ( W \Lambda)^{-1} \tilde b_1 \]

or

(7.31)\[ \check b_1 = ( W \Lambda)^{-1} U^T X_1 , \]

which is computationally more efficient than the following instance of equation (7.27) for computing the initial vector \(\check b_1\):

(7.32)\[ \check b_1= \Phi^{+} X_1 \]

Users of DMD sometimes call components of the basis vector \(\check b_t = \Phi^+ X_t \equiv (W \Lambda)^{-1} U^T X_t\) the exact DMD modes.

Conditional on \(X_t\), we can compute our decoded \(\check X_{t+j}, j = 1, 2, \ldots \) from either

(7.33)\[ \check X_{t+j} = \Phi \Lambda^j \Phi^{+} X_t \]

or

(7.34)\[ \check X_{t+j} = \Phi \Lambda^j (W \Lambda)^{-1} U^T X_t . \]

We can then use \(\check X_{t+j}\) to forcast \(X_{t+j}\).

7.14. Using Fewer Modes

Some of the preceding formulas assume that we have retained all \(p\) modes associated with the positive singular values of \(X\).

We can adjust our formulas to describe a situation in which we instead retain only the \(r < p\) largest singular values.

In that case, we simply replace \(\Sigma\) with the appropriate \(r \times r\) matrix of singular values, \(U\) with the \(m \times r\) matrix of whose columns correspond to the \(r\) largest singular values, and \(V\) with the \(n \times r\) matrix whose columns correspond to the \(r\) largest singular values.

Counterparts of all of the salient formulas above then apply.

7.15. Source for Some Python Code

You can find a Python implementation of DMD here:

https://mathlab.github.io/PyDMD/