# 8. VARs and DMDs#

This lecture applies computational methods that we learned about in this lecture Singular Value Decomposition to

first-order vector autoregressions (VARs)

dynamic mode decompositions (DMDs)

connections between DMDs and first-order VARs

## 8.1. First-Order Vector Autoregressions#

We want to fit a **first-order vector autoregression**

where \(\epsilon_{t+1}\) is the time \(t+1\) component of a sequence of i.i.d. \(m \times 1\) random vectors with mean vector zero and identity covariance matrix and where the \( m \times 1 \) vector \( X_t \) is

and where \(\cdot ^\top \) again denotes complex transposition and \( X_{i,t} \) is variable \( i \) at time \( t \).

We want to fit equation (8.1).

Our data are organized in an \( m \times (n+1) \) matrix \( \tilde X \)

where for \( t = 1, \ldots, n +1 \), the \( m \times 1 \) vector \( X_t \) is given by (8.2).

Thus, we want to estimate a system (8.1) that consists of \( m \) least squares regressions of **everything** on one lagged value of **everything**.

The \(i\)’th equation of (8.1) is a regression of \(X_{i,t+1}\) on the vector \(X_t\).

We proceed as follows.

From \( \tilde X \), we form two \(m \times n\) matrices

and

Here \( ' \) is part of the name of the matrix \( X' \) and does not indicate matrix transposition.

We use \(\cdot^\top \) to denote matrix transposition or its extension to complex matrices.

In forming \( X \) and \( X' \), we have in each case dropped a column from \( \tilde X \), the last column in the case of \( X \), and the first column in the case of \( X' \).

Evidently, \( X \) and \( X' \) are both \( m \times n \) matrices.

We denote the rank of \( X \) as \( p \leq \min(m, n) \).

Two cases that interest us are

\( n > > m\), so that we have many more time series observations \(n\) than variables \(m\)

\(m > > n\), so that we have many more variables \(m \) than time series observations \(n\)

At a general level that includes both of these special cases, a common formula describes the least squares estimator \(\hat A\) of \(A\).

But important details differ.

The common formula is

where \(X^+\) is the pseudo-inverse of \(X\).

To read about the **Moore-Penrose pseudo-inverse** please see Moore-Penrose pseudo-inverse

Applicable formulas for the pseudo-inverse differ for our two cases.

**Short-Fat Case:**

When \( n > > m\), so that we have many more time series observations \(n\) than variables \(m\) and when
\(X\) has linearly independent **rows**, \(X X^\top \) has an inverse and the pseudo-inverse \(X^+\) is

Here \(X^+\) is a **right-inverse** that verifies \( X X^+ = I_{m \times m}\).

In this case, our formula (8.3) for the least-squares estimator of the population matrix of regression coefficients \(A\) becomes

This formula for least-squares regression coefficients is widely used in econometrics.

It is used to estimate vector autorgressions.

The right side of formula (8.4) is proportional to the empirical cross second moment matrix of \(X_{t+1}\) and \(X_t\) times the inverse of the second moment matrix of \(X_t\).

**Tall-Skinny Case:**

When \(m > > n\), so that we have many more attributes \(m \) than time series observations \(n\) and when \(X\) has linearly independent **columns**,
\(X^\top X\) has an inverse and the pseudo-inverse \(X^+\) is

Here \(X^+\) is a **left-inverse** that verifies \(X^+ X = I_{n \times n}\).

In this case, our formula (8.3) for a least-squares estimator of \(A\) becomes

Please compare formulas (8.4) and (8.5) for \(\hat A\).

Here we are especially interested in formula (8.5).

The \( i \)th row of \( \hat A \) is an \( m \times 1 \) vector of regression coefficients of \( X_{i,t+1} \) on \( X_{j,t}, j = 1, \ldots, m \).

If we use formula (8.5) to calculate \(\hat A X\) we find that

so that the regression equation **fits perfectly**.

This is a typical outcome in an **underdetermined least-squares** model.

To reiterate, in the **tall-skinny** case (described in Singular Value Decomposition) in which we have a number \(n\) of observations that is small relative to the number \(m\) of
attributes that appear in the vector \(X_t\), we want to fit equation (8.1).

We confront the facts that the least squares estimator is underdetermined and that the regression equation fits perfectly.

To proceed, we’ll want efficiently to calculate the pseudo-inverse \(X^+\).

The pseudo-inverse \(X^+\) will be a component of our estimator of \(A\).

As our estimator \(\hat A\) of \(A\) we want to form an \(m \times m\) matrix that solves the least-squares best-fit problem

where \(|| \cdot ||_F\) denotes the Frobenius (or Euclidean) norm of a matrix.

The Frobenius norm is defined as

The minimizer of the right side of equation (8.6) is

where the (possibly huge) \( n \times m \) matrix \( X^{+} = (X^\top X)^{-1} X^\top \) is again a pseudo-inverse of \( X \).

For some situations that we are interested in, \(X^\top X \) can be close to singular, a situation that makes some numerical algorithms be inaccurate.

To acknowledge that possibility, we’ll use efficient algorithms to constructing
a **reduced-rank approximation** of \(\hat A\) in formula (8.5).

Such an approximation to our vector autoregression will no longer fit perfectly.

The \( i \)th row of \( \hat A \) is an \( m \times 1 \) vector of regression coefficients of \( X_{i,t+1} \) on \( X_{j,t}, j = 1, \ldots, m \).

An efficient way to compute the pseudo-inverse \(X^+\) is to start with a singular value decomposition

where we remind ourselves that for a **reduced** SVD, \(X\) is an \(m \times n\) matrix of data, \(U\) is an \(m \times p\) matrix, \(\Sigma\) is a \(p \times p\) matrix, and \(V\) is an \(n \times p\) matrix.

We can efficiently construct the pertinent pseudo-inverse \(X^+\) by recognizing the following string of equalities.

(Since we are in the \(m > > n\) case in which \(V^\top V = I_{p \times p}\) in a reduced SVD, we can use the preceding string of equalities for a reduced SVD as well as for a full SVD.)

Thus, we shall construct a pseudo-inverse \( X^+ \) of \( X \) by using a singular value decomposition of \(X\) in equation (8.8) to compute

where the matrix \( \Sigma^{-1} \) is constructed by replacing each non-zero element of \( \Sigma \) with \( \sigma_j^{-1} \).

We can use formula (8.10) together with formula (8.7) to compute the matrix \( \hat A \) of regression coefficients.

Thus, our estimator \(\hat A = X' X^+\) of the \(m \times m\) matrix of coefficients \(A\) is

## 8.2. Dynamic Mode Decomposition (DMD)#

We turn to the \( m >>n \) **tall and skinny** case associated with **Dynamic Mode Decomposition**.

Here an \( m \times n+1 \) data matrix \( \tilde X \) contains many more attributes (or variables) \( m \) than time periods \( n+1 \).

Dynamic mode decomposition was introduced by [Sch10],

You can read about Dynamic Mode Decomposition here [KBBWP16] and here [BK19] (section 7.2).

**Dynamic Mode Decomposition** (DMD) computes a rank \( r < p \) approximation to the least squares regression coefficients \( \hat A \) described by formula (8.11).

We’ll build up gradually to a formulation that is useful in applications.

We’ll do this by describing three alternative representations of our first-order linear dynamic system, i.e., our vector autoregression.

**Guide to three representations:** In practice, we’ll mainly be interested in Representation 3.

We use the first two representations to present some useful intermediate steps that help us to appreciate what is under the hood of Representation 3.

In applications, we’ll use only a small subset of **DMD modes** to approximate dynamics.

We use such a small subset of DMD modes to construct a reduced-rank approximation to \(A\).

To do that, we’ll want to use the **reduced** SVD’s affiliated with representation 3, not the **full** SVD’s affiliated with representations 1 and 2.

**Guide to impatient reader:** In our applications, we’ll be using Representation 3.

You might want to skip the stage-setting representations 1 and 2 on first reading.

## 8.3. Representation 1#

In this representation, we shall use a **full** SVD of \(X\).

We use the \(m\) **columns** of \(U\), and thus the \(m\) **rows** of \(U^\top \), to define a \(m \times 1\) vector \(\tilde b_t\) as

The original data \(X_t\) can be represented as

(Here we use \(b\) to remind ourselves that we are creating a **basis** vector.)

Since we are now using a **full** SVD, \(U U^\top = I_{m \times m}\).

So it follows from equation (8.12) that we can reconstruct \(X_t\) from \(\tilde b_t\).

In particular,

Equation (8.12) serves as an

**encoder**that**rotates**the \(m \times 1\) vector \(X_t\) to become an \(m \times 1\) vector \(\tilde b_t\)Equation (8.13) serves as a

**decoder**that**reconstructs**the \(m \times 1\) vector \(X_t\) by rotating the \(m \times 1\) vector \(\tilde b_t\)

Define a transition matrix for an \(m \times 1\) basis vector \(\tilde b_t\) by

We can recover \(\hat A\) from

Dynamics of the \(m \times 1\) basis vector \(\tilde b_t\) are governed by

To construct forecasts \(\overline X_t\) of future values of \(X_t\) conditional on \(X_1\), we can apply decoders (i.e., rotators) to both sides of this equation and deduce

where we use \(\overline X_{t+1}, t \geq 1 \) to denote a forecast.

## 8.4. Representation 2#

This representation is related to one originally proposed by [Sch10].

It can be regarded as an intermediate step on the way to obtaining a related representation 3 to be presented later

As with Representation 1, we continue to

use a

**full**SVD and**not**a reduced SVD

As we observed and illustrated in a lecture about the Singular Value Decomposition

(a) for a full SVD \(U U^\top = I_{m \times m} \) and \(U^\top U = I_{p \times p}\) are both identity matrices

(b) for a reduced SVD of \(X\), \(U^\top U \) is not an identity matrix.

As we shall see later, a full SVD is too confining for what we ultimately want to do, namely, cope with situations in which \(U^\top U\) is **not** an identity matrix because we use a reduced SVD of \(X\).

But for now, let’s proceed under the assumption that we are using a full SVD so that requirements (a) and (b) are both satisfied.

Form an eigendecomposition of the \(m \times m\) matrix \(\tilde A = U^\top \hat A U\) defined in equation (8.14):

where \(\Lambda\) is a diagonal matrix of eigenvalues and \(W\) is an \(m \times m\) matrix whose columns are eigenvectors corresponding to rows (eigenvalues) in \(\Lambda\).

When \(U U^\top = I_{m \times m}\), as is true with a full SVD of \(X\), it follows that

According to equation (8.16), the diagonal matrix \(\Lambda\) contains eigenvalues of \(\hat A\) and corresponding eigenvectors of \(\hat A\) are columns of the matrix \(UW\).

It follows that the systematic (i.e., not random) parts of the \(X_t\) dynamics captured by our first-order vector autoregressions are described by

Multiplying both sides of the above equation by \(W^{-1} U^\top \) gives

or

where our **encoder** is

and our **decoder** is

We can use this representation to construct a predictor \(\overline X_{t+1}\) of \(X_{t+1}\) conditional on \(X_1\) via:

In effect, [Sch10] defined an \(m \times m\) matrix \(\Phi_s\) as

and a generalized inverse

[Sch10] then represented equation (8.17) as

Components of the basis vector \( \hat b_t = W^{-1} U^\top X_t \equiv \Phi_s^+ X_t\) are

DMD **projected modes**.

To understand why they are called **projected modes**, notice that

so that the \(m \times p\) matrix

is a matrix of regression coefficients of the \(m \times n\) matrix \(X\) on the \(m \times p\) matrix \(\Phi_s\).

We’ll say more about this interpretation in a related context when we discuss representation 3, which was suggested by Tu et al. [TRL+14].

It is more appropriate to use representation 3 when, as is often the case in practice, we want to use a reduced SVD.

## 8.5. Representation 3#

Departing from the procedures used to construct Representations 1 and 2, each of which deployed a **full** SVD, we now use a **reduced** SVD.

Again, we let \(p \leq \textrm{min}(m,n)\) be the rank of \(X\).

Construct a **reduced** SVD

where now \(\tilde U\) is \(m \times p\), \(\tilde \Sigma\) is \( p \times p\), and \(\tilde V^\top \) is \(p \times n\).

Our minimum-norm least-squares approximator of \(A\) now has representation

**Computing Dominant Eigenvectors of \(\hat A\)**

We begin by paralleling a step used to construct Representation 1, define a transition matrix for a rotated \(p \times 1\) state \(\tilde b_t\) by

**Interpretation as projection coefficients**

[BK22] remark that \(\tilde A\) can be interpreted in terms of a projection of \(\hat A\) onto the \(p\) modes in \(\tilde U\).

To verify this, first note that, because \( \tilde U^\top \tilde U = I\), it follows that

Next, we’ll just compute the regression coefficients in a projection of \(\hat A\) on \(\tilde U\) using a standard least-squares formula

Thus, we have verified that \(\tilde A\) is a least-squares projection of \(\hat A\) onto \(\tilde U\).

**An Inverse Challenge**

Because we are using a reduced SVD, \(\tilde U \tilde U^\top \neq I\).

Consequently,

so we can’t simply recover \(\hat A\) from \(\tilde A\) and \(\tilde U\).

**A Blind Alley**

We can start by hoping for the best and proceeding to construct an eigendecomposition of the \(p \times p\) matrix \(\tilde A\):

where \(\Lambda\) is a diagonal matrix of \(p\) eigenvalues and the columns of \(\tilde W\) are corresponding eigenvectors.

Mimicking our procedure in Representation 2, we cross our fingers and compute an \(m \times p\) matrix

that corresponds to (8.18) for a full SVD.

At this point, where \(\hat A\) is given by formula (8.21) it is interesting to compute \(\hat A \tilde \Phi_s\):

That
\( \hat A \tilde \Phi_s \neq \tilde \Phi_s \Lambda \) means that, unlike the corresponding situation in Representation 2, columns of \(\tilde \Phi_s = \tilde U \tilde W\)
are **not** eigenvectors of \(\hat A\) corresponding to eigenvalues on the diagonal of matix \(\Lambda\).

**An Approach That Works**

Continuing our quest for eigenvectors of \(\hat A\) that we **can** compute with a reduced SVD, let’s define an \(m \times p\) matrix
\(\Phi\) as

It turns out that columns of \(\Phi\) **are** eigenvectors of \(\hat A\).

This is a consequence of a result established by Tu et al. [TRL+14] that we now present.

**Proposition** The \(p\) columns of \(\Phi\) are eigenvectors of \(\hat A\).

**Proof:** From formula (8.26) we have

so that

Let \(\phi_i\) be the \(i\)th column of \(\Phi\) and \(\lambda_i\) be the corresponding \(i\) eigenvalue of \(\tilde A\) from decomposition (8.24).

Equating the \(m \times 1\) vectors that appear on the two sides of equation (8.27) gives

This equation confirms that \(\phi_i\) is an eigenvector of \(\hat A\) that corresponds to eigenvalue \(\lambda_i\) of both \(\tilde A\) and \(\hat A\).

This concludes the proof.

Also see [BK22] (p. 238)

### 8.5.1. Decoder of \(\check b\) as a linear projection#

From eigendecomposition (8.27) we can represent \(\hat A\) as

From formula (8.28) we can deduce dynamics of the \(p \times 1\) vector \(\check b_t\):

where

Since the \(m \times p\) matrix \(\Phi\) has \(p\) linearly independent columns, the generalized inverse of \(\Phi\) is

and so

The \(p \times n\) matrix \(\check b\) is recognizable as a matrix of least squares regression coefficients of the \(m \times n\) matrix \(X\) on the \(m \times p\) matrix \(\Phi\) and consequently

is an \(m \times n\) matrix of least squares projections of \(X\) on \(\Phi\).

**Variance Decomposition of \(X\)**

By virtue of the least-squares projection theory discussed in this quantecon lecture https://python-advanced.quantecon.org/orth_proj.html, we can represent \(X\) as the sum of the projection \(\check X\) of \(X\) on \(\Phi\) plus a matrix of errors.

To verify this, note that the least squares projection \(\check X\) is related to \(X\) by

or

where \(\epsilon\) is an \(m \times n\) matrix of least squares errors satisfying the least squares orthogonality conditions \(\epsilon^\top \Phi =0 \) or

Rearranging the orthogonality conditions (8.33) gives \(X^\top \Phi = \check b \Phi^\top \Phi\), which implies formula (8.30).

### 8.5.2. An Approximation#

We now describe a way to approximate the \(p \times 1\) vector \(\check b_t\) instead of using formula (8.29).

In particular, the following argument adapted from [BK22] (page 240) provides a computationally efficient way to approximate \(\check b_t\).

For convenience, we’ll apply the method at time \(t=1\).

For \(t=1\), from equation (8.32) we have

where \(\check b_1\) is a \(p \times 1\) vector.

Recall from representation 1 above that \(X_1 = U \tilde b_1\), where \(\tilde b_1\) is a time \(1\) basis vector for representation 1 and \(U\) is from the full SVD \(X = U \Sigma V^\top\).

It then follows from equation (8.32) that

where \(\epsilon_1\) is a least-squares error vector from equation (8.32).

It follows that

Replacing the error term \(U^\top \epsilon_1\) by zero, and replacing \(U\) from a **full** SVD of \(X\) with \(\tilde U\) from a **reduced** SVD, we obtain an approximation \(\hat b_1\) to \(\tilde b_1\):

Recall that from equation (8.23), \( \tilde A = \tilde U^\top X' \tilde V \tilde \Sigma^{-1}\).

It then follows that

and therefore, by the eigendecomposition (8.24) of \(\tilde A\), we have

Consequently,

or

which is a computationally efficient approximation to the following instance of equation (8.29) for the initial vector \(\check b_1\):

(To highlight that (8.35) is an approximation, users of DMD sometimes call components of basis vector \(\check b_t = \Phi^+ X_t \) the **exact** DMD modes and components of \(\hat b_t = ( \tilde W \Lambda)^{-1} \tilde U^\top X_t\) the **approximate** modes.)

Conditional on \(X_t\), we can compute a decoded \(\check X_{t+j}, j = 1, 2, \ldots \) from the exact modes via

or use compute a decoded \(\hat X_{t+j}\) from approximate modes via

We can then use a decoded \(\check X_{t+j}\) or \(\hat X_{t+j}\) to forecast \(X_{t+j}\).

### 8.5.3. Using Fewer Modes#

In applications, we’ll actually use only a few modes, often three or less.

Some of the preceding formulas assume that we have retained all \(p\) modes associated with singular values of \(X\).

We can adjust our formulas to describe a situation in which we instead retain only the \(r < p\) largest singular values.

In that case, we simply replace \(\tilde \Sigma\) with the appropriate \(r\times r\) matrix of singular values, \(\tilde U\) with the \(m \times r\) matrix whose columns correspond to the \(r\) largest singular values, and \(\tilde V\) with the \(n \times r\) matrix whose columns correspond to the \(r\) largest singular values.

Counterparts of all of the salient formulas above then apply.

## 8.6. Source for Some Python Code#

You can find a Python implementation of DMD here: