12. Two Meanings of Probability

12.1. Overview

This lecture illustrates two distinct interpretations of a probability distribution

  • A frequentist interpretation as relative frequencies anticipated to occur in a large i.i.d. sample

  • A Bayesian interpretation as a personal opinion (about a parameter or list of parameters) after seeing a collection of observations

We recommend watching this video about hypothesis testing within the frequentist approach

After you watch that video, please watch the following video on the Bayesian approach to constructing coverage intervals

After you are familiar with the material in these videos, this lecture uses the Socratic method to to help consolidate your understanding of the different questions that are answered by

  • a frequentist confidence interval

  • a Bayesian coverage interval

We do this by inviting you to write some Python code.

It would be especially useful if you tried doing this after each question that we pose for you, before proceeding to read the rest of the lecture.

We provide our own answers as the lecture unfolds, but you’ll learn more if you try writing your own code before reading and running ours.

Code for answering questions:

In addition to what’s in Anaconda, this lecture will deploy the following library:

pip install prettytable
Requirement already satisfied: prettytable in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (3.5.0)
Requirement already satisfied: wcwidth in /__w/lecture-python.myst/lecture-python.myst/3/envs/quantecon/lib/python3.9/site-packages (from prettytable) (0.2.5)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

Note: you may need to restart the kernel to use updated packages.

To answer our coding questions, we’ll start with some imports

import numpy as np
import pandas as pd
import prettytable as pt
import matplotlib.pyplot as plt
from scipy.stats import binom
import scipy.stats as st
%matplotlib inline

Empowered with these Python tools, we’ll now explore the two meanings described above.

12.2. Frequentist Interpretation

Consider the following classic example.

The random variable \(X \) takes on possible values \(k = 0, 1, 2, \ldots, n\) with probabilties

\[ \textrm{Prob}(X = k | \theta) = \left(\frac{n!}{k! (n-k)!} \right) \theta^k (1-\theta)^{n-k} \]

where the fixed parameter \(\theta \in (0,1)\).

This is called the binomial distribution.

Here

  • \(\theta\) is the probability that one toss of a coin will be a head, an outcome that we encode as \(Y = 1\).

  • \(1 -\theta\) is the probability that one toss of the coin will be a tail, an outcome that we denote \(Y = 0\).

  • \(X\) is the total number of heads that came up after flipping the coin \(n\) times.

Consider the following experiment:

Take \(I\) independent sequences of \(n\) independent flips of the coin

Notice the repeated use of the adjective independent:

  • we use it once to describe that we are drawing \(n\) independent times from a Bernoulli distribution with parameter \(\theta\) to arrive at one draw from a Binomial distribution with parameters \(\theta,n\).

  • we use it again to describe that we are then drawing \(I\) sequences of \(n\) coin draws.

Let \(y_h^i \in \{0, 1\}\) be the realized value of \(Y\) on the \(h\)th flip during the \(i\)th sequence of flips.

Let \(\sum_{h=1}^n y_h^i\) denote the total number of times heads come up during the \(i\)th sequence of \(n\) independent coin flips.

Let \(f_k\) record the fraction of samples of length \(n\) for which \(\sum_{h=1}^n y_h^i = k\):

\[ f_k^I = \frac{\textrm{number of samples of length n for which } \sum_{h=1}^n y_h^i = k}{ I} \]

The probability \(\textrm{Prob}(X = k | \theta)\) answers the following question:

  • As \(I\) becomes large, in what fraction of \(I\) independent draws of \(n\) coin flips should we anticipate \(k\) heads to occur?

As usual, a law of large numbers justifies this answer.

Exercise 12.1

  1. Please write a Python class to compute \(f_k^I\)

  2. Please use your code to compute \(f_k^I, k = 0, \ldots , n\) and compare them to \(\textrm{Prob}(X = k | \theta)\) for various values of \(\theta, n\) and \(I\)

  3. With the Law of Large numbers in mind, use your code to say something

Let’s do some more calculations.

Comparison with different \(\theta\)

Now we fix

\[ n=20, k=10, I=1,000,000 \]

We’ll vary \(\theta\) from \(0.01\) to \(0.99\) and plot outcomes against \(\theta\).

θ_low, θ_high, npt = 0.01, 0.99, 50
thetas = np.linspace(θ_low, θ_high, npt)
P = []
f_kI = []
for i in range(npt):
    freq = frequentist(thetas[i], n, I)
    freq.binomial(k)
    freq.draw()
    freq.compute_fk(k)
    P.append(freq.P)
    f_kI.append(freq.f_kI)
fig, ax = plt.subplots(figsize=(8, 6))
ax.grid()
ax.plot(thetas, P, 'k-.', label='Theoretical')
ax.plot(thetas, f_kI, 'r--', label='Fraction')
plt.title(r'Comparison with different $\theta$', fontsize=16)
plt.xlabel(r'$\theta$', fontsize=15)
plt.ylabel('Fraction', fontsize=15)
plt.tick_params(labelsize=13)
plt.legend()
plt.show()
_images/prob_meaning_9_0.png

Comparison with different \(n\)

Now we fix \(\theta=0.7, k=10, I=1,000,000\) and vary \(n\) from \(1\) to \(100\).

Then we’ll plot outcomes.

n_low, n_high, nn = 1, 100, 50
ns = np.linspace(n_low, n_high, nn, dtype='int')
P = []
f_kI = []
for i in range(nn):
    freq = frequentist(θ, ns[i], I)
    freq.binomial(k)
    freq.draw()
    freq.compute_fk(k)
    P.append(freq.P)
    f_kI.append(freq.f_kI)
fig, ax = plt.subplots(figsize=(8, 6))
ax.grid()
ax.plot(ns, P, 'k-.', label='Theoretical')
ax.plot(ns, f_kI, 'r--', label='Frequentist')
plt.title(r'Comparison with different $n$', fontsize=16)
plt.xlabel(r'$n$', fontsize=15)
plt.ylabel('Fraction', fontsize=15)
plt.tick_params(labelsize=13)
plt.legend()
plt.show()
_images/prob_meaning_12_0.png

Comparison with different \(I\)

Now we fix \(\theta=0.7, n=20, k=10\) and vary \(\log(I)\) from \(2\) to \(7\).

I_log_low, I_log_high, nI = 2, 6, 200
log_Is = np.linspace(I_log_low, I_log_high, nI)
Is = np.power(10, log_Is).astype(int)
P = []
f_kI = []
for i in range(nI):
    freq = frequentist(θ, n, Is[i])
    freq.binomial(k)
    freq.draw()
    freq.compute_fk(k)
    P.append(freq.P)
    f_kI.append(freq.f_kI)
fig, ax = plt.subplots(figsize=(8, 6))
ax.grid()
ax.plot(Is, P, 'k-.', label='Theoretical')
ax.plot(Is, f_kI, 'r--', label='Fraction')
plt.title(r'Comparison with different $I$', fontsize=16)
plt.xlabel(r'$I$', fontsize=15)
plt.ylabel('Fraction', fontsize=15)
plt.tick_params(labelsize=13)
plt.legend()
plt.show()
_images/prob_meaning_15_0.png

From the above graphs, we can see that \(I\), the number of independent sequences, plays an important role.

When \(I\) becomes larger, the difference between theoretical probability and frequentist estimate becomes smaller.

Also, as long as \(I\) is large enough, changing \(\theta\) or \(n\) does not substantially change the accuracy of the observed fraction as an approximation of \(\theta\).

The Law of Large Numbers is at work here.

For each draw of an independent sequence, \(\textrm{Prob}(X_i = k | \theta)\) is the same, so aggregating all draws forms an i.i.d sequence of a binary random variable \(\rho_{k,i},i=1,2,...I\), with a mean of \(\textrm{Prob}(X = k | \theta)\) and a variance of

\[ n \cdot \textrm{Prob}(X = k | \theta) \cdot (1-\textrm{Prob}(X = k | \theta)). \]

So, by the LLN, the average of \(P_{k,i}\) converges to:

\[ E[\rho_{k,i}] = \textrm{Prob}(X = k | \theta) = \left(\frac{n!}{k! (n-k)!} \right) \theta^k (1-\theta)^{n-k} \]

as \(I\) goes to infinity.

12.3. Bayesian Interpretation

We again use a binomial distribution.

But now we don’t regard \(\theta\) as being a fixed number.

Instead, we think of it as a random variable.

\(\theta\) is described by a probability distribution.

But now this probability distribution means something different than a relative frequency that we can anticipate to occur in a large i.i.d. sample.

Instead, the probability distribution of \(\theta\) is now a summary of our views about likely values of \(\theta\) either

  • before we have seen any data at all, or

  • before we have seen more data, after we have seen some data

Thus, suppose that, before seeing any data, you have a personal prior probability distribution saying that

\[ P(\theta) = \frac{\theta^{\alpha-1}(1-\theta)^{\beta -1}}{B(\alpha, \beta)} \]

where \(B(\alpha, \beta)\) is a beta function , so that \(P(\theta)\) is a beta distribution with parameters \(\alpha, \beta\).

Exercise 12.2

a) Please write down the likelihood function for a sample of length \(n\) from a binomial distribution with parameter \(\theta\).

b) Please write down the posterior distribution for \(\theta\) after observing one flip of the coin.

c) Now pretend that the true value of \(\theta = .4\) and that someone who doesn’t know this has a beta prior distribution with parameters with \(\beta = \alpha = .5\). Please write a Python class to simulate this person’s personal posterior distribution for \(\theta\) for a single sequence of \(n\) draws.

d) Please plot the posterior distribution for \(\theta\) as a function of \(\theta\) as \(n\) grows as \(1, 2, \ldots\).

e) For various \(n\)’s, please describe and compute a Bayesian coverage interval for the interval \([.45, .55]\).

f) Please tell what question a Bayesian coverage interval answers.

g) Please compute the Posterior probabililty that \(\theta \in [.45, .55]\) for various values of sample size \(n\).

h) Please use your Python class to study what happens to the posterior distribution as \(n \rightarrow + \infty\), again assuming that the true value of \(\theta = .4\), though it is unknown to the person doing the updating via Bayes’ Law.

How shall we interpret the patterns above?

The answer is encoded in the Bayesian updating formulas.

It is natural to extend the one-step Bayesian update to an \(n\)-step Bayesian update.

\[ \textrm{Prob}(\theta|k) = \frac{\textrm{Prob}(\theta,k)}{\textrm{Prob}(k)}=\frac{\textrm{Prob}(k|\theta)*\textrm{Prob}(\theta)}{\textrm{Prob}(k)}=\frac{\textrm{Prob}(k|\theta)*\textrm{Prob}(\theta)}{\int_0^1 \textrm{Prob}(k|\theta)*\textrm{Prob}(\theta) d\theta} \]
\[ =\frac{{N \choose k} (1 - \theta)^{N-k} \theta^k*\frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)}}{\int_0^1 {N \choose k} (1 - \theta)^{N-k} \theta^k*\frac{\theta^{\alpha - 1} (1 - \theta)^{\beta - 1}}{B(\alpha, \beta)} d\theta} \]
\[ =\frac{(1 -\theta)^{\beta+N-k-1}* \theta^{\alpha+k-1}}{\int_0^1 (1 - \theta)^{\beta+N-k-1}* \theta^{\alpha+k-1} d\theta} \]
\[ ={Beta}(\alpha + k, \beta+N-k) \]

A beta distribution with \(\alpha\) and \(\beta\) has the following mean and variance.

The mean is \(\frac{\alpha}{\alpha + \beta}\)

The variance is \(\frac{\alpha \beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\)

  • \(\alpha\) can be viewed as the number of successes

  • \(\beta\) can be viewed as the number of failures

The random variables \(k\) and \(N-k\) are governed by Binomial Distribution with \(\theta=0.4\).

Call this the true data generating process.

According to the Law of Large Numbers, for a large number of observations, observed frequencies of \(k\) and \(N-k\) will be described by the true data generating process, i.e., the population probability distribution that we assumed when generating the observations on the computer. (See Exercise 12.1).

Consequently, the mean of the posterior distribution converges to \(0.4\) and the variance withers to zero.

upper_bound = [ii.ppf(0.95) for ii in Bay_stat.posterior_list]
lower_bound = [ii.ppf(0.05) for ii in Bay_stat.posterior_list]

fig, ax = plt.subplots(figsize=(10, 6))
ax.scatter(np.arange(len(upper_bound)), upper_bound, label='95 th Quantile')
ax.scatter(np.arange(len(lower_bound)), lower_bound, label='05 th Quantile')

ax.set_xticks(np.arange(0, len(upper_bound), 2))  
ax.set_xticklabels(num_list[::2])
ax.set_xlabel('Number of Observations', fontsize=12)
ax.set_title('Bayesian Coverage Intervals of Posterior Distributions', fontsize=15)

ax.legend(fontsize=11)
plt.show()
_images/prob_meaning_29_0.png

After observing a large number of outcomes, the posterior distribution collapses around \(0.4\).

Thus, the Bayesian statististian comes to believe that \(\theta\) is near \(.4\).

As shown in the figure above, as the number of observations grows, the Bayesian coverage intervals (BCIs) become narrower and narrower around \(0.4\).

However, if you take a closer look, you will find that the centers of the BCIs are not exactly \(0.4\), due to the persistent influence of the prior distribution and the randomness of the simulation path.

12.4. Role of a Conjugate Prior

We have made assumptions that link functional forms of our likelihood function and our prior in a way that has eased our calculations considerably.

In particular, our assumptions that the likelihood function is binomial and that the prior distribution is a beta distribution have the consequence that the posterior distribution implied by Bayes’ Law is also a beta distribution.

So posterior and prior are both beta distributions, albeit ones with different parameters.

When a likelihood function and prior fit together like hand and glove in this way, we can say that the prior and posterior are conjugate distributions.

In this situation, we also sometimes say that we have conjugate prior for the likelihood function \(\textrm{Prob}(X | \theta)\).

Typically, the functional form of the likelihood function determines the functional form of a conjugate prior.

A natural question to ask is why should a person’s personal prior about a parameter \(\theta\) be restricted to be described by a conjugate prior?

Why not some other functional form that more sincerely describes the person’s beliefs.

To be argumentative, one could ask, why should the form of the likelihood function have anything to say about my personal beliefs about \(\theta\)?

A dignified response to that question is, well, it shouldn’t, but if you want to compute a posterior easily you’ll just be happier if your prior is conjugate to your likelihood.

Otherwise, your posterior won’t have a convenient analytical form and you’ll be in the situation of wanting to apply the Markov chain Monte Carlo techniques deployed in this quantecon lecture.

We also apply these powerful methods to approximating Bayesian posteriors for non-conjugate priors in this quantecon lecture and this quantecon lecture