{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "08d04cc4",
   "metadata": {},
   "source": [
    "# Job Search VII: A McCall Worker Q-Learns"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b79ac29d",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "This lecture illustrates a powerful machine learning technique called Q-learning.\n",
    "\n",
    "[[Sutton and Barto, 2018](https://python.quantecon.org/zreferences.html#id33)] presents Q-learning and a variety of other statistical learning procedures.\n",
    "\n",
    "The Q-learning algorithm combines ideas from\n",
    "\n",
    "- dynamic programming  \n",
    "- a recursive version of least squares known as [temporal difference learning](https://en.wikipedia.org/wiki/Temporal_difference_learning).  \n",
    "\n",
    "\n",
    "This lecture applies a Q-learning algorithm to the situation faced by  a   McCall worker.\n",
    "\n",
    "This lecture also considers the case where a McCall worker is given an option to quit the current job.\n",
    "\n",
    "Relative to the dynamic programming formulation of the McCall worker model that we studied in  [quantecon lecture](https://python.quantecon.org/mccall_model.html), a Q-learning algorithm gives the worker less knowledge about\n",
    "\n",
    "- the random process that generates a sequence of wages  \n",
    "- the reward function that tells  consequences of accepting or rejecting a job  \n",
    "\n",
    "\n",
    "The Q-learning algorithm  invokes a statistical learning model to learn about these things.\n",
    "\n",
    "Statistical learning often comes down to some version of least squares, and it will be here too.\n",
    "\n",
    "Any time we say **statistical learning**, we have to say what object is being learned.\n",
    "\n",
    "For Q-learning, the object that is learned is not  the **value function** that is a focus\n",
    "of dynamic programming.\n",
    "\n",
    "But it is something that is closely affiliated with it.\n",
    "\n",
    "In the finite-action, finite state context studied in this lecture, the object to be learned statistically is a **Q-table**, an instance of a **Q-function** for finite sets.\n",
    "\n",
    "Sometimes a Q-function or Q-table is called  a quality-function or quality-table.\n",
    "\n",
    "The rows and columns of a Q-table correspond to possible states that an agent might encounter, and possible\n",
    "actions that he can take in each state.\n",
    "\n",
    "An equation  that resembles a  Bellman equation  plays an important role in the algorithm.\n",
    "\n",
    "It  differs from the Bellman equation for the McCall model that we have seen in [this quantecon lecture](https://python.quantecon.org/mccall_model.html)\n",
    "\n",
    "In this lecture, we’ll learn a little about\n",
    "\n",
    "- the **Q-function** or **quality function** that is affiliated with any Markov decision problem whose optimal value function satisfies a Bellman equation  \n",
    "- **temporal difference learning**,  a key component of a Q-learning algorithm  \n",
    "\n",
    "\n",
    "As usual, let’s  import some Python modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd3944b6",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "!pip install quantecon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c98ae997",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "from numba import jit, float64, int64\n",
    "from numba.experimental import jitclass\n",
    "from quantecon.distributions import BetaBinomial\n",
    "\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "np.random.seed(123)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d0767462",
   "metadata": {},
   "source": [
    "## Review of McCall Model\n",
    "\n",
    "We begin by reviewing the McCall model described in [this quantecon lecture](https://python.quantecon.org/mccall_model.html).\n",
    "\n",
    "We’ll  compute an optimal value function and a policy that attains it.\n",
    "\n",
    "We’ll eventually compare that optimal policy to what the Q-learning McCall worker learns.\n",
    "\n",
    "The McCall model is characterized by parameters $ \\beta,c $ and a known distribution of wage offers $ F $.\n",
    "\n",
    "A McCall worker wants to maximize an expected discounted sum of lifetime incomes\n",
    "\n",
    "$$\n",
    "\\mathbb{E} \\sum_{t=0}^{\\infty} \\beta^t y_t\n",
    "$$\n",
    "\n",
    "The worker’s income $ y_t $ equals his wage $ w $ if he is employed, and unemployment compensation $ c $ if he is unemployed.\n",
    "\n",
    "An optimal value  $ V\\left(w\\right) $ for a McCall worker who has just received a wage offer $ w $ and is deciding whether\n",
    "to accept or reject it satisfies the Bellman equation\n",
    "\n",
    "\n",
    "<a id='equation-eq-mccallbellman'></a>\n",
    "$$\n",
    "V\\left(w\\right)=\\max_{\\text{accept, reject}}\\;\\left\\{ \\frac{w}{1-\\beta},c+\\beta\\int V\\left(w'\\right)dF\\left(w'\\right)\\right\\} \\tag{44.1}\n",
    "$$\n",
    "\n",
    "To form a benchmark to compare with results from Q-learning, we  first approximate the optimal value function.\n",
    "\n",
    "With possible states residing in a finite discrete state space indexed by $ \\{1,2,...,n\\} $, we make an initial guess for the value function of $ v\\in\\mathbb{R}^{n} $ and then iterate on the Bellman equation:\n",
    "\n",
    "$$\n",
    "v^{\\prime}(i)=\\max \\left\\{\\frac{w(i)}{1-\\beta}, c+\\beta \\sum_{1 \\leq j \\leq n} v(j) q(j)\\right\\} \\quad \\text { for } i=1, \\ldots, n\n",
    "$$\n",
    "\n",
    "Let’s use  Python   code from [this quantecon lecture](https://python.quantecon.org/mccall_model.html).\n",
    "\n",
    "We use a Python method called `VFI` to compute the optimal value function using value function iterations.\n",
    "\n",
    "We construct an assumed distribution  of wages and plot it with the following Python code"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7983d38b",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "n, a, b = 10, 200, 100                        # default parameters\n",
    "q_default = BetaBinomial(n, a, b).pdf()       # default choice of q\n",
    "\n",
    "w_min, w_max = 10, 60\n",
    "w_default = np.linspace(w_min, w_max, n+1)\n",
    "\n",
    "# plot distribution of wage offer\n",
    "fig, ax = plt.subplots(figsize=(10,6))\n",
    "ax.plot(w_default, q_default, '-o', label='$q(w(i))$')\n",
    "ax.set_xlabel('wages')\n",
    "ax.set_ylabel('probabilities')\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "128faa91",
   "metadata": {},
   "source": [
    "Next we’ll compute the worker’s optimal value function by iterating to convergence on the Bellman equation.\n",
    "\n",
    "Then we’ll plot various iterates on the Bellman operator."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa63bc8d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "mccall_data = [\n",
    "    ('c', float64),      # unemployment compensation\n",
    "    ('β', float64),      # discount factor\n",
    "    ('w', float64[:]),   # array of wage values, w[i] = wage at state i\n",
    "    ('q', float64[:]),    # array of probabilities\n",
    "]\n",
    "\n",
    "\n",
    "@jitclass(mccall_data)\n",
    "class McCallModel:\n",
    "\n",
    "    def __init__(self, c=25, β=0.99, w=w_default, q=q_default):\n",
    "\n",
    "        self.c, self.β = c, β\n",
    "        self.w, self.q = w, q\n",
    "\n",
    "    def state_action_values(self, i, v):\n",
    "        \"\"\"\n",
    "        The values of state-action pairs.\n",
    "        \"\"\"\n",
    "        # Simplify names\n",
    "        c, β, w, q = self.c, self.β, self.w, self.q\n",
    "        # Evaluate value for each state-action pair\n",
    "        # Consider action = accept or reject the current offer\n",
    "        accept = w[i] / (1 - β)\n",
    "        reject = c + β * np.sum(v * q)\n",
    "\n",
    "        return np.array([accept, reject])\n",
    "\n",
    "    def VFI(self, eps=1e-5, max_iter=500):\n",
    "        \"\"\"\n",
    "        Find the optimal value function.\n",
    "        \"\"\"\n",
    "\n",
    "        n = len(self.w)\n",
    "        v = self.w / (1 - self.β)\n",
    "        v_next = np.empty_like(v)\n",
    "        flag=0\n",
    "\n",
    "        for i in range(max_iter):\n",
    "            for j in range(n):\n",
    "                v_next[j] = np.max(self.state_action_values(j, v))\n",
    "\n",
    "            if np.max(np.abs(v_next - v))<=eps:\n",
    "                flag=1\n",
    "                break\n",
    "            v[:] = v_next\n",
    "\n",
    "        return v, flag\n",
    "\n",
    "def plot_value_function_seq(mcm, ax, num_plots=8):\n",
    "    \"\"\"\n",
    "    Plot a sequence of value functions.\n",
    "\n",
    "        * mcm is an instance of McCallModel\n",
    "        * ax is an axes object that implements a plot method.\n",
    "\n",
    "    \"\"\"\n",
    "\n",
    "    n = len(mcm.w)\n",
    "    v = mcm.w / (1 - mcm.β)\n",
    "    v_next = np.empty_like(v)\n",
    "    for i in range(num_plots):\n",
    "        ax.plot(mcm.w, v, '-', alpha=0.4, label=f\"iterate {i}\")\n",
    "        # Update guess\n",
    "        for i in range(n):\n",
    "            v_next[i] = np.max(mcm.state_action_values(i, v))\n",
    "        v[:] = v_next  # copy contents into v\n",
    "\n",
    "    ax.legend(loc='lower right')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a0f32c4",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "mcm = McCallModel()\n",
    "valfunc_VFI, flag = mcm.VFI()\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10,6))\n",
    "ax.set_xlabel('wage')\n",
    "ax.set_ylabel('value')\n",
    "plot_value_function_seq(mcm, ax)\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee24a13d",
   "metadata": {},
   "source": [
    "Next we’ll print out the limit of the sequence of iterates.\n",
    "\n",
    "This  is the approximation to the McCall worker’s value function that is produced by value function iteration.\n",
    "\n",
    "We’ll use this value function as a benchmark later after we have done some Q-learning."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1523278f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "print(valfunc_VFI)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "73aee41f",
   "metadata": {},
   "source": [
    "## Implied Quality Function  $ Q $\n",
    "\n",
    "A **quality function** $ Q $ map  state-action pairs into optimal values.\n",
    "\n",
    "They are tightly linked to optimal  value functions.\n",
    "\n",
    "But value functions  are functions just of states, and not actions.\n",
    "\n",
    "For each given  state, the quality function gives a list of optimal values that can be attained starting from that\n",
    "state, with each component of the list indicating one of the possible actions that is taken.\n",
    "\n",
    "For our McCall worker with a finite set of possible wages\n",
    "\n",
    "- the state space  $ \\mathcal{W}=\\{w_1,w_2,...,w_n\\} $ is indexed by integers $ 1,2,...,n $  \n",
    "- the action space is  $ \\mathcal{A}=\\{\\text{accept}, \\text{reject}\\} $  \n",
    "\n",
    "\n",
    "Let $ a \\in \\mathcal{A} $ be one of the  two possible actions, i.e., accept or reject.\n",
    "\n",
    "For our McCall worker, an optimal Q-function $ Q(w,a) $ equals the maximum value of that a  previously unemployed   worker who has offer $ w $ in hand can attain if he takes action $ a $.\n",
    "\n",
    "This definition of $ Q(w,a) $ presumes that in subsequent periods the worker  takes  optimal actions.\n",
    "\n",
    "An optimal   Q-function for our McCall worker satisfies\n",
    "\n",
    "\n",
    "<a id='equation-eq-impliedq'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "Q\\left(w,\\text{accept}\\right) & =\\frac{w}{1-\\beta} \\\\\n",
    "Q\\left(w,\\text{reject}\\right) & =c+\\beta\\int\\max_{\\text{accept, reject}}\\left\\{ \\frac{w'}{1-\\beta},Q\\left(w',\\text{reject}\\right)\\right\\} dF\\left(w'\\right)\n",
    "\\end{aligned} \\tag{44.2}\n",
    "$$\n",
    "\n",
    "Note that the first equation of system [(44.2)](#equation-eq-impliedq) presumes that after  the agent has  accepted an offer, he will not have the objection to reject that same offer in the future.\n",
    "\n",
    "These equations are aligned with the Bellman equation for the worker’s  optimal value function that we studied in [this quantecon lecture](https://python.quantecon.org/mccall_model.html).\n",
    "\n",
    "Evidently, the optimal value function $ V(w) $ described in that lecture is related to our Q-function by\n",
    "\n",
    "$$\n",
    "V(w) = \\max_{\\textrm{accept},\\textrm{reject}} \\left\\{ Q(w, \\text{accept} \\right), Q\\left(w,\\text{reject} \\right)\\}\n",
    "$$\n",
    "\n",
    "If we stare at the second equation of system [(44.2)](#equation-eq-impliedq), we notice that since the wage process is identically and independently distributed over time,\n",
    "$ Q\\left(w,\\text{reject}\\right) $, the right side of the equation is independent of the current state   $ w $.\n",
    "\n",
    "So we can denote it as a scalar\n",
    "\n",
    "$$\n",
    "Q_r := Q\\left(w,\\text{reject}\\right) \\quad \\forall \\, w\\in\\mathcal{W}.\n",
    "$$\n",
    "\n",
    "This fact provides us with an\n",
    "an alternative, and  as it turns out in this case, a faster way to compute an optimal value function and associated optimal policy for the McCall worker.\n",
    "\n",
    "Instead of using the  value function iterations that we deployed above, we can instead  iterate to convergence on a version of the second equation in system [(44.2)](#equation-eq-impliedq)  that maps an estimate of  $ Q_r $ into an improved estimate $ Q_r' $:\n",
    "\n",
    "$$\n",
    "Q_{r}^\\prime=c+\\beta\\int\\max_{\\text{}}\\left\\{ \\frac{w'}{1-\\beta},Q_{r}\\right\\} dF\\left(w'\\right)\n",
    "$$\n",
    "\n",
    "After a $ Q_r $ sequence has converged, we can recover the optimal value function $ V(w) $ for the McCall worker from\n",
    "\n",
    "$$\n",
    "V\\left(w\\right)=\\max\\left\\{ \\frac{w}{1-\\beta},Q_{r}\\right\\}\n",
    "$$"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "24a4f1a9",
   "metadata": {},
   "source": [
    "## From Probabilities  to Samples\n",
    "\n",
    "We noted  above that  the optimal Q function for our McCall worker satisfies the Bellman equations\n",
    "\n",
    "\n",
    "<a id='equation-eq-probtosample1'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "         w  & + \\beta \\max_{\\textrm{accept, reject}} \\left\\{ Q (w, \\textrm{accept}), Q(w, \\textrm{reject}) \\right\\} - Q (w, \\textrm{accept})   = 0  \\cr\n",
    "         c  & +\\beta\\int\\max_{\\text{accept, reject}}\\left\\{ Q(w', \\textrm{accept}),Q\\left(w',\\text{reject}\\right)\\right\\} dF\\left(w'\\right) - Q\\left(w,\\text{reject}\\right)  = 0  \\cr\n",
    "\\end{aligned} \\tag{44.3}\n",
    "$$\n",
    "\n",
    "Notice the integral over $ F(w') $ on the second line.\n",
    "\n",
    "Erasing the integral sign sets the stage for an illegitmate argument that can get us started thinking about  Q-learning.\n",
    "\n",
    "Thus, construct a difference  equation system that keeps the first equation of [(44.3)](#equation-eq-probtosample1)\n",
    "but replaces the second by removing integration over $ F (w') $:\n",
    "\n",
    "\n",
    "<a id='equation-eq-probtosample2'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "         w  & + \\beta \\max_{\\textrm{accept, reject}} \\left\\{ Q (w, \\textrm{accept}), Q(w, \\textrm{reject}) \\right\\} - Q (w, \\textrm{accept})   = 0  \\cr\n",
    "         c  & +\\beta \\max_{\\text{accept, reject}}\\left\\{ Q(w', \\textrm{accept}),Q\\left(w',\\text{reject}\\right)\\right\\}  - Q\\left(w,\\text{reject}\\right)  \\approx 0  \\cr\n",
    "\\end{aligned} \\tag{44.4}\n",
    "$$\n",
    "\n",
    "The second equation can’t  hold for all $ w, w' $ pairs in the appropriate Cartesian product of our  state space.\n",
    "\n",
    "But maybe an appeal to  a   Law of Large numbers could let  us  hope that it would hold\n",
    "**on average** for a long time series sequence of draws of $ w_t, w_{t+1} $ pairs, where\n",
    "we are thinking of $ w_t $ as $ w $ and $ w_{t+1} $ as $ w' $.\n",
    "\n",
    "The basic idea of Q-learning is to draw a long sample of wage offers from $ F $ (we know $ F $ though we assume that the worker doesn’t) and iterate on a  recursion\n",
    "that maps an estimate $ \\hat Q_t $ of a  Q-function at date $ t $ into an improved estimate\n",
    "$ \\hat Q_{t+1} $ at date\n",
    "$ t+1 $.\n",
    "\n",
    "To set up such an algorithm, we first define some errors or “differences”\n",
    "\n",
    "\n",
    "<a id='equation-eq-old105'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "         w  & + \\beta \\max_{\\textrm{accept, reject}} \\left\\{ \\hat Q_t (w_t, \\textrm{accept}), \\hat Q_t(w_t, \\textrm{reject}) \\right\\} - \\hat Q_t(w_t, \\textrm{accept})   = \\textrm{diff}_{\\textrm{accept},t}  \\cr\n",
    "         c  & +\\beta \\max_{\\text{accept, reject}}\\left\\{ \\hat Q_t(w_{t+1}, \\textrm{accept}),\\hat Q_t\\left(w_{t+1},\\text{reject}\\right)\\right\\}  - \\hat Q_t\\left(w_t,\\text{reject}\\right)  = \\textrm{diff}_{\\textrm{reject},t}  \\cr\n",
    "\\end{aligned} \\tag{44.5}\n",
    "$$\n",
    "\n",
    "The adaptive learning scheme would then be some version of\n",
    "\n",
    "\n",
    "<a id='equation-eq-old106'></a>\n",
    "$$\n",
    "\\hat Q_{t+1} = \\hat Q_t + \\alpha \\ \\textrm{diff}_t \\tag{44.6}\n",
    "$$\n",
    "\n",
    "where $ \\alpha \\in (0,1) $ is a small **gain** parameter that governs the rate of learning and  $ \\hat Q_t $ and $ \\textrm{diff}_t $ are $ 2 \\times 1 $ vectors corresponding\n",
    "to  objects in equation system [(44.5)](#equation-eq-old105).\n",
    "\n",
    "This informal argument takes us to the threshold of Q-learning."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b5ca52ef",
   "metadata": {},
   "source": [
    "## Q-Learning\n",
    "\n",
    "Let’s first describe  a $ Q $-learning algorithm precisely.\n",
    "\n",
    "Then we’ll  implement it.\n",
    "\n",
    "The algorithm works by using a Monte Carlo method to  update estimates of a Q-function.\n",
    "\n",
    "We begin with an initial guess for a  Q-function.\n",
    "\n",
    "In the example studied in this lecture,  we have a finite action space and also a finite state space.\n",
    "\n",
    "That means that we can represent a Q-function as a matrix or Q-table, $ \\widetilde{Q}(w,a) $.\n",
    "\n",
    "Q-learning proceeds by updating the Q-function as the decision maker acquires experience along a path of wage draws generated by simulation.\n",
    "\n",
    "During the learning process, our McCall worker  takes actions and\n",
    "experiences rewards that are consequences of those actions.\n",
    "\n",
    "He learns simultaneously about the environment, in this case the distribution of wages, and the reward function,\n",
    "in this case the unemployment compensation $ c $ and the present value of wages.\n",
    "\n",
    "The updating algorithm is based on a slight modification (to be described soon) of  a  recursion  like\n",
    "\n",
    "\n",
    "<a id='equation-eq-old3'></a>\n",
    "$$\n",
    "\\widetilde{Q}^{new}\\left(w,a\\right)=\\widetilde{Q}^{old}\\left(w,a\\right)+\\alpha \\widetilde{TD}\\left(w,a\\right) \\tag{44.7}\n",
    "$$\n",
    "\n",
    "where\n",
    "\n",
    "\n",
    "<a id='equation-eq-old4'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\widetilde{TD}\\left(w,\\text{accept}\\right) & = \\left[ w+\\beta\\max_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w,a'\\right) \\right]-\\widetilde{Q}^{old}\\left(w,\\text{accept}\\right) \\\\\n",
    "\\widetilde{TD}\\left(w,\\text{reject}\\right) & = \\left[ c+\\beta\\max_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w',a'\\right) \\right]-\\widetilde{Q}^{old}\\left(w,\\text{reject}\\right),\\;w'\\sim F\n",
    "\\end{aligned} \\tag{44.8}\n",
    "$$\n",
    "\n",
    "The terms  $ \\widetilde{TD}(w,a) $ for $ a = \\left\\{\\textrm{accept,reject} \\right\\} $  are the **temporal difference errors** that drive the updates.\n",
    "\n",
    "This system is thus a version of the adaptive system that we sketched informally\n",
    "in equation [(44.6)](#equation-eq-old106).\n",
    "\n",
    "An aspect of the algorithm not yet captured by equation system [(44.8)](#equation-eq-old4) is random **experimentation** that\n",
    "we add by occasionally randomly replacing\n",
    "\n",
    "$$\n",
    "\\textrm{argmax}_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w,a'\\right)\n",
    "$$\n",
    "\n",
    "with\n",
    "\n",
    "$$\n",
    "\\textrm{argmin}_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w,a'\\right)\n",
    "$$\n",
    "\n",
    "and\n",
    "occasionally replacing\n",
    "\n",
    "$$\n",
    "\\textrm{argmax}_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w',a'\\right)\n",
    "$$\n",
    "\n",
    "with\n",
    "\n",
    "$$\n",
    "\\textrm{argmin}_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w',a'\\right)\n",
    "$$\n",
    "\n",
    "We activate such experimentation with probability $ \\epsilon $ in step 3 of the following\n",
    "pseudo-code for   our McCall worker to do Q-learning:\n",
    "\n",
    "1. Set  an arbitrary initial Q-table.  \n",
    "1. Draw an initial wage offer $ w $ from $ F $.  \n",
    "1. From the appropriate row in the Q-table, choose an action using the following $ \\epsilon $-greedy algorithm:  \n",
    "  - with probability $ 1-\\epsilon $, choose the action that maximizes the value, and  \n",
    "  - with probability $ \\epsilon $, choose the alternative action.  \n",
    "1. Update the state associated with the chosen action and compute $ \\widetilde{TD} $ according to [(44.8)](#equation-eq-old4) and update $ \\widetilde{Q} $ according to [(44.7)](#equation-eq-old3).  \n",
    "1. Either draw a new state  $ w' $ if required or else take existing wage if and update the Q-table again according to [(44.7)](#equation-eq-old3).  \n",
    "1. Stop when the old and new Q-tables are close enough, i.e., $ \\lVert\\tilde{Q}^{new}-\\tilde{Q}^{old}\\rVert_{\\infty}\\leq\\delta $ for given $ \\delta $ or if the worker keeps accepting for $ T $ periods for a prescribed $ T $.  \n",
    "1. Return to step 2 with the updated Q-table.  \n",
    "\n",
    "\n",
    "Repeat this procedure for $ N $ episodes or until the updated Q-table has converged.\n",
    "\n",
    "We call one pass through  steps 2 to 7 an “episode” or “epoch”  of temporal difference learning.\n",
    "\n",
    "In our context, each episode starts with an agent drawing an initial wage offer, i.e., a new state.\n",
    "\n",
    "The agent then takes actions based on the preset Q-table, receives rewards, and then enters a new state implied by this period’s actions.\n",
    "\n",
    "The Q-table is updated via temporal difference learning.\n",
    "\n",
    "We iterate this until convergence of the Q-table or the maximum length of an episode is reached.\n",
    "\n",
    "Multiple episodes allow the agent to start afresh and visit states that she was less likely to visit from the terminal state of a previos episode.\n",
    "\n",
    "For example, an agent who has accepted a wage offer based on her Q-table will be less likely to draw a new offer from other parts of the wage distribution.\n",
    "\n",
    "By using the $ \\epsilon $-greedy method and also by increasing the number of episodes, the Q-learning algorithm  balances  gains from exploration and from exploitation.\n",
    "\n",
    "**Remark:** Notice that    $ \\widetilde{TD} $ associated with  an optimal Q-table defined in [(44.7)](#equation-eq-old3) automatically above satisfies  $ \\widetilde{TD}=0 $ for all state action pairs.  Whether a limit of our Q-learning algorithm converges to an optimal Q-table depends on whether the algorithm visits all state-action pairs often enough.\n",
    "\n",
    "We implement this pseudo code  in a Python class.\n",
    "\n",
    "For simplicity and convenience, we let `s` represent the state index between $ 0 $ and $ n=50 $ and $ w_s=w[s] $.\n",
    "\n",
    "The first column of the Q-table represents the value associated with rejecting the wage and the second represents accepting the wage.\n",
    "\n",
    "We use `numba` compilation to accelerate computations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "212a1c7e",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "params=[\n",
    "    ('c', float64),            # unemployment compensation\n",
    "    ('β', float64),            # discount factor\n",
    "    ('w', float64[:]),         # array of wage values, w[i] = wage at state i\n",
    "    ('q', float64[:]),         # array of probabilities\n",
    "    ('eps', float64),          # for epsilon greedy algorithm\n",
    "    ('δ', float64),            # Q-table threshold\n",
    "    ('lr', float64),           # the learning rate α\n",
    "    ('T', int64),              # maximum periods of accepting\n",
    "    ('quit_allowed', int64)    # whether quit is allowed after accepting the wage offer\n",
    "]\n",
    "\n",
    "@jitclass(params)\n",
    "class Qlearning_McCall:\n",
    "    def __init__(self, c=25, β=0.99, w=w_default, q=q_default, eps=0.1,\n",
    "                 δ=1e-5, lr=0.5, T=10000, quit_allowed=0):\n",
    "\n",
    "        self.c, self.β = c, β\n",
    "        self.w, self.q = w, q\n",
    "        self.eps, self.δ, self.lr, self.T = eps, δ, lr, T\n",
    "        self.quit_allowed = quit_allowed\n",
    "\n",
    "\n",
    "    def draw_offer_index(self):\n",
    "        \"\"\"\n",
    "        Draw a state index from the wage distribution.\n",
    "        \"\"\"\n",
    "\n",
    "        q = self.q\n",
    "        return np.searchsorted(np.cumsum(q), np.random.random(), side=\"right\")\n",
    "\n",
    "    def temp_diff(self, qtable, state, accept):\n",
    "        \"\"\"\n",
    "        Compute the TD associated with state and action.\n",
    "        \"\"\"\n",
    "\n",
    "        c, β, w = self.c, self.β, self.w\n",
    "\n",
    "        if accept==0:\n",
    "            state_next = self.draw_offer_index()\n",
    "            TD = c + β*np.max(qtable[state_next, :]) - qtable[state, accept]\n",
    "        else:\n",
    "            state_next = state\n",
    "            if self.quit_allowed == 0:\n",
    "                TD = w[state_next] + β*np.max(qtable[state_next, :]) - qtable[state, accept]\n",
    "            else:\n",
    "                TD = w[state_next] + β*qtable[state_next, 1] - qtable[state, accept]\n",
    "\n",
    "        return TD, state_next\n",
    "\n",
    "    def run_one_epoch(self, qtable, max_times=20000):\n",
    "        \"\"\"\n",
    "        Run an \"epoch\".\n",
    "        \"\"\"\n",
    "\n",
    "        c, β, w = self.c, self.β, self.w\n",
    "        eps, δ, lr, T = self.eps, self.δ, self.lr, self.T\n",
    "\n",
    "        s0 = self.draw_offer_index()\n",
    "        s = s0\n",
    "        accept_count = 0\n",
    "\n",
    "        for t in range(max_times):\n",
    "\n",
    "            # choose action\n",
    "            accept = np.argmax(qtable[s, :])\n",
    "            if np.random.random()<=eps:\n",
    "                accept = 1 - accept\n",
    "\n",
    "            if accept == 1:\n",
    "                accept_count += 1\n",
    "            else:\n",
    "                accept_count = 0\n",
    "\n",
    "            TD, s_next = self.temp_diff(qtable, s, accept)\n",
    "\n",
    "            # update qtable\n",
    "            qtable_new = qtable.copy()\n",
    "            qtable_new[s, accept] = qtable[s, accept] + lr*TD\n",
    "\n",
    "            if np.max(np.abs(qtable_new-qtable))<=δ:\n",
    "                break\n",
    "\n",
    "            if accept_count == T:\n",
    "                break\n",
    "\n",
    "            s, qtable = s_next, qtable_new\n",
    "\n",
    "        return qtable_new\n",
    "\n",
    "@jit\n",
    "def run_epochs(N, qlmc, qtable):\n",
    "    \"\"\"\n",
    "    Run epochs N times with qtable from the last iteration each time.\n",
    "    \"\"\"\n",
    "\n",
    "    for n in range(N):\n",
    "        if n%(N/10)==0:\n",
    "            print(f\"Progress: EPOCHs = {n}\")\n",
    "        new_qtable = qlmc.run_one_epoch(qtable)\n",
    "        qtable = new_qtable\n",
    "\n",
    "    return qtable\n",
    "\n",
    "def valfunc_from_qtable(qtable):\n",
    "    return np.max(qtable, axis=1)\n",
    "\n",
    "def compute_error(valfunc, valfunc_VFI):\n",
    "    return np.mean(np.abs(valfunc-valfunc_VFI))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7f15e065",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# create an instance of Qlearning_McCall\n",
    "qlmc = Qlearning_McCall()\n",
    "\n",
    "# run\n",
    "qtable0 = np.zeros((len(w_default), 2))\n",
    "qtable = run_epochs(20000, qlmc, qtable0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "48e1c2b6",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "print(qtable)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ac8b5740",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# inspect value function\n",
    "valfunc_qlr = valfunc_from_qtable(qtable)\n",
    "\n",
    "print(valfunc_qlr)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dd2a18e2",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# plot\n",
    "fig, ax = plt.subplots(figsize=(10,6))\n",
    "ax.plot(w_default, valfunc_VFI, '-o', label='VFI')\n",
    "ax.plot(w_default, valfunc_qlr, '-o', label='QL')\n",
    "ax.set_xlabel('wages')\n",
    "ax.set_ylabel('optimal value')\n",
    "ax.legend()\n",
    "\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "292cf6b9",
   "metadata": {},
   "source": [
    "Now, let us compute the case with a larger state space: $ n=30 $ instead of $ n=10 $."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e554c48",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "n, a, b = 30, 200, 100                        # default parameters\n",
    "q_new = BetaBinomial(n, a, b).pdf()           # default choice of q\n",
    "\n",
    "w_min, w_max = 10, 60\n",
    "w_new = np.linspace(w_min, w_max, n+1)\n",
    "\n",
    "\n",
    "# plot distribution of wage offer\n",
    "fig, ax = plt.subplots(figsize=(10,6))\n",
    "ax.plot(w_new, q_new, '-o', label='$q(w(i))$')\n",
    "ax.set_xlabel('wages')\n",
    "ax.set_ylabel('probabilities')\n",
    "\n",
    "plt.show()\n",
    "\n",
    "# VFI\n",
    "mcm = McCallModel(w=w_new, q=q_new)\n",
    "valfunc_VFI, flag = mcm.VFI()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f4e858d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "mcm = McCallModel(w=w_new, q=q_new)\n",
    "valfunc_VFI, flag = mcm.VFI()\n",
    "valfunc_VFI"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74314527",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "def plot_epochs(epochs_to_plot, quit_allowed=1):\n",
    "    \"Plot value function implied by outcomes of an increasing number of epochs.\"\n",
    "    qlmc_new = Qlearning_McCall(w=w_new, q=q_new, quit_allowed=quit_allowed)\n",
    "    qtable = np.zeros((len(w_new),2))\n",
    "    epochs_to_plot = np.asarray(epochs_to_plot)\n",
    "    # plot\n",
    "    fig, ax = plt.subplots(figsize=(10,6))\n",
    "    ax.plot(w_new, valfunc_VFI, '-o', label='VFI')\n",
    "\n",
    "    max_epochs = np.max(epochs_to_plot)\n",
    "    # iterate on epoch numbers\n",
    "    for n in range(max_epochs + 1):\n",
    "        if n%(max_epochs/10)==0:\n",
    "            print(f\"Progress: EPOCHs = {n}\")\n",
    "        if n in epochs_to_plot:\n",
    "            valfunc_qlr = valfunc_from_qtable(qtable)\n",
    "            error = compute_error(valfunc_qlr, valfunc_VFI)\n",
    "\n",
    "            ax.plot(w_new, valfunc_qlr, '-o', label=f'QL:epochs={n}, mean error={error}')\n",
    "\n",
    "\n",
    "        new_qtable = qlmc_new.run_one_epoch(qtable)\n",
    "        qtable = new_qtable\n",
    "\n",
    "    ax.set_xlabel('wages')\n",
    "    ax.set_ylabel('optimal value')\n",
    "    ax.legend(loc='lower right')\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c07e0b7",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "plot_epochs(epochs_to_plot=[100, 1000, 10000, 100000, 200000])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea027f0a",
   "metadata": {},
   "source": [
    "The above graphs indicates that\n",
    "\n",
    "- the Q-learning algorithm has trouble  learning  the Q-table well for wages that are rarely drawn  \n",
    "- the quality of approximation to the “true” value function computed by value function iteration improves for longer epochs  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "785d444f",
   "metadata": {},
   "source": [
    "## Employed Worker Can’t Quit\n",
    "\n",
    "The preceding version of temporal difference Q-learning described in  equation system  [(44.8)](#equation-eq-old4) lets an employed  worker quit, i.e., reject her wage as an incumbent and instead receive unemployment compensation this period\n",
    "and draw a new offer next period.\n",
    "\n",
    "This is an option that the McCall worker described in [this quantecon lecture](https://python.quantecon.org/mccall_model.html) would not take.\n",
    "\n",
    "See [[Ljungqvist and Sargent, 2018](https://python.quantecon.org/zreferences.html#id187)], chapter 6 on search, for a proof.\n",
    "\n",
    "But in the context of Q-learning, giving the worker the option to quit and get unemployment compensation while\n",
    "unemployed turns out to accelerate the learning process by promoting experimentation vis a vis premature\n",
    "exploitation only.\n",
    "\n",
    "To illustrate this, we’ll amend our formulas for temporal differences to forbid an employed worker from quitting a job she had accepted earlier.\n",
    "\n",
    "With this understanding about available choices, we obtain the following temporal difference values:\n",
    "\n",
    "\n",
    "<a id='equation-eq-temp-diff'></a>\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\widetilde{TD}\\left(w,\\text{accept}\\right) & = \\left[ w+\\beta\\widetilde{Q}^{old}\\left(w,\\text{accept}\\right) \\right]-\\widetilde{Q}^{old}\\left(w,\\text{accept}\\right) \\\\\n",
    "\\widetilde{TD}\\left(w,\\text{reject}\\right) & = \\left[ c+\\beta\\max_{a'\\in\\mathcal{A}}\\widetilde{Q}^{old}\\left(w',a'\\right) \\right]-\\widetilde{Q}^{old}\\left(w,\\text{reject}\\right),\\;w'\\sim F\n",
    "\\end{aligned} \\tag{44.9}\n",
    "$$\n",
    "\n",
    "It turns out that formulas [(44.9)](#equation-eq-temp-diff) combined with our Q-learning recursion [(44.7)](#equation-eq-old3) can lead our agent to eventually learn the optimal value function as well as in the case where an option to redraw can be exercised.\n",
    "\n",
    "But learning is slower because  an agent who ends up accepting a wage offer prematurally loses the option to explore new states in the same episode and to adjust the value associated with that state.\n",
    "\n",
    "This can lead to inferior outcomes when the number of epochs/episodes is low.\n",
    "\n",
    "But if we increase the number of epochs/episodes, we can observe that the error decreases and the outcomes get better.\n",
    "\n",
    "We illustrate these possibilities with the following code and graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9568220e",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "plot_epochs(epochs_to_plot=[100, 1000, 10000, 100000, 200000], quit_allowed=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d37ecec",
   "metadata": {},
   "source": [
    "## Possible Extensions\n",
    "\n",
    "To extend the algorthm to handle problems with continuous state spaces,\n",
    "a typical approach is to restrict Q-functions and policy functions to take particular\n",
    "functional forms.\n",
    "\n",
    "This is the approach in **deep Q-learning** where the idea is to use a multilayer neural\n",
    "network as a good function approximator.\n",
    "\n",
    "We will take up this topic in a subsequent quantecon lecture."
   ]
  }
 ],
 "metadata": {
  "date": 1751452621.2318974,
  "filename": "mccall_q.md",
  "kernelspec": {
   "display_name": "Python",
   "language": "python3",
   "name": "python3"
  },
  "title": "Job Search VII: A McCall Worker Q-Learns"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}