{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "73eaca28",
   "metadata": {},
   "source": [
    "<div id=\"qe-notebook-header\" align=\"right\" style=\"text-align:right;\">\n",
    "        <a href=\"https://quantecon.org/\" title=\"quantecon.org\">\n",
    "                <img style=\"width:250px;display:inline;\" width=\"250px\" src=\"https://assets.quantecon.org/img/qe-menubar-logo.svg\" alt=\"QuantEcon\">\n",
    "        </a>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "362a8b6d",
   "metadata": {},
   "source": [
    "# Linear Regression in Python"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b676e10f",
   "metadata": {},
   "source": [
    "## Contents\n",
    "\n",
    "- [Linear Regression in Python](#Linear-Regression-in-Python)  \n",
    "  - [Overview](#Overview)  \n",
    "  - [Simple Linear Regression](#Simple-Linear-Regression)  \n",
    "  - [Extending the Linear Regression Model](#Extending-the-Linear-Regression-Model)  \n",
    "  - [Endogeneity](#Endogeneity)  \n",
    "  - [Summary](#Summary)  \n",
    "  - [Exercises](#Exercises)  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "106e1769",
   "metadata": {},
   "source": [
    "In addition to what’s in Anaconda, this lecture will need the following libraries:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "701b2e4d",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "!pip install linearmodels"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "46e9bc8a",
   "metadata": {},
   "source": [
    "## Overview\n",
    "\n",
    "Linear regression is a standard tool for analyzing the relationship between two or more variables.\n",
    "\n",
    "In this lecture, we’ll use the Python package `statsmodels` to estimate, interpret, and visualize linear regression models.\n",
    "\n",
    "Along the way, we’ll discuss a variety of topics, including\n",
    "\n",
    "- simple and multivariate linear regression  \n",
    "- visualization  \n",
    "- endogeneity and omitted variable bias  \n",
    "- two-stage least squares  \n",
    "\n",
    "\n",
    "As an example, we will replicate results from Acemoglu, Johnson and Robinson’s seminal paper [[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)].\n",
    "\n",
    "- You can download a copy [here](https://economics.mit.edu/research/publications/colonial-origins-comparative-development-empirical-investigation).  \n",
    "\n",
    "\n",
    "In the paper, the authors emphasize the importance of institutions in economic development.\n",
    "\n",
    "The main contribution is the use of settler mortality rates as a source of *exogenous* variation in institutional differences.\n",
    "\n",
    "Such variation is needed to determine whether it is institutions that give rise to greater economic growth, rather than the other way around.\n",
    "\n",
    "Let’s start with some imports:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b3f2e9c4",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "plt.rcParams[\"figure.figsize\"] = (11, 5)  #set default figure size\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import statsmodels.api as sm\n",
    "from statsmodels.iolib.summary2 import summary_col\n",
    "from linearmodels.iv import IV2SLS\n",
    "import seaborn as sns\n",
    "sns.set_theme()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eda76b22",
   "metadata": {},
   "source": [
    "### Prerequisites\n",
    "\n",
    "This lecture assumes you are familiar with basic econometrics.\n",
    "\n",
    "For an introductory text covering these topics, see, for example,\n",
    "[[Wooldridge, 2015](https://python.quantecon.org/zreferences.html#id97)]."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6831e0b7",
   "metadata": {},
   "source": [
    "## Simple Linear Regression\n",
    "\n",
    "[[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)] wish to determine whether or not differences in institutions can help to explain observed economic outcomes.\n",
    "\n",
    "How do we measure *institutional differences* and *economic outcomes*?\n",
    "\n",
    "In this paper,\n",
    "\n",
    "- economic outcomes are proxied by log GDP per capita in 1995, adjusted for exchange rates.  \n",
    "- institutional differences are proxied by an index of protection against expropriation on average over 1985-95, constructed by the [Political Risk Services Group](https://www.prsgroup.com/).  \n",
    "\n",
    "\n",
    "These variables and other data used in the paper are available for download on Daron Acemoglu’s [webpage](https://economics.mit.edu/people/faculty/daron-acemoglu/data-archive).\n",
    "\n",
    "We will use pandas’ `.read_stata()` function to read in data contained in the `.dta` files to dataframes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c81b4dc6",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df1 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable1.dta?raw=true')\n",
    "df1.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eb74168f",
   "metadata": {},
   "source": [
    "Let’s use a scatterplot to see whether any obvious relationship exists\n",
    "between GDP per capita and the protection against\n",
    "expropriation index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c44d1f3f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df1.plot(x='avexpr', y='logpgp95', kind='scatter')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7b3522ce",
   "metadata": {},
   "source": [
    "The plot shows a fairly strong positive relationship between\n",
    "protection against expropriation and log GDP per capita.\n",
    "\n",
    "Specifically, if higher protection against expropriation is a measure of\n",
    "institutional quality, then better institutions appear to be positively\n",
    "correlated with better economic outcomes (higher GDP per capita).\n",
    "\n",
    "Given the plot, choosing a linear model to describe this relationship\n",
    "seems like a reasonable assumption.\n",
    "\n",
    "We can write our model as\n",
    "\n",
    "$$\n",
    "{logpgp95}_i = \\beta_0 + \\beta_1 {avexpr}_i + u_i\n",
    "$$\n",
    "\n",
    "where:\n",
    "\n",
    "- $ \\beta_0 $ is the intercept of the linear trend line on the\n",
    "  y-axis  \n",
    "- $ \\beta_1 $ is the slope of the linear trend line, representing\n",
    "  the *marginal effect* of protection against risk on log GDP per\n",
    "  capita  \n",
    "- $ u_i $ is a random error term (deviations of observations from\n",
    "  the linear trend due to factors not included in the model)  \n",
    "\n",
    "\n",
    "Visually, this linear model involves choosing a straight line that best\n",
    "fits the data, as in the following plot (Figure 2 in [[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b5b78dd",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Dropping NA's is required to use numpy's polyfit\n",
    "df1_subset = df1.dropna(subset=['logpgp95', 'avexpr'])\n",
    "\n",
    "# Use only 'base sample' for plotting purposes\n",
    "df1_subset = df1_subset[df1_subset['baseco'] == 1]\n",
    "\n",
    "X = df1_subset['avexpr']\n",
    "y = df1_subset['logpgp95']\n",
    "labels = df1_subset['shortnam']\n",
    "\n",
    "# Replace markers with country labels\n",
    "fig, ax = plt.subplots()\n",
    "ax.scatter(X, y, marker='')\n",
    "\n",
    "for i, label in enumerate(labels):\n",
    "    ax.annotate(label, (X.iloc[i], y.iloc[i]))\n",
    "\n",
    "# Fit a linear trend line\n",
    "ax.plot(np.unique(X),\n",
    "         np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),\n",
    "         color='black')\n",
    "\n",
    "ax.set_xlim([3.3,10.5])\n",
    "ax.set_ylim([4,10.5])\n",
    "ax.set_xlabel('Average Expropriation Risk 1985-95')\n",
    "ax.set_ylabel('Log GDP per capita, PPP, 1995')\n",
    "ax.set_title('Figure 2: OLS relationship between expropriation \\\n",
    "    risk and income')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d6513c07",
   "metadata": {},
   "source": [
    "The most common technique to estimate the parameters ($ \\beta $’s)\n",
    "of the linear model is Ordinary Least Squares (OLS).\n",
    "\n",
    "As the name implies, an OLS model is solved by finding the parameters\n",
    "that minimize *the sum of squared residuals*, i.e.\n",
    "\n",
    "$$\n",
    "\\underset{\\hat{\\beta}}{\\min} \\sum^N_{i=1}{\\hat{u}^2_i}\n",
    "$$\n",
    "\n",
    "where $ \\hat{u}_i $ is the difference between the observation and\n",
    "the predicted value of the dependent variable.\n",
    "\n",
    "To estimate the constant term $ \\beta_0 $, we need to add a column\n",
    "of 1’s to our dataset (consider the equation if $ \\beta_0 $ was\n",
    "replaced with $ \\beta_0 x_i $ and $ x_i = 1 $)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "47f9c691",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df1['const'] = 1"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19e08332",
   "metadata": {},
   "source": [
    "Now we can construct our model in `statsmodels` using the OLS function.\n",
    "\n",
    "We will use `pandas` dataframes with `statsmodels`, however standard arrays can also be used as arguments"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6f14de9a",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "reg1 = sm.OLS(endog=df1['logpgp95'], exog=df1[['const', 'avexpr']], \\\n",
    "    missing='drop')\n",
    "type(reg1)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe735c54",
   "metadata": {},
   "source": [
    "So far we have simply constructed our model.\n",
    "\n",
    "We need to use `.fit()` to obtain parameter estimates\n",
    "$ \\hat{\\beta}_0 $ and $ \\hat{\\beta}_1 $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "badeb94f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "results = reg1.fit()\n",
    "type(results)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9a269d2b",
   "metadata": {},
   "source": [
    "We now have the fitted regression model stored in `results`.\n",
    "\n",
    "To view the OLS regression results, we can call the `.summary()`\n",
    "method.\n",
    "\n",
    "Note that an observation was mistakenly dropped from the results in the\n",
    "original paper (see the note located in `maketable2.do` from Acemoglu’s webpage), and thus the\n",
    "coefficients differ slightly."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79a0c642",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "print(results.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6b187ec",
   "metadata": {},
   "source": [
    "From our results, we see that\n",
    "\n",
    "- The intercept $ \\hat{\\beta}_0 = 4.63 $.  \n",
    "- The slope $ \\hat{\\beta}_1 = 0.53 $.  \n",
    "- The positive $ \\hat{\\beta}_1 $ parameter estimate implies that.\n",
    "  institutional quality has a positive effect on economic outcomes, as\n",
    "  we saw in the figure.  \n",
    "- The p-value of 0.000 for $ \\hat{\\beta}_1 $ implies that the\n",
    "  effect of institutions on GDP is statistically significant (using p <\n",
    "  0.05 as a rejection rule).  \n",
    "- The R-squared value of 0.611 indicates that around 61% of variation\n",
    "  in log GDP per capita is explained by protection against\n",
    "  expropriation.  \n",
    "\n",
    "\n",
    "Using our parameter estimates, we can now write our estimated\n",
    "relationship as\n",
    "\n",
    "$$\n",
    "\\widehat{logpgp95}_i = 4.63 + 0.53 \\ {avexpr}_i\n",
    "$$\n",
    "\n",
    "This equation describes the line that best fits our data, as shown in\n",
    "Figure 2.\n",
    "\n",
    "We can use this equation to predict the level of log GDP per capita for\n",
    "a value of the index of expropriation protection.\n",
    "\n",
    "For example, for a country with an index value of 7.07 (the average for\n",
    "the dataset), we find that their predicted level of log GDP per capita\n",
    "in 1995 is 8.38."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0b003ab2",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "mean_expr = np.mean(df1_subset['avexpr'])\n",
    "mean_expr"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9f007754",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "predicted_logpdp95 = 4.63 + 0.53 * 7.07\n",
    "predicted_logpdp95"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "95e2f2af",
   "metadata": {},
   "source": [
    "An easier (and more accurate) way to obtain this result is to use\n",
    "`.predict()` and set $ constant = 1 $ and\n",
    "$ {avexpr}_i = mean\\_expr $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "619701f8",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "results.predict(exog=[1, mean_expr])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08992f27",
   "metadata": {},
   "source": [
    "We can obtain an array of predicted $ {logpgp95}_i $ for every value\n",
    "of $ {avexpr}_i $ in our dataset by calling `.predict()` on our\n",
    "results.\n",
    "\n",
    "Plotting the predicted values against $ {avexpr}_i $ shows that the\n",
    "predicted values lie along the linear line that we fitted above.\n",
    "\n",
    "The observed values of $ {logpgp95}_i $ are also plotted for\n",
    "comparison purposes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6a9ca03",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Drop missing observations from whole sample\n",
    "\n",
    "df1_plot = df1.dropna(subset=['logpgp95', 'avexpr'])\n",
    "\n",
    "# Plot predicted values\n",
    "\n",
    "fix, ax = plt.subplots()\n",
    "ax.scatter(df1_plot['avexpr'], results.predict(), alpha=0.5,\n",
    "        label='predicted')\n",
    "\n",
    "# Plot observed values\n",
    "\n",
    "ax.scatter(df1_plot['avexpr'], df1_plot['logpgp95'], alpha=0.5,\n",
    "        label='observed')\n",
    "\n",
    "ax.legend()\n",
    "ax.set_title('OLS predicted values')\n",
    "ax.set_xlabel('avexpr')\n",
    "ax.set_ylabel('logpgp95')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e9b81d31",
   "metadata": {},
   "source": [
    "## Extending the Linear Regression Model\n",
    "\n",
    "So far we have only accounted for institutions affecting economic\n",
    "performance - almost certainly there are numerous other factors\n",
    "affecting GDP that are not included in our model.\n",
    "\n",
    "Leaving out variables that affect $ logpgp95_i $ will result in **omitted variable bias**, yielding biased and inconsistent parameter estimates.\n",
    "\n",
    "We can extend our bivariate regression model to a **multivariate regression model** by adding in other factors that may affect $ logpgp95_i $.\n",
    "\n",
    "[[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)] consider other factors such as:\n",
    "\n",
    "- the effect of climate on economic outcomes; latitude is used to proxy\n",
    "  this  \n",
    "- differences that affect both economic performance and institutions,\n",
    "  eg. cultural, historical, etc.; controlled for with the use of\n",
    "  continent dummies  \n",
    "\n",
    "\n",
    "Let’s estimate some of the extended models considered in the paper\n",
    "(Table 2) using data from `maketable2.dta`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "466f4c7b",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df2 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable2.dta?raw=true')\n",
    "\n",
    "# Add constant term to dataset\n",
    "df2['const'] = 1\n",
    "\n",
    "# Create lists of variables to be used in each regression\n",
    "X1 = ['const', 'avexpr']\n",
    "X2 = ['const', 'avexpr', 'lat_abst']\n",
    "X3 = ['const', 'avexpr', 'lat_abst', 'asia', 'africa', 'other']\n",
    "\n",
    "# Estimate an OLS regression for each set of variables\n",
    "reg1 = sm.OLS(df2['logpgp95'], df2[X1], missing='drop').fit()\n",
    "reg2 = sm.OLS(df2['logpgp95'], df2[X2], missing='drop').fit()\n",
    "reg3 = sm.OLS(df2['logpgp95'], df2[X3], missing='drop').fit()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6fb866c9",
   "metadata": {},
   "source": [
    "Now that we have fitted our model, we will use `summary_col` to\n",
    "display the results in a single table (model numbers correspond to those\n",
    "in the paper)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "86a73f57",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "info_dict={'R-squared' : lambda x: f\"{x.rsquared:.2f}\",\n",
    "           'No. observations' : lambda x: f\"{int(x.nobs):d}\"}\n",
    "\n",
    "results_table = summary_col(results=[reg1,reg2,reg3],\n",
    "                            float_format='%0.2f',\n",
    "                            stars = True,\n",
    "                            model_names=['Model 1',\n",
    "                                         'Model 3',\n",
    "                                         'Model 4'],\n",
    "                            info_dict=info_dict,\n",
    "                            regressor_order=['const',\n",
    "                                             'avexpr',\n",
    "                                             'lat_abst',\n",
    "                                             'asia',\n",
    "                                             'africa'])\n",
    "\n",
    "results_table.add_title('Table 2 - OLS Regressions')\n",
    "\n",
    "print(results_table)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d9ac383e",
   "metadata": {},
   "source": [
    "## Endogeneity\n",
    "\n",
    "As [[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)] discuss, the OLS models likely suffer from\n",
    "**endogeneity** issues, resulting in biased and inconsistent model\n",
    "estimates.\n",
    "\n",
    "Namely, there is likely a two-way relationship between institutions and\n",
    "economic outcomes:\n",
    "\n",
    "- richer countries may be able to afford or prefer better institutions  \n",
    "- variables that affect income may also be correlated with\n",
    "  institutional differences  \n",
    "- the construction of the index may be biased; analysts may be biased\n",
    "  towards seeing countries with higher income having better\n",
    "  institutions  \n",
    "\n",
    "\n",
    "To deal with endogeneity, we can use **two-stage least squares (2SLS)\n",
    "regression**, which is an extension of OLS regression.\n",
    "\n",
    "This method requires replacing the endogenous variable\n",
    "$ {avexpr}_i $ with a variable that is:\n",
    "\n",
    "1. correlated with $ {avexpr}_i $  \n",
    "1. not correlated with the error term (ie. it should not directly affect\n",
    "  the dependent variable, otherwise it would be correlated with\n",
    "  $ u_i $ due to omitted variable bias)  \n",
    "\n",
    "\n",
    "The new set of regressors is called an **instrument**, which aims to\n",
    "remove endogeneity in our proxy of institutional differences.\n",
    "\n",
    "The main contribution of [[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)] is the use of settler mortality\n",
    "rates to instrument for institutional differences.\n",
    "\n",
    "They hypothesize that higher mortality rates of colonizers led to the\n",
    "establishment of institutions that were more extractive in nature (less\n",
    "protection against expropriation), and these institutions still persist\n",
    "today.\n",
    "\n",
    "Using a scatterplot (Figure 3 in [[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)]), we can see protection\n",
    "against expropriation is negatively correlated with settler mortality\n",
    "rates, coinciding with the authors’ hypothesis and satisfying the first\n",
    "condition of a valid instrument."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fec593a7",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Dropping NA's is required to use numpy's polyfit\n",
    "df1_subset2 = df1.dropna(subset=['logem4', 'avexpr'])\n",
    "\n",
    "X = df1_subset2['logem4']\n",
    "y = df1_subset2['avexpr']\n",
    "labels = df1_subset2['shortnam']\n",
    "\n",
    "# Replace markers with country labels\n",
    "fig, ax = plt.subplots()\n",
    "ax.scatter(X, y, marker='')\n",
    "\n",
    "for i, label in enumerate(labels):\n",
    "    ax.annotate(label, (X.iloc[i], y.iloc[i]))\n",
    "\n",
    "# Fit a linear trend line\n",
    "ax.plot(np.unique(X),\n",
    "         np.poly1d(np.polyfit(X, y, 1))(np.unique(X)),\n",
    "         color='black')\n",
    "\n",
    "ax.set_xlim([1.8,8.4])\n",
    "ax.set_ylim([3.3,10.4])\n",
    "ax.set_xlabel('Log of Settler Mortality')\n",
    "ax.set_ylabel('Average Expropriation Risk 1985-95')\n",
    "ax.set_title('Figure 3: First-stage relationship between settler mortality \\\n",
    "    and expropriation risk')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "14c971b7",
   "metadata": {},
   "source": [
    "The second condition may not be satisfied if settler mortality rates in the 17th to 19th centuries have a direct effect on current GDP (in addition to their indirect effect through institutions).\n",
    "\n",
    "For example, settler mortality rates may be related to the current disease environment in a country, which could affect current economic performance.\n",
    "\n",
    "[[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)] argue this is unlikely because:\n",
    "\n",
    "- The majority of settler deaths were due to malaria and yellow fever\n",
    "  and had a limited effect on local people.  \n",
    "- The disease burden on local people in Africa or India, for example,\n",
    "  did not appear to be higher than average, supported by relatively\n",
    "  high population densities in these areas before colonization.  \n",
    "\n",
    "\n",
    "As we appear to have a valid instrument, we can use 2SLS regression to\n",
    "obtain consistent and unbiased parameter estimates.\n",
    "\n",
    "**First stage**\n",
    "\n",
    "The first stage involves regressing the endogenous variable\n",
    "($ {avexpr}_i $) on the instrument.\n",
    "\n",
    "The instrument is the set of all exogenous variables in our model (and\n",
    "not just the variable we have replaced).\n",
    "\n",
    "Using model 1 as an example, our instrument is simply a constant and\n",
    "settler mortality rates $ {logem4}_i $.\n",
    "\n",
    "Therefore, we will estimate the first-stage regression as\n",
    "\n",
    "$$\n",
    "{avexpr}_i = \\delta_0 + \\delta_1 {logem4}_i + v_i\n",
    "$$\n",
    "\n",
    "The data we need to estimate this equation is located in\n",
    "`maketable4.dta` (only complete data, indicated by `baseco = 1`, is\n",
    "used for estimation)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7fa083f",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Import and select the data\n",
    "df4 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable4.dta?raw=true')\n",
    "df4 = df4[df4['baseco'] == 1]\n",
    "\n",
    "# Add a constant variable\n",
    "df4['const'] = 1\n",
    "\n",
    "# Fit the first stage regression and print summary\n",
    "results_fs = sm.OLS(df4['avexpr'],\n",
    "                    df4[['const', 'logem4']],\n",
    "                    missing='drop').fit()\n",
    "print(results_fs.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5fbf8e27",
   "metadata": {},
   "source": [
    "**Second stage**\n",
    "\n",
    "We need to retrieve the predicted values of $ {avexpr}_i $ using\n",
    "`.predict()`.\n",
    "\n",
    "We then replace the endogenous variable $ {avexpr}_i $ with the\n",
    "predicted values $ \\widehat{avexpr}_i $ in the original linear model.\n",
    "\n",
    "Our second stage regression is thus\n",
    "\n",
    "$$\n",
    "{logpgp95}_i = \\beta_0 + \\beta_1 \\widehat{avexpr}_i + u_i\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a6d9a8d2",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "df4['predicted_avexpr'] = results_fs.predict()\n",
    "\n",
    "results_ss = sm.OLS(df4['logpgp95'],\n",
    "                    df4[['const', 'predicted_avexpr']]).fit()\n",
    "print(results_ss.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0473c69d",
   "metadata": {},
   "source": [
    "The second-stage regression results give us an unbiased and consistent\n",
    "estimate of the effect of institutions on economic outcomes.\n",
    "\n",
    "The result suggests a stronger positive relationship than what the OLS\n",
    "results indicated.\n",
    "\n",
    "Note that while our parameter estimates are correct, our standard errors\n",
    "are not and for this reason, computing 2SLS ‘manually’ (in stages with\n",
    "OLS) is not recommended.\n",
    "\n",
    "We can correctly estimate a 2SLS regression in one step using the\n",
    "[linearmodels](https://github.com/bashtage/linearmodels) package, an extension of `statsmodels`\n",
    "\n",
    "Note that when using `IV2SLS`, the exogenous and instrument variables\n",
    "are split up in the function arguments (whereas before the instrument\n",
    "included exogenous variables)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9d1f8581",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "iv = IV2SLS(dependent=df4['logpgp95'],\n",
    "            exog=df4['const'],\n",
    "            endog=df4['avexpr'],\n",
    "            instruments=df4['logem4']).fit(cov_type='unadjusted')\n",
    "\n",
    "print(iv.summary)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c8534ad8",
   "metadata": {},
   "source": [
    "Given that we now have consistent and unbiased estimates, we can infer\n",
    "from the model we have estimated that institutional differences\n",
    "(stemming from institutions set up during colonization) can help\n",
    "to explain differences in income levels across countries today.\n",
    "\n",
    "[[Acemoglu *et al.*, 2001](https://python.quantecon.org/zreferences.html#id98)] use a marginal effect of 0.94 to calculate that the\n",
    "difference in the index between Chile and Nigeria (ie. institutional\n",
    "quality) implies up to a 7-fold difference in income, emphasizing the\n",
    "significance of institutions in economic development."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6ddc0d79",
   "metadata": {},
   "source": [
    "## Summary\n",
    "\n",
    "We have demonstrated basic OLS and 2SLS regression in `statsmodels` and `linearmodels`.\n",
    "\n",
    "If you are familiar with R, you may want to use the [formula interface](https://www.statsmodels.org/dev/example_formulas.html) to `statsmodels`, or consider using [r2py](https://rpy2.github.io/) to call R from within Python."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c83d2a2d",
   "metadata": {},
   "source": [
    "## Exercises"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f8e2c4a7",
   "metadata": {},
   "source": [
    "## Exercise 76.1\n",
    "\n",
    "In the lecture, we think the original model suffers from endogeneity\n",
    "bias due to the likely effect income has on institutional development.\n",
    "\n",
    "Although endogeneity is often best identified by thinking about the data\n",
    "and model, we can formally test for endogeneity using the **Hausman\n",
    "test**.\n",
    "\n",
    "We want to test for correlation between the endogenous variable,\n",
    "$ avexpr_i $, and the errors, $ u_i $\n",
    "\n",
    "$$\n",
    "\\begin{aligned}\n",
    " H_0 : Cov(avexpr_i, u_i) = 0  \\quad (no\\ endogeneity) \\\\\n",
    " H_1 : Cov(avexpr_i, u_i) \\neq 0 \\quad (endogeneity)\n",
    " \\end{aligned}\n",
    "$$\n",
    "\n",
    "This test is running in two stages.\n",
    "\n",
    "First, we regress $ avexpr_i $ on the instrument, $ logem4_i $\n",
    "\n",
    "$$\n",
    "avexpr_i = \\pi_0 + \\pi_1 logem4_i + \\upsilon_i\n",
    "$$\n",
    "\n",
    "Second, we retrieve the residuals $ \\hat{\\upsilon}_i $ and include\n",
    "them in the original equation\n",
    "\n",
    "$$\n",
    "logpgp95_i = \\beta_0 + \\beta_1 avexpr_i + \\alpha \\hat{\\upsilon}_i + u_i\n",
    "$$\n",
    "\n",
    "If $ \\alpha $ is statistically significant (with a p-value < 0.05),\n",
    "then we reject the null hypothesis and conclude that $ avexpr_i $ is\n",
    "endogenous.\n",
    "\n",
    "Using the above information, estimate a Hausman test and interpret your\n",
    "results."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "51515a0b",
   "metadata": {},
   "source": [
    "## Solution to[ Exercise 76.1](https://python.quantecon.org/#ols_ex1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd179681",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Load in data\n",
    "df4 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable4.dta?raw=true')\n",
    "\n",
    "# Add a constant term\n",
    "df4['const'] = 1\n",
    "\n",
    "# Estimate the first stage regression\n",
    "reg1 = sm.OLS(endog=df4['avexpr'],\n",
    "              exog=df4[['const', 'logem4']],\n",
    "              missing='drop').fit()\n",
    "\n",
    "# Retrieve the residuals\n",
    "df4['resid'] = reg1.resid\n",
    "\n",
    "# Estimate the second stage residuals\n",
    "reg2 = sm.OLS(endog=df4['logpgp95'],\n",
    "              exog=df4[['const', 'avexpr', 'resid']],\n",
    "              missing='drop').fit()\n",
    "\n",
    "print(reg2.summary())"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5c745b1a",
   "metadata": {},
   "source": [
    "The output shows that the coefficient on the residuals is statistically\n",
    "significant, indicating $ avexpr_i $ is endogenous."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4a298cd8",
   "metadata": {},
   "source": [
    "## Exercise 76.2\n",
    "\n",
    "The OLS parameter $ \\beta $ can also be estimated using matrix\n",
    "algebra and `numpy` (you may need to review the\n",
    "[numpy](https://python-programming.quantecon.org/numpy.html) lecture to\n",
    "complete this exercise).\n",
    "\n",
    "The linear equation we want to estimate is (written in matrix form)\n",
    "\n",
    "$$\n",
    "y = X\\beta + u\n",
    "$$\n",
    "\n",
    "To solve for the unknown parameter $ \\beta $, we want to minimize\n",
    "the sum of squared residuals\n",
    "\n",
    "$$\n",
    "\\underset{\\hat{\\beta}}{\\min} \\hat{u}'\\hat{u}\n",
    "$$\n",
    "\n",
    "Rearranging the first equation and substituting into the second\n",
    "equation, we can write\n",
    "\n",
    "$$\n",
    "\\underset{\\hat{\\beta}}{\\min} \\ (Y - X\\hat{\\beta})' (Y - X\\hat{\\beta})\n",
    "$$\n",
    "\n",
    "Solving this optimization problem gives the solution for the\n",
    "$ \\hat{\\beta} $ coefficients\n",
    "\n",
    "$$\n",
    "\\hat{\\beta} = (X'X)^{-1}X'y\n",
    "$$\n",
    "\n",
    "Using the above information, compute $ \\hat{\\beta} $ from model 1\n",
    "using `numpy` - your results should be the same as those in the\n",
    "`statsmodels` output from earlier in the lecture."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dfdff478",
   "metadata": {},
   "source": [
    "## Solution to[ Exercise 76.2](https://python.quantecon.org/#ols_ex2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "bd91a14b",
   "metadata": {
    "hide-output": false
   },
   "outputs": [],
   "source": [
    "# Load in data\n",
    "df1 = pd.read_stata('https://github.com/QuantEcon/lecture-python/blob/master/source/_static/lecture_specific/ols/maketable1.dta?raw=true')\n",
    "df1 = df1.dropna(subset=['logpgp95', 'avexpr'])\n",
    "\n",
    "# Add a constant term\n",
    "df1['const'] = 1\n",
    "\n",
    "# Define the X and y variables\n",
    "y = np.asarray(df1['logpgp95'])\n",
    "X = np.asarray(df1[['const', 'avexpr']])\n",
    "\n",
    "# Compute β_hat\n",
    "β_hat = np.linalg.solve(X.T @ X, X.T @ y)\n",
    "\n",
    "# Print out the results from the 2 x 1 vector β_hat\n",
    "print(f'β_0 = {β_hat[0]:.2}')\n",
    "print(f'β_1 = {β_hat[1]:.2}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "08b42a7f",
   "metadata": {},
   "source": [
    "It is also possible to use `np.linalg.inv(X.T @ X) @ X.T @ y` to solve\n",
    "for $ \\beta $, however `.solve()` is preferred as it involves fewer\n",
    "computations."
   ]
  }
 ],
 "metadata": {
  "date": 1752455389.5421355,
  "filename": "ols.md",
  "kernelspec": {
   "display_name": "Python",
   "language": "python3",
   "name": "python3"
  },
  "title": "Linear Regression in Python"
 },
 "nbformat": 4,
 "nbformat_minor": 5
}