Lecture 20: Two-Stage Least Squares

Economics 326 — Introduction to Econometrics II

Vadim Marmer, UBC

Beyond the simple IV model

Previously: simple IV model with one endogenous regressor, one instrument, no controls.
The simple IV formula does not extend to models with:
1. Exogenous control variables.
2. Multiple instruments (l > 1).
Two-stage least squares (2SLS) handles both extensions.
Notation: y = dependent variable, Y = endogenous regressor, X’s = exogenous controls, Z’s = instruments.

IV model with exogenous controls

The model: y_i = \gamma_0 + {\color{red}\beta_1 Y_i} + {\color{blue}\gamma_1 X_{1,i} + \ldots + \gamma_k X_{k,i}} + U_i, where
- \gamma_0 is the intercept: \mathrm{E}\left[U_i\right] = 0.
- {\color{red}Y_i} is the endogenous regressor: \mathrm{Cov}\left(Y_i, U_i\right) \neq 0.
- {\color{blue}X_{1,i}, \ldots, X_{k,i}} are k exogenous regressors: \mathrm{Cov}\left(X_{1,i}, U_i\right) = \ldots = \mathrm{Cov}\left(X_{k,i}, U_i\right) = 0.
l instruments {\color{blue}Z_{1,i}, \ldots, Z_{l,i}}, excluded from the regression.

Identification with instruments

There are {\color{red}k + 2} unknown coefficients: y_i = {\color{red}\gamma_0} + {\color{red}\beta_1} Y_i + {\color{red}\gamma_1} X_{1,i} + \ldots + {\color{red}\gamma_k} X_{k,i} + U_i.
Exogeneity gives only {\color{blue}k + 1} equations: \mathrm{E}\left[U_i\right] = 0, \quad \mathrm{Cov}\left(X_{j,i}, U_i\right) = 0, \quad j = 1, \ldots, k.
l instruments provide l additional moment conditions: \mathrm{Cov}\left(Z_{j,i}, U_i\right) = 0, \quad j = 1, \ldots, l, for a total of {\color{blue}k + 1 + l} equations.
Order condition: l \geq 1.
- l = 1: exactly identified.
- l > 1: overidentified.

The first-stage equation

Consider a system of two equations. The original regression becomes the second stage; a new first-stage equation describes how Y_i depends on Z’s and X’s: \begin{aligned} \text{(first stage)} \quad Y_i &= \pi_0 + \pi_1 Z_{1,i} + \ldots + \pi_l Z_{l,i} \\ &\quad + \pi_{l+1} X_{1,i} + \ldots + \pi_{l+k} X_{k,i} + V_i \\[4pt] \text{(second stage)} \quad y_i &= \gamma_0 + \beta_1 Y_i + \gamma_1 X_{1,i} + \ldots + \gamma_k X_{k,i} + U_i \end{aligned}
All RHS variables in the first-stage equation are exogenous → the \pi’s can be estimated consistently by OLS.
Y_i is endogenous because \mathrm{Cov}\left(U_i, V_i\right) \neq 0.
IV relevance condition: at least one \pi_j \neq 0 for j = 1, \ldots, l.
Y_i contains both exogenous variation (driven by Z’s and X’s) and endogenous variation (correlated with U_i).
OLS uses all variation in Y_i → inconsistent.
Idea: estimate \beta_1 using only the exogenous variation in Y_i.
The first stage extracts this variation: \hat{Y}_i captures the part of Y_i explained by Z’s and X’s.
X’s appear in the first stage because they affect Y_i directly and can be correlated with Z’s.
\hat{Y}_i depends only on exogenous variables → uncorrelated with U_i.

2SLS: two stages

Stage 1: Regress Y_i on Z_{1,i}, \ldots, Z_{l,i}, X_{1,i}, \ldots, X_{k,i} by OLS. Obtain fitted values: \begin{aligned} \hat{Y}_i &= \hat{\pi}_0 + \hat{\pi}_1 Z_{1,i} + \ldots + \hat{\pi}_l Z_{l,i} \\ &\quad + \hat{\pi}_{l+1} X_{1,i} + \ldots + \hat{\pi}_{l+k} X_{k,i}. \end{aligned}
Stage 2: Regress y_i on \hat{Y}_i and X_{1,i}, \ldots, X_{k,i} by OLS: \begin{aligned} y_i &= \hat{\gamma}_0^{2SLS} + \hat{\beta}_1^{2SLS} \hat{Y}_i \\ &\quad + \hat{\gamma}_1^{2SLS} X_{1,i} + \ldots + \hat{\gamma}_k^{2SLS} X_{k,i} + \hat{U}_i. \end{aligned}
The 2SLS estimators are consistent and asymptotically normal.
Standard errors from naïve second-stage OLS are incorrect; statistical packages report corrected standard errors.

2SLS with a single instrument

Special case: l = 1 (one instrument Z_i), k = 0 (no exogenous controls).
Stage 1: regress Y_i on Z_i → \hat{Y}_i = \hat{\alpha}_0 + \hat{\alpha}_1 Z_i.
Stage 2: regress y_i on \hat{Y}_i.
The 2SLS estimator coincides with the simple IV estimator from the previous lecture.
With exogenous controls or multiple instruments, the simple IV formula no longer works; 2SLS is the standard approach.

Example: returns to education (setup)

Estimate the returns to education using the MROZ dataset (Wooldridge, 2006): \begin{aligned} \ln \text{Wage}_i &= \gamma_0 + {\color{red}\beta_1 \text{Educ}_i} \\ &\quad + \gamma_1 \text{Exper}_i + \gamma_2 \text{Exper}_i^2 + U_i. \end{aligned}
- Endogenous: {\color{red}\text{Educ}} (education).
- Instruments: \text{MotherEduc} and \text{FatherEduc} (parents’ education).
- Exogenous: \text{Exper} and \text{Exper}^2 (experience).
The model is overidentified (l = 2 > 1).
In R, use ivreg() from the AER package with vcovHC(..., type = "HC1") for heteroskedasticity-robust standard errors.

Example: first stage

First-stage regression (\text{Educ} on instruments and exogenous regressors):

first_stage <- lm(educ ~ exper + expersq + motheduc + fatheduc, data = d)
coeftest(first_stage, vcov = vcovHC(first_stage, type = "HC1"))


t test of coefficients:

              Estimate Std. Error t value              Pr(>|t|)    
(Intercept)  9.1026401  0.4241444 21.4612 < 0.00000000000000022 ***
exper        0.0452254  0.0419107  1.0791                0.2812    
expersq     -0.0010091  0.0013233 -0.7626                0.4461    
motheduc     0.1575970  0.0354502  4.4456         0.00001121343 ***
fatheduc     0.1895484  0.0324419  5.8427         0.00000001026 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both instruments (motheduc, fatheduc) are statistically significant. First-stage R^2 = 0.2115.

Example: 2SLS vs OLS

2SLS second stage:

iv_fit <- ivreg(lwage ~ educ + exper + expersq |
                  exper + expersq + motheduc + fatheduc, data = d)
coeftest(iv_fit, vcov = vcovHC(iv_fit, type = "HC1"))


t test of coefficients:

               Estimate  Std. Error t value Pr(>|t|)   
(Intercept)  0.04810031  0.42979771  0.1119 0.910945   
educ         0.06139663  0.03333859  1.8416 0.066231 . 
exper        0.04417039  0.01554638  2.8412 0.004711 **
expersq     -0.00089897  0.00043008 -2.0902 0.037193 * 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OLS for comparison:

ols_fit <- lm(lwage ~ educ + exper + expersq, data = d)
coeftest(ols_fit, vcov = vcovHC(ols_fit, type = "HC1"))


t test of coefficients:

               Estimate  Std. Error t value            Pr(>|t|)    
(Intercept) -0.52204056  0.20165046 -2.5888            0.009961 ** 
educ         0.10748964  0.01321897  8.1315 0.00000000000000472 ***
exper        0.04156651  0.01527304  2.7216            0.006765 ** 
expersq     -0.00081119  0.00042007 -1.9311            0.054139 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The 2SLS estimate (\hat{\beta}_1^{2SLS} = 0.061) is smaller than OLS (\hat{\beta}_1^{OLS} = 0.107), consistent with upward ability bias.
The 2SLS standard error (0.033) is larger than OLS (0.013): 2SLS uses only the exogenous variation in \text{Educ}, so estimates are noisier.

Multiple endogenous regressors

So far: one endogenous variable with exogenous controls and multiple instruments.
In practice, models may have several endogenous regressors.
Example: \begin{aligned} \ln \text{Wage}_i &= \gamma_0 + {\color{red}\beta_1 \text{Educ}_i + \beta_2 \text{Children}_i} \\ &\quad + {\color{blue}\gamma_1 \text{Age}_i + \gamma_2 \text{Sex}_i} + U_i. \end{aligned}
- {\color{red}\text{Endogenous}} regressors: education and children (family size).
- {\color{blue}\text{Exogenous}} regressors: age, sex, and a constant.

General model

General model with m endogenous regressors: \begin{aligned} y_i &= \gamma_0 + {\color{red}\beta_1 Y_{1,i} + \ldots + \beta_m Y_{m,i}} \\ &\quad + {\color{blue}\gamma_1 X_{1,i} + \ldots + \gamma_k X_{k,i}} + U_i, \end{aligned} where
- {\color{red}Y_{1,i}, \ldots, Y_{m,i}} are m endogenous regressors: \mathrm{Cov}\left(Y_{1,i}, U_i\right) \neq 0, \ldots, \mathrm{Cov}\left(Y_{m,i}, U_i\right) \neq 0.
- {\color{blue}X_{1,i}, \ldots, X_{k,i}} are k exogenous regressors: \mathrm{Cov}\left(X_{1,i}, U_i\right) = \ldots = \mathrm{Cov}\left(X_{k,i}, U_i\right) = 0.

Identification and the order condition

There are {\color{red}k + 1 + m} unknown coefficients, but exogeneity of X’s gives only {\color{blue}k + 1} equations. We need m more.
l additional exogenous IVs {\color{blue}Z_{1,i}, \ldots, Z_{l,i}}, excluded from the second-stage equation, provide l moment conditions: \mathrm{Cov}\left(Z_{j,i}, U_i\right) = 0, \quad j = 1, \ldots, l.
Order condition: {\color{red}l \geq m}.
- l = m: exactly identified.
- l > m: overidentified.
- l < m: underidentified (coefficients cannot be estimated).

First-stage equations

The system has m first-stage equations, one per endogenous regressor: \begin{aligned} Y_{1,i} &= \pi_{0,1} + \pi_{1,1} Z_{1,i} + \ldots + \pi_{l,1} Z_{l,i} \\ &\quad + \pi_{l+1,1} X_{1,i} + \ldots + \pi_{l+k,1} X_{k,i} + V_{1,i}, \\ &\;\;\vdots \\ Y_{m,i} &= \pi_{0,m} + \pi_{1,m} Z_{1,i} + \ldots + \pi_{l,m} Z_{l,i} \\ &\quad + \pi_{l+1,m} X_{1,i} + \ldots + \pi_{l+k,m} X_{k,i} + V_{m,i}. \end{aligned}
The exogenous regressors X’s appear because they can be correlated with Y’s.
The X’s and Z’s are uncorrelated with all errors U and V’s.

2SLS: the first stage

Estimate each first-stage equation by OLS. The fitted values are: \begin{aligned} \hat{Y}_{1,i} &= \hat{\pi}_{0,1} + \hat{\pi}_{1,1} Z_{1,i} + \ldots + \hat{\pi}_{l,1} Z_{l,i} \\ &\quad + \hat{\pi}_{l+1,1} X_{1,i} + \ldots + \hat{\pi}_{l+k,1} X_{k,i}, \\ &\;\;\vdots \\ \hat{Y}_{m,i} &= \hat{\pi}_{0,m} + \hat{\pi}_{1,m} Z_{1,i} + \ldots + \hat{\pi}_{l,m} Z_{l,i} \\ &\quad + \hat{\pi}_{l+1,m} X_{1,i} + \ldots + \hat{\pi}_{l+k,m} X_{k,i}. \end{aligned}
The \hat{Y}’s are functions of Z’s and X’s (all exogenous), so they are asymptotically uncorrelated with the errors.

2SLS: the second stage

In the second stage, regress (OLS) y on a constant, \hat{Y}’s, and X’s: \begin{aligned} y_i &= \hat{\gamma}_0^{2SLS} + \hat{\beta}_1^{2SLS} \hat{Y}_{1,i} + \ldots + \hat{\beta}_m^{2SLS} \hat{Y}_{m,i} \\ &\quad + \hat{\gamma}_1^{2SLS} X_{1,i} + \ldots + \hat{\gamma}_k^{2SLS} X_{k,i} + \hat{U}_i. \end{aligned}
The 2SLS estimators \hat{\beta}_1^{2SLS}, \ldots, \hat{\beta}_m^{2SLS}, \hat{\gamma}_0^{2SLS}, \ldots, \hat{\gamma}_k^{2SLS} are consistent and asymptotically normal.
Standard errors from naïve second-stage OLS are incorrect: they do not account for the estimation error in \hat{\pi}’s from the first stage.
Statistical packages report the corrected standard errors.