Lecture 19: Instrumental Variables
Economics 326 — Introduction to Econometrics II
Endogeneity
In the linear regression model, Y_{i}=\beta _{0}+\beta _{1}X_{i}+U_{i}, the condition for consistent estimation of \beta _{1} by OLS is that X is exogenous: \mathrm{Cov}\left(X_{i},U_{i}\right) =0.
When \mathrm{Cov}\left(X_{i},U_{i}\right) \neq 0, we say that the regressor X is endogenous.
When the regressor is endogenous, the OLS estimator is inconsistent: \begin{align*} \hat{\beta}_{1,n}-\beta _{1} &= \frac{\frac{1}{n}\sum_{i=1}^{n}\left( X_{i}-\bar{X}_{n}\right) U_{i}}{\frac{1}{n}\sum_{i=1}^{n}\left( X_{i}-\bar{X}_{n}\right) ^{2}} \\ &\rightarrow _{p}\frac{{\color{red}\mathrm{Cov}\left(X_{i},U_{i}\right)}}{\mathrm{Var}\left(X_{i}\right)} \neq 0. \end{align*}
Consequences of endogeneity
The causal effect of X on Y is not estimated consistently: \hat{\beta}_{1,n}\rightarrow _{p}\beta _{1}+\frac{\mathrm{Cov}\left(X_{i},U_{i}\right)}{\mathrm{Var}\left(X_{i}\right)}. The effect can be over- or underestimated depending on the sign of \mathrm{Cov}\left(X_{i},U_{i}\right).
Tests and confidence intervals are invalid.
Sources of endogeneity
Several possible sources of endogeneity:
Omitted explanatory variables.
Simultaneity.
Errors in variables.
All result in regressors correlated with the errors.
Omitted explanatory variables
Suppose that the true model is \ln \text{Wage}_{i}=\beta _{0}+\beta _{1}\text{Education}_{i}+\beta _{2}\text{Ability}_{i}+V_{i}, where V_{i} is uncorrelated with \text{Education} and \text{Ability}.
Since \text{Ability} is unobservable, the econometrician regresses \ln \text{Wage} against \text{Education}, and \beta _{2}\text{Ability} goes into the error: \begin{aligned} \ln \text{Wage}_{i} &= \beta _{0}+\beta _{1}\text{Education}_{i}+U_{i}, \\ U_{i} &= \beta _{2}\text{Ability}_{i}+V_{i}. \end{aligned}
\text{Education} is correlated with \text{Ability}: we can expect that \mathrm{Cov}\left(\text{Education}_{i},\text{Ability}_{i}\right) >0, \beta _{2}>0, and therefore \mathrm{Cov}\left(\text{Education}_{i},U_{i}\right) >0. Thus, OLS will overestimate the return to education.
Simultaneity
Consider a demand-supply system: \begin{array}{ll} \text{Demand:} & Q^{d}=\beta _{0}^{d}+\beta _{1}^{d}P+U^{d}, \\ \text{Supply:} & Q^{s}=\beta _{0}^{s}+\beta _{1}^{s}P+U^{s}, \end{array} where Q^{d} = quantity demanded, Q^{s} = quantity supplied, P = price.
The quantity and price are determined simultaneously in the equilibrium: Q^{d}=Q^{s}=Q.
Q^{d} and Q^{s} are not observed separately; we observe only the equilibrium values Q.
Simultaneity
Solving for P: \begin{aligned} Q^{d} &= Q^{s} \\ \beta _{0}^{d}+\beta _{1}^{d}P+U^{d} &= \beta _{0}^{s}+\beta _{1}^{s}P+U^{s}, \end{aligned} so P=-\frac{\beta _{0}^{d}-\beta _{0}^{s}}{\beta _{1}^{d}-\beta _{1}^{s}}-\frac{U^{d}-U^{s}}{\beta _{1}^{d}-\beta _{1}^{s}}.
Thus, \mathrm{Cov}\left(P,U^{d}\right) \neq 0 \text{ and } \mathrm{Cov}\left(P,U^{s}\right) \neq 0. The demand-supply equations cannot be estimated by OLS.
Simultaneity
Consider a labour supply model for married women: \text{Hours}_{i}=\beta _{0}+\beta _{1}\text{Children}_{i}+\text{Other Factors}+U_{i}, where \text{Hours} = hours of work, \text{Children} = number of children.
It is reasonable to assume that women decide simultaneously how much time to devote to career and family.
Thus, while we may be mainly interested in the effect of family size on labour supply, there is another equation: \text{Children}_{i}=\gamma _{0}+\gamma _{1}\text{Hours}_{i}+\text{Other Factors}+V_{i}, and \text{Children} and \text{Hours} are determined simultaneously in an equilibrium.
As a result, \mathrm{Cov}\left(\text{Children}_{i},U_{i}\right) \neq 0, and the effect of family size cannot be estimated by OLS.
Errors in variables
Consider a model: Y_{i}=\beta _{0}+\beta _{1}X_{i}^{\ast }+V_{i}, where X_{i}^{\ast } is the “true” regressor.
Suppose that X_{i}^{\ast } is not directly observable. Instead, we observe X_{i} that measures X_{i}^{\ast } with an error \varepsilon_{i}: X_{i}=X_{i}^{\ast }+\varepsilon _{i}.
Since X_{i}^{\ast } is unobservable, the econometrician has to regress Y_{i} against X_{i}.
Errors in variables
- The model for Y_{i} as a function of X_{i} can be written as \begin{aligned} Y_{i} &= \beta _{0}+\beta _{1}\left( X_{i}-\varepsilon _{i}\right) +V_{i} \\ &= \beta _{0}+\beta _{1}X_{i}+V_{i}-\beta _{1}\varepsilon _{i}, \end{aligned} or \begin{aligned} Y_{i} &= \beta _{0}+\beta _{1}X_{i}+U_{i}, \\ U_{i} &= V_{i}-\beta _{1}\varepsilon _{i}. \end{aligned}
Errors in variables
We can assume that \mathrm{Cov}\left(X_{i}^{\ast },V_{i}\right) =\mathrm{Cov}\left(X_{i}^{\ast },\varepsilon_{i}\right) =\mathrm{Cov}\left(\varepsilon _{i},V_{i}\right) =0.
However, \begin{aligned} \mathrm{Cov}\left(X_{i},U_{i}\right) &= \mathrm{Cov}\left(X_{i}^{\ast }+\varepsilon _{i},\; V_{i}-\beta _{1}\varepsilon_{i}\right) \\ &= \mathrm{Cov}\left(X_{i}^{\ast },V_{i}\right) -\beta _{1}\mathrm{Cov}\left(X_{i}^{\ast },\varepsilon _{i}\right) \\ &\quad +\mathrm{Cov}\left(\varepsilon _{i},V_{i}\right) -\beta _{1}{\color{red}\mathrm{Cov}\left(\varepsilon _{i},\varepsilon _{i}\right)} \\ &= -\beta _{1}\mathrm{Var}\left(\varepsilon _{i}\right) \neq 0. \end{aligned}
Thus, X_{i} is endogenous and \beta _{1} cannot be estimated by OLS.
Instrumental variable (IV)
Consider \begin{aligned} Y_{i} &= \beta _{0}+\beta _{1}X_{i}+U_{i}, \\ \mathrm{Cov}\left(X_{i},U_{i}\right) &\neq 0. \end{aligned}
Suppose that the econometrician observes another variable Z_{i}, called the instrumental variable, that satisfies the following conditions:
The IV is exogenous: \mathrm{Cov}\left(Z_{i},U_{i}\right) =0.
The IV determines the endogenous regressor: \mathrm{Cov}\left(Z_{i},X_{i}\right) \neq 0.
When an IV satisfying those conditions is available, it allows us to estimate the effect of X on Y consistently.
IV regression
Consider the IV estimator of \beta _{1}: {\color{blue}\hat{\beta}_{1,n}^{IV}=\frac{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) Y_{i}}{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}}}.
Substituting Y_{i}=\beta _{0}+\beta _{1}X_{i}+U_{i}: \begin{aligned} \hat{\beta}_{1,n}^{IV} &= \frac{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) \left( \beta _{0}+\beta _{1}X_{i}+U_{i}\right)}{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}} \\ &= \beta _{1}+\frac{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i}}{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}}. \end{aligned}
Consistency of the IV estimator
The IV conditions:
- Exogeneity: {\color{red}\mathrm{Cov}\left(Z_{i},U_{i}\right) =0}.
- Relevance: {\color{blue}\mathrm{Cov}\left(Z_{i},X_{i}\right) \neq 0}.
By the LLN: \begin{aligned} \frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i} &\rightarrow _{p} {\color{red}\mathrm{Cov}\left(Z_{i},U_{i}\right)}, \\ \frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i} &\rightarrow _{p} {\color{blue}\mathrm{Cov}\left(Z_{i},X_{i}\right)}. \end{aligned}
The IV estimator is consistent: \begin{align*} \hat{\beta}_{1,n}^{IV} &= \beta _{1}+\frac{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i}}{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}} \\ &\rightarrow _{p}\beta _{1}+\frac{{\color{red}0}}{{\color{blue}\mathrm{Cov}\left(Z_{i},X_{i}\right)}} =\beta _{1}. \end{align*}
Natural experiments
Theoretically, the causal effect can be estimated from controlled experiments:
To estimate the return to education, select a random sample of children, randomly assign how many years of education they should have, and measure their income several years after graduation.
To estimate the effect of family size on labor supply, select a random sample of parents and randomly assign how many children they should have, and measure their labor market outcomes.
Such an approach is infeasible due to high cost and/or ethical reasons.
Natural experiments: use the random variation in the variable of interest to estimate the causal effect.
Example: Compulsory schooling laws
Angrist and Krueger (1991, QJE) suggested using school start age policy to estimate \beta _{1} in \ln \text{Wage}_{i}=\beta _{0}+\beta _{1}\text{Education}_{i}+\beta _{2}\text{Ability}_{i}+V_{i}.
We need an IV Z such that \mathrm{Cov}\left(\text{Ability}_{i},Z_{i}\right) =0 and \mathrm{Cov}\left(\text{Education}_{i},Z_{i}\right) \neq 0.
They argue that due to compulsory schooling laws, the season of birth satisfies the IV conditions:
A child must attend school until reaching a certain drop-out age.
Students born in the first quarter reach the legal drop-out age before classmates born later in the year.
The quarter-of-birth dummy is correlated with education.
The quarter of birth is uncorrelated with ability.
Example: Sibling-sex composition
Angrist and Evans (1998, AER) argue that parents’ preferences for a mixed sibling-sex composition can be used to estimate \beta _{1} in \text{Hours}_{i}=\beta _{0}+\beta _{1}\text{Children}_{i}+\ldots +U_{i}.
We need an IV Z such that \mathrm{Cov}\left(U_{i},Z_{i}\right) =0 and \mathrm{Cov}\left(\text{Children}_{i},Z_{i}\right) \neq 0.
Consider a dummy variable equal to one if the sex of the second child matches the sex of the first child.
If parents prefer a mixed sibling-sex composition, they are more likely to have another child if their first two children are of the same sex.
The same-sex dummy is correlated with the number of children.
Since sex mix is randomly determined, the same-sex dummy is exogenous.
Asymptotic distribution of the IV estimator
Write \begin{align*} \sqrt{n}\left( \hat{\beta}_{1,n}^{IV}-\beta _{1}\right) &= \frac{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i}}{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}} \\ &\rightarrow _{d}\frac{N\left( 0,\;\mathrm{E}\left[\left( Z_{i}-\mathrm{E}\left[Z_{i}\right]\right)^{2}U_{i}^{2}\right]\right)}{\mathrm{Cov}\left(Z_{i},X_{i}\right)}. \end{align*}
Thus, \begin{aligned} \sqrt{n}\left( \hat{\beta}_{1,n}^{IV}-\beta _{1}\right) &\rightarrow_{d} N\left( 0,V^{IV}\right), \text{ where} \\ V^{IV} &= \frac{\mathrm{E}\left[\left( Z_{i}-\mathrm{E}\left[Z_{i}\right]\right) ^{2}U_{i}^{2}\right]}{\left(\mathrm{Cov}\left(Z_{i},X_{i}\right)\right) ^{2}}. \end{aligned}
Variance estimation
Let \hat{\beta}_{0,n}^{IV}=\bar{Y}_{n}-\hat{\beta}_{1,n}^{IV}\cdot \bar{X}_{n}.
Let \hat{U}_{i}=Y_{i}-\hat{\beta}_{0,n}^{IV}-\hat{\beta}_{1,n}^{IV}X_{i}.
Estimate V^{IV} by \hat{V}_{n}^{IV}=\frac{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) ^{2}\hat{U}_{i}^{2}}{\left( \frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}\right) ^{2}}.
In finite samples, we use the approximation: \hat{\beta}_{1,n}^{IV}\overset{a}{\sim }N\left( \beta _{1},\frac{\hat{V}_{n}^{IV}}{n}\right).