Economics 326 — Introduction to Econometrics II
In the linear regression model, Y_{i}=\beta _{0}+\beta _{1}X_{i}+U_{i}, the condition for consistent estimation of \beta _{1} by OLS is that X is exogenous: \mathrm{Cov}\left(X_{i},U_{i}\right) =0.
When \mathrm{Cov}\left(X_{i},U_{i}\right) \neq 0, we say that the regressor X is endogenous.
When the regressor is endogenous, the OLS estimator is inconsistent: \begin{align*} \hat{\beta}_{1,n}-\beta _{1} &= \frac{\frac{1}{n}\sum_{i=1}^{n}\left( X_{i}-\bar{X}_{n}\right) U_{i}}{\frac{1}{n}\sum_{i=1}^{n}\left( X_{i}-\bar{X}_{n}\right) ^{2}} \\ &\rightarrow _{p}\frac{{\color{red}\mathrm{Cov}\left(X_{i},U_{i}\right)}}{\mathrm{Var}\left(X_{i}\right)} \neq 0. \end{align*}
The causal effect of X on Y is not estimated consistently: \hat{\beta}_{1,n}\rightarrow _{p}\beta _{1}+\frac{\mathrm{Cov}\left(X_{i},U_{i}\right)}{\mathrm{Var}\left(X_{i}\right)}. The effect can be over- or underestimated depending on the sign of \mathrm{Cov}\left(X_{i},U_{i}\right).
Tests and confidence intervals are invalid.
Several possible sources of endogeneity:
Omitted explanatory variables.
Simultaneity.
Errors in variables.
All result in regressors correlated with the errors.
Suppose that the true model is \ln \text{Wage}_{i}=\beta _{0}+\beta _{1}\text{Education}_{i}+\beta _{2}\text{Ability}_{i}+V_{i}, where V_{i} is uncorrelated with \text{Education} and \text{Ability}.
Since \text{Ability} is unobservable, the econometrician regresses \ln \text{Wage} against \text{Education}, and \beta _{2}\text{Ability} goes into the error: \begin{aligned} \ln \text{Wage}_{i} &= \beta _{0}+\beta _{1}\text{Education}_{i}+U_{i}, \\ U_{i} &= \beta _{2}\text{Ability}_{i}+V_{i}. \end{aligned}
\text{Education} is correlated with \text{Ability}: we can expect that \mathrm{Cov}\left(\text{Education}_{i},\text{Ability}_{i}\right) >0, \beta _{2}>0, and therefore \mathrm{Cov}\left(\text{Education}_{i},U_{i}\right) >0. Thus, OLS will overestimate the return to education.
Consider a demand-supply system: \begin{array}{ll} \text{Demand:} & Q^{d}=\beta _{0}^{d}+\beta _{1}^{d}P+U^{d}, \\ \text{Supply:} & Q^{s}=\beta _{0}^{s}+\beta _{1}^{s}P+U^{s}, \end{array} where Q^{d} = quantity demanded, Q^{s} = quantity supplied, P = price.
The quantity and price are determined simultaneously in the equilibrium: Q^{d}=Q^{s}=Q.
Q^{d} and Q^{s} are not observed separately; we observe only the equilibrium values Q.
Solving for P: \begin{aligned} Q^{d} &= Q^{s} \\ \beta _{0}^{d}+\beta _{1}^{d}P+U^{d} &= \beta _{0}^{s}+\beta _{1}^{s}P+U^{s}, \end{aligned} so P=-\frac{\beta _{0}^{d}-\beta _{0}^{s}}{\beta _{1}^{d}-\beta _{1}^{s}}-\frac{U^{d}-U^{s}}{\beta _{1}^{d}-\beta _{1}^{s}}.
Thus, \mathrm{Cov}\left(P,U^{d}\right) \neq 0 \text{ and } \mathrm{Cov}\left(P,U^{s}\right) \neq 0. The demand-supply equations cannot be estimated by OLS.
Consider a labour supply model for married women: \text{Hours}_{i}=\beta _{0}+\beta _{1}\text{Children}_{i}+\text{Other Factors}+U_{i}, where \text{Hours} = hours of work, \text{Children} = number of children.
It is reasonable to assume that women decide simultaneously how much time to devote to career and family.
Thus, while we may be mainly interested in the effect of family size on labour supply, there is another equation: \text{Children}_{i}=\gamma _{0}+\gamma _{1}\text{Hours}_{i}+\text{Other Factors}+V_{i}, and \text{Children} and \text{Hours} are determined simultaneously in an equilibrium.
As a result, \mathrm{Cov}\left(\text{Children}_{i},U_{i}\right) \neq 0, and the effect of family size cannot be estimated by OLS.
Consider a model: Y_{i}=\beta _{0}+\beta _{1}X_{i}^{\ast }+V_{i}, where X_{i}^{\ast } is the “true” regressor.
Suppose that X_{i}^{\ast } is not directly observable. Instead, we observe X_{i} that measures X_{i}^{\ast } with an error \varepsilon_{i}: X_{i}=X_{i}^{\ast }+\varepsilon _{i}.
Since X_{i}^{\ast } is unobservable, the econometrician has to regress Y_{i} against X_{i}.
We can assume that \mathrm{Cov}\left(X_{i}^{\ast },V_{i}\right) =\mathrm{Cov}\left(X_{i}^{\ast },\varepsilon_{i}\right) =\mathrm{Cov}\left(\varepsilon _{i},V_{i}\right) =0.
However, \begin{aligned} \mathrm{Cov}\left(X_{i},U_{i}\right) &= \mathrm{Cov}\left(X_{i}^{\ast }+\varepsilon _{i},\; V_{i}-\beta _{1}\varepsilon_{i}\right) \\ &= \mathrm{Cov}\left(X_{i}^{\ast },V_{i}\right) -\beta _{1}\mathrm{Cov}\left(X_{i}^{\ast },\varepsilon _{i}\right) \\ &\quad +\mathrm{Cov}\left(\varepsilon _{i},V_{i}\right) -\beta _{1}{\color{red}\mathrm{Cov}\left(\varepsilon _{i},\varepsilon _{i}\right)} \\ &= -\beta _{1}\mathrm{Var}\left(\varepsilon _{i}\right) \neq 0. \end{aligned}
Thus, X_{i} is endogenous and \beta _{1} cannot be estimated by OLS.
Consider \begin{aligned} Y_{i} &= \beta _{0}+\beta _{1}X_{i}+U_{i}, \\ \mathrm{Cov}\left(X_{i},U_{i}\right) &\neq 0. \end{aligned}
Suppose that the econometrician observes another variable Z_{i}, called the instrumental variable, that satisfies the following conditions:
The IV is exogenous: \mathrm{Cov}\left(Z_{i},U_{i}\right) =0.
The IV determines the endogenous regressor: \mathrm{Cov}\left(Z_{i},X_{i}\right) \neq 0.
When an IV satisfying those conditions is available, it allows us to estimate the effect of X on Y consistently.
Consider the IV estimator of \beta _{1}: {\color{blue}\hat{\beta}_{1,n}^{IV}=\frac{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) Y_{i}}{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}}}.
Substituting Y_{i}=\beta _{0}+\beta _{1}X_{i}+U_{i}: \begin{aligned} \hat{\beta}_{1,n}^{IV} &= \frac{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) \left( \beta _{0}+\beta _{1}X_{i}+U_{i}\right)}{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}} \\ &= \beta _{1}+\frac{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i}}{\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}}. \end{aligned}
The IV conditions:
By the LLN: \begin{aligned} \frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i} &\rightarrow _{p} {\color{red}\mathrm{Cov}\left(Z_{i},U_{i}\right)}, \\ \frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i} &\rightarrow _{p} {\color{blue}\mathrm{Cov}\left(Z_{i},X_{i}\right)}. \end{aligned}
The IV estimator is consistent: \begin{align*} \hat{\beta}_{1,n}^{IV} &= \beta _{1}+\frac{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i}}{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}} \\ &\rightarrow _{p}\beta _{1}+\frac{{\color{red}0}}{{\color{blue}\mathrm{Cov}\left(Z_{i},X_{i}\right)}} =\beta _{1}. \end{align*}
Theoretically, the causal effect can be estimated from controlled experiments:
To estimate the return to education, select a random sample of children, randomly assign how many years of education they should have, and measure their income several years after graduation.
To estimate the effect of family size on labor supply, select a random sample of parents and randomly assign how many children they should have, and measure their labor market outcomes.
Such an approach is infeasible due to high cost and/or ethical reasons.
Natural experiments: use the random variation in the variable of interest to estimate the causal effect.
Angrist and Krueger (1991, QJE) suggested using school start age policy to estimate \beta _{1} in \ln \text{Wage}_{i}=\beta _{0}+\beta _{1}\text{Education}_{i}+\beta _{2}\text{Ability}_{i}+V_{i}.
We need an IV Z such that \mathrm{Cov}\left(\text{Ability}_{i},Z_{i}\right) =0 and \mathrm{Cov}\left(\text{Education}_{i},Z_{i}\right) \neq 0.
They argue that due to compulsory schooling laws, the season of birth satisfies the IV conditions:
A child must attend school until reaching a certain drop-out age.
Students born in the first quarter reach the legal drop-out age before classmates born later in the year.
The quarter-of-birth dummy is correlated with education.
The quarter of birth is uncorrelated with ability.
Angrist and Evans (1998, AER) argue that parents’ preferences for a mixed sibling-sex composition can be used to estimate \beta _{1} in \text{Hours}_{i}=\beta _{0}+\beta _{1}\text{Children}_{i}+\ldots +U_{i}.
We need an IV Z such that \mathrm{Cov}\left(U_{i},Z_{i}\right) =0 and \mathrm{Cov}\left(\text{Children}_{i},Z_{i}\right) \neq 0.
Consider a dummy variable equal to one if the sex of the second child matches the sex of the first child.
If parents prefer a mixed sibling-sex composition, they are more likely to have another child if their first two children are of the same sex.
The same-sex dummy is correlated with the number of children.
Since sex mix is randomly determined, the same-sex dummy is exogenous.
Write \begin{align*} \sqrt{n}\left( \hat{\beta}_{1,n}^{IV}-\beta _{1}\right) &= \frac{\frac{1}{\sqrt{n}}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) U_{i}}{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}} \\ &\rightarrow _{d}\frac{N\left( 0,\;\mathrm{E}\left[\left( Z_{i}-\mathrm{E}\left[Z_{i}\right]\right)^{2}U_{i}^{2}\right]\right)}{\mathrm{Cov}\left(Z_{i},X_{i}\right)}. \end{align*}
Thus, \begin{aligned} \sqrt{n}\left( \hat{\beta}_{1,n}^{IV}-\beta _{1}\right) &\rightarrow_{d} N\left( 0,V^{IV}\right), \text{ where} \\ V^{IV} &= \frac{\mathrm{E}\left[\left( Z_{i}-\mathrm{E}\left[Z_{i}\right]\right) ^{2}U_{i}^{2}\right]}{\left(\mathrm{Cov}\left(Z_{i},X_{i}\right)\right) ^{2}}. \end{aligned}
Let \hat{\beta}_{0,n}^{IV}=\bar{Y}_{n}-\hat{\beta}_{1,n}^{IV}\cdot \bar{X}_{n}.
Let \hat{U}_{i}=Y_{i}-\hat{\beta}_{0,n}^{IV}-\hat{\beta}_{1,n}^{IV}X_{i}.
Estimate V^{IV} by \hat{V}_{n}^{IV}=\frac{\frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) ^{2}\hat{U}_{i}^{2}}{\left( \frac{1}{n}\sum_{i=1}^{n}\left( Z_{i}-\bar{Z}_{n}\right) X_{i}\right) ^{2}}.
In finite samples, we use the approximation: \hat{\beta}_{1,n}^{IV}\overset{a}{\sim }N\left( \beta _{1},\frac{\hat{V}_{n}^{IV}}{n}\right).