The fundamental challenge: we can never observe the same individual both with and without treatment at the same time.
This is modeled using the potential outcomes framework: each individual has two potential outcomes, one under treatment and one under control, but we only observe one of them.
Potential outcomes
Treatment status: D_i = 1 if individual i receives treatment, D_i = 0 otherwise.
For each individual i, define two potential outcomes:
Y_i(1): outcome if individual i receives treatment (D_i = 1),
Y_i(0): outcome if individual i does not receive treatment (D_i = 0).
The individual treatment effect for person i is:
Y_i(1) - Y_i(0).
Example: if Y_i is earnings and D_i indicates job training, then Y_i(1) - Y_i(0) is the causal effect of training on earnings for person i.
The fundamental problem
The observed outcome is:
Y_i = D_i \, Y_i(1) + (1 - D_i) \, Y_i(0).
If D_i = 1, we observe Y_i(1) but not Y_i(0).
If D_i = 0, we observe Y_i(0) but not Y_i(1).
We can never observe both potential outcomes for the same individual. This is the fundamental problem of causal inference.
ATE averages over the entire population; ATT averages only over those who actually receive treatment.
Selection bias
Since we cannot observe the counterfactual for each individual, a natural idea is to use the outcomes of other people as stand-ins: compare average outcomes of the treated group to average outcomes of the untreated group.
Can we estimate ATE by simply comparing group averages, \mathrm{E}\left[Y_i \mid D_i = 1\right] - \mathrm{E}\left[Y_i \mid D_i = 0\right]?
Comparing group averages seems natural, but it conflates the treatment effect with pre-existing differences between groups:
Randomized experiments (like the National Supported Work program) achieve this by randomly assigning individuals to treatment and control groups.
Regression with a treatment dummy
With random assignment, the ATE can be estimated by regressing Y_i on a treatment dummy:
Y_i = \alpha + \tau D_i + U_i,
where \mathrm{E}\left[U_i \mid D_i\right] = 0 under randomization.
The OLS estimate of \tau equals the difference in sample means:
\hat{\tau} = \bar{Y}_1 - \bar{Y}_0,
where \bar{Y}_1 and \bar{Y}_0 are the sample averages for the treated and control groups.
Example: Lalonde data
The National Supported Work (NSW) demonstration recruited disadvantaged workers (long-term unemployed, high-school dropouts, former drug users, ex-offenders).
Among eligible applicants, some were randomly assigned to receive job training (treated group), and the rest formed the experimental control group. Both groups come from the same disadvantaged population.
We use the jtrain2 dataset from the wooldridge package, based on the Lalonde (1986) study:
train: 1 if randomly assigned to job training, 0 if assigned to control.
re78: real earnings in 1978 (thousands of dollars).
Estimating the ATE
Since treatment was randomly assigned, the coefficient on train estimates the ATE:
summary(lm(re78 ~ train, data = jtrain2))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.554802 0.4080460 11.162474 1.154113e-25
train 1.794343 0.6328536 2.835321 4.787524e-03
The estimated ATE is approximately $1,800 (re78 is in thousands): on average, participation in the job training program increased 1978 earnings by about $1,800.
Observational studies
In many settings, randomization is not feasible or ethical. Treatment is determined by individual choices or institutional rules.
Self-selection generates selection bias: workers who voluntarily enroll in job training may be more motivated; workers at firms offering 401(k) plans tend to have higher incomes.
Key idea: if we can identify covariates X_i that explain why some individuals are treated and others are not, then after controlling for X_i, treatment may be as good as random.
This is the selection on observables assumption: all confounding factors are captured by observable covariates.
Potential outcomes with a covariate
The selection on observables assumption can be formalized as follows: after conditioning on X_i, treatment assignment D_i is as good as random.
Suppose the potential outcomes depend linearly on a covariate X_i:
where \mathrm{E}\left[U_i(0) \mid X_i, D_i\right] = 0 and \mathrm{E}\left[U_i(1) \mid X_i, D_i\right] = 0.
These conditions are called conditional mean independence: while Y_i(0) and Y_i(1) depend on X_i (and thus on D_i), the residual terms after controlling for X_i, namely U_i(0) and U_i(1), are uncorrelated with both X_i and D_i.
The effect of X_i on the potential outcomes can differ between the treated and control groups: \beta_1 may not equal \beta_0.
The coefficient \hat{\tau} on D_i directly estimates the ATE.
This works because demeaning the interaction “absorbs” the (\beta_1 - \beta_0)\mathrm{E}\left[X_i\right] part of the ATE into the coefficient on D_i.
Why not regress Y_i on D_i and X_i?
A simpler regression omits the interaction:
Y_i = a + \tau D_i + b X_i + V_i.
This is valid if \beta_1 = \beta_0 (the covariate has the same effect in both groups). Then \delta = 0, the interaction drops out, and the coefficient on D_i is the ATE.
If \beta_1 \neq \beta_0 (the effect of X_i differs between groups), the omitted interaction creates bias: the interaction X_iD_i affects Y_i and is correlated with D_i, so the coefficient on D_i does not equal the ATE.
Example: separate regressions
Estimate separate regressions of re78 on educ for the treatment and control groups:
The coefficient on train directly estimates the ATE, matching the result from the separate regressions approach.
Example: adding more covariates
So far, we used only educ as the control. The dataset contains more pre-treatment characteristics: age, black, married.
With random assignment, slopes are approximately equal across groups, so we can add controls directly without demeaned interactions:
reg_X <-lm(re78 ~ train + educ + age + black + married, data = jtrain2)round(summary(reg_X)$coefficients, 1)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.9 2.2 0.4 0.7
train 1.7 0.6 2.7 0.0
educ 0.4 0.2 2.4 0.0
age 0.1 0.0 1.2 0.2
black -2.3 0.8 -2.7 0.0
married 0.2 0.8 0.2 0.9
Comparison: simple vs. controlled
Compare the estimated ATE and its standard error across specifications, including one with demeaned interactions:
m1 <-lm(re78 ~ train, data = jtrain2)m2 <-lm(re78 ~ train + educ + age + black + married, data = jtrain2)covs <-c("educ", "age", "black", "married")for (v in covs) jtrain2[[paste0(v, "_dm")]] <- jtrain2[[v]] -mean(jtrain2[[v]])m3 <-lm(re78 ~ train + educ + age + black + married +I(train * educ_dm) +I(train * age_dm) +I(train * black_dm) +I(train * married_dm), data = jtrain2)make_row <-function(m) c(Estimate =coef(m)["train"],SE =summary(m)$coef["train", "Std. Error"])tab <-rbind("No controls"=make_row(m1),"Controls, equal slopes"=make_row(m2),"Demeaned interactions"=make_row(m3))round(tab, 2)
Estimate.train SE
No controls 1.79 0.63
Controls, equal slopes 1.68 0.63
Demeaned interactions 1.64 0.63
All three estimates are nearly identical. The demeaned interactions barely matter because randomization ensures approximately equal slopes across groups.
Why do controls change little here?
Under random assignment, D_i is independent of all covariates: the treatment and control groups are balanced in expectation.
Because of balance, including X_i does not change the coefficient on D_i: there is no selection bias to remove.
The NSW program recruited a narrow, disadvantaged population: long-term unemployed, high-school dropouts, former drug users, ex-offenders.
Within this homogeneous group, covariates like educ, age, black, and married vary little and have limited predictive power for earnings.
Both the point estimate and the standard error are nearly unchanged because the controls add almost no information beyond the treatment dummy.
Example: 401(k) eligibility and savings
The k401ksubs dataset (from the wooldridge package) contains 9,275 households from the 1991 Survey of Income and Program Participation.
Eligible workers have much higher income and much higher savings. The raw gap in savings is not a treatment effect — it partly reflects income differences.
Separate regressions
Estimate separate regressions of nettfa on inc for each group:
At x = 0 (zero income), this is an extrapolation far outside the data range — the sign flips to negative.
Contrast with jtrain2: there, controls barely mattered because randomization already eliminated selection bias. Here, controls and the demeaning trick are essential.
Summary
The fundamental problem of causal inference is that we never observe both potential outcomes Y_i(1) and Y_i(0) for the same individual.
Comparing group averages \mathrm{E}\left[Y_i \mid D_i = 1\right] - \mathrm{E}\left[Y_i \mid D_i = 0\right] does not generally yield the ATE because of selection bias: treated and control groups may differ in their baseline characteristics.
Randomization eliminates selection bias by making treatment independent of potential outcomes. The simple difference in means then equals the ATE, which can be estimated by regressing Y_i on a treatment dummy D_i.
In observational studies, treatment is not randomly assigned. Under selection on observables, controlling for covariates X_i restores the “as good as random” property of treatment assignment.
With covariates, \text{ATE} = (\alpha_1 - \alpha_0) + (\beta_1 - \beta_0)\mathrm{E}\left[X_i\right] depends on both intercept and slope differences.
The demeaning trick makes the ATE appear directly as the coefficient on D_i: interact D_i with (X_i - \bar{X}) instead of X_i. This absorbs the (\beta_1 - \beta_0)\mathrm{E}\left[X_i\right] component into the D_i coefficient.