Lab 03: Post-Lasso and double Lasso with growth data

```{r} suppressMessages({ library(stargazer) library(hdm) library(AER) }) ``` # Model $$\begin{equation} (\Delta\log GDP)_i=\alpha \cdot GDP^0_i+U_i. \end{equation}$$ * $(\Delta\log GDP)_i$ is the change in the log of GDP per capita of country $i$ between periods $t_0$ and $t_1$, $t_0⋆p<0.1; ^⋆⋆p<0.05; ^⋆⋆⋆p<0.01") ) ) ``` ##### **No support for the catching up hypothesis**. # Conditional model: * Institutions and technology matter. * The catching up hypothesis works with similar countries only. * Need to control for the characteristics:$$\begin{equation} (\Delta\log GDP)_i=\alpha \cdot GDP^0_i+X_i'\beta+U_i. \end{equation}$$ * $X_i$ is the vector of controls describing the economic conditions of country $i$ in period $t_0$. There are a lot of potential controls in the data: ```{r} dim(GrowthData) names(GrowthData) ``` * 90 observations (countries). * Over 60 potential controls. * This is a "big-data"/"high-dimensional" example. ### Estimation of the conditional model ```{r} y=as.vector(GrowthData$Outcome) D=as.vector(GrowthData$gdpsh465) Controls=as.matrix(GrowthData)[,-c(1,2,3)] ``` * `y` = GDP per capita growth rate. * `D` = initial GDP per capita. * `-c(1,2,3)` instructs to exclude the first 3 variables in `GrowthData`: * `Outcome` * `intercept` * `gdpsh465` OLS regression with all controls: ```{r, results='asis'} conditional=lm(y~D+Controls) suppressWarnings( stargazer(conditional, header=FALSE, title="Testing the conditional catching up hypothesis", omit.stat = "all", omit="Controls", #type="text" type="html",notes.append = FALSE,notes = c("^⋆p<0.1; ^⋆⋆p<0.05; ^⋆⋆⋆p<0.01") ) ) ``` ##### No support for the conditional catching up hypothesis. * The estimate is negative but the std.err is too large - too many controls, still no support. * The std.err. on the initial GDP per capita increased from `0.006` to `0.030`. # Post-Lasso with Double Lasso ```{r} ?rlassoEffect ``` Usage: * `x=` specifies the matrix of controls. * `y=` specifies the outcome variable. * `d=` specifies the treatment variable (the main regressor of interest). ```{r} Effect<-rlassoEffect(x=Controls,y=y,d=D,method="double selection") summary(Effect) ``` ```{r} names(Effect) ``` ##### **A negative significant estimate!** ```{r} Effect$selection.index ``` Included controls: ```{r} sum(Effect$selection.index==TRUE) Effect$selection.index[Effect$selection.index==TRUE] ``` #### Double Lasso selected 7 controls: * `bmp1l`: Log of the black market premium. * `freetar`: Measure of tariff restrictions. * `hm65`: Male gross enrollment ratio for higher education in 1965. * `sf65`: Female gross enrollment ratio for secondary education in 1965. * `lifee065`: Life expectancy at 0 in 1965. * `humanf65`: Average schooling years in the female population over age 25 in 1965. * `pop6565`: Population Proportion over 65 in 1965. ```{r} ?rlasso() ``` ##### Step 1 of the algorithm: ```{r} lasso.Y<-rlasso(Outcome~Controls,data=GrowthData) names(lasso.Y) ``` ```{r} coef(lasso.Y) ``` #### Step 2 ```{r} lasso.D<-rlasso(D~Controls) coef(lasso.D) ``` ### Using the partialling out approach: ```{r} Effect_PO<-rlassoEffect(x=Controls,y=y,d=D,method="partialling out") summary(Effect_PO) ``` * A very similar estimate to the Double Lasso approach. ```{r} sum(Effect_PO$selection.index==TRUE) Effect_PO$selection.index[Effect_PO$selection.index==TRUE] ``` * The same selected controls. ```{r} lasso.Y<-rlasso(Outcome~Controls,data=GrowthData) Ytilde<-lasso.Y$residuals lasso.D<-rlasso(D~Controls,data=GrowthData) Dtilde<-lasso.D$residuals Post<-lm(Ytilde~ -1+ Dtilde) coeftest(Post,vcov. = vcovHC(Post,type="HC0")) ``` ### Conclusion This data set contains many potential controls that are highly correlated among each other: * Many similar education variables. * Many similar demographic variables. * Many similar political variables. * Etc. These controls are related not only among each other, but also to the main regressor (treatment): The "initial" GDP per capita. As a result, including all potential controls produces insignificant estimates due to the presence of many controls (over 60 controls with only 90 observations). It is plausible to assume that the model is sparse: only certain demographic, education, and etc. variables matter. This is an appropriate problem for Lasso as we need to select few out of many controls. Lasso selects the important controls. The double Lasso step also selects the controls that are related to the main regressor to avoid potential omitted variables bias. Post Lasso produces significant estimates on the main regressor. The result implies that the conditional catching up hypothesis holds: growth rates converge for countries with similar economic and demographic characteristics. ---

Vadim Marmer