Lab 01: BIC illustration

```{r} suppressMessages({ library(foreach) library(stargazer) }) ``` We illustrate the use of the BIC model selection procedures using simulated data from a polynomial regression model. # DGP Consider a polynomial regression model: \begin{align} Y_i & = \beta_0+\beta_1 X_{i}+\beta_2 X_i^2+\ldots+\beta_{p_0} X_i^{p_0}+U_i,\\ U_i &\sim N(0,1),\\ X_i &\sim N(0,1). \end{align} Let's define a custom R function to generate the data: ```{r} DGP<-function(n,p0){ #p0 = the number of polynomial terms #n = the number of observations X<-rnorm(n) U<-rnorm(n) Y<-U for (j in 1:p0){ Y<-Y+X^j } return(list(y=Y,x=X)) } ``` Let's simulate some data. To make the results reproducible, we will use a specified state/seed of the random number generator. Each time you run the code, the same sequence of ovservations is simulated. ```{r} n=30 p0=4 set.seed(42) MyData<-DGP(n=n,p0=p0) ``` # Polynomial regression We can obtain the OLS estimator for the polynomial regression using the function `poly(x,p,raw=TRUE)`, where: - `x` is the data vector used to construct the polynomial terms. - `p` is the degree of the polynomial. - `raw=TRUE` instructs R to **not** use orthogonal polynomial terms. Below, we estimate the model with the true number of the polynomial terms `p=p0`: ```{r, results="asis"} model4<-lm(MyData$y ~ poly(MyData$x,p=p0,raw=TRUE)) stargazer(model4, covariate.labels=c(as.character(seq(from=1,to=4,by=1)),"Intercept"), omit.stat=c("ll","f","ser","n","adj.rsq") #,type="html",notes.append = FALSE,notes = c("^⋆p<0.1; ^⋆⋆p<0.05; ^⋆⋆⋆p<0.01") ,type="text" ) ```
* All estimates are significant. * All estimates are reasonably close to one (the true value) Now let's increase the number of regressors (polynomial terms) beyond what was used to generate the data: `p=10` ```{r, results="asis"} model10<-lm(MyData$y ~ poly(MyData$x,p=10,raw=TRUE)) stargazer(model10, covariate.labels=c(as.character(seq(from=1,to=10,by=1)),"Intercept"), omit.stat=c("ll","f","ser","n","adj.rsq") #,type="html",notes.append = FALSE,notes = c("^⋆p<0.1; ^⋆⋆p<0.05; ^⋆⋆⋆p<0.01") ,type="text" ) ```
* No significant estimates. * The estimates for the first four parameters can substantially deviate from the true values due to the large variances. * The variances (and the standard errors) are large because the extra regressors are correlated with the relevant regressors. Let's try a misspecified model: ```{r, results="asis"} model3<-lm(MyData$y ~ poly(MyData$x,p=3,raw=TRUE)) stargazer(model3, covariate.labels=c(as.character(seq(from=1,to=3,by=1)),"Intercept"), omit.stat=c("ll","f","ser","n","adj.rsq") #,type="html",notes.append = FALSE,notes = c("^⋆p<0.1; ^⋆⋆p<0.05; ^⋆⋆⋆p<0.01") ,type="text" ) ```
* Biased estimates due to the omitted variables # BIC - The R `BIC()` function is slightly differently defined than the one we used in the notes. - It works similarly to our version of BIC. ```{r} cat("BIC for: \n", '- correct model =', BIC(model4),'\n', '- overspecified model =', BIC(model10),'\n', '- under-specified model =', BIC(model3),'\n') ``` The custom function below runs all polynomial regression up to the `max_p` order, and picks the one with the lowest BIC value: ```{r} MySelect<-function(MyData,max_p){ Powers<-seq(from=1,to=max_p,by=1) myBICs<-foreach(j=Powers) %do% BIC(lm(MyData$y ~ poly(MyData$x,p=j,raw=TRUE))) p_hat<-which.min(myBICs) return(p_hat) } ``` Let's try BIC selection: ```{r} MySelect(MyData,max_p=10) ``` # Monter Carlo simulations for the accuracy of BIC in selecting the correct model Let's check how likely BIC to select the right model using MC simulations. - `n` is the sample size. - `p0` is the true degree. - `max_p` is the largest degree to try. - `R` is the number of MC repetitions. - `stateRNG` is to set the seed for reproducibility. - `.export=c("DGP","MySelect","MyBIC")` is to make sure that the user-defined objects are passed to the multiple cores. Note that while under some installations the code also runs without that option, in other cases excluding the option may result in errors "could not find function". ```{r} MC_BIC<-function(n,p0,max_p,R,seed){ Success<-foreach(r=1:R, .combine='rbind',.export=c("DGP","MySelect","MyBIC")) %do% { MyData<-DGP(n,p0=p0) p_hat<-MySelect(MyData,max_p=max_p) (p_hat==p0) } cat("With n=",n, "the probability of BIC selecting the true model is",mean(Success)) } ``` ```{r} MC_BIC(n=30,p0=4,max_p=10,R=10^2,seed=42) ``` ```{r} MC_BIC(n=100,p0=4,max_p=10,R=10^2,seed=42) ``` # What if some of the coefficients are small? We re-define the function that generates the coefficients: - The coefficient on the term of order $p_0$ is now **small**: ```{r} DGP<-function(n,p0){ X<-rnorm(n) U<-rnorm(n) Y<-U+0.01*X^p0 for (j in 1:p0-1){ Y<-Y+X^j } return(list(y=Y,x=X)) } ``` ```{r} MC_BIC(n=100,p0=4,max_p=10,R=10^2,seed=42) ``` - Now, the probability of selecting the true order $p_0$ is much smaller. - BIC thinks the $p_0$ term is irrelevant. ---

Vadim Marmer