suppressMessages({
library(foreach)
library(stargazer)
})

We illustrate the use of the BIC model selection procedures using simulated data from a polynomial regression model.

DGP

Consider a polynomial regression model:

\[\begin{align} Y_i & = \beta_0+\beta_1 X_{i}+\beta_2 X_i^2+\ldots+\beta_{p_0} X_i^{p_0}+U_i,\\ U_i &\sim N(0,1),\\ X_i &\sim N(0,1). \end{align}\]

Let’s define a custom R function to generate the data:

DGP<-function(n,p0){
  #p0 = the number of polynomial terms
  #n = the number of observations
  X<-rnorm(n)
  U<-rnorm(n)
  Y<-U
  for (j in 1:p0){
    Y<-Y+X^j
  }
  return(list(y=Y,x=X))
}

Let’s simulate some data. To make the results reproducible, we will use a specified state/seed of the random number generator. Each time you run the code, the same sequence of ovservations is simulated.

n=30
p0=4
set.seed(42)
MyData<-DGP(n=n,p0=p0)

Polynomial regression

We can obtain the OLS estimator for the polynomial regression using the function poly(x,p,raw=TRUE), where:

Below, we estimate the model with the true number of the polynomial terms p=p0:

model4<-lm(MyData$y ~ poly(MyData$x,p=p0,raw=TRUE))
stargazer(model4,
          covariate.labels=c(as.character(seq(from=1,to=4,by=1)),"Intercept"),
          omit.stat=c("ll","f","ser","n","adj.rsq")
          ,type="html",notes.append = FALSE,notes = c("<sup>&sstarf;</sup>p<0.1; <sup>&sstarf;&sstarf;</sup>p<0.05; <sup>&sstarf;&sstarf;&sstarf;</sup>p<0.01")
          #,type="text"
          )
Dependent variable:
y
1 1.117***
(0.388)
2 1.023***
(0.347)
3 0.922***
(0.096)
4 0.989***
(0.063)
Intercept -0.119
(0.288)
R2 0.993
Note: p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01


Now let’s increase the number of regressors (polynomial terms) beyond what was used to generate the data: p=10

model10<-lm(MyData$y ~ poly(MyData$x,p=10,raw=TRUE))
stargazer(model10,
          covariate.labels=c(as.character(seq(from=1,to=10,by=1)),"Intercept"),
          omit.stat=c("ll","f","ser","n","adj.rsq")
          ,type="html",notes.append = FALSE,notes = c("<sup>&sstarf;</sup>p<0.1; <sup>&sstarf;&sstarf;</sup>p<0.05; <sup>&sstarf;&sstarf;&sstarf;</sup>p<0.01")
          #,type="text"
          )
Dependent variable:
y
1 0.437
(1.203)
2 3.453
(3.131)
3 0.283
(3.387)
4 -0.863
(3.957)
5 0.998
(2.619)
6 0.684
(1.944)
7 -0.327
(0.722)
8 -0.146
(0.410)
9 0.032
(0.065)
10 0.013
(0.031)
Intercept -0.510
(0.447)
R2 0.995
Note: p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01


Let’s try a misspecified model:

model3<-lm(MyData$y ~ poly(MyData$x,p=3,raw=TRUE))
stargazer(model3,
          covariate.labels=c(as.character(seq(from=1,to=3,by=1)),"Intercept"),
          omit.stat=c("ll","f","ser","n","adj.rsq")
          ,type="html",notes.append = FALSE,notes = c("<sup>&sstarf;</sup>p<0.1; <sup>&sstarf;&sstarf;</sup>p<0.05; <sup>&sstarf;&sstarf;&sstarf;</sup>p<0.01")
          #,type="text"
          )
Dependent variable:
y
1 1.729
(1.255)
2 6.194***
(0.373)
3 0.386
(0.292)
Intercept -2.156**
(0.837)
R2 0.927
Note: p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01


BIC

cat("BIC for: \n",
   '- correct model         =', BIC(model4),'\n', 
   '- overspecified model   =', BIC(model10),'\n',
   '- under-specified model =', BIC(model3),'\n')
## BIC for: 
##  - correct model         = 105.3882 
##  - overspecified model   = 117.5081 
##  - under-specified model = 173.8286

The custom function below runs all polynomial regression up to the max_p order, and picks the one with the lowest BIC value:

MySelect<-function(MyData,max_p){
  Powers<-seq(from=1,to=max_p,by=1)
  myBICs<-foreach(j=Powers) %do% BIC(lm(MyData$y ~ poly(MyData$x,p=j,raw=TRUE)))
    p_hat<-which.min(myBICs)
  return(p_hat)
}

Let’s try BIC selection:

MySelect(MyData,max_p=10)
## [1] 4

Monter Carlo simulations for the accuracy of BIC in selecting the correct model

Let’s check how likely BIC to select the right model using MC simulations.

MC_BIC<-function(n,p0,max_p,R,seed){
    Success<-foreach(r=1:R, .combine='rbind',.export=c("DGP","MySelect","MyBIC"))  %do% {
      MyData<-DGP(n,p0=p0)
      p_hat<-MySelect(MyData,max_p=max_p)
      (p_hat==p0)
  }
  cat("With n=",n, "the probability of BIC selecting the true model is",mean(Success))
}
MC_BIC(n=30,p0=4,max_p=10,R=10^2,seed=42)
## With n= 30 the probability of BIC selecting the true model is 0.73
MC_BIC(n=100,p0=4,max_p=10,R=10^2,seed=42)
## With n= 100 the probability of BIC selecting the true model is 0.94

What if some of the coefficients are small?

We re-define the function that generates the coefficients:

DGP<-function(n,p0){
  X<-rnorm(n)
  U<-rnorm(n)
  Y<-U+0.01*X^p0
  for (j in 1:p0-1){
    Y<-Y+X^j
  }
  return(list(y=Y,x=X))
}
MC_BIC(n=100,p0=4,max_p=10,R=10^2,seed=42)
## With n= 100 the probability of BIC selecting the true model is 0