suppressMessages({
library(foreach)
library(stargazer)
})
We illustrate the use of the BIC model selection procedures using simulated data from a polynomial regression model.
Consider a polynomial regression model:
\[\begin{align} Y_i & = \beta_0+\beta_1 X_{i}+\beta_2 X_i^2+\ldots+\beta_{p_0} X_i^{p_0}+U_i,\\ U_i &\sim N(0,1),\\ X_i &\sim N(0,1). \end{align}\]
Let’s define a custom R function to generate the data:
DGP<-function(n,p0){
#p0 = the number of polynomial terms
#n = the number of observations
X<-rnorm(n)
U<-rnorm(n)
Y<-U
for (j in 1:p0){
Y<-Y+X^j
}
return(list(y=Y,x=X))
}
Let’s simulate some data. To make the results reproducible, we will use a specified state/seed of the random number generator. Each time you run the code, the same sequence of ovservations is simulated.
n=30
p0=4
set.seed(42)
MyData<-DGP(n=n,p0=p0)
We can obtain the OLS estimator for the polynomial regression using
the function poly(x,p,raw=TRUE)
, where:
x
is the data vector used to construct the polynomial
terms.p
is the degree of the polynomial.raw=TRUE
instructs R to not use
orthogonal polynomial terms.Below, we estimate the model with the true number of the polynomial
terms p=p0
:
model4<-lm(MyData$y ~ poly(MyData$x,p=p0,raw=TRUE))
stargazer(model4,
covariate.labels=c(as.character(seq(from=1,to=4,by=1)),"Intercept"),
omit.stat=c("ll","f","ser","n","adj.rsq")
,type="html",notes.append = FALSE,notes = c("<sup>⋆</sup>p<0.1; <sup>⋆⋆</sup>p<0.05; <sup>⋆⋆⋆</sup>p<0.01")
#,type="text"
)
Dependent variable: | |
y | |
1 | 1.117*** |
(0.388) | |
2 | 1.023*** |
(0.347) | |
3 | 0.922*** |
(0.096) | |
4 | 0.989*** |
(0.063) | |
Intercept | -0.119 |
(0.288) | |
R2 | 0.993 |
Note: | ⋆p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01 |
Now let’s increase the number of regressors (polynomial terms) beyond
what was used to generate the data: p=10
model10<-lm(MyData$y ~ poly(MyData$x,p=10,raw=TRUE))
stargazer(model10,
covariate.labels=c(as.character(seq(from=1,to=10,by=1)),"Intercept"),
omit.stat=c("ll","f","ser","n","adj.rsq")
,type="html",notes.append = FALSE,notes = c("<sup>⋆</sup>p<0.1; <sup>⋆⋆</sup>p<0.05; <sup>⋆⋆⋆</sup>p<0.01")
#,type="text"
)
Dependent variable: | |
y | |
1 | 0.437 |
(1.203) | |
2 | 3.453 |
(3.131) | |
3 | 0.283 |
(3.387) | |
4 | -0.863 |
(3.957) | |
5 | 0.998 |
(2.619) | |
6 | 0.684 |
(1.944) | |
7 | -0.327 |
(0.722) | |
8 | -0.146 |
(0.410) | |
9 | 0.032 |
(0.065) | |
10 | 0.013 |
(0.031) | |
Intercept | -0.510 |
(0.447) | |
R2 | 0.995 |
Note: | ⋆p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01 |
Let’s try a misspecified model:
model3<-lm(MyData$y ~ poly(MyData$x,p=3,raw=TRUE))
stargazer(model3,
covariate.labels=c(as.character(seq(from=1,to=3,by=1)),"Intercept"),
omit.stat=c("ll","f","ser","n","adj.rsq")
,type="html",notes.append = FALSE,notes = c("<sup>⋆</sup>p<0.1; <sup>⋆⋆</sup>p<0.05; <sup>⋆⋆⋆</sup>p<0.01")
#,type="text"
)
Dependent variable: | |
y | |
1 | 1.729 |
(1.255) | |
2 | 6.194*** |
(0.373) | |
3 | 0.386 |
(0.292) | |
Intercept | -2.156** |
(0.837) | |
R2 | 0.927 |
Note: | ⋆p<0.1; ⋆⋆p<0.05; ⋆⋆⋆p<0.01 |
BIC()
function is slightly differently defined
than the one we used in the notes.cat("BIC for: \n",
'- correct model =', BIC(model4),'\n',
'- overspecified model =', BIC(model10),'\n',
'- under-specified model =', BIC(model3),'\n')
## BIC for:
## - correct model = 105.3882
## - overspecified model = 117.5081
## - under-specified model = 173.8286
The custom function below runs all polynomial regression up to the
max_p
order, and picks the one with the lowest BIC
value:
MySelect<-function(MyData,max_p){
Powers<-seq(from=1,to=max_p,by=1)
myBICs<-foreach(j=Powers) %do% BIC(lm(MyData$y ~ poly(MyData$x,p=j,raw=TRUE)))
p_hat<-which.min(myBICs)
return(p_hat)
}
Let’s try BIC selection:
MySelect(MyData,max_p=10)
## [1] 4
Let’s check how likely BIC to select the right model using MC simulations.
n
is the sample size.p0
is the true degree.max_p
is the largest degree to try.R
is the number of MC repetitions.stateRNG
is to set the seed for reproducibility..export=c("DGP","MySelect","MyBIC")
is to make sure
that the user-defined objects are passed to the multiple cores. Note
that while under some installations the code also runs without that
option, in other cases excluding the option may result in errors “could
not find function”.MC_BIC<-function(n,p0,max_p,R,seed){
Success<-foreach(r=1:R, .combine='rbind',.export=c("DGP","MySelect","MyBIC")) %do% {
MyData<-DGP(n,p0=p0)
p_hat<-MySelect(MyData,max_p=max_p)
(p_hat==p0)
}
cat("With n=",n, "the probability of BIC selecting the true model is",mean(Success))
}
MC_BIC(n=30,p0=4,max_p=10,R=10^2,seed=42)
## With n= 30 the probability of BIC selecting the true model is 0.73
MC_BIC(n=100,p0=4,max_p=10,R=10^2,seed=42)
## With n= 100 the probability of BIC selecting the true model is 0.94
We re-define the function that generates the coefficients:
DGP<-function(n,p0){
X<-rnorm(n)
U<-rnorm(n)
Y<-U+0.01*X^p0
for (j in 1:p0-1){
Y<-Y+X^j
}
return(list(y=Y,x=X))
}
MC_BIC(n=100,p0=4,max_p=10,R=10^2,seed=42)
## With n= 100 the probability of BIC selecting the true model is 0