11.4 Stepwise selection
The basic idea behind stepwise model selection is that we wish to create and test models in a variable-by-variable manner until only “important” (say “well supported”) variables are left in the model. The support for each variable is evaluated in turn relative to some pre-determined criterion and an arbitrary (or not) starting point.
While convenient, this approach has some well-known pitfalls. First, this generally is not a useful way to construct biological hypotheses for experiments or for observational studies. Second, it is easy to miss out on important relationships that are not considered because of the automated inclusion or exclusion of ‘significant’ explanatory variables and the order in which they are entered or dropped. For example, in most readily accessible applications, this tool also does not include interaction terms that might be of biological interest by default. Therefore, regardless of the method used, careful thought is warranted regarding the variables included and their potential mathematical combinations. For most purely predictive situations better tools now exist since the advent of machine learning algorithms.
11.4.1 Forward selection
We start by making a “null” model that includes no explanatory variables. This model is simply a least-squares estimate of the mean. If we think about it in terms of a linear model, the only parameter is the intercept \(Y = \beta_0\), so the estimate of (Intercept)
that R reports from the model is simply the mean of y
in the data. Let’s demonstrate with the swiss
data.
We write the null model like this.
Mathemetically, the 1
just tells R to make a model matrix with a single column of 1
s called (Intercept)
. Have a look:
## (Intercept)
## Courtelary 1
## Delemont 1
## Franches-Mnt 1
## Moutier 1
## Neuveville 1
## Porrentruy 1
Now that we have a null
model, we need to make a full model. The full model is the model that includes all the variables we want to consider in different combinations. In phylogenetics, these would be different trees that consider varying numbers of splits and different groupings. We can write out the formula for the full model by hand in the lm()
function, or we can use .
to tell R that we want it to consider additive combinations of all columns other than Fertility
.
Now we perform the forward selection using the step()
function. Watch them fly by in real time! Here we are telling R to start with the null
model we created above using object = null
, but we could actually specify any other model between the null
and full
if we wanted to. Next, we tell R that the scope
of models to consider should include all combinations of explanatory variables (Education
, Catholic
, Infant.Mortality
, and Agriculture
), including none of them (null
) and all of them (full
). Then, we tell R what direction to build models in, either forward
, backward
, or both
.
## Start: AIC=238.35
## Fertility ~ 1
##
## Df Sum of Sq RSS AIC
## + Education 1 3162.7 4015.2 213.04
## + Examination 1 2994.4 4183.6 214.97
## + Catholic 1 1543.3 5634.7 228.97
## + Infant.Mortality 1 1245.5 5932.4 231.39
## + Agriculture 1 894.8 6283.1 234.09
## <none> 7178.0 238.34
##
## Step: AIC=213.04
## Fertility ~ Education
##
## Df Sum of Sq RSS AIC
## + Catholic 1 961.07 3054.2 202.18
## + Infant.Mortality 1 891.25 3124.0 203.25
## + Examination 1 465.63 3549.6 209.25
## <none> 4015.2 213.04
## + Agriculture 1 61.97 3953.3 214.31
##
## Step: AIC=202.18
## Fertility ~ Education + Catholic
##
## Df Sum of Sq RSS AIC
## + Infant.Mortality 1 631.92 2422.2 193.29
## + Agriculture 1 486.28 2567.9 196.03
## <none> 3054.2 202.18
## + Examination 1 2.46 3051.7 204.15
##
## Step: AIC=193.29
## Fertility ~ Education + Catholic + Infant.Mortality
##
## Df Sum of Sq RSS AIC
## + Agriculture 1 264.176 2158.1 189.86
## <none> 2422.2 193.29
## + Examination 1 9.486 2412.8 195.10
##
## Step: AIC=189.86
## Fertility ~ Education + Catholic + Infant.Mortality + Agriculture
##
## Df Sum of Sq RSS AIC
## <none> 2158.1 189.86
## + Examination 1 53.027 2105.0 190.69
##
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality +
## Agriculture, data = swiss)
##
## Coefficients:
## (Intercept) Education Catholic Infant.Mortality
## 62.1013 -0.9803 0.1247 1.0784
## Agriculture
## -0.1546
Here, we see that the best model is that which includes the additive effects of Education
, Catholic
, Infant.Mortality
, and Agriculture
, or our full
model. Go ahead and try it with a different starting object
or direction
to see if this changes the result.