11.4 Stepwise selection

The basic idea behind stepwise model selection is that we wish to create and test models in a variable-by-variable manner until only “important” (say “well supported”) variables are left in the model. The support for each variable is evaluated in turn relative to some pre-determined criterion and an arbitrary (or not) starting point.

While convenient, this approach has some well-known pitfalls. First, this generally is not a useful way to construct biological hypotheses for experiments or for observational studies. Second, it is easy to miss out on important relationships that are not considered because of the automated inclusion or exclusion of ‘significant’ explanatory variables and the order in which they are entered or dropped. For example, in most readily accessible applications, this tool also does not include interaction terms that might be of biological interest by default. Therefore, regardless of the method used, careful thought is warranted regarding the variables included and their potential mathematical combinations. For most purely predictive situations better tools now exist since the advent of machine learning algorithms.

11.4.1 Forward selection

We start by making a “null” model that includes no explanatory variables. This model is simply a least-squares estimate of the mean. If we think about it in terms of a linear model, the only parameter is the intercept $Y = \beta_0$ , so the estimate of (Intercept) that R reports from the model is simply the mean of y in the data. Let’s demonstrate with the swiss data.

We write the null model like this.

data(swiss)
null <- lm(Fertility ~ 1, data = swiss)

Mathemetically, the 1 just tells R to make a model matrix with a single column of 1s called (Intercept). Have a look:

head(model.matrix(null))

##              (Intercept)
## Courtelary             1
## Delemont               1
## Franches-Mnt           1
## Moutier                1
## Neuveville             1
## Porrentruy             1

Now that we have a null model, we need to make a full model. The full model is the model that includes all the variables we want to consider in different combinations. In phylogenetics, these would be different trees that consider varying numbers of splits and different groupings. We can write out the formula for the full model by hand in the lm() function, or we can use . to tell R that we want it to consider additive combinations of all columns other than Fertility.

full <- lm(Fertility ~ ., data = swiss)

Now we perform the forward selection using the step() function. Watch them fly by in real time! Here we are telling R to start with the null model we created above using object = null, but we could actually specify any other model between the null and full if we wanted to. Next, we tell R that the scope of models to consider should include all combinations of explanatory variables (Education, Catholic, Infant.Mortality, and Agriculture), including none of them (null) and all of them (full). Then, we tell R what direction to build models in, either forward, backward, or both.

step(object = null, scope = list(lower = null, upper = full), direction = "forward")

## Start:  AIC=238.35
## Fertility ~ 1
## 
##                    Df Sum of Sq    RSS    AIC
## + Education         1    3162.7 4015.2 213.04
## + Examination       1    2994.4 4183.6 214.97
## + Catholic          1    1543.3 5634.7 228.97
## + Infant.Mortality  1    1245.5 5932.4 231.39
## + Agriculture       1     894.8 6283.1 234.09
## <none>                          7178.0 238.34
## 
## Step:  AIC=213.04
## Fertility ~ Education
## 
##                    Df Sum of Sq    RSS    AIC
## + Catholic          1    961.07 3054.2 202.18
## + Infant.Mortality  1    891.25 3124.0 203.25
## + Examination       1    465.63 3549.6 209.25
## <none>                          4015.2 213.04
## + Agriculture       1     61.97 3953.3 214.31
## 
## Step:  AIC=202.18
## Fertility ~ Education + Catholic
## 
##                    Df Sum of Sq    RSS    AIC
## + Infant.Mortality  1    631.92 2422.2 193.29
## + Agriculture       1    486.28 2567.9 196.03
## <none>                          3054.2 202.18
## + Examination       1      2.46 3051.7 204.15
## 
## Step:  AIC=193.29
## Fertility ~ Education + Catholic + Infant.Mortality
## 
##               Df Sum of Sq    RSS    AIC
## + Agriculture  1   264.176 2158.1 189.86
## <none>                     2422.2 193.29
## + Examination  1     9.486 2412.8 195.10
## 
## Step:  AIC=189.86
## Fertility ~ Education + Catholic + Infant.Mortality + Agriculture
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     2158.1 189.86
## + Examination  1    53.027 2105.0 190.69

## 
## Call:
## lm(formula = Fertility ~ Education + Catholic + Infant.Mortality + 
##     Agriculture, data = swiss)
## 
## Coefficients:
##      (Intercept)         Education          Catholic  Infant.Mortality  
##          62.1013           -0.9803            0.1247            1.0784  
##      Agriculture  
##          -0.1546

Here, we see that the best model is that which includes the additive effects of Education, Catholic, Infant.Mortality, and Agriculture, or our full model. Go ahead and try it with a different starting object or direction to see if this changes the result.