15.2 Logistic regression

For our first example this week, we will use the choice data from a couple weeks ago that we used to demonstrate binomial logistic regression. This time, we will add in a random intercept term that will allow us to account for repeated observations within a year. This has two implications: 1) it accounts for the fact that the years in which we conducted this study are random samples from a larger, unobserved population, and 2) it accounts for the heterogeneity of variance that theoretically might occur as a result of taking multiple, and variable, numbers of measurements within a given year- thereby reducing the overall error of the model and our associated parameter estimates (in theory).

##   path year hatchery length mass date flow
## 1    0 2010        1    176   57  118  345
## 2    0 2005        1    205  101  128 1093
## 3    0 2010        1    180   56  118  345
## 4    0 2010        1    193   74  118  345
## 5    0 2005        1    189   76  128 1093
## 6    0 2010        1    180   65  118  345

15.2.1 Data Explanation

These data are from a study that examined factors affecting path choice by wild and hatchery-reared endangered Atlantic salmon smolts during seaward migration in the Penobscot River, Maine. State, local, and federal fishery managers were interested in understanding what factors affected migratory routing through the lower river because there were different numbers of dams, with different estimated smolt mortality rates, on either side of a large island hydropower project in this system. If managers could understand factors influencing migratory route, they might be able to manipulate flows, stocking dates, and dam operation to improve survival of these endangered fish. Furthermore, the results of the study were used to predict the effects of dam removal, and hydropower re-allocation in the lower river on population-level consequences for these fish. Please see the <08_glm_logisticRegression>logistic regression module for a complete explanation of the data.

15.2.2 Data analysis

We are going to use the 1/0 binary data to estimate the effects of a number of covariates of interest on the probability that an individual fish used the Stillwater Branch for migration in each year of this study using logistic regression. In order to do this, we will use the ‘logit’ link function, which can be defined as:

logit <- function(x) {
  log(x / (1 - x))
}

The inverse of the logit function is:

invlogit <- function(x) {
  exp(x) / (1 + exp(x))
}

Since we are not interested in the linear trend in the use of the Stillwater Branch through time, we need to convert year to factor. This is the same as if we wanted to use this as a fixed effect in the model as we did in Chapter 12.4 when we last worked with these data.

choice$year <- as.factor(choice$year)

Next, define a set of models based on a priori combinations of explanatory variables.

# First, make an empty list to hold the models
mods <- list()

# Now, fill the list with several a priori models
# Need to load the `lme4` package for the `glmer` function

library(lme4)
# Here is the list
mods[[1]] <- glmer(path ~ (1 | year) + hatchery + length + flow, family = binomial, data = choice)
mods[[2]] <- glmer(path ~ (1 | year) + flow, family = binomial, data = choice)
mods[[3]] <- glmer(path ~ (1 | year) + hatchery, family = binomial, data = choice)
mods[[4]] <- glmer(path ~ (1 | year) + length, family = binomial, data = choice)
mods[[5]] <- glmer(path ~ (1 | year) + length + hatchery, family = binomial, data = choice)
mods[[6]] <- glmer(path ~ (1 | year) + length + flow, family = binomial, data = choice)
mods[[7]] <- glmer(path ~ (1 | year) + hatchery + flow, family = binomial, data = choice)

Give the models some names using the formulas for each of the models. Remember: models are stored as list objects in R, and each of those list objects (models) has names. We can reference those names using the $ notation:

for (i in 1:length(mods)) {
  names(mods)[i] <- as.character(summary(mods[[i]])$call$formula)[3]
}

Now, we use the AICcmodavg package to make a model selection table like we did last week:

# Load the package and make the table
library(AICcmodavg)
modtable <- aictab(cand.set = mods, modnames = names(mods))

Here, we see the best model is the one that incorporates only flow, and that the addition of length or hatchery or both doesn’t really improve the fit of our model. Therefore, we’ll go ahead and make predictions from the simplest (best) model alone.

Let’s re-name that model really quick to make things easier to remember:

best_mod <- glmer(
  path ~ (1 | year) + flow,
  family = binomial, data = choice
)

15.2.3 Predictions

Finally, we can use these models to make predictions about the relationships in our models the same way we have done previously with linear models and GLMs.

# Load the merTools package
library(merTools)

# Simulate predictions from the relationship
# stored in the model fit using our original data
logit_preds <- predictInterval(
  merMod = best_mod,
  level = 0.95, n.sims = 10000,
  stat = "median", type = "linear.prediction"
)

# Convert them to the real scale for plotting
real_preds <- apply(logit_preds, 2, invlogit)

# Combine predictions with the original data
mer_preds <- data.frame(choice, real_preds)

# Plot the predictions
ggplot(mer_preds, aes(x = flow, y = fit, color = year, fill = year)) +
  geom_line(alpha = 0.10) +
  geom_ribbon(aes(ymin = lwr, ymax = upr, color = NULL), alpha = .3) +
  geom_line(aes(y = fit), lwd = 1, alpha = 0.50) +
  xlab("Discharge") +
  ylab("Probability of using Stillwater Branch")