18.4 Zero inflation

The fits of these two models, in reality, suggest the need to for what is becoming an increasingly common statistical tool: the zero inflated count model. Zero inflation (excess zeroes in count data) can arise by one of two mechanisms: true (“process”) zeros and observational zeros that result from imperfect detection. There are a number of ways to handle these types of data.

One approach to dealing with this is to use a hurdle model. The idea is to make two separate models: 1) a logistic regression model to help us determine which factors influence whether the phenomenon of interest even occurred (0 or 1), and 2) a count model to help us determine what factors influence with the frequency of occurrence given that it occurred in the first place.

When these models are linked mathematically, we call it a “mixture model” - an approach that has become very popular for accounting for imperfect detection when estimating abundance of organisms. For now, let’s just look at the hurdle model for our crab data as the n-mixture approach falls solidly in the realm of “advanced” methods we’ll not discuss in this decidedly Worst statistics book.

18.4.1 Step 1: Presence-absence

First, we need to make a binary indicator variable to represent whether or not any satellites were present:

# Make a new column for count
# and absence (of satellite males)
# and initialize to zero
crabs$present <- 0

# Assign a '1' if any satellites were
# observed
crabs$present[crabs$satellites > 0] <- 1

Now, the first step in the hurdle model is to fit a logistic regression model to predict how our response is affected by some combination of explanatory variables.

hurdle_step1 <- stan_glm(
  present ~ mass, 
  data = crabs, 
  family = binomial())

Let’s have a look at the estimated coefficient for mass from this model:

coeffs <- data.frame(hurdle_step1)

ggplot(coeffs, aes(x = mass)) + geom_histogram(bins = 50)

Here we see that mass has a significant effect on whether or not any satellite males are present because zero is not included within the credible range of estimates. You could imagine fitting any number of plausible biological models for comparison using LOO-IC or elpd_loo at this point.

18.4.2 Step 2: Counts given presence

Step 2 is to fit a count model to explain the effects of some combination of explanatory variables on the frequency with which the phenomenon occurs given that it ever occurred in the first place. Note: This does not have to be the same combination of explanatory variables. In fact, it is always conceivable that different processes influence these two distinct phenomena. As with the count-absence model, you could even fit a candidate set of models and proceed with model comparisons using LOO-IC.

crabs2 <- crabs[crabs$satellites != 0, ]

# Make a model relating the number
# of satellite males to the mass
# of female crabs
hurdle_step2 <- stan_glm(
  satellites ~ mass,
  data = crabs2,
  family = neg_binomial_2()
)

From these results, we can see that our count models in the previous sections were really just picking up on the large number of zeroes in our data set. We know this because of the differences in the results between the models hurdle_step1 and hurdle_step2.

Likewise, we can take another look at our model diagnostics for hurdle_step2 to see if our diagnostic plots look more reasonable now.

resids <- data.frame(.fitted = hurdle_step2$fitted.values, 
                     .resid = hurdle_step2$residuals)

Now, we can plot them:

ggplot(resids, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  xlab("Fitted values") +
  ylab(expression(paste(epsilon)))

These residuals still aren’t perfect, but they are a lot better than they were before. For now, we will go ahead and make some predictive plots to see how we did.

18.4.3 Predictions

Now that we are finally happy with our residual plots (wow, that took a lot longer than fitting any of the models!) we can make a plot of our predictions against the raw data to see how we did.

Let’s start with the count/absence component that we fit in hurdle_step1:

Remember, that hurdle_step1 is just a logistic regression model, so our predictions will follow the same process as we did in Chapter 12.4. We’ll use the crabs data to make and plot predictions for this one since we had presence-absence info for each crab.

# Make predictions on the logit scale
logit_preds <- data.frame( predict(hurdle_step1, se.fit = TRUE) )

# Get 95% confidence intervals
logit_preds$lwr <- logit_preds$fit + logit_preds$se.fit * qnorm(0.025)
logit_preds$upr <- logit_preds$fit + logit_preds$se.fit * qnorm(0.975)

# Invert the logit-link function
logit_preds <- apply(logit_preds, 2, invlogit)

# Combine the new masses and the predictions with which
# they are associated
pres_preds <- data.frame(crabs, logit_preds)

Important: Our hurdle model actually contains two models (hurdle_step1 and hurdle_step2). The hurdle_step1 component is actually logistic regression and therefore uses the logit link function that we introduced in Chapter 12, so we need to invert the logit to get the probability of a female having any satellite males as a function of mass. This is done in the code above. Make sure you understand how and why we do this!

Once you’ve got that down, it’s all over but the plotting. Here is how predicted probability of satellite male crab count changes across the range of observed female mass:

ggplot(pres_preds, aes(x = mass, y = fit)) +
  geom_line() +
  geom_ribbon(aes(ymin = lwr, ymax = upr, color = NULL), alpha = .3) +
  scale_y_continuous(limits = c(0, 1)) +
  xlab("Mass of female crab (g)") +
  ylab("Probability of satellite male")

Finally, we can make a plot of the number of satellite males we would expect to see on a female crab given that she had attracted any males in the first place.

Also important: We need to remember here that we have two different models. The first model hurdle_step1 was a binary logistic regression, so it used the logit link. The second model hurdle_step2 was a count model and used the log link. That means we need to invert the log link to get our predicted counts back on the real scale.

We will combine our predictions with the crabs2 dataframe, which only contains observations for which there was at least one satellite.

# Make predictions using step2 model and the crabs2 dataframe
count_preds <- data.frame(predict(hurdle_step2, se.fit = TRUE))

# Get 95% confidence intervals
count_preds$lwr <- count_preds$fit + count_preds$se.fit * qnorm(0.025)
count_preds$upr <- count_preds$fit + count_preds$se.fit * qnorm(0.975)

# Invert the log link function
count_preds <- apply(count_preds, 2, exp)

# Combine the new masses and the predictions with which
# they are associated - overwriting on the fly - yuck!!
count_preds <- data.frame(crabs2, count_preds)

Here is a plot showing how the number of satellite males observed changes with the mass of female horseshoe crabs:

ggplot(count_preds, aes(x = mass, y = satellites)) +
  geom_jitter() +
  geom_line(aes(y = fit)) +
  geom_ribbon(aes(ymin = lwr, ymax = upr, color = NULL), alpha = .3) +
  xlab("Mass of female crab (kg)") +
  ylab("Number of satellite males")

Well, it doesn’t exactly inspire great confidence in the biological relationship between female horseshoe crab mass and the number of satellite males she attracts, but that is exactly why it is so important to communicate these effects