13.2 Poisson regression

Poisson regression is useful for situations in which we have a response (independent variable) that is a count. These are discrete data that cannot be considered continuous because it is impossible for them to take on non-integer or non-negative values. Common examples of these types of responses include species count data in ecology, cell or colony counts in biology, and the number of respondents or patients reporting a side-effect or symptom of interest in health care studies.

For the Poisson family, the link function that we will use is the “log” link function. This one is exactly what it sounds like. This link function allows us to work with data that are constrained to be non-negative, a desirable property when we are working with count data.

Let’s use the crab data set to demonstrate the GLM with Poisson data. We will walk through this data set for both the Poisson and negative binomial examples, addressing some distributional assumptions and model fit along the way.

##   color spine width mass satellites
## 1     2     3  28.3 3.05          8
## 2     3     3  26.0 2.60          4
## 3     3     3  25.6 2.15          0
## 4     4     2  21.0 1.85          0
## 5     2     3  29.0 3.00          1
## 6     1     2  25.0 2.30          3

## 'data.frame':    173 obs. of  5 variables:
##  $ color     : int  2 3 3 4 2 1 4 2 2 2 ...
##  $ spine     : int  3 3 3 2 3 2 3 3 1 3 ...
##  $ width     : num  28.3 26 25.6 21 29 25 26.2 24.9 25.7 27.5 ...
##  $ mass      : num  3.05 2.6 2.15 1.85 3 2.3 1.3 2.1 2 3.15 ...
##  $ satellites: int  8 4 0 0 1 3 0 0 8 6 ...

13.2.1 Data explanation

These data represent the number of satellite male horseshoe crabs per female (rows) in relation to a number of characteristics of the females. Here, our response of interest is satellites and the female characteristics are our explanatory (independent) variables. These include female color, spine condition, carapace width, and mass (g).

The full citation for the paper on which this data set is based is here:

H. J. Brockmann. 1996. Satellite male groups in horseshoe crabs, Limulus polyphemus. Ethology 102:1-21. doi:10.1111/j.1439-0310.1996.tb01099.x

We are going to convert color to a factor to start because it is currently stored as a numeric variable.

# We want to convert color to
# a factor right off the bat
crabs$color <- as.factor(crabs$color)

Next, we’ll fit a “full” model that assumes the number of satellites is a function of width, mass, spine condition, and color.

# Fit a model
count_mod <- glm(
  satellites ~ width + mass + spine + color,
  data = crabs,
  family = "poisson"(link = log)
)

Before we go any further, let’s have a quick look at the model diagnostics using the methods we applied to linear models in Chapter 9. Right away, we can see that this model is not a very good fit to the data.

ggplot(count_mod, aes(x = .fitted, y = .resid)) +
  geom_jitter() +
  xlab("Fitted values") +
  ylab(expression(paste(epsilon)))

This brings us to the next important point we need to make about GLMS…

Even though we are relaxing the assumptions of linear models, we still need to check to make sure the models we use are valid with respect to our assumptions.

This will become considerably more complicated as we begin to move into distributions other than the binomial and the Poisson, as our standard methods become less and less applicable and in-depth model validation becomes more obscure and more involved (but more important!).

So, what is going on here? At this point a few things should jump out at you from this plot. First, there is some kind of trend happening at the bottom of our plot where the residuals slant downward from left to right. That’s something that should scream “not good!” at you, even if you don’t know why. Second, this plot doesn’t really make it look like our residuals are symmetrically distributed with a mean of zero. But, it is really hard to tell from this graph, especially because of the weird patterns at the bottom of the scatter plot.

We can check this second concern using a histogram if we want to see if it “looks normal” like this:

ggplot(count_mod, aes(x = .resid)) +
  geom_histogram(bins = 25) +
  xlab(expression(paste(epsilon)))

Now that we see it presented this way, it is pretty clear that our residuals are not symmetrically distributed around zero even if the mean is about zero. We could also calculate some descriptive statistics for the residuals to really nail this down.

mean(count_mod$residuals)

## [1] -0.01809651

Hmmm…the mean is fairly close to zero here, but we’ve already talked about the fact that the mean isn’t a very good descriptive statistic for distributions that we suspect or know are not normal. What about the median as a measure of central tendency?

median(count_mod$residuals)

## [1] -0.281012

Wow, okay, that is a bit more negative than the mean, and at this point we may start to question whether we can say that the residuals are normally distributed with a mean of zero. If we are uncomfortable coming right out and saying that the residuals are not normal, we could always use a Shapiro-Wilk normality test to check (see Chapter 6 if you’ve forgotten what this one is).

shapiro.test(count_mod$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  count_mod$residuals
## W = 0.82351, p-value = 3.419e-13

Remember that the null hypothesis here is that the sampling distribution of our residuals is normal, so a p-value of < 0.05 tells us to reject the null and now our residuals are officially not normal! Crap, wasn’t the whole point of using GLM supposed to be so we could keep meeting this assumption?

So, why has this happened to us? Recall that the Poisson distribution is controlled by a single parameter, $\lambda$ , that is both the mean and the variance of the distribution. If we had started by doing data exploration we would have, of course, noticed that even though the data represent counts, they are pretty clearly over-dispersed (variance is much larger than mean) and are indicative of a negative binomial sampling distribution.

For now, we won’t bother to look at the results from this model because the link function is the same, so we can get the results from the negative binomial regression in the same way. Plus, if our model is in violation of assumptions then the results will be unreliable anyway.