GLM: Count models

Introduction

This week we continued to work with the generalized linear model (GLM) and discussed why we might want to use it and how to interpret the results of these tools. The purpose of this lab is to get you comfortable with regression for count data, and help you get comfortable with some of the diagnostics we will use with these tools, while continuing to work with interpreting results with respect to the link functions we are working with. Next week, we will talk about how we can deal with things like repeated measures on the same categories and nested designs.

By the end of this lab, you should 1) be comfortable specifying general linear models for count data in R, 2) be able to apply this tool to analyze count data from start to finish, and 3) be able to diagnose and interpret the results of these models.

We’ll be working with functions from the tidyverse as usual and the MASS package for the negative.binomial family, so go ahead and load those up when you are ready to start!

library(tidyverse) # For everything we love
library(MASS) # For negative.binomial family
library(car) # For Anova()

Exercises

Let’s start by reading the file and taking a look at the structure of the data set that we’ll be working with this week.

# Read in the data
uw <- read.csv("data/weights.csv")

# Look at the data
head(uw)
  year region  country    sex number
1 1986 Africa  Senegal Female   15.3
2 1987 Africa     Mali Female   27.0
3 1987 Africa  Burundi Female   27.8
4 1988 Africa     Togo Female   19.7
5 1988 Africa Zimbabwe Female    7.2
6 1988 Africa   Uganda Female   19.1

These are data that I downloaded from the World Health Organization’s website free of charge. The data set contains 1099 observations of 5 variables. The variables are year, region, country, sex, and number of individuals per 100 that were underweight. These data are estimates, so they are presented as numeric values. We know that they actually represent counts per some unit (i.e. a rate) so we want to convert them to integer class before we start working with them.

# Now we need to make the counts into an integer because these are
# counts and counts are
uw$number <- as.integer(uw$number)

Poisson regression

Start by looking at a single region, we will use Africa in this case.

# Make a new dataframe that contains only those records for which
# the region is `Africa`
africa <-
  uw %>%
  filter(region == "Africa")

Let’s do some data exploration to look at the our sampling distribution of number. Here, a quick histogram of your data should show you what you are looking for.

For thought: Do these data appear to conform to a Poisson distribution?

Now, let’s fit a model using number as the response and take a quick look at some regression diagnostics for the model. We will include the variable sex to see if there are differences between males and females. We will also include the variable year as a numeric variable to determine whether or not there is a linear increase or decrease in the number throughout the time series.

This is basically just ANCOVA, but with a count for y and a different sampling distribution!

# Fit the model, being sure to specify
# the appropriate error distribution and
# link function.
model <- glm(number ~ year + sex,
  data = africa,
  family = "negative.binomial"(theta = 1)
)

Before you look at the summary of the model, be sure to have a look at the regression diagnostics for the count model. Does it appear that you have violated the assumptions underlying the model that you have fit? Think about why or why not. This is not a homework question, it is doing the analysis the right way.

Now that you have looked at the regression diagnostics, take a look at the model summary.

# Print the summary of the model
summary(model)

Call:
glm(formula = number ~ year + sex, family = negative.binomial(theta = 1), 
    data = africa)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.41262  -0.26729  -0.08145   0.25462   0.93405  

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 36.269089   6.045822   5.999 4.68e-09 ***
year        -0.016632   0.003018  -5.511 6.65e-08 ***
sexMale      0.142461   0.041978   3.394 0.000763 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(1) family taken to be 0.1587279)

    Null deviance: 72.861  on 377  degrees of freedom
Residual deviance: 66.292  on 375  degrees of freedom
AIC: 3065.2

Number of Fisher Scoring iterations: 4

You’ll note that we have included both categorical and numerical explanatory variables in this model. This makes the interpretation of the model slightly more complicated than some of the simpler models we have looked at so far.

If you try to predict the mean expected value for females (Intercept term) from these coefficients as we have done with previous exercises, we get huge values that don’t make much logical sense. The reason for this is the combination of categorical and numerical explanatory variables. Remember: the interpretation of regression coefficients (or \(Beta_j\)) is the effect of \(X_i\) on \(Y_i\) given that all other \(X\) are held constant. If we simply do:

exp(beta)

…then we are implicitly telling R that the variable year is being held constant at zero. Because the scale of year is large (1986-2015) relative to our counts, we get ridiculously large values if we try to predict the mean value of number forfemale without specifying a reasonable value for year. So, if we want to get a reasonable prediction for the mean expected value of number for females, we need a reasonable value of year. In most cases, unless you are predicting across the range of years, using the mean is fine (although not ideal if it is significant!):

# Predict mean values of female
female <- 36.269089 - 0.016632 * mean(africa$year)
female
[1] 2.950881

# Now, we need to use the inverse of the link function to get our prediction.
# Since the link function was 'log', we use:
exp(female)
[1] 19.12279

We see that the mean number of females per 100 in the African region that were underweight during the time period from 1986 through 2015 was about 19. What if we wanted to do this for males? Your turn:

Question 1. What is the mean number of males per 100 in the African region that were underweight during the time period under study? Use the model coefficients to determine this like I demonstrated above.

Plot the predictions from your model. Note that you may need to combine information from Chapter 13.4 and Chapter 10.4 to accomplish this. Remember: you essentially have a glm() count model that is structured like ANCOVA.

Question 2. Using either your predicted values or the plots you have made, what is the mean difference between males and females during the time period under study? The difference is statistically significant…do you think it is biologically meaningful? Defend your answer.

Negative binomial regression

Now, we will take a step back and start working with the entire data set. Let’s take another quick look at it just to refresh our memories.

# Look at the first 10 rows of data
head(uw, 10)
   year region                     country    sex number
1  1986 Africa                     Senegal Female     15
2  1987 Africa                        Mali Female     27
3  1987 Africa                     Burundi Female     27
4  1988 Africa                        Togo Female     19
5  1988 Africa                    Zimbabwe Female      7
6  1988 Africa                      Uganda Female     19
7  1988 Africa                       Ghana Female     23
8  1990 Africa                     Nigeria Female     33
9  1990 Africa                  Mauritania Female     42
10 1991 Africa United Republic of Tanzania Female     23

# Let's find out how many unique regions we have in the data set
unique(uw$region)
[1] Africa                Americas              Eastern Mediterranean Europe                South-East Asia       Western Pacific      
Levels: Africa Americas Eastern Mediterranean Europe South-East Asia Western Pacific

Okay, so now instead of looking at just one region from the data set, we will look at all six to see if there are differences between regions. Leave us have another exploratory look at the data:

It’s pretty obvious here that no matter which way we slice it, these are decidedly not coming from a Poisson distribution. So, what do we do? This is GLM! We just need to use a different error distribution. From our histogram, it should be clear that we are now working with a negative binomial error structure.

Start by fitting a negative binomial regression model to the data that tests the effect of region (x) on number (y). This one is essentially an ANOVA model, but with counts and a different sampling distribution that we can fit with glm()!

Plot the residual diagnostics for this model following the approach in Chapter 13.3 of the textbook.

Question 3. Based on the residual diagnostics, does it look like this model provides a reasonable fit to our data? Are we in obvious violations of any of our assumptions?

Now that we have done some model validation, let’s have a look at the actual results of our model.

First, check the overall effect of region on number using the Anova() function from the car package:

We can see from the output, that there was there a significant effect of region on the number of individuals per 100 that were underweight.

Now, predict the mean number of individuals that were underweight per 100 for each of the regions in this study using the approach demonstrated in Chapter 13.4.

Question 4. Which region reported the greatest number of individuals that were underweight? Which region reported the fewest number of individuals that were underweight?

Now that you have your model results, make a boxplot or violin plot of number by region that shows your predictions over your raw data like we did in Chapter 10.4.1 when we were doing ANOVA (because these are the same things!).

Model selection for count regression

Finally, we can use these data to determine how the number of underweight individuals changed during the time series by region to determine if there were differences in how these numbers changed over time.

Make two models for this section. For the first model, consider only the additive effects of region and year (x variables) on number (y). In the second model, include the interaction effect between region and year on number.

Use the AIC function in base R to get the AIC score for each of the two models you have created.

Plot the predictions of the best model.

Question 5. According to the AIC scores that you have extracted, is there unequivocal evidence to indicate that one of these models is better than the other? Does this make sense based on your plots?




This work is licensed under a Creative Commons Attribution 4.0 International License. Data are provided for educational purposes only unless otherwise noted.