12.3 Introducing the GLM

There are a number of situations that should just scream “GLM!!!” at you. The majority of these are easy to identify because you will know right away that the response variable in which you are interested is clearly not a continuous or normally distributed variable. This is the number one reason for moving into the GLM framework for most people. These include response variables such as counts (integers) and binary (1 or 0) or categorical variables (“Jane”, “Bill”, “Tracy”), and even probabilities or proportions.

The standard GLM consists of three major components:

  1. A random variable (Y) that is our response of interest,

  2. Linear predictor(s) of Y, called X, and

  3. A invertible “link function” that projects the expectation of Y onto some space based on assumptions about the distributional family of Y.

The first two components are familiar to us. They are the exact same basic components of any regression formula that takes the following form:

\(Y_{i,j} = \beta_0 + \beta_j \cdot X_{i,j}\),

or

\(Y = mX + b\),

if you prefer.

So, this much should be familiar. The major change from the linear models with which we have been working is the addition of this invertible link function, and it is the component from which the GLM inherits its name. The link function is just a way for us to put the expectation of the response within the context of an asymptotically normal distribution so that we can relax the assumptions of the linear model to accommodate new data types. In essence, it is very similar to the kinds of transformations that we talked about earlier in the semester, but is used during estimation rather than before hand.

To solve for the coefficients (betas) of a GLM, we move fully into the realm of maximum likelihood, with which you are all undoubtedly still familiar thanks to your close reading of Chapter 5. A given link function is used for the corresponding distribution that we assume for our data set, and a likelihood for that distribution can be defined such that we can calculate the likelihood of the data given our parameter estimates in a manner similar to the method we used for the standard normal distribution earlier this semester. Within this framework, we input different values (or guesses) about the parameter values that maximize the likelihood of our data one step at a time. Once the change in likelihood becomes sufficiently small derivative of y with respect to x = 0), we accept that the algorithm has ‘converged’ on the optimal estimates for our model parameters (our \(\beta_{i,j}\)), and the algorithm stops. This all assumes that the parameters follow defined sampling distributions - you guessed it, the normal! You do not need to be able to do this by hand (thank goodness for R!), but you do need to understand what is going on so you can troubleshoot when R says that the model failed to converge…

Let’s take a look at a few variable types that we might consider to be common applications for GLM in biology and ecology. We will cover each of these below in detail, here is a list so you know what is coming:

  1. Binary response (Chapter 12.4)

  2. Count data (Poisson) (Chapter 13)

  3. Overdispersed count data (negative binomial, also Chapter 13)