Linear models

Introduction

The objective of this assignment is to move beyond simple null hypothesis significance tests and get you thinking about how you can approach real-world problems mathematically. To do this, we introduced a class of tools known as linear models in Chapter 7 and Chapter 8. The simplest of these tools are the linear regression model, and the analysis of variance (ANOVA), but we learned that we can also combine them in the general linear model.

In addition to introducing the concepts of linear models, we also will begin to build your statistical vocabulary and start using the kind of language that you see in peer-reviewed journal articles for framing the analysis in the methods section and reporting the results of statistical tests. We will ease into some of the more common vernacular this week and continue to develop this skill for the rest of the semester.

By the end of this homework, you should be comfortable running and interpreting parametric linear models (regression, ANOVA, ANCOVA) and be starting to think about how we test and compare multiple hypotheses within this framework.

We’ll work with the smolt.txt data from the class data this week, and we’ll continue to work with packages from the tidyverse. Go ahead and load both of those before you get started.

Exercises

Analysis of variance (ANOVA)

We’ll start this week with analysis of variance (ANOVA). ANOVA is an extension of the t-tests we worked with last week. Last week, we tested hypotheses about differences in the mean of some response between two groups. ANOVA let’s us extend that to include cases in which we might want to test for differences in more than two groups. As we will see, this allows for comparisons not only between more than two groups (e.g. through one-way ANOVA), but also allows us to look across multiple grouping factors in a single test (e.g. two-way ANOVA).

We will use the simple case of the one-way ANOVA to practice our newly acquired skills in n-way ANOVA. Begin by reading in the smolt data from last week. See Homework 4 for a description of the data.

# Read in the smolt data
smolts <- read.csv("smolts.txt")

# Look at the first few lines for a sanity check
head(smolts)

##      ID     stage  fl  mass   nka osmolality      pref max n.switch
## 1 31B1F     Smolt 216  97.5 6.667        318 0.3638229  83       93
## 2 31B20     Smolt 184  63.0 4.444        306 0.3922490  47      107
## 3 31B27  Presmolt 183  61.1 1.429        418 0.3378025  45       80
## 4 31B29 Postsmolt 204  76.4 3.899        356 0.4551466  76      123
## 5 31B2A  Presmolt 218 117.0 3.623        419 0.4337464  62      107
## 6 31B2E Postsmolt 199  70.8 2.151        355 0.4828588  59      100

Use a one-way ANOVA to investigate differences in nka activity between stage of smolts (Presmolt, Postsmolt, and Smolt).

Question 1. State the purpose of the analysis (i.e. to detect the effects of “blank” on “other blank”), and formally state the null hypothesis for this analysis.

Question 2. Explain the results of the test in sentence format, being sure to include the test statistic, the numerator and denominator degrees of freedom, and the p-value along with the test that you used. Also report the R² value for this model (recall you can get this multiple ways): see Chapter 4 or Chapter 7 for examples of how to do this. You should be able to do this in 1-2 sentences. See the ANOVA example for help putting this into words, but give it your own flare.

Now, use a Tukey HSD test to determine factor-level differences in nka activity between Presmolt, Smolt, and Postsmolt.

Reminder: if you have forgotten how to conduct the Tukey HSD for pairwise comparisons of factor-level means, you can find this in Chapter 7.

Linear regression

Moving on…Now let’s consider the case for which we have a continuous response variable and a continuous explanatory variable. Linear regression is going to be our tool of choice for this job!

Remember, to use linear regression in R, we are still going to work with the lm function because ANOVA and linear regression are the same thing as far as we are concerned, and they are both linear models of the form: \(Y_i = \beta_0 + \beta_i*X_i + \varepsilon_i\) for which we are (for now) assuming that \(\varepsilon_i\) is an error term normally distributed with \(\mu = 0\) and \(\sigma^2 = 1\).

Let’s look at the relationship between nka activity and plasma osmolality of the fish used in this study. Use the lm function to test the effects of gill nka on plasma osmolality following exposure to salt water.

Question 5. State the null hypothesis for the test and report the Type-I error rate (\(\alpha\)) in sentence form.

Question 6. Is there a significant effect of nka activity on plasma osmolality following exposure to saltwater? Remember, you need to use the summary() function to answer this question. What is the direction of this relationship (positive or negative), and what response in plasma osmolality would you expect to see with increasing nka activity?

Question 7. How much of the variability in plasma osmolality does nka activity explain (think R²)?

Question 8. Plot the predictions from the regression model against the raw data. Remember there are examples of how to do this in Chapter 7.2. Make it pretty (as if I even need to tell you that).

ANCOVA

Putting it all together now. Let us imagine a world in which we are not only interested in the effects of categorical OR continuous variables on our response, but rather the additive or combined effects of categorical and continuous variables on our response. Now we’re talking about useful stuff! To do this we will use the analysis of covariance (ANCOVA).

Remember, to use ANCOVA in R, we are still going to work with the lm function because ANOVA and linear regression and ANCOVA are the same thing here, and they are all linear models of the form: \(Y_i = \beta_0 + \beta_i*X_i + \varepsilon_i\) in which we are (for now) assuming that \(\varepsilon_i\) is normally distributed error with \(\mu = 0\) and \(\sigma^2 = 1\).

Awesome sauce.

With the lm function in R, use ANCOVA to test the main (additive) effects of stage and nka activity on plasma osmolality.

Now look at the results. Remember there are two different methods we need to use to understand the results of the ANCOVA because now we have both groups and continuous explanatory variables. Remember that the summary() function gives us the regression coefficients and an overall R² for the model, and the Anova() function will give us the anova table for the model that we can use to determine whether or not stage had a significant effect overall, but we need to specify the argument type = 'III' in the call. In order to use the Anova() function, you will need to install and load the car library:

# Install package
# install.packages('car')

# Load it
library(car)

# Check out the Anova function
?Anova

Sometimes people have problems installing car for one reason or another. If unable to install car, you can use anova() function from base R for this assignment, but you won’t be able to use type = III.

Question 9. What can you conclude about the effects of stage and nka on osmolality? Report this in sentence format as you did above, this time being sure to include the test name, test statistic, p-value, R² for the model. This may need a few sentences because we now have to report the results of the ANOVA-style component of the analysis.

Question 10. Describe how osmolality varied between stages (reporting mean and sd of osmolality for each stage in addition to differences between levels) as you did for Question 2. Then, describe how osmolality changed with nka activity in addition to stage (be sure to indicate the direction of the relationship using your regression coefficients - you can get this from the summary() of the model).