# Data manipulation and visualization

## Introduction

The objective of this module is to reinforce common tools that you learned about in the first half of the Introductory session and get you started developing skills in data manipulation and visualization. We will continue to build these skills and integrate them into statistical analyses in the Intermediate R workshop.

By the end of this module, you should be comfortable: 1) reading data into R, 2) conducting some basic data manipulation, 3) generating new variables, and 4) creating and modifying basic plots.

## Exercises

### Packages

We’re going to work with the dplyr, ggplot2, and magrittr packages for this module, so you can go ahead and put some code in your script to load them both from the library where they were installed on your computer. If you installed them with the tidyverse, you can go ahead and load the whole thing all at once [loves this feature].

library(tidyverse)

### Data import

Read the data from physical.csv into R from the data directory in your workshop folder using the read.csv() function. If you saved your script in the workshop folder, then you can just run dir(data/) to see the file names and copy and paste from the output into your read.csv() function so you don’t mess up the name.

The header = TRUE tells R that our columns have names in the first row.

otsego <- read.csv(file = "data/physical.csv", header = TRUE)

Verify that this worked by checking your Environment tab in your RStudio session or running ls() in the console:

ls()

### Data explanation

These are data collected each year from Otsego Lake by students, staff, and faculty at the SUNY Oneonta Biological Field Station in Cooperstown, NY, USA. The data set includes temperature (°C), pH, dissolved oxygen, and specific conductance measurements from a period of about 40 years. There are all kinds of cool spatial and seasonal patterns in the data that we can look at.

We will use do_mgl for the examples that follow. This variable is a measure of the amount of dissolved oxygen available in the water at each depth on each day of the data set. Dissolved oxygen is important for supporting aerobic organisms that can’t produce their own food through photosynthesis (e.g. aquatic bugs, clams, fish). It has been measured along with other water characteristics to help monitor changes in the lake due to watershed and in-lake management as well as introduced species or climate change.

One of the biggest changes to Otsego Lake in the last several decades was a reduction in the amount of dissolved oxygen in the deepest water (hypolimnion). A primary cause of this reduction in dissolved oxygen was the introduction of an invasive fish, the alewife (Alosa pseudoharengus). Alewife are really good at eating the microscopic animals (zooplankton) that graze on algae. When they were introduced, the alewife population increased rapidly and basically ate all of the zooplankton that eat algae. This allowed algae to grow out of control during the summer. Once the algae dies each year, it sinks to the bottom of the lake where it is decomposed by aerobic organisms that use up oxygen deep in the lake. The amount of oxygen available is fixed from about June until December each year until seasonal changes cause all of the water to mix together again. This means that the lowest oxygen levels were occurring in the deepest water around October or November each year. Oxygen levels got so low because of this that the deep, cold water needed by popular sport fishes such as lake trout (Salvelinus namaycush) did not have enough oxygen to support them. Therefore, the New York State Department of Environmental Conservation (NYSDEC), SUNY Oneonta Biological Field Station, Otsego Lake Association, and others began stocking walleye (Sander vitreus) to eliminate the alewife and reverse these changes. What a hot mess, huh?!

It turns out that these folks were actually successful in eliminating the alewife. Well, sort of. It is kind of a messy story thanks to the subsequent introduction of invasive zebra mussels (Dreissena polymorpha) in 2008. But, hey - that’s lake management!

For this exercise, we will weed through about 30 years of data to summarize changes in dissolved oxygen in Otsego Lake during the last 30 years while all of these cool changes were happening. But, to get there we’ll need to do some data munging and some quality-control to assure that we know what we’re working with. These steps are pretty typical for most data sets you’ll run into, and are a critical part of experimental design and the statistical analyses that we’ll want to conduct later.

### First steps for data

I constantly forget about the fancy point-and-click tools that RStudio brings to the table having learned much of my basic R syntax right before that revolution occurred. Make sure that you look at your data structure in your Environment tab or using the str() function.

You can also use the built-in function summary() to get a closer look at your data.

Once you’ve taken a look at the data, let’s check out some specifics.

How many (unique) years are included in this data set? If you are stuck look back to how we did this during the first half of the Introductory session you can peek at that. Or, you could Google it.

# Insert code for this here
# This is a reminder for me to have people
# do this, not a leftover reminder to put code here.
# Do it. I'll demonstrate, don't worry!

Next, calculate the mean dissolved oxygen do_mgl throughout the water column each year. You can use the approach demonstrated in the first half for this.

Next, we’ll modify hypo$alewife to replace “present” with “absent” for all years after 2010 by treating it like an atomic vector! hypo$alewife[hypo$year > 2010] <- "absent" Now if we wanted to, we could calculate the mean do_mgl in years when alewife were present or absent. Go ahead and give this a try. # Insert code for this here # This is a reminder for me to have people # do this, not a leftover reminder to put code here. # Do it. I'll demonstrate, don't worry! ### Make it pretty Okay, let’s wrap up the Introductory session by making a couple of nice plots to visually compare hypolimnetic oxygen (hypo$do_mgl) between years in which alewife was "present" or "absent".

Here is the basic plot to get you started:

ggplot(hypo, aes(x = alewife, y = do_mgl, fill = alewife)) +
geom_boxplot()

You can view some of the cool options that you can change in a boxplot geometry by running ?geom_boxplot in the console. You can also find complete, built-in themes by running ?ggtheme in the console. Or, if you’d like to modify one of those or build your own, you can use the theme() function to change basically every aspect of the graph.

We’ll take some time to play with these options to wrap things up for the Introductory session.

Of course, these data could also be represented easily using a histogram:

ggplot(hypo, aes(x = do_mgl, fill = alewife)) +
geom_histogram()

If you prefer violin plots over boxplots (no shame in that game), you can basically copy-and-paste the boxplot code and just replace the geometry!

ggplot(hypo, aes(x = alewife, y = do_mgl, fill = alewife)) +
geom_violin()

Any of the modifications you make the the overall plot will be more-or-less transferrable outside of the specific geometry you choose (e.g. geom_boxplot(), geom_point(), geom_line(), geom_violin(), geom_histogram()). This means you can basically copy and paste all of your plotting code once you have a style with which you are happy.

I will demonstrate some of these with the violin plots.

You can also layer on other geometries by adding a + to the previous code. Here is an example that shows the raw data jittered over the violin plots and uses an alpha channel to assign transparencies in each of the respective geometries while maintaining a consistent set of aesthetics.

ggplot(hypo,
aes(x = alewife, y = do_mgl, color = alewife, fill = alewife)) +
geom_violin(alpha = 0.10) +
geom_jitter(alpha = 0.20)

We could tweak a few of the options in the geom_violin() function. Remember to run ?geom_violin to see these options. I’ll add a line to each violin for the median (50th percentile).

ggplot(hypo,
aes(x = alewife, y = do_mgl, color = alewife, fill = alewife)) +
geom_violin(alpha = 0.10, draw_quantiles = 0.50) +
geom_jitter(alpha = 0.20)

And, of course we have full control over axis titles, group names, and how the legend is displayed.

ggplot(hypo,
aes(x = alewife, y = do_mgl, color = alewife, fill = alewife)) +
geom_violin(alpha = 0.10, draw_quantiles = 0.50) +
geom_jitter(alpha = 0.20) +
scale_x_discrete(labels = c("Absent", "Present")) +
xlab("Alewife presence or absence") +
ylab("Dissolved oxygen (mg/l)") +
labs(fill = "Alewife", color = "Alewife") +
theme(axis.title.x = element_text(vjust = -1),
axis.title.y = element_text(vjust = 3)
)

If you don’t like the default panel layout, you can use a built-in ggtheme to change it, or add arguments to the theme() function to modify specific elements, or both, like this:

ggplot(hypo,
aes(x = alewife, y = do_mgl, color = alewife, fill = alewife)) +
geom_violin(alpha = 0.10, draw_quantiles = 0.50) +
geom_jitter(alpha = 0.20) +
scale_x_discrete(labels = c("Absent", "Present")) +
xlab("Alewife presence or absence") +
ylab("Dissolved oxygen (mg/l)") +
labs(fill = "Alewife", color = "Alewife") +
theme_bw() +
theme(axis.title.x = element_text(vjust = -1),
axis.title.y = element_text(vjust = 3),
panel.grid = element_blank()
)

And finally, of course, we can change the colors manually or using color palettes. In this case, we have two levels so I will do it manually:

ggplot(hypo,
aes(x = alewife, y = do_mgl, color = alewife, fill = alewife)) +
geom_violin(alpha = 0.10, draw_quantiles = 0.50) +
geom_jitter(alpha = 0.20) +
scale_x_discrete(labels = c("Absent", "Present")) +
scale_fill_manual(values = c("gray40", "black")) +
scale_color_manual(values = c("gray40", "black")) +
xlab("Alewife presence or absence") +
ylab("Dissolved oxygen (mg/l)") +
labs(fill = "Alewife", color = "Alewife") +
theme_bw() +
theme(axis.title.x = element_text(vjust = -1),
axis.title.y = element_text(vjust = 3),
panel.grid = element_blank()
)

And, of course, if you really hate violin plots and wish I would have stuck with the boxplot, you can just swap out the geometry that is being used here!!

ggplot(hypo,
aes(x = alewife, y = do_mgl, color = alewife, fill = alewife)) +
geom_boxplot(alpha = 0.10, width = 0.3) +
geom_jitter(alpha = 0.20, width = .1) +
scale_x_discrete(labels = c("Absent", "Present")) +
scale_fill_manual(values = c("gray40", "black")) +
scale_color_manual(values = c("gray40", "black")) +
xlab("Alewife presence or absence") +
ylab("Dissolved oxygen (mg/l)") +
labs(fill = "Alewife", color = "Alewife") +
theme_bw() +
theme(axis.title.x = element_text(vjust = -1),
axis.title.y = element_text(vjust = 3),
panel.grid = element_blank()
)

And, we could just as easily swap back to using a histogram by making just a couple of quick tweaks to the plotting code!

ggplot(hypo,
aes(x = do_mgl, color = alewife, fill = alewife)) +
geom_histogram(alpha = 0.20) +
scale_fill_manual(values = c("gray40", "black")) +
scale_color_manual(values = c("gray40", "black")) +
ylab("Frequency of observation") +
xlab("Dissolved oxygen (mg/l)") +
labs(fill = "Alewife", color = "Alewife") +
theme_bw() +
theme(axis.title.x = element_text(vjust = -1),
axis.title.y = element_text(vjust = 3),
panel.grid = element_blank()
)

And, we could just as easily wrap these into a multi-faceted plot, if you prefer, by using the facet_wrap() function like this:

ggplot(hypo,
aes(x = do_mgl, color = alewife, fill = alewife)) +
geom_histogram(alpha = 0.20) +
scale_fill_manual(values = c("gray40", "black")) +
scale_color_manual(values = c("gray40", "black")) +
ylab("Frequency of observation") +
xlab("Dissolved oxygen (mg/l)") +
labs(fill = "Alewife", color = "Alewife") +
facet_wrap(~alewife) +
theme_bw() +
theme(axis.title.x = element_text(vjust = -1),
axis.title.y = element_text(vjust = 3),
panel.grid = element_blank()
)

## Summary and next steps

Hopefully the power of these data manipulation and plotting techniques is starting to resonate with you. The workflow is a little tough to wrap your head around at first, but once you get it down you can re-use it again and again. This is the reason why the whole world is using these techniques now. Just think about how much time these plots would have taken to build and customize in Excel or Sigma!! This is time you can now spend collecting data, paddling around your favorite lake, playing with your kids, or writing your dissertation. You’re welcome.

We will continue to apply these techniques in the Intermediate session. If you are dying for more just Google “How to ____ in ggplot” and you will be served more examples than you can complete in a lifetime!