Welcome Applications of R for Aquatic Plant data! This module will serve as a tutorial to help you understand how to improve your data and analysis workflows for aquatic plant data in R. In this module, we will read in and manipulate a standard data file format used for point-intercept (rake-toss) surveys, we will demonstrate some basic analyses for species richness and presence-absence data, and we will cover some tools for visualizing results.

Libraries

Let’s go ahead and load the libraries we’ll be working with in the first half of this workshop.

library(tidyverse) # Data manipulation, summaries, plotting
library(RColorBrewer) # Pretty color ramps for graphs

Data

The first data set we’ll work with is from a point-intercept (rake-toss) survey in Otsego Lake, NY, USA during summer 2025. Well…sort of.

The original data were collected by Chii Kojima, an undergraduate at Mt. Holyoke College and SUNY Oneonta BFS summer intern, and Mairi Meehan, a professor at SUNY Cobleskill and BFS visiting faculty researcher. They completed 3 surveys at 54 points throughout the growing season on the dates included in the data set, with three rake tosses at each site. They recorded Depth, location (Longitude and Latitude), and categorical abundance (Z, T, S, M, or D) of each species at each site. If you are not familiar with this categorical abundance scale, the letters represent Zero, Trace, Sparse, Moderate, and Dense. They wanted to understand seasonality in plant presence and abundance and whether this and other factors could be used to improve survey design for long-term monitoring.

We used the original data to simulate new data using an institutional large language model (LLM, “AI”) to simulate new observations at sites and dates from the survey based on spatial and temporal autocorrelation in the data, with depth as a covariate of abundance, and with permission for all of this from the data originators. As an aside, this was a fun learning experience and the AI ended up using the same methods we would have employed, but ironically it used Python instead of R. And, it did a pretty not bad job of simulating the new data set that we’ll play with. Back on track here.

Read in the simulated data now that we are done with story time:

plants <- read.csv("data/plant_abundances.csv")

Have a look at the data set using glimpse(), as shown below, or you can look at it in data table format using View(). We’re not showing the output here in our version because it’s a bit long.

glimpse(plants)

This is a pretty standard point-intercept data set. It is in a somewhat “wide” format. That is, plant species are in columns to the right of all the other aggregated site and event data. The wide format is good for data entry and viewing in spreadsheets, but it is not how we’ll want the data formatted for most of our needs in R.

We’ll first need to get the data set into a longer format to work with it more easily for most uses in R. This would be a bit of a copy-paste nightmare in your typical spreadsheet application. But, it is very easily done using the pivot_longer() function from the dplyr package. We’re telling R to stack the individual columns for species 8:ncol(plants) into two columns: one with species name, and one with categorical abundance. You would need to change this for a different data set, but could also do things like identify columns with underscore characters (_) if your species columns were named similarly.

long_data <- pivot_longer(plants,
                          cols = 8:ncol(plants),
                          names_to = "Species",
                          values_to = "Abundance")

Presence-absence data

First, we’ll look at how we can work with information about presences and absences of plants at each site on each survey to say something about the plant community. Some common tools for assessing plant communities might include metrics such as species richness or evenness.

To start with, we’ll need to make a new column for presence-absence data so we can do things like add up species. There are a few different ways to do this. A succinct approach would be to initialize a new column that is all zeroes and then replace any zeroes with ones if the Abundance column contained anything other than Z. Like this:

# Initialize new column called Presence of all `0`s
long_data$Presence <- 0

# Change this to `1` if Abundance does not equal `Z` for zero
long_data$Presence[long_data$Abundance != "Z"] <- 1

We could look at the presence absence data in a few different ways. First, let’s estimate species richness for each Site and Survey using the replicate observations (Toss). We’re also keeping columns for Longitude and Latitude here so we can play with those below.

richness <- long_data %>% 
  group_by(Site, Survey, Toss, Longitude, Latitude) %>% 
  summarize(Richness = sum(Presence))

Now we can make a boxplot to visualize richness across surveys (June, July, August) to see if there are obvious seasonal trends.

ggplot(richness, aes(x = Survey, y = Richness, group = Survey)) +
  geom_boxplot()

This is cool, but you could spend some time to make it a little prettier. It looks like there is an increase from June through the later months, but it’s not super clear whether Richness differed between surveys 2 and 3.

You could formally test for differences among groups using a variety of approaches.

Presence-absence for a given species

We could use a similar approach to the one we used above to explore trends in the presence of individual species as well. To do this, we could work directly with the Presence column we created in long_data during data manipulation.

Let’s filter the data set so that we are working with a single species. You can choose a different one to work with, and none of the code that follows should need to change, although your results will likely differ dependent upon species.

The perceived expansion of clasping pondweed (P. richardsonii) following dreissenid mussel invasions and increased water clarity has been a common topic of discussion at basically any gathering around Otsego Lake in recent years, so we’ll check that out here.

presence_data <- long_data %>% 
  filter(Species == "Potamogeton_richardsonii")

Now, we can use the same glm() function that we used to analyze species richness in the example above, but this time we’ll use a “binomial” family for logistic regression. It really is that easy to use in R.

presence_mod <- glm(Presence ~ Longitude * Latitude + factor(Survey),
                    data = presence_data,
                    family = binomial)

Just as before, we can make predictions from this model. In this case, the response scale is probability of presence.

presence_data$fit <- predict(presence_mod, type = "response") 

And, we can now plot probability of presence by site and survey!

# We could use the RColorBrewer package to make a spectral scale that looks cool
myPalette <- colorRampPalette(rev(brewer.pal(10, "Spectral")))
sc <- scale_colour_gradientn(colours = myPalette(10), limits=c(0, 1),
                             oob = scales::squish)

# Then we just add it to our ggplot like so, along with a couple of other 
# changes in the name of beautiful data visualization
ggplot(presence_data, aes(x = Longitude, y = Latitude, color = fit)) +
  geom_point(size = 4, alpha = 0.85) +
  facet_wrap(~Survey) +
  sc +
  labs(color = "Probability of presence") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Isn’t that cool?

Summary

In this module, we’ve walked through the process of data read, manipulation of a standard point-intercept data format, and examined multiple ways to analyze and visualize species richness and individual species presence-absence data. This should give you an appreciation for the power and diversity at your disposal in R. And, this is really just scratching the surface of the tools that are available to address these needs in R. And, hopefully, you have the confidence to start exploring more of those tools having been oriented to some of them!



This work is licensed under a Creative Commons Attribution 4.0 International License. Data are provided for educational purposes only unless otherwise noted.