3.3 Subsetting and selecting data

Before we can make meaningful data summaries, we will probably need to re-organize our data in a logical way (through sub-setting, or selecting, specific chunks of data). A lot of times, we do this along the way without really thinking about it.

3.3.1 Manual subsets and selections

We talked a little about sub-setting data with logical queries in Chapter 2. Now, let’s refresh and take that a little further to see why we might want to do that.

First, we’ll select just the data from am_shad where backCalculated was FALSE. This will give us only the measured Length and Mass for each of the fish, along with their Sex and yearCollected. I’ll call this new object measured. Remember, am_shad is a data frame, so it has two dimensions when we use [ ] for sub-setting and these are separated by a comma, like this: object[rows, columns]. When we leave the columns blank, R knows that it should keep all of the columns.

measured <- am_shad[am_shad$backCalculated == FALSE, ]

We could do this for as many conceivable conditions in our data on which we may wish to subset, but the code can get clunky and hard to manage. For example can you imagine re-writing this if you just want to select age six roes without back-calculated lengths?

# Notice how we string together multiple 
# conditions with "&". If these were 'or'
# we would use the vertical pipe "|"
age_6_rows_measured <- am_shad[am_shad$backCalculated == FALSE & 
                                 am_shad$Sex == "R" &
                                 am_shad$Age == 6, ]

3.3.2 Subsetting and summaries in base R

This notation can be really confusing to folks who are just trying to learn a new programming language. Because of that, there are great functions like subset() available that are more intuitive (but less clear to programmers). You could also subset the data using the following code:

measured <- subset(am_shad, backCalculated == FALSE)

We could also get our age-six females from the previous example using this approach, and at least the code is a little cleaner:

age_6_roes_measured <- subset(am_shad,
                              backCalculated == FALSE &
                                Sex == "R" &
                                Age == 6
                              )

Both do the same thing, but we’ll see later that using functions like subset or filter is preferable if we plan on chaining together a bunch of data manipulation commands using pipes (%>% or |>).

Next, we might be interested to know how many fish we have represented in each Sex. We can find this out using the table function in base R:

# Here, I use the column name because
# we just want all observations of a single
# variable. Be careful switching between names,
# numbers, and $names!
table(measured['Sex'])
## Sex
##    B    R 
## 1793 1253

We see that we have 1793 females and 1253 males.

We can also get tallies of the number of fish in each Age for each Sex if we would like to see that:

table(measured$Sex, measured$Age)
##    
##       3   4   5   6   7
##   B 255 848 579 108   3
##   R   0 361 658 220  14

But, what if we wanted to calculate some kind of summary statistic, like a mean and report that by group?

For our age-6 females example, it would look like this:

age_6_roes_measured <- subset(am_shad,
                              backCalculated == FALSE &
                                Sex == "R" &
                                Age == 6
                              )

age_6_female_mean <- mean(age_6_roes_measured$Length)

Again, we could do this manually, but would require a lot of code for a simple calculation if we use the methods above all by themselves to get these means for each age group of roes.

We would basically just copy-and-paste the code over and over to force R into making the data summaries we need. Nothing wrong with this approach, and it certainly has its uses for simple summaries, but it can be cumbersome and redundant. It also fills your workspace up with tons of objects that are hard to keep track of and that will cause your code-completion suggestions to be wicked annoying in RStudio.

That usually means there is a better way to write the code…

3.3.3 Subsetting and summaries in the tidyverse

Long ago, when I was still a noOb writing R code with a stand-alone text editor and a console there were not a ton of packages available for the express purpose of cleaning up data manipulation in R. The one I relied on most heavily was the plyr package. Since then, R has grown and a lot of these functions have been gathered under the umbrella of the tidyverse, which is a collection of specific R packages designed to make the whole process less painful. These include packages like dplyr (which replaced plyr) and others that are designed to work together with similar syntax to make data science (for us, data manipulation and presentation) a lot cleaner and better standardized. We will rely heavily on packages in the tidyverse throughout this book.

Before we can work with these packages, however, we need to install them - something we haven’t talked about yet! Most of the critical R packages are hosted through the Comprehensive R Archive Network, or CRAN. Still, tons of others are available for installation from hosting services like GitHub and GitLab.

If you haven’t seen it yet, here is a three-minute video explaining how to install packages using RStudio. Watch it. Please.

It is also easy to install packages by running a line of code in the console. We could install each of the packages in the tidyverse separately. But we can also get all of them at once because they are all packaged together, too.

Follow the instructions in the YouTube link above, or install the package from the command line:

install.packages('tidyverse')

Once we have installed these packages, we can use the functions in them to clean up our data manipulation pipeline and get some really useful information.