3.4 Better data summaries

Now, we’ll look at some slightly more advanced summaries. Start by loading the dplyr package into your R session with the following code.

library(dplyr)

We can use functions from the dplyr package to calculate mean Length of fish for each combination of Sex and Age group much more easily than we did for a single group above.

First, we group the data in measured data frame that we created previously using the group_by function. For this, we just need to give R the data frame and the variables by which we would like to group:

g_lengths <- group_by(measured, Sex, Age)

This doesn’t change how we see the data much (it gets converted to a tibble), just how R sees it.

Next, we summarize the variable Length by Sex and Age using the summarize function:

sum_out <- summarize(g_lengths, avg = mean(Length))

head(sum_out)
## # A tibble: 6 × 3
## # Groups:   Sex [2]
##   Sex     Age   avg
##   <chr> <int> <dbl>
## 1 B         3  38.1
## 2 B         4  40.5
## 3 B         5  42.0
## 4 B         6  43.4
## 5 B         7  46.8
## 6 R         4  45.0

Wow! That was super-easy!

Finally, to make things even more streamlined, we can chain all of these operations together using the %>% function from magrittr. This really cleans up the code and gives us small chunks of code that are easier to read than the dozens of lines of code it would take to do this manually.

# This will do it all at once!
sum_out <- # Front-end object assignment
  measured %>% # Pass measured to the group_by function
  group_by(Sex, Age) %>% # Group by Sex and age and pass to summarize
  summarize(avg = mean(Length))

We could also assign the output to a variable at the end, whichever is easier for you to read:

  measured %>% # Pass measured to the group_by function
  group_by(Sex, Age) %>% # Group by Sex and age and pass to summarize
  summarize(avg = mean(Length)) -> sim_out # Back-end object assignment

And, it is really easy to get multiple summaries out like this at once:

sum_out <-
  measured %>% 
  group_by(Sex, Age) %>% 
  summarize(avg = mean(Length), s.d. = sd(Length))

head(sum_out)
## # A tibble: 6 × 4
## # Groups:   Sex [2]
##   Sex     Age   avg  s.d.
##   <chr> <int> <dbl> <dbl>
## 1 B         3  38.1  2.75
## 2 B         4  40.5  2.70
## 3 B         5  42.0  2.29
## 4 B         6  43.4  2.09
## 5 B         7  46.8  1.61
## 6 R         4  45.0  2.65

Isn’t that slick? Just think how long that would have taken most of us in Excel!

This is just one example of how functions in packages can make your life easier and your code more efficient. Now that we have the basics under our belts, lets move on to how we create new variables.