6.3 Frequency analysis

Now, what if we didn’t collect very good data, or we binned our data into low-resolution categories for the sake of ease in our study design? Often, and for a variety of reasons other than crappy data collection, we want to compare frequencies of events between two (or more) groups. We may even design studies specifically to test these kinds of hypotheses when we think about rates, for example. This is very common in studies of population genetics [definitely citations available for that one - go Google them]

The simplest way to test for differences in the frequency of a categorical response between two groups is (some would argue) the \(\chi^2\) test. The \(\chi^2\) is another one of those that you should really work out by hand because it is used in a variety of settings “under the hood” of more complex routines. Here is your token video link showing an example. Watch it. Please.

6.3.1 Worked example

Let’s say we want to know if the number of grass carp in a given age group (say age 10) varies between years. These fish are sterile hybrids, so we would expect that the number of fish in each age would change drastically with increasing time since the year of initial stocking (1995).

First, make a table showing the number of fish in each Age by Year with the grasscarp data.

agexyear <- with(grasscarp, table(Age, Year))

print(agexyear)
##     Year
## Age  2006 2007 2009 2010 2017
##   1     0    0    0    2    0
##   2     6    7    7    8    0
##   3    11   10   10   14    3
##   4     4    3    6    3    3
##   5     0    1    3    2   10
##   6     0    1    2    1   12
##   7     0    0    1    3   11
##   8     0    1    1    2   12
##   9     1    0    1    3   21
##   10    4    1    1    2   21
##   11    4    9    0    6   15
##   12    3   15    2    3   18
##   13    0    1    4    3    7
##   14    0    0   11    8    6
##   15    0    0    2   12   12
##   16    0    0    0    2   10
##   17    0    0    0    0    8
##   18    0    0    0    0    9
##   19    0    0    0    0   12
##   20    0    0    0    0   12
##   21    0    0    0    0    5
##   22    0    0    0    0    3
##   23    0    0    0    0    2

Basically what we are going to do is analyze the proportion of total fish in each column by age.

You should see some pretty obvious patterns here. We have a couple of things to think about now. First, this is the kind of question you don’t need statistics for. Second, we have a whole bunch of empty groups, and these are not random with respect to year. Some of these come from ages that were not yet available in years 2006 - 2010 and some from patterns in fish stocking. The large number of empty pairings and the fact that most age classes had fewer than five fish in any year prior to 2017 means we should probably break the data down a little further. This stinks because we lose resolution, but that is the cost.

For the sake of demonstration, let’s summarize the data by high and low density again and we’ll look at the number of fish collected in each age class during high and low density years.

freqs <- grasscarp %>% # Pass grass carp data frame to group_by()
    filter(Age %in% c(10:15)) # Select only shared age range

head(freqs)
##   Year Length Weight_kg Age cohort n_stocked     n acre   ha nha     kg
## 1 2006   1262    23.154  12   1994      7000 25128 2957 1203  21 129929
## 2 2006   1138    23.923  10   1996      7000 25128 2957 1203  21 129929
## 3 2006   1187    18.614  10   1996      7000 25128 2957 1203  21 129929
## 4 2006   1207    19.522  12   1994      7000 25128 2957 1203  21 129929
## 5 2006   1086    21.285  10   1996      7000 25128 2957 1203  21 129929
## 6 2006   1095    18.160  12   1994      7000 25128 2957 1203  21 129929
##   kg_ha density
## 1   108     low
## 2   108     low
## 3   108     low
## 4   108     low
## 5   108     low
## 6   108     low

We will test the null hypothesis that there is no difference in the number of age-10 fish between high and low densities.

# Run the test
chi_test <- with(freqs, chisq.test(x = density, y = Age))

# Have a look
print(chi_test)
## 
##  Pearson's Chi-squared test
## 
## data:  density and Age
## X-squared = 13.107, df = 5, p-value = 0.0224

And, bam! We see that there is a difference in the frequency of fish collected in each age class in high and low density years. Shocker.

Data visualization techniques for contingency table analyses like this seem to have generally lagged behind theory in terms of wide-spread implementation. There is a base R mosaicplot that plots relative freqencies. You can interpret the width of the bars as the proportion of total observations in each age class. Likewise, the height of the vertical segnmens corresponds to proportion of high or low density observations in each Age.

mosaicplot(Age ~ density, data = freqs)

The need for improved graphical representation for these data types is recognized. There have been recent efforts to extend the philosophies used ggplot() to contingency analysis by r developers (see Wickham and Hofmann 2011). It was even the topic of a recent master’s thesis (see Grant 2017). But as far as I know the ideas from these works have not been integrated into ggplot2 or the tidyverse yet. Sorry about the citations. I don’t know what I was thinking. This is supposed to be the Worst Stats Text eveR.