3.1 Data read

There are few things that will turn someone away from a statistical software program faster than if they can’t even figure out how to get the program to read in their data. So, we are going to get it out of the way right up front!

Let’s start by reading in a data file - this time we use real data.

The data are stored in a “comma separated values” file (.csv extension). This is a fairly universal format, so we read it in using the fairly universal read.csv() function. This would change depending on how the data were stored, or how big the data files were, but that is a topic of further investigation for a later date. I probably do 95% of my data reads using .csv files. We’ll look at a few others later.

Important Remember that I am assuming your scripts are in the same directory (folder) on your computer as where you downloaded and unzipped the class data (see here for reminder).

Before you can read this file you will need to set your working directory. For class, I will ask that you click Session > Set Working Directory > To Source File Location. This will set the working directory to wherever you have saved your code so that R can find the folder data and the files inside of it. You’ll notice that R spits out some code in the console when you click this. You can also use that code to set a working directory in your script but that can cause all kinds of problems, so don’t do it.

# Start by reading in the data
am_shad <- read.csv("data/ctr_fish.csv")

Once you’ve read your data in, it’s always a good idea to look at the first few lines of data to make sure nothing looks ‘fishy’. Ha-ha, I couldn’t help myself!

These are sex-specific length and age data for American shad (Alosa sapidissima) from the Connecticut River, USA. The data are used in models that I maintain with collaborators from NOAA Fisheries, the US Geological Survey, the US Fish and Wildlife Service, and others. The data were provided by CT Department of Energy and Environmental Protection (CTDEEP) and come from adult fish that return to the river from the ocean each year to spawn in fresh water.

You can look at the first few rows of data with the head() function:

# Look at the first 10 rows
head(am_shad, 10)
##    Sex Age Length yearCollected backCalculated
## 1    B   1     13          2010           TRUE
## 2    B   1     15          2010           TRUE
## 3    B   1     15          2010           TRUE
## 4    B   1     15          2010           TRUE
## 5    B   1     15          2010           TRUE
## 6    B   1     15          2010           TRUE
## 7    B   1     16          2010           TRUE
## 8    B   1     16          2010           TRUE
## 9    B   1     16          2010           TRUE
## 10   B   1     16          2010           TRUE
##    Mass
## 1    NA
## 2    NA
## 3    NA
## 4    NA
## 5    NA
## 6    NA
## 7    NA
## 8    NA
## 9    NA
## 10   NA

The NA values are supposed to be there. They are missing data.

And, don’t forget about your old friend str() for a peek at how R sees your data. This can take care of a lot of potential problems later on.

# Look at the structure of the data
str(am_shad)
## 'data.frame':    16946 obs. of  6 variables:
##  $ Sex           : chr  "B" "B" "B" "B" ...
##  $ Age           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Length        : num  13 15 15 15 15 15 16 16 16 16 ...
##  $ yearCollected : int  2010 2010 2010 2010 2010 2010 2010 2010 2010 2010 ...
##  $ backCalculated: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
##  $ Mass          : int  NA NA NA NA NA NA NA NA NA NA ...

There are about 17,000 observations (rows) of 6 variables (columns) in this data set. Here is a quick breakdown:

Sex: fish gender. B stands for ‘buck’ (males), R stands for ‘roe’ (females).

Age: an integer describing fish age.

Length: fish length at age (cm).

yearCollected: the year in which the fish was caught.

backCalculated: a logical indicating whether or not the length of the fish was back-calculated from aging.

Mass: the mass of individual fish (in grams). Note that this is NA for all ages that were estimated from hard structures (so all cases for which backCalculated == TRUE).