5.6 Sample statistics

In this section, we will learn how to derive the parameters of the normal distribution using a few different methods in R. We will use this opportunity to re-introduce the parameters as moments of the distribution so we can talk about what we mean by confidence intervals. We also will introduce a couple of different methods for calculating moments of a distribution. Specifically, we will look at how to derive…

5.6.1 Moments about the mean

Sounds fancy, huh? Here they are, like a bandaid:

  1. Zeroth moment
    • This is the sum of the total probability of the distribution 1.00, always
  2. First moment
    • The mean
    • We will look at a few ways to calculate this
  3. Second moment
    • The variance
    • As with the mean, we will examine a couple of options for calculating
  4. Third moment
    • Skew
    • We won’t calculate for this class, but we have discussed, and this parameter contributes to the location/spread of the distribution (how far left or right the peak is)
  5. Fourth moment
    • Kurtosis
    • Similarly, we won’t cover the calculation, but this is another moment that we may have discussed with respect to departure from a z
      distribution in the normal

5.6.2 Estimating parameters of the normal distribution from a sample

The tools demonstrated below can be used for most of the probability distributions that have been implemented in R, and we could go on and on forever about them. But, for the sake of our collective sanity we will walk through the tools available using the normal distribution alone. Most of the time this will suffice because our objective in understanding other distributions is really just so that we can use them to assume asymptotic normality in response variables (with transformations) or parameter distributions (with link functions) later on anyway.

5.6.2.1 Method of moments estimator

The moments of the normal distribution are well defined, and you are probably familiar with how to calculate a mean (average) already. See if you can rearrange this in a way that makes sense with how you know to calculate a mean and a variance!

Start by simulating a variable with a known mean and standard deviation. We’ll pretend that we are simulating cold temperatures here:

# Take a random sample from a normal
# with a mean of 20 and a standard
# deviation of 2
test_norm <- rnorm(1e4, 20, 2)

First, we’ll estimate it by making our own function:

# Write the function
# First, define a function by name
norm.mom = function(x){      
  
  # Calculate mean
  x_bar = (1/length(x)) * sum(x) 
  
  # Calculate variance
  sigma2 = (1/length(x)) * sum((x-x_bar)^2)
  
  # Return the calculations
  return(c(x_bar, sigma2))      
  
}
# Test the function
norm.mom(test_norm)
## [1] 20.001882  3.967866

Because this one is so common, R has built-in estimators that rely on the exact solution provided by the formulas for the first two moments of the normal distribution:

mean(test_norm)
## [1] 20.00188
var(test_norm)
## [1] 3.968263

Wow, that was a lot less code. That is the beauty of having these functions available. How do these compare to the answers returned by our function if you scroll back up?

5.6.2.2 Maximum likelihood estimator

R also has built-in maximum likelihood estimators that provide an exact solution to the first two moments of the normal distribution. These are available through the MASS package.

fitdistr(test_norm, 'normal')
##       mean           sd     
##   20.00188246    1.99195024 
##  ( 0.01991950) ( 0.01408522)

Only one problem here: R doesn’t report the second moment! It reports the square root of the second moment: the standard deviation!

Finally, let’s write our own function and maximize the likelihood with the optim() function in R.

# Define the function
normal.lik = function(theta, y){
  
  # The starting value for
  # mu that we provide
  mu = theta[1]
  
  # The starting value for
  # sigma2 that we provide
  sigma2 = theta[2]
  
  # Number of observations in the data
  n = nrow(y)
  
  # Compute the log likelihood of the
  # data (y) using the likelihood
  # function for the normal distribution
  # given the starting values for our
  # parameters (contained in the vector 'theta')
  logl = -.5*n*log(2*pi) -.5*n*log(sigma2)-(1/(2*sigma2))*sum((y-mu)**2)
  return(-logl)
}

Now, we use the optim function to maximize the likelihood of the data (technically by minimizing the -2*log[likehood]) given different values of our parameters (mu and sigma2).

To get started, we need to take a guess at what those parameters could be. (Yes, we know they are mu = 20 and sd = 2)

optim(c(20, 4), normal.lik, y=data.frame(test_norm))
## $par
## [1] 20.001884  3.968585
## 
## $value
## [1] 21080.53
## 
## $counts
## function gradient 
##       49       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

The pieces are in pars here (right where we told R to put them!). We can also make the output into an object and call the parts by name:

# Make it into an object
est = optim(c(0, 1),
            normal.lik,
            y=data.frame(test_norm)
            ) 
## Warning in log(sigma2): NaNs produced

## Warning in log(sigma2): NaNs produced

## Warning in log(sigma2): NaNs produced

## Warning in log(sigma2): NaNs produced

## Warning in log(sigma2): NaNs produced

## Warning in log(sigma2): NaNs produced

Look at the structure I’ll be damned, it’s a list! Hey, we learned about those!

str(est)   
## List of 5
##  $ par        : num [1:2] 20 3.97
##  $ value      : num 21081
##  $ counts     : Named int [1:2] 101 NA
##   ..- attr(*, "names")= chr [1:2] "function" "gradient"
##  $ convergence: int 0
##  $ message    : NULL

And, here are the estimates:

# Both
est$par 
## [1] 20.001204  3.971007
# The mean  
est$par[1]
## [1] 20.0012
# The variance  
est$par[2] 
## [1] 3.971007

There you have it, a couple of different ways to calculate parameters of the normal distribution using a couple of different approaches each.

5.6.3 Quantiles and other descriptive statistics

There are a number of other ways we might like to describe this this (or any) sampling distribution. Here are a few examples that we will work with this semester.

# Here is the median, or 50th percentile
median(test_norm) 
## [1] 20.0211
# The 95% confidence limits
quantile(test_norm, probs = c(0.025, 0.975)) 
##     2.5%    97.5% 
## 16.05969 23.85148
# Interquartile range (Q1 and Q3)
quantile(test_norm, probs = c(0.25, 0.75))    
##      25%      75% 
## 18.70077 21.33718
# Range of sample
range(test_norm)                             
## [1] 12.93180 28.44992