undergrad level stat homework with R coding

Tutorial6-Checkingmodelfit.pdf

Home >Mathematics homework help >Statistics homework help >undergrad level stat homework with R coding

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

l2STAT 503 – Statistical Methods for Biology

Tutorial 6: Checking model fit

Material from this tutorial will be used in Homework 7, which is due at 11:59 PM on Tuesday, November 6.

Tutorial 6 introduces methods for checking the distributional assumptions of a model, along with some more advanced graphing techniques. It also introduces the use of predictive simulation as a model diagnostic tool. Working through this tutorial, you will learn:

1) How to overlay a theoretical distribution on an empirical histogram. 2) How to generate a normal quantile plot. 3) How to use the with() function to access variables in a data frame. 4) How to perform the same procedure on subsets of the data using dplyr's do() function. 5) How to use facet_wrap() and/or color coding to represent subsets of the data in a

graph. 6) How to generate simulated data from a fitted model with replicate() and rnorm(),

and then use those simulations to check the model's distributional assumptions. 7) How to randomly subsample a dataset using the sample() function.

Please attempt all of the exercises and ask for help if you encounter any problems or confusion.

1. Getting started

Open RStudio. Remember that if you selected “Save” the last time that you closed it, your previous workspace will be reloaded, and that

1) any scripts that were open when you closed RStudio will be reopened, 2) any objects that were saved in the Environment will still be there, and 3) your working directory will be set to whatever it was when you closed RStudio.

In this tutorial, we will use the dplyr and ggplot2 packages. Remember that when you attach the dplyr package, it may mask a few items from the base and stats packages. This is okay. There is no data to load with this tutorial.

2. Data

This tutorial will use the original rats.csv data file. In addition, I would like to introduce a new tool for locating data files, which should make your life easier. In a script, write,

filename <- file.choose()

filename

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Now run these two lines. When the GUI file selection window pops up, simply navigate to rats.csv, and select it, the same way you would if you were opening a file in Excel. After you click "Open" (Mac might say something different?), R will print the value of the object filename. This is the path to rats.csv. However, you do not need to copy it. Just put,

filename <- file.choose()

rats <- read.csv(filename)

at the beginning of your script, and you'll be prompted to locate it when you run the script (if you don't want to do this, then you can copy and paste the path into your script). Note that this approach can also be used with write.table() or write.csv() when outputting results.

3. Overlaying histograms and density curves

Any time you that you use a statistical model to make an inference, you are making certain assumptions about the distribution of the data. It is important to check these assumptions carefully. Although they will almost never be met exactly, severe violations of the model assumptions can lead to faulty inferences and potentially to false scientific conclusions.

One of the key assumptions made by most statistical models is that the data come from a particular distribution. There are several ways to check this assumption, but the most informative techniques are graphical. This section will introduce two methods for checking the fit between a model's distribution and data: overlaying histograms with density curves, and generating quantile plots. In the next section, we will build on the quantile-plot methodology by comparing the plot for the data with plots of simulated data generated from the model.

3.1 The basics

Start a new script and load the original rats.csv dataset. If you remember, this dataset contains weights of 15 male and 15 female rats, with 5 individuals from each sex assigned to each of three diet treatments: control, A, or B. Use qplot(), ggplot(), or hist() (whichever you prefer) to plot a histogram of the weight variable (don't forget to adjust your bins – you can use bins, binwidth, or breaks to do so).

qplot(weight_g, data = rats, binwidth=5,

col=I('black'), fill=I('cornflowerblue'))

Typically, we would expect body mass to be normally distributed because its value reflects the additive effects of many small, random factors that differ among individuals, but here, we have a bimodal distribution. A quick glance at some summary statistics suggests a reason for this: the average mass for the male rats is noticeably larger than for the female rats (there might also be some diet effects, but that is less clear):

group_by(rats, sex, diet) %>%

summarize(mean = mean(weight_g), sd = sd(weight_g))

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

# A tibble: 6 x 4 # Groups: sex [?] sex diet mean sd <fct> <fct> <dbl> <dbl> 1 female A 141. 5.37 2 female B 144. 5.95 3 female control 139. 4.38 4 male A 161. 4.73 5 male B 166. 4.83 6 male control 162. 3.58

When we think about the process that generates a normal distribution, the word small is important. If a single large source of variation is added to many small sources of variation, then we get something like the histogram seen here. It is a mixture of (at least) two different normal distributions, one for male rats, and one for female rats.

Next, we'll overlay a normal curve on the histogram for the full rat dataset. We already know that a normal distribution will not fit these data – they're strongly bimodal – so if we were analyzing these rat data in real life, we would probably skip this step. However, working through it will demonstrate how the code and the process work.

We will use ggplot() to make this plot. Begin by setting up a plot object that defines weight_g as the variable to be represented on the x-axis:

rat_hist <- ggplot(rats, aes(x = weight_g))

rat_hist <- rat_hist + labs(x = 'Body mass (g)')

rat_hist <- rat_hist + theme_gray(base_size = 18)

The second and third lines here relabel the axis in the plot and increase the font size. They do not affect the plot's information content.

We want to overlay a normal density curve on a histogram for the full data. To do this, we will need to add two different geometries: one for the histogram, and one for the curve. Each new geometry that we include adds a new layer to the plot. Since the layers are added in order, we want to add the histogram first, and then the curve.

To add the histogram layer, we use geom_histogram() as usual, but to make sure that the histogram and the density curve will both be on the same scale, we need to tell geom_histogram() to plot a density histogram instead of the usual frequency histogram (in other words, we need a histogram of relative frequencies instead of absolute frequencies). To do this, we can add aes(y = ..density..) to the geom_histogram() call:

rat_hist <- rat_hist +

geom_histogram(

mapping = aes(y = ..density..),

binwidth = 5,

col = 'black',

fill = 'cornflowerblue'

)

This code tells R that we want the y-axis to represent densities instead of counts, we want the bins to be 5 grams wide, and we want cornflower blue bars with black outlines. You can, of course, pick your own colors and bin widths.

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

There are a couple of different ways to add the density curve. The one that was demonstrated in lecture uses the ggplot2 function stat_function(). Unfortunately, although it works well when you want to make a single graph, stat_function() is not completely compatible with facets (see Section 3.2). Therefore, we will use a different method that is slightly more complex, but ultimately more flexible.

Before we look at the code to add the curve, make sure you understand what we are trying to accomplish. Our goal is to model the population that the rat data came from. With this plot, we are investigating whether or not a model that consists of a single normal distribution would be adequate to serve our needs. If it is adequate, then the shape of our histogram should roughly match the shape of a normal density function with parameters based on our data. To generate the correct comparison curve, we need to (1) calculate the summary statistics from our data, (2) use the estimated parameter values to calculate the normal density values that we want to plot, and then (3) add the normal curve to the graph. The parameters of the normal distribution are a mean, , and a standard deviation, , and they are estimated by the sample mean, ̅, and sample standard deviation, , respectively. Therefore, we need to calculate ̅ and for the rats, and then return a density curve based on those values.

Our reference curve should cover the full range of weights in our data, and should be reasonably smooth. To draw the curve, we will calculate densities at a number of points along the x-axis, and then connect the dots to approximate the normal curve. If the number of points along the x-axis is too small, then the approximation will be jagged, but if it is too large, then we will increase processing time without getting much benefit from it. Feel free to adjust the number of points to fit your own preferences, but 200-300 points spread at equal intervals along the x-axis is usually sufficient to ensure a nice- looking curve, and does not pose any kind of burden for modern computers.

Here is code to calculate the normal parameter estimates, and then return a data frame with (x, y) coordinates to draw the curve:

rat_normals <- with(data = rats, {

# collect summary statistics for weight_g

mean_w <- mean(weight_g)

sd_w <- sd(weight_g)

min_w <- min(weight_g)

max_w <- max(weight_g)

# make density curve

weight_grid <- seq(from = min_w, to = max_w, len = 300)

normal_dens <- dnorm(weight_grid, mean = mean_w, sd = sd_w)

# return data frame with coordinates

data.frame(x = weight_grid, fx = normal_dens)

})

The with() function is part of base R, not dplyr, but it behaves a lot like a dplyr function. It allows you to run a block of code that calls the columns of a data frame by name, without using the dollar-sign ($) operator. The first argument to with() is the data

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

frame that contains the variables you want to work with. The code that you want to run is passed as the second argument to with(), and should always be wrapped in curly brackets ({}) if it includes more than one statement (like ours does here).

The code that I have written in this with() call has three parts. The first four lines get summary statistics for the variable weight_g, including the minimum and maximum as well as the sample mean and standard deviation. Each statistic is saved in a named object. Next, I use these summary statistics to define the curve. The first line in this section uses the seq() function to generate a sequence of 300 evenly spaced values ranging from the minimum of weight_g to its maximum. The len argument tells R that I want 300 values in the output (len is short for length.out). R will figure out how to space them on its own. The second statement evaluates a normal density function at each value on the weight grid, and assigns the results to normal_dens. Finally, the last line returns a data frame with two columns, named x and fx. Each row is a pair of coordinate points for a value along the x-axis and its corresponding probability density (i.e., and

) given ∼ ̅, . This output data frame is assigned ("<-") to the object named rat_normals.

There are a couple of additional things to note about this code. First, the result of the last statement is the only output generated by with(). Everything else (mean_w, weight_grid, etc.) is temporary and gets cleaned up (disappears) as soon as with() finishes running. In addition, there is no reason that I had to name the columns in the output x and fx. If I had wanted to, I could have written data.frame(weight_grid = weight_grid, normal_dens = normal_dens); I used x and fx so the statement would fit on one line in the tutorial text.

Now we have found the summary statistics and created a data frame with the coordinates for points along the curve. The next step is to add the curve to our graph. To do this, we'll use geom_line():

rat_hist1 <- rat_hist +

geom_line(

mapping = aes(x = x, y = fx),

data = rat_normals,

col = 'red',

lwd = 1

)

Notice that for the line, I am completely changing the aesthetic mapping and the dataset being referred to. Instead of the original rats data, I am telling geom_line() to get its data from rat_normals, and I am telling it to use x and fx for the x-axis and y-axis values, respectively. The lwd argument sets the line width (that's a lower-case L in lwd, not a one, but the value is set to 1).

Plotting the graph confirms what we already know: a model which assumes that all of the rat weights come from the same normal distribution does a poor job of fitting these data (Figure 1).

rat_hist1

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Figure 1: Histogram of rat body masses overlaid with a normal density

curve parameterized using the overall mean and standard deviation of the data.

3.2 Working with subsets of the data: do() and facet_wrap()

Based on the summary results from the top of page 3, we might expect to have better luck if we split the rats up by sex and fit separate models for the females and males. To do this, we can make some small modifications to the code that we used to generate the curves. First, we need to group the rats by sex. Then we need to calculate summary statistics for each group, and finally, we need to generate a curve for each group. We can do this using a pipe, along with a new dplyr function called do() (if you dislike pipes, you can also do this in a pair of assignments – see Appendix):

rat_normals <- rats %>% # using a pipe

group_by(sex) %>%

do(

with(data = ., { # NOTE CHANGE HERE!

mean_w <- mean(weight_g)

sd_w <- sd(weight_g)

min_w <- min(weight_g)

max_w <- max(weight_g)

weight_grid <- seq(from = min_w, to = max_w, len = 300)

normal_dens <- dnorm(weight_grid, mean = mean_w, sd = sd_w)

data.frame(x = weight_grid, fx = normal_dens)

})

) Starting from the top, we are going to redefine the object rat_normals (we are using

the assignment arrow to overwrite its current value). To get the new rat_normals, we

0.00

0.01

0.02

0.03

0.04

130 140 150 160 170

Body mass (g)

d e n s ity

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

will take the data frame rats, group it by the values of the column named sex. Then we will pass each group on to do(), which will do more or less anything we ask it to do. Unlike summarize(), which can only return one row for each group, do() can return as many rows as we want it to return. The code inside the parentheses for do() is almost an exact copy of the previous code from with(). The comments have been removed. In addition, I have made one critically important change that allows it to accept input from do(). On page 4, we said data = rats. Here, I've replaced rats with a lone period (see the line with the comment that reads "# NOTE CHANGE HERE!"). This period tells do() where to put the data that it receives from group_by().

If you're not sure from reading the code how the output will look, use glimpse(rat_normals) to look at it. The do() function generates a whole set of normal density values for each group of rats and appends a column (or columns, if more than one grouping variable is used) to identify which group each row of the output belongs to. As a result, rat_normals now has three columns: sex, x, and fx. To make the new graph, we can use the same code that we used earlier, but I'll name the output rat_hist2 to keep it separate from the previous histogram.

rat_hist2 <- rat_hist + geom_line( mapping = aes(x = x, y = fx), data = rat_normals, col = 'red', lwd = 1 )

Figure 2: Histogram of rat body masses overlaid with a pair of normal

density curves parameterized separately for females (left) and males. The two curves are joined by the straight line in the middle of the graph.

0.00

0.02

0.04

0.06

0.08

130 140 150 160 170

Body mass (g)

d e n s ity

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Simply plotting rat_hist2 shows a histogram for the full data, along with a bimodal curve with peaks that roughly correspond to the peaks in the histogram (Figure 2).

rat_hist2

But the histogram has shrunk! Why?!? The reason for this has to do with the way that we calculated the normal curves. We generated separate normal density values for female and male rats. These separate curves are conditional probability density functions for weight given sex. Each curve integrates to 1. However, the histogram includes both sexes. Since there are 15 female rats and 15 males, the total density of for each sex is 0.5, not 1. As a result, the histogram is short compared to the curves.

There are two ways we can fix this height-mismatch. One option is to multiply the values for normal_dens in the do() call by a fraction that gives the proportion of the dataset for each sex. Alternatively (and much more easily), we can split the data and generate separate histograms for male and female rats. This is extremely easy; we can simply add the facet_wrap() function when we draw rat_hist2, specifying that we want to split the graph by the variable sex:

rat_hist2 + facet_wrap(~ sex)

… and voilà, two separate histograms, nicely labeled, each with its own normal curve (Figure 3). Unlike the model that assumes the rats' weights all come from a single normal population distribution, the model that assumes separate distributions for females and males appears to work reasonably well.

Figure 3: Histograms of body masses for female (left) and male rats

overlaid with parameterized normal density curves.

female male

130 140 150 160 170 130 140 150 160 170

0.00

0.02

0.04

0.06

0.08

Body mass (g)

d e n s ity

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

The argument to facet_wrap() is called a model formula (that's R jargon, by the way, not statistics jargon). In this case, it tells R that you want to draw separate plots according to the variable sex. The tilde can be read as, "conditional on." For this to work properly, all of the datasets referenced in the graphing instructions must include the variable(s) that are being referenced by facet_wrap(), or R will get confused. For example, try this:

rat_hist2 + facet_wrap(~ diet)

The do() command added sex to rat_normals for us, but we did not group by diet. Since it is not included in rat_normals, we get a set of histograms split by diet that are similar to the original rat_hist2 without the facet_wrap().

3.3 (OPTIONAL) One variable with several different geometries

In the previous Sections 3.1 and 3.2, we used geom_line() to add the normal curve to our plot. If you are still a little mystified/confused/perplexed/intimidated by all of these different geom_'s, it may be helpful to take a moment to consider some of the other ways that we might have added the information in rat_normals to the graph.

For this exercise, we will only use every fifth row in rat_normals. Otherwise, the differences between different plotting methods will be difficult to see clearly. To get a reduced data frame with only rows 1, 6, 11, and so on, we can use the slice() function:

rat_norms <- slice( rat_normals, seq(from = 1, to = nrow(rat_normals), by = 5) )

The new data frame rat_norms is the same as the old data frame rat_normals, except that rat_norms has 1/5 the number of rows. Use glimpse() to check this if you like.

Each row in rat_norms contains the coordinates for one point in the (x, y) plane represented on the graph. To see the points themselves, we can use geom_point(),

rat_hist + geom_point(aes(x=x, y=fx), data=rat_norms)

However, for the graphs in the previous section, we wanted smooth curves, so instead of geom_point(), we used geom_line(), which connected the points together with straight lines:

rat_hist + geom_line(aes(x=x, y=fx), data=rat_norms)

In previous tutorials, we also used geom_step(), which draws a stair-step connection between the points, instead of a straight line:

rat_hist + geom_step(aes(x=x, y=fx), data=rat_norms)

As you can see, the data do not change in any of these graphs. Only the way that we draw the relationship changes.

rat_hist + geom_step(aes(x=x, y=fx), data=rat_norms) +

geom_line(aes(x=x, y=fx), data=rat_norms, col='red') +

geom_point(aes(x=x, y=fx), data=rat_norms, col = 'blue')

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

4. Normal quantile plots

Histogram overlays provide an intuitive way to examine the fit between a model and your dataset. However, the appearance of a histogram can be sensitive to the break points for the bins that you choose to use to build the histogram. For this reason, quantile plots (a.k.a. QQ- plots) are generally the preferred method for examining a distributional model's fit to data.

4.1 The basics

To generate a quantile plot, the variable being examined (weight_g, in this case) should be mapped to the sample aesthetic, as shown in the first line of the following code:

rat_qq <- ggplot(rats, aes(sample = weight_g))

rat_qq <- rat_qq + theme_gray(base_size=18)

rat_qq + geom_qq() + geom_qq_line()

In the last line, geom_qq() generates a normal quantile plot, and geom_qq_line() adds a reference line. If the data come from a normal distribution, the points will fall near this line. If the patterns made by the points is substantially non-linear, then the data are not consistent with a normal distribution. In this case, the weights in the complete rats dataset are definitely non-normal (Figure 4).

It is worth pointing out that we did not need to do any fitting or summarizing of the data to generate this plot. All of the necessary parameterization is done by geom_qq() itself. This is also the case if we add a facet_wrap() to the quantile plot (see Figure 5):

rat_qq + geom_qq() + geom_qq_line() + facet_wrap(~ sex)

Figure 4: Normal quantile plot for rat body masses, including both males

and females in the same distribution. The data are clearly non- normal, an impression confirmed by the histogram in Figure 1.

120

140

160

180

-2 -1 0 1 2

theoretical

s a m

p le

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Figure 5: Normal quantile plots for body masses of female (left) and male

rats. Splitting the data by sex produces a much better fit to the normal distribution for each group.

As an alternative to facet_wrap(), we can specify colors for the different sexes and

then display both quantile plots on the same graph:

rat_qq <- ggplot(rats, aes(sample = weight_g, col = sex))

rat_qq <- rat_qq + theme_gray(base_size=18) rat_qq + geom_qq() + geom_qq_line()

You may or may not appreciate ggplot2's choice of colors. If you would like to change them, you can use scale_color_manual() to map your own choice of colors to different values of the variable that color encodes (here, sex). For example,

rat_qq + geom_qq() + geom_qq_line() +

scale_color_manual(values = c(female = 'red', male = 'blue'))

(Figure 6). Color coding like this has the advantage that it allows you to display separate quantile plots for different groups at the same time that you use facet_wrap() to display simulated comparison data.

female male

-2 -1 0 1 2-2 -1 0 1 2

140

150

160

170

theoretical

s a m

p le

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Figure 6: Normal quantile plots for body masses of female (red) and

male rats (blue), both in grams.

4.2 Using predictive simulations to help evaluate model fit

The evaluation of quantile plots is a subjective skill that improves with experience, and both students and beginning researchers often struggle with the classification of borderline plots that may or may not indicate problems. Comparisons against random simulations from the fitted model can help you to identify whether or not a particular plot suggests that distributional assumptions are being violated. Keep in mind, however, that unless your sample size is large enough to reliably detect a lack of fit, failing to find evidence that a model does not fit the data is not the same as finding positive evidence that the model does fit.

The basic idea behind generating comparison sets is that you first estimate the model parameters from your data, and then use those estimates to simulate several new datasets from the parameterized model. The simulated datasets will give you an idea of the range of patterns that you might see if the model were a perfect description of the population. They are intended to provide a visual aid. Nothing more.

The code below generates eight simulations from the model for rat weights that assumes different normal distributions for female and male rats. It is very similar to the code that we used to generate reference curves in Section 3. However, this code does not calculate the minimum and maximum since we don't need them. Instead, after getting the mean and standard deviation for weight, we also get the length of the weight_g vector, and store it to the object n. Next, n, mean_w, and sd_w are used as arguments for the rnorm() function inside replicate() to generate the simulated datasets (notice that the number of simulations is set by nsim, which I have defined outside the with() command [but this placement is not necessary]).

140

150

160

170

-2 -1 0 1 2

theoretical

sa m

p le sex

female male

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

nsim <- 8

rat_sims <- rats %>% # using a pipe

group_by(sex) %>%

do(

with(data = ., { # Remember to use . here!

mean_w <- mean(weight_g)

sd_w <- sd(weight_g)

n <- length(weight_g)

weight_sim <- replicate(nsim, rnorm(n, mean_w, sd_w))

all_weights <- cbind(weight_g, weight_sim)

rep_number <- cbind(0, col(weight_sim))

data.frame(

rep = as.vector(rep_number),

weight_g = as.vector(all_weights)

)

})

)

Recall that replicate() returns a matrix of values. Here, the result will have n rows

(one for each data point) and nsim columns. The cbind() function binds columns together. So, for example, cbind(weight_g, weight_sim) attaches the real data (weight_g) to the weight_sim matrix, making it the new first column of a matrix with nsim + 1 columns (as you might guess, there is also an rbind() function).

An extended explanation – The col() function takes a matrix as its input and

returns a matrix of the same dimensions, where each element gives the column number for itself. To see how some of these functions work, we can go to the console and make a small matrix, z, like this

> z <- matrix(1:9, nrow = 3) > z [,1] [,2] [,3] [1,] 1 4 7 [2,] 2 5 8 [3,] 3 6 9

then,

> col(z) [,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 [3,] 1 2 3

gives the column indices for each value in the matrix. Using cbind() along with col() produces,

> cbind(0, col(z))

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

[,1] [,2] [,3] [,4] [1,] 0 1 2 3 [2,] 0 1 2 3 [3,] 0 1 2 3

Because 0 is a scalar (it has length 1), it is recycled so that we get a column of zeroes. Alternatively, we could say,

> cbind(c(12,1,47), col(z))

[,1] [,2] [,3] [,4] [1,] 12 1 2 3 [2,] 1 1 2 3 [3,] 47 1 2 3

and attach a vector. If the length of the vector does not allow either the vector or Finally, in our data.frame() call to generate the output, as.vector() converts the matrix into a vector. For instance,

> as.vector(z)

[1] 1 2 3 4 5 6 7 8 9

(c(z) would do the same thing). You can use,

> rm(z)

to get rid of matrix z now that this demonstration is done.

Back to the rats! Run the code to generate rat_sims, and then take a glimpse() at it. You'll see that it has three columns: one for weight in grams, one for replicate number (with a 0 for the real data), and one for the sex of the rat. The last one was attached by do(). If we grouped by a different variable (or multiple variables), then this would be reflected in our output. To check the model, we can now combine the color- coding and facet_wrap() approaches used earlier:

rat_qq <- ggplot(rat_sims, aes(sample = weight_g, col = sex))

rat_qq <- rat_qq + theme_gray(base_size=18) rat_qq + geom_qq() + geom_qq_line() + facet_wrap(~rep)

Nothing stands out that would clearly identify Plot 0 (the real data) as different from the others. Therefore, based on these results, there is little reason to think that the model with separate normal distributions for female and male rats does not fit the data. We could therefore proceed with caution under the assumption that this model does a reasonable job of describing the population of rat weights. Some caution is warranted, however, since we only have 15 data points for each group.

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Figure 7: Normal quantile plots for body masses of female (red) and

male rats (blue), both in grams. Plot 0 shows the real data, and plots 1-8 show simulated data from the fitted model. The fit for the real data appears consistent with the simulations.

4.3 Effects of sample size on fit evaluations

Our model checking results in Section 3.2 gave us little reason to seriously distrust our model. This result is consistent with the results in Section 2.2, where we found a reasonably good match between our empirical histograms and theoretical curves. It is fair to ask, however, whether we have sufficient sample size (power) to detect a serious violation of our model assumptions. To help you get a feeling for how these tools behave with different quantities of data, we will run the quantile simulation for the original all- rats-from-one-normal model, using progressively smaller subsets of the data.

Start by generating the quantile plot with simulations for the full dataset, without grouping by sex. To do this, you can simply comment out the group_by() line in the code on page 13 (or other remove the group_by() statement from your code). You will also need to remove col = sex from the ggplot() command to avoid an error. The original data have 30 points, which is usually considered sufficient spot major distributional violations. Based on your plots, would you be confident rejecting the idea that these data came from a normal distribution. What features of Plot 0 would you point to if you needed to convince a skeptical audience that your interpretation was correct (i.e., what about Plot 0 makes it different from the others)?

Next, we'll subsample the data and remake the plot. To generate a new data frame that is a random subset of the original, you can use,

nsub <- 20

indices <- sample(nrow(rats), size = nsub)

6 7 8

3 4 5

0 1 2

-2 -1 0 1 2-2 -1 0 1 2-2 -1 0 1 2

130

140

150

160

170

130

140

150

160

170

130

140

150

160

170

theoretical

s a m

p le sex

female male

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

rat_sub <- slice(rats, indices)

In the first line, nsub is simply the number of data points that you want to keep in the subsample. The second line introduces the sample() command; its behavior depends on what its first argument looks like. Here, the first argument is the number of rows in the rats dataset (i.e., 30). When given a single integer, sample() will generate a random sample of integers from 1 to the argument value. The size argument tells it how many values to return. So here, we get 20 random numbers between 1 and 30. All of the values have an equal chance of being selected, and the sampling is done without replacement (we could change this by setting replace = TRUE). The selected rows are placed in the object named indices; then the slice() function returns a data frame containing only those rows. If you are having trouble following this explanation, run the code, and then print out both rats and rat_sub, and compare the two data frames (and then do it again).

Rerun the quantile plots once at each sample size using subsamples of 25 rats, 20 rats, 10 rats, and 5 rats. We know that in fact, the rat weights do not come from a normal distribution. However, is there a sample size below which you can no longer reliably tell from the data that the distribution of rat weights is not normal? If possible, you may want to compare your graphs with those generated by other students in the class and see if they reach similar conclusions.

4.4 Different ways for a model not to fit

In lecture, we discussed several patterns that can arise in quantile plots when the data do not conform to a normal distribution. For review, these are summarized if Figure 8. Take a moment to compare Figures 4 and 8. Can you think of a reason that the bimodality of the rats data would show up as "short-tailed" in a quantile plot? It may help to look at Figure 1 and imagine how the curve would look if you extended it until the tails got near zero.

All of the patterns in Figure 8 use continuous data, but discrete data also show distinct patterns in quantile plots. To see this, we can use rbinom() to simulate several replications of a binomial trial. For instance,

bindata <- data.frame(x = rbinom(100, size = 5, prob = 0.3))

simulates 100 replicates of a trial with sample size 5 and 0.3. Make a quantile plot of these data and take note of any patterns that you notice. Then repeat the process with sample sizes of 10, 50, 500, and 5000. What happens to the pattern from the original graph as the sample size increases?

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Figure 8: Normal quantile plots for various types of non-normal data.

"Short" and "long-tailed" data are also referred to as "light" and "heavy-tailed," respectively.

5. Unknown distributions

Please load the dataset unknowns.csv. This dataset contains four variables, A, B, C, and D. For each variable, apply the tools that you have learned in this tutorial, as well as glimpse() or other tools (e.g., try is.integer()), to answer the following questions:

1. Does a normal model fit the data?

2. If a normal distribution does fit the data, please report the point estimates for the mean and standard deviation.

3. If the normal distribution does not fit the data, in what way does it not fit, and how can you tell?

Short-tailed Long-tailed

Right-skewed Left-skewed

-2 -1 0 1 2 -2 -1 0 1 2

-4

-2

-4

-2

theoretical

s a m

p le

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Appendix: Pipe alternatives

Just like writing styles, programming styles differ. Some people love pipes, and some hate them. Here are three different ways to write the code from Section 3.3 that generates different curves for female and male rats. The original code using a pipe:

rat_normals <- rats %>% # using a pipe

group_by(sex) %>%

do(

with(data = ., { # NOTE CHANGE HERE!

mean_w <- mean(weight_g)

sd_w <- sd(weight_g)

min_w <- min(weight_g)

max_w <- max(weight_g)

weight_grid <- seq(from = min_w, to = max_w, len = 300)

normal_dens <- dnorm(weight_grid, mean = mean_w, sd = sd_w)

data.frame(x = weight_grid, fx = normal_dens)

})

)

Same thing, but with a series of assignment arrows:

rat_groups <- group_by(rats, sex) # using assignments

rat_normals <- do(

rat_groups,

with(data = ., { # NOTE CHANGE HERE!

mean_w <- mean(weight_g)

sd_w <- sd(weight_g)

min_w <- min(weight_g)

max_w <- max(weight_g)

weight_grid <- seq(from = min_w, to = max_w, len = 300)

normal_dens <- dnorm(weight_grid, mean = mean_w, sd = sd_w)

data.frame(x = weight_grid, fx = normal_dens)

})

)

Lichti/STAT 503 – Statistical Methods for Biology Modified: 2018‐28‐10

Same thing again, this time with a single set of nested function calls:

rat_normals <- do(

group_by(rats, sex), # nested functions

with(data = ., { # NOTE CHANGE HERE!

mean_w <- mean(weight_g)

sd_w <- sd(weight_g)

min_w <- min(weight_g)

max_w <- max(weight_g)

weight_grid <- seq(from = min_w, to = max_w, len = 300)

normal_dens <- dnorm(weight_grid, mean = mean_w, sd = sd_w)

data.frame(x = weight_grid, fx = normal_dens)

})

)