statistics with R programming

profilejamesfiona1993
lab2.docx

STA 622 Lab 2: Making graphs with R

Whenever you use R for anything, you should include your relevant code and relevant output as part of your submission, as that is how we show work. Please get in the habit of using comments in your code at all times (with the # sign). Please make sure to always use a script, as on Beckerman, P15-19, including using the library() function at the top of every script for dplyr and ggplot2 as on Beckerman, P23. You can upload a word document or pdf onto blackboard, with your answers, R code, and R output organized by each of the following parts.

I agree wholeheartedly with the first paragraph on page 79. Graphs are always the starting point for any analysis. This lab will have you work with the R package called ggplot2, which has lots of really cool functionalities.

I recommend you reread Beckerman Chapter 4 prior to doing this lab (recall I asked you to read the preface and chapters 1-4 several weeks before the semester started).

You’ll need to import a data set into R, and the data set should have at least two quantitative variables, and at least one categorical variable. If the data set you used for Lab 1 meets that, then feel free to use it. You do not have to show how you imported the data into R, though it could just be part of your script. As always, you should verify that the data are in long format before preceeding.

1. The primary graphical summary of the relationship between two quantitative variables is called a a scatterplot, which is just plotting (x,y) pairs. Please always be sure to put the response variable on the y-axis and the explanatory variable on the x-axis.

(a) Create a script exactly like Beckerman P82, except for your data. Show your code and the output (the output from both glimpse() and ggplot()).

(b) Include the changes to ggplot() show on Beckerman P84. Show your code and output. If your categorical variable has more than 3 categories, you can just subset the data to have fewer categories for the purposes of this lab if you prefer (please explain that you did this, and show your code for it).

2. When your response variable is quantitative and the explanatory variable is categorical, the most common kind of plot used is the side by side boxplot (or box and whisker plots).

(a) First implement the code at the top of Beckerman P86 (the code without the geom_point() function) with x being your categorical variable, and y being one of the quantitative variables that might be thought to be related to the categorical variable. Show your code and output.

(b) Implement the code in the middle of Beckerman P86 (the code with the geom_point() function). Explain what the addition of the geom_point() function does. Show your code and output.

(c) As you can see, the default in ggplot() for boxplots is to make them vertical. Some people prefer them to be horizontal. Doing some googling on your own, determine how to make the boxplots horizontal. Show your code and output.

(d) Explain how you can look only at the side by side boxplots, and be able to compare each of the following across the groups defined by the categorical variable.

(i) Center of the distribution (which group has higher center)

(ii) Variability of the distribution (which group has higher variability)

(iii) Shape of the distribution (symmetric or right skewed or left skewed)

(iv) Outliers (are there any, and if so, on the high end or low end, and how extreme)

3. Histograms are another kind of graph that are useful for quantitative variables, provided that the sample size is sufficiently large. Histograms are essentially a bar graph of frequencies of the data carved up into classes or bins.

(a) One drawback to histograms is that the shape of the histogram is dependent on the number or width of the bins. Using the code on Beckerman P89, play around with either the number of bins or binwidth, to show two different histograms of the same data. Show your code and output.

(b) You can also easily make side by side histograms, dividing the data by your categorical variable. Use the code on Beckerman P90 to do so for your data. Show your code and output.

4. R makes presentation quality graphs, so you’ll want to be able to save these to put into presentation or manuscripts. Try both versions of saving a graph on Beckerman P91. The only thing you have to turn in for this problem is which one you prefer and why.