Statistics Data Project using Rstudio
DATA PROJECT INSTRUCTIONS & DOCUMENTATION
Please read the following instructions carefully. You will be penalized heavily if you do not follow the instructions.
1. There are 3 different datasets in your “DATA PROJECT” folder in D2L. They are called diamond, taxi and risk.
· If your last name begins with the letters A – H, use the data set diamonds for your project
· If your last name begins with the letters I – Q, use the data set taxi for your project
· Last name beginning with letters R – Z, should use the data set risk for their project
NOTE: Using the wrong data set automatically cost 50% of your project grade.
2. All graphs and analysis are to be done with R. An example with codes will be provided in a separate document. Look at the “examplewithoutcode” document to see how your project should look like when uploading to D2L. You must save your project as a pdf file before you upload it into D2L. Only the last and final document uploaded by you before the due date will be graded.
3. There are two parts in this project: Part 1 involves graphical representation and summary of data (descriptive statistics) and part 2 deal with inferential statistics. Each part will be scored independently.
Data Documentation
Below are the sources and description for the data set you will be using for this project:
Diamonds
Your data set is a subset of a very large data set included with the ggplot2 package in R. It contains information about 400 diamonds that were listed for sale.
To read in the diamond data ( Follow the sample procedure with the other data set)
a. download the “diamond” file for D2L and save it to your desktop.
b. Open Rstudio. Under the “Environment” tab on your right, click on “Import Dataset” and select “From CSV…”
c. Click the “browse” tab and search for your data. Click “import”.
You should see something like..
diamond <- read_csv("…diamonds.csv ")
view(diamonds)
On your console. Your data is now loaded!!
Variable Descriptions
carat: Weight of the diamond.
price: price in US dollars
cut: quality of the cut, classified as Fair, Good, Very Good, Premium, or Ideal.
color: diamond color, classified from J (worst) to D (best)
clarity: a measurement of how clear the diamond is , classified as I1 (worst), SI1, SI2, VS1, VS2, VVS1, VVS2, IF (best)
Taxi
The data set contains a sample of taxi trips in New York City during the month of June 2017. The data comes from the NYC Taxi and Limousine Commission website at http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.
Variable Descriptions
distance: The distance of the trip, in miles.
minutes: The number of minutes between the pickup and drop off times recorded by the meter.
fare: The total fare paid by the customer in US dollars
payment: The method of payment, classified as Cash or Card
call: Whether the taxi was hailed on the street (Street_Hail) or sent by the dispatch office (Dispatch).
Risk
From https://www.cdc.gov/brfss/ : The Behavioral Risk Factor Surveillance System (BRFSS) is the nation's premier system of health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.
Your data is a subset of the data collected on 500 people during the year 2000.
Variable descriptions
age: The age of the respondent in years.
gender: The gender of the respondent, classified as m or f.
height: The height of the respondent in inches.
weight: The weight of the respondent in pounds.
genhlth: The self-reported general health status of the respondent, classified as poor (worst), fair, good, very good, or excellent.
smoke100: Whether the respondent has smoked more than 100 cigarettes in their lifetime, classified as yes or no.