r studio project chapter 9
Northampton Community College Introductory Statistics Chapter 9 Project Confidence Intervals
SPRING 2022
Confidence Intervals
Introduction
In this project you will use Confidence Intervals to discover if new wood burning stoves generate less pollution than the old wood burning stoves. There is good data on the old wood burning stoves and the amount of pollution generated by the old stoves is well established. The new stoves have not been evaluated. Samples of air pollution are taken after the new stoves are installed. Since it is not yet established for the new stoves how much pollution they generate, Confidence Intervals are used based on the new data on the new stoves.
The four pollutant that are tested are particulate matter (PM), total carbon (TC), organic carbon (OC), which is carbon bound in organic molecules, and levoglucosan (LE).
The pollution output for the old stoves is given by the Baseline Levels. The baseline levels are compared with the levels of pollution generated by the new stoves. If the baseline level for the pollutant is within the confidence interval, there is no reason to believe the new stoves are different from the old stoves. But if the baseline number is not within the confidence interval, it is reasonable to assume that the level of pollution generated by the new stoves is different from the pollution levels of the old stoves. In particular, if the baseline number is higher—or above—the confidence interval, it is reasonable to conclude that the new stoves generate less pollution than the old stoves.
Baseline levels:
Particulate Matter (PM): 27.08 micrograms per cubic meter
Total Carbon (TC): 18.87 micrograms per cubic meter
Organic Carbon (OC): 17.41 micrograms per cubic meter
Leveglucosan (LE): 2.57 micrograms per cubic meter
Your project must have 8 boxplots, 8 confidence intervals and 8 conclusions. For each pollutant you will generate a boxplot based on the data in the Excel file. Then you will create a confidence interval. Finally you will decide if the new stove is generating more or less pollution than the old stove or if there is no difference. The decision is based on the baseline number in comparison with the confidence interval for the pollutant.
Code to complete the project is given on pages 5 and 6.
Data file: Chapter_9_Project_CI_Pollution.xlsx
Make sure you answer all questions in context and print out any outputs and graphs that you perform in R. These printouts will serve as justification for your work. For all probabilities round your answers to 4 decimal places. All graphs must have titles and labeled axes.
The town of Libby, Montana, has experienced high levels of air pollution in the winter because many of the homes in Libby are heated by wood stoves that produce a lot of pollution. In an attempt to reduce the level of air pollution in Libby, a program was undertaken in which almost every wood stove in the town was replaced with a newer, cleaner-burning model. Measurements of several air pollutants were taken both before and after the stove replacement. They included particulate matter (PM), total carbon (TC), organic carbon (OC), which is carbon bound in organic molecules, and levoglucosan (LE), which is a compound found in charcoal and is thus an indicator of the amount of wood smoke in the atmosphere.
In order to determine how much the pollution levels were reduced, scientists measured the levels of these pollutants for three winters prior to the replacement. The mean levels over this period of time are referred to as the baseline levels. Following are the baseline levels for these pollutants. The units are micrograms per cubic meter.
PM: 27.08 TC: 18.87 OC: 17.41 LE: 2.57
The data in Blackboard titled Chapter_9_Project_CI_Pollution.xlsx provides values measured on samples of winter days during the two years following replacement. For each pollutant, 20 measurements were taken. The first-year measurements are in PM1, OC1, TC1 and LE1. The second-year measurements are in PM2, OC2, TC2 and LE2.
Year 1
For each of the four pollutants measured in the first year following replacement-- PM1, OC1, TC1 & LE1, complete the following and include graphs and statistics in your project.
1. Construct a boxplot for the values for Year 1 to verify that the assumptions for constructing a confidence interval are satisfied. [You should have one boxplot for each measure for a total of 4 boxplots.] Make sure you elaborate on these assumptions and use the boxplot to justify your results. [Do the boxplots show that the distribution of values is symmetrical?]
2. Report the median, mean and standard deviation for each sample.
3. Report the shape of the distribution of data. Is it symmetrical or skewed?
4. Construct a 95% confidence interval (CI) for the mean level of each pollutant for Year 1. Make sure you interpret each CI in context.
5. Based on the confidence interval, is it reasonable to conclude that the mean level in Year 1 was lower than the baseline level for the pollutant? Explain your results in context. [Use the confidence interval to decide. Is the baseline number within the confidence interval for the pollutant? If the baseline number is not within the confidence interval, what can you conclude?]
Year 2
The investigators were concerned that the reduction in pollution levels might be only temporary. Specifically, they were concerned that people might use their new stoves carefully at first, thus obtaining the full advantage of their cleaner burning, but then become more casual in their operation, leading to an increase in pollution levels. You will investigate this issue by constructing confidence intervals for the mean levels in Year 2. Again, 20 measurements were taken for each pollutant.
Repeat Steps 1 through 5 for each of the four pollutants in Year 2: PM2, OC2, TC2 & LE2. [Construct 4 boxplots using R. Construct 4 confidence intervals.]
Boxplots:
Function for boxplots:
boxplot(dataset$variable, main="title", xlab="label x axis", horizontal = T)
Example:
boxplot(Chapter_9_Project_CI_Pollution$PM1,main=”Particulate Matter Year 1”,xlab=”Particulate Matter”, horizontal=T)
Function for median: median(dataset$variable)
Example: median(Chapter_9_Project_CI_Pollution$PM1)
Function for mean: mean(dataset$variable)
Example: mean(Chapter_9_Project_CI_Pollution$PM1)
Function for standard deviation: sd(dataset$variable)
Example: sd(Chapter_9_Project_CI_Pollution$PM1)
Confidence Interval for µ when σ is unknown.
Given a simple random sample of size n with sample mean and sample standard deviation s , and assuming a normal distribution or large enough sample size, the confidence interval for µ when σ is unknown is:
where = sample mean of a random sample and = the margin of error.
M =
c = confidence level ( 0 < c < 1)
tc = critical value for confidence level c and degrees of freedom d.f. = n – 1
R code for Confidence Interval (σ is unknown) using t distribution.
Function for (Student) t-distribution critical value: qt(value of α/2, degrees of freedom)
Function for absolute value: abs(number or variable or function)
Function for square root: sqrt(number or variable or function)
Example:
alpha05 = 0.05 #specify confidence level 95%
n = 20
tcrit = abs(qt(alpha05/2, n-1)) #calculates the t-critical for confidence interval with
n – 1 d.f. when confidence level is 1-α
xbarPM1 = mean(Chapter_9_Project_CI_Pollution$PM1)
sPM1 = sd(Chapter_9_Project_CI_Pollution$PM1)
stderrPM1 = sPM1/sqrt(n) #calculates the standard error for sample mean
MEPM1= tcrit*stderrPM1 #calculates the margin of error
CIPM1 = c(xbarPM1-MEPM1, xbarPM1+MEPM1) #calculates the confidence interval
CIPM1 #prints out the confidence interval
Refer to page 40 in Introduction to and R-Studio
____________________________________________________________________________
Proposed R code
Set values for t-critical value [Use same values for all pollutants—there is no need to repeat these lines of code for each pollutant]
> alpha05 = 0.05 #specify confidence level 95%
> n = 20
> tcrit = abs(qt(alpha05/2, n-1)) #calculates the t-critical for confidence interval with
n – 1 d.f. when confidence level is 1-α
Create Boxplot
> boxplot(Chapter_9_Project_CI_Pollution$PM1,main=”Particulate Matter Year 1”,xlab=”Particulate Matter”, horizontal=T)
Define variables for Normal Distribution function [median, mean and standard deviation]
> median(Chapter_9_Project_CI_Pollution$PM1) #median of PM1
> xbar = mean(Chapter_9_Project_CI_Pollution$PM1) #assign to mu value of mean of PM1
To display the mean, , , for the sample, type xbar on the command line and execute [press the ENTER key]
> xbar
> s = sd(Chapter_9_Project_CI_Pollution$PM1) #assign to s value of standard deviation of PM1
To display the standard deviation, s , for the sample, type s on the command line and execute [press the ENTER key]
> s
> stderr = s/sqrt(n) #calculates the standard error [standard deviation] for sample mean
Calculate margin of error and bounds for confidence interval
> ME= tcrit*stderr #calculates the margin of error
> CI = c(xbar-ME, xbar+ME) #calculates the confidence interval and assigns it to CI
> CI #prints out the confidence interval
To create the Boxplot and get the Confidence Interval for the other pollutants, repeat the code by reassigning xbar to mean of each sample and reassigning s to standard deviation of each sample. Then copy, paste, and execute the sequence of code above to get the Confidence Interval for the pollutant.