statistics project
Fall 2021 Semester Project Detailed Instructions
Due Date: I must receive your project no later than 11:59 PM on the due date in the syllabus. Given the advanced notice and the amount of instructional time devoted to this project in the final week, there will be no extensions granted. Projects turned in after the deadline will receive a grade of 0.
Instructor/ACE Assistance: This project will be treated like a take-home final exam. This means that I and the ACE tutors will be happy to answer general questions about these instructions or about any of the techniques and procedures you need to use when completing the project. However, we will not help you produce the required content, nor will we review your report prior to submission to provide "feedback." You will be expected to do the analysis and write the project report on your own. Please refer to the sample project in the Semester Project module if you need an example of what a successful (100%) project looks like.
Collaboration between Classmates: This is not only allowed but encouraged. However, you are not permitted to use the Semester Project discussion board to collaborate, and all discussion threads of that nature will be deleted. If you collaborate, you may do so offline. However, please remember to submit your own report; submission of duplicate reports or a single group report with multiple names is not permitted. Academic Dishonesty: The data set used in this project has the same name as data sets I've used in previous semesters. However, these sets are all random subsets of a larger master data set and therefore produce different enough results that I will instantly be able to tell if your submission is based on a previous semester's version of the project. Any such submission, or any previous semester's project, as well as duplicate projects from this semester (see above), will result in an immediate grade of 0 and a report of academic dishonesty will be filed. Document Format: You must submit your project report in a single file through Canvas. The acceptable formats are Microsoft Word (*.docx) or PDF (*.pdf) – no exceptions. The submission page is in the Semester Project module. Projects submitted in multiple parts, in a format other than Word or PDF, or via email/hardcopy will be rejected.
Style Requirements: The Semester Project module contains a sample project that would receive a 100% grade. Your report should be formatted similarly.
− The first page of your report must be a title page containing your name, the course and section number, the title "Semester Project," and the submission date.
− Use a font suitable for an official business document. Any standard typeface is acceptable as long as it is readable and presents a professional appearance (Calibri and Times New Roman are good examples, but not the only possibilities). The size should be no smaller than 12 point, and the color should be black.
− Do not include any borders, decorative images/illustrations, or watermarking.
− Embed all graphics directly into your project file. I will not accept separate files containing graphics.
Data Set: All students will use the same data set: Fruit Fly Data v1. The data set is located in the StatCrunch MTH 245 Assignments Group. This data set is an extract of a larger data set that came from a study of the behavior of fruit flies. The two variables of interest are Percent Time Asleep and Longevity (measured in days). Technology Requirements: Except where required to build graphs or charts, all numerical calculations must be performed using StatCrunch. Do not use a graphing calculator, Excel, R, Minitab, standard normal tables, or any other method for your numerical calculations.
Graphics Requirements: All graphics must be constructed using StatCrunch, Excel, or other computer-based graphics program. Hand-drawn plots, cell phone pictures of graphics, etc., are not acceptable. All graphics must include an informative title and (except for boxplots) correct labels for both axes. Orient all boxplots horizontally.
Rounding Rules:
− In Section 1, all upper and lower class bounds should be integers. − In Section 2, do not round the five-number summaries. − Round all other sample statistics in Section 2, as well as the confidence limits in Section
3, to one decimal place. − Round all p-values—one each in Sections 4 and 5—to three decimal places. − In Section 5, round all StatCrunch output (besides the p-values) to one decimal place. − Add trailing zeroes to any rounded value as needed. − Do not simply paste screen shots of StatCrunch output into your report. − Warning: do not use the sample project as an example of how to round! The data set used in that
particular project is different from yours, so the values are all rounded differently!
Required Content: Organize your report in five separate sections using the following numbers and titles. The required elements for each section are as follows:
Section 1 – Visual Data Assessment. For each variable of interest – Percent Time Asleep and Longevity – create a grouped frequency histogram. For the Percent Time Asleep histogram, use a lower limit of 0.0 and a class width of 10.0. For Longevity, use a lower limit of 10.0 and a class width of 10.0. Each histogram must include an informative title, along with correct labels for both axes. For each histogram, include a paragraph that answers each of the following questions:
a. Is the histogram symmetric, left-skewed, right-skewed, or uniform? b. How many peaks does the histogram have, and in which class(es) are they located
(must include the correct lower and upper bounds for each class listed)? c. Does the histogram have any gaps between classes? If so, where are they?
Section 2 – Descriptive Statistics.
a. For each variable, find the mean, range, variance, standard deviation, and five- number summary. Display these numbers in a format that is easy to understand. (Do not simply copy screencaps of the StatCrunch output!)
b. Construct a regular boxplot for each variable. For each boxplot, include a brief statement containing an assessment of whether the data appear to be symmetric, left-skewed, right-skewed, or uniform.
c. For each variable, construct a modified boxplot and use it to identify potential outliers. If any exist, list them by value; if none exist, say so.
Section 3 – Confidence Intervals. Construct a 95% confidence interval for the mean μ of each variable (two intervals total). You may use either algebraic or interval notation as shown in the course notes. State the distribution you used for each interval (𝑡𝑡 or normal).
Section 4 – Hypothesis Test. Using the p-value method, conduct a formal hypothesis test of the claim that 𝜇𝜇, the mean longevity of fruit flies, is less than 57 days. Use 𝛼𝛼 = 0.01. Include the following in your written summary of the results:
a. Your null and alternate hypotheses in the proper format using standard notation. b. The type of distribution you used (𝑡𝑡 or normal). c. The p-value and its logical relationship to 𝛼𝛼 (≤ or >). d. Your decision regarding the null hypothesis: reject or fail to reject. e. A statement interpreting your decision: reject/fail to reject (or support/fail to
support) the original claim that the mean longevity of fruit flies is less than 57 days.
Note: Section 4 only applies to Longevity. There is no hypothesis test related to Percent Time Asleep. Section 5 – Correlation/Regression Analysis.
a. Construct a linear regression model with Percent Time Asleep as the predictor and Longevity as the response. State the equation in correct algebraic format as shown in the course notes.
b. Create a scatter plot of the data with a plot of the least squares line included. (StatCrunch should have generated this plot when you calculated the model in 5a.) The plot must include an informative title and correct labels for both axes.
c. Use the coefficient of determination to identify the percentage of the variation in Longevity explained by the variation in Percent Time Asleep.
d. Identify all likely influential points (Cook's Distance greater than 1.0). If any exist, list them as ordered pairs in the form (Percent Time Asleep, Longevity). If none exist, say so.
e. Conduct a formal hypothesis test at 𝛼𝛼 = 0.05 to determine if there is sufficient evidence of correlation between Percent Time Asleep and Longevity. Include the following:
1) The p-value and its logical relationship to 𝛼𝛼 (≤ or >). 2) Your decision regarding the null hypothesis: reject or fail to reject. 3) A statement regarding the sufficiency of the evidence for a linear relationship
between Percent Time Asleep and Longevity. f. State whether the equation in 5a satisfies the following LINE criteria (assume the
residuals are independent): Linear Relationship (L): Using the scatterplot with fitted line, determine if a linear model is appropriate based on the model's visual fit to the data. Independent Residuals (I): Include a statement that the residuals are assumed to be independent. Normally-Distributed Residuals (N): Determine if the residuals fit a normal distribution using a residual histogram and a Q-Q plot. (Do not use a boxplot.) Verify your assessment by conducting a Shapiro-Wilk goodness-of-fit test for normality on the model residuals. Use 𝛼𝛼 = 0.05. Report the p-value, its logical relationship to 𝛼𝛼 (≤ or >), and your interpretation of the result. Equal Variances of the Residuals (E): Assess the residuals for constant variance using a plot of the residuals versus Percent Time Asleep.
g. Using the results from 5e and 5f, clearly state whether the model you built in 5a provides valid estimates of Longevity as a function of Percent Time Asleep. Justify your claim.
h. Provide a valid estimate of 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛, a new observation of Longevity for a fruit fly with Percent Time Asleep = 20. Use either the regression model you constructed in 5a or calculate the value using the Longevity data column by itself, whichever is appropriate.
i. If you use the regression model from 5a to calculate the estimate in 5h, calculate a 95% prediction interval estimate of 𝑦𝑦𝑛𝑛𝑛𝑛𝑛𝑛. If the model in 5a is invalid, include a statement that a prediction interval estimate is not applicable.