undergrad level stat homework with R coding

profilepenabo
stat503_fall2018_hw7.docx

Dabornet. al (2002, Science 297: 2253-2256) used microarrays to investigate the genetic origins of insecticide resistance in Drosophila melanogaster, and found that upregulation of a single gene, Cyp6g1, conferred resistance to DDT. The data below were inferred from one of their figures (Fig. 1c). Each rowgives the mean transcription of Cyp6g1 relative to a standard for a particular strain of Drosophila, along with the strain's phenotype (resistant or susceptible). The data are on a log10 scale, so a value of 0 means thatCypo6g1 expression is equal to the standard, a value of 1 means that it is 10 the standard, etc.

Phenotype

Expression

resistant

1.200

resistant

1.050

resistant

0.975

resistant

0.800

resistant

0.200

resistant

0.675

resistant

0.700

resistant

0.500

resistant

0.875

resistant

0.950

resistant

0.600

resistant

0.700

susceptible

-1.100

susceptible

-1.750

susceptible

-0.700

susceptible

-0.550

susceptible

-1.475

susceptible

-1.400

Please use these data to answer the questions on the following pages. Please answer in complete sentences and include captions and appropriate formatting with all graphs.

1. [2 points] Would it be reasonable to use a single normal distribution to model the expression of Cyp6g1in both the resistant and susceptible strains? Please provide a graph (with caption, etc.) to support your answer, and briefly explain why the graph supports your answer.

2. [4 points] Would it be reasonable to model Cyp6g1 expression in resistant and susceptible strains using two different normal distributions (yes or no)?

a. Please provide a graph to support your answer and explain why it supports your answer.

b. Based on your graph, is it appropriate to conclude that Cyp6g1 expression definitely does follow a normal distribution? Why or why not?

3. [2 points] Calculate the sample mean and standard error of the mean for each phenotype. You may use R or a calculator to perform the calculations, but please show the equations that you are applying at each step (sentences are not required for this question).

4. [2 points] Calculate an approximate 95% confidence interval for the mean expression of Cyp6g1 in the resistant Drosophila strains using the normal distribution (sentences are not required for this question).

5. [2 points] The interval calculated in Question 4 should consist of 2 numbers. Please explain the interpretation of this interval.

6. [3 points] Use the Student's t distribution to find a 95% confidence interval for the mean expression of Cyp6g1 in resistant Drosophila strains (sentences are not required for this question).

a. Determine the appropriate degrees of freedom.

b. Use the qt() function in R to find the appropriate value of the -statistic. Its first argument (p) is the quantile that you want to evaluate, and its second argument (df) is the degrees of freedom. Please include the statement that you use to find the t-value here, along with your answer.

c. Calculate the confidence interval.

7. [4 points] Compare your answers in Questions 4 and 5.

a. How do the intervals differ?

b. What accounts for this difference (please do not just tell me that you used different distributions – what about them is different)?

c. Which one should you use, and why?

8. [3 points] Calculate the appropriate 95% confidence interval for the susceptible strains (complete sentences are not required; please show your work).

9. [2 points] Write a short paragraph summarizing your results and drawing a conclusion regarding the difference (or lack thereof) in Cyp6g1 expression between susceptible and resistant strains. Be sure to include the mean, standard error, and confidence intervals.

10. [2 points] Report your answers for the unknown distributions problem (Section 5) in Tutorial 6. Please keep your answers short (a few sentences per distribution).

a. Distribution A:

b. Distribution B:

c. Distribution C:

d. Distribution D:

11. [3 points] Please attach your complete code for the analyses performed in this homework. For full credit, your script must meet all of the following criteria:

a. It must be clutter-free.

b. It must include a line or lines that either creates the data using data.frame() or reads the data in from a csv file (you can use file.choose() if you want).

c. It must include code for the graphs in Questions 1 and 2, the t-quantiles in Questions 6 and 8, and any analyses you performed for Question 10. Code for other questions or calculations is optional. Comments labeling the code for each question are helpful to the graders.

d. We should be able to copy your script into R, change only the path for the input data file, and run it to produce all of your results, without errors (you can check this yourself – clear your workspace with rm(list=ls()), and then highlight and run the entire script).

4