statistics with R programming

jamesfiona1993

Week8outline.docx

Home >Mathematics homework help >Statistics homework help >statistics with R programming

Week 7 Outline (10-21-20):

All papers are graded! Assignment 5 coming after exam, over Chapter 6.

Exam 1 will be Thursday October 22 9:00 pm to Saturday October 24 9:00 pm, please do not wait until the last minute to start. Exam 1 will cover Aho chapters 1 through 5, and R. Focus on what we have done in class as the most important ideas.

· You may not discuss the exams with anyone during the 48 hour exam period, other than the professor.

· You may use your notes and textbook, though I recommend you prepare as if you did not have these resources, so you don’t have to look everything up.

· You may not use any online resources, other than to access R if necessary.

· You may not seek help with any part of the exam from any online resources, including tutors.

· I looked for extra practice problems with solutions that I could share with you, but I did not find any over relevant topics. Though I can understand the request, especially in a traditional semester. However, please keep in mind that you will have 48 hours to take the exam, and you will have access to your textbook and notes, so I think that alleviates at least some of the need for extra practice problems.

Questions

Finish Aho Chapter 6: Hypothesis testing for comparing two population means from independent (quantitative) populations

· Remember that all hypothesis testing and confidence procedures that we cover in our class require that the sampling procedure used is random sampling. If that is not the case, then adjustments would have to be made, though we don’t have time to cover these in our class.

· Remember that if you are asked to use a sample to learn about a population, you must do either a confidence interval or hypothesis test. If the problem does not say confidence interval anywhere, you are being asked to do a hypothesis test. Here are the steps to address

Step 0: what are you observing from each individual, and what kind of problem is it (e.g. categorical or quantitative)?

Step 1: check any necessary conditions

Step 2: define the relevant parameter in words

Step 3: set up hypotheses and specify

Step 4: compute test statistic

Step 5: compute p-value

Step 6: Decide to reject or not reject (by comparing p-value to )

Step 7: Explain decision (step 6) in the context of the problem (without statistical jargon)

· Hypothesis testing for two independent (quantitative) populations

· Two independent sample t test

· Overview of how different than one sample t test

· Pooled vs unpooled (Welch’s approximate t test or Saittherwaite approximate)

· For unpooled, df =

· Confidence interval for difference in two population means

· Example:

· Sample from Population 1: 12, 13, 13, 14, 15, 17, 20, 25, 33

· Sample from Population 2: 14, 15, 15, 16, 17, 17, 21, 25, 27, 35, 50

· R code (y is quantitative response variable and x is categorical explanatory (group) variable)

	y<-c(12,13,13,14,15,17,20,25,33,14,15,15,16,17,17,21,25,27,35,50)

	x<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)

For unpooled two sample t test:
> t.test(y~x, alternative = c("two.sided"), mu = 0, var.equal = FALSE,
        conf.level = 0.95)
	Welch Two Sample t-test
data:  y by x
t = -1.206, df = 17.046, p-value = 0.2443
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -13.495698   3.677517
sample estimates:
mean in group 1 mean in group 2 
       18.00000        22.90909 
	

For pooled two sample t test:
> t.test(y~x, alternative = c("two.sided"), mu = 0, var.equal = TRUE,
        conf.level = 0.95)
	Two Sample t-test
data:  y by x
t = -1.1524, df = 18, p-value = 0.2642
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -13.858902   4.040721
sample estimates:
mean in group 1 mean in group 2 
       18.00000        22.90909 


			
· Duality between confidence intervals and hypothesis tests

· What if we cannot conclude that  is normally distributed?
· Wilcoxon rank sum test (sometimes referred to as “Mann-Whitney U”, which gives the same p-value as Wilcoxon rank sum test)
· If   is not normally distributed, that means skewness or outliers are causing a problem, which means we focus on population median instead of population mean
· Requires that the two populations have the same shape and same variability
· Hand calculate test statistic W
· Key idea is to rank the values as if they were all in one group, using average ranks for ties
· If one group has systematically lower values, then that group would generally receive lower ranks
·  = sum of ranks in sample 1
·  = sum of ranks in sample 2
· Whichever group has higher population median should generally produce higher values, so should have higher sum of ranks
· Rescale  and  by subtracting lowest possible value for each (if each group’s values were lower than all the values for the other group)
· 
· 
· This is because sum of first n whole numbers is n(n+1)/2
· Aho textbook says
· For two tail test, W = min( , )
· For right tail test, W = 
· For left tail test, W = 
· R just uses W=   , so that is what we will do
· Use R to get p-value (y is quantitative response variable, x is categorical explanatory (group) variable)

> wilcox.test(y~x, alternative = c("two.sided"), mu = 0, exact = TRUE)
	Wilcoxon rank sum test with continuity correction
data:  y by x
W = 30, p-value = 0.1472
alternative hypothesis: true location shift is not equal to 0
Warning message:
In wilcox.test.default(x = c(12, 13, 13, 14, 15, 17, 20, 25, 33),  :
  cannot compute exact p-value with ties

· Two independent sample t test has more power than Wilcoxon rank sum test
· Decision tree for checking conditions for two independent (quantitative) populations

Start Aho Chapter 7:  Sampling Design and Experimental Design
· Terminology already covered previously
· Variables
· Explanatory vs response variable
· Other words for explanatory:  independent variable, predictor, covariate
· Other word for response variable:  dependent variable, outcome variable
· Categorical vs quantitative variables
· Discrete vs continuous variables
· Sampling
· The goal is always to get a representative sample, so that statistics will be unbiased, meaning that the average value of the statistic is equal to the parameter that is being estimated
· Common kinds of bias
· Selection bias- something about how you select the sample leads to an unrepresentative sample
· Example: undercoverage- some subjects were not available to be selected
· Response bias- subjects supply information, but it is inaccurate
· Example:  people lie about their age
· Nonresponse bias- some subjects are selected, but are not included in the sample because they cannot be reached or found
· If the subjects that do not provide information are different than those that do, that is a biased sample
· Example:  people refuse to answer what their age is
· There are three main ways to get a representative sample from a population
· Simple random sampling
· Every subject has an equal chance of being selected
· Stratified random sampling
· Population is divided into strata, and hopefully within strata, the subjects are similar to each other (homogenous)
· A random sample is taken from each strata, which makes sure that no group is missed, and if done well, this can make the statistics from the sample have better properties
· Cluster sampling
· Population is divided into clusters, and then some of the clusters are chosen randomly (and all subjects in each chosen cluster are in the sample)
· This is typically motivated by large geographical distances between clusters, and using clusters allows data to be acquired more efficiently
· Everything we do in our class requires having a simple random sample
· The main thing you should worry about when dealing with your data is if the sample was taken in a way other than simple random sampling, because the way you analyze the data has to account for using stratified or cluster sampling
· We won’t worry about this in our class, other than you reading a description and determining if the sample was taken other than as a simple random sample
· Observational study vs randomized experiment
· After a sample is taken, we collect variables from each subject
· if subjects are not manipulated by the researcher, then the data collection method is called an observational study
· if subjects are manipulated (randomly assigned to different categories of the primary explanatory variable) by the researcher, then the data collection method is called a randomized experiment
· Confounding and lurking variables
· When we have a response variable (y) and one or more explanatory variables (x’s), the goal is to learn about the relationship between y and the x’s
· Sometimes there can be a strong relationship between y and x1 , but that can be because there is another variable x2 that is related to both y and x1 
· x2 is called a confounding variable
· example:  y = murder rate,  x1 = ice cream sales
· example:  y= lung cancer (Yes/No), x1 = smoking status
· when x2  is not known or not observed, x2  is often called a lurking variable
· When you randomly assign subjects to groups (as defined by the primary explanatory variable), as in a randomized experiment, you can balance out other (confounding) variables, so that the only difference between the groups is that particular variable of primary interest, which will allow you to infer a cause and effect relationship between the response variable and the primary explanatory variable

· Website:  http://www.rossmanchance.com/applets/Subjects.html

· Response variable is how high people can jump
· Primary explanatory variable is group (1 or 2), where group 1 has been training with a special kind of shoe, and group 2 is traditional shoe
· Other observed explanatory variables are gender and height
· There are many different ways to design a randomized experiment, depending on the research goals
· If you are someone that may be designing experiments in your future, I recommend you read section 7.6 on pages 269 to 287
· Unfortunately we don’t have time to go into great detail on these many types of experimental design


· Confounding can take place for quantitative explanatory variables too, but you typically don’t have control over these, you just have to try to incorporate all important explanatory variables into the analysis
· What does confounding look like in 3D?
· Y = BP, X1 = fiber intake, X2 = weight
· Note the difference in the slopes of the two lines.  The red line (line at the top) is when weight is ignored (not in the model), and the blue line (line at the bottom) is when weight is in the model alongside fiber.



· Note the red line is the average pattern for the relationship between BP and fiber when weight is not included in the model
· When weight is included, the relationship between BP and fiber is where the blue plane intersects the BP/fiber plane, and that slope is not as steep as the red line

Chapter 8 reading:  Aho Chapter 8:  Pages 295-311, 316 (starting at section 8.4)- 318 (ignoring Kendall’s tau)

a
0
H