statistics with R programming
Week 7 Outline (10-21-20):
All papers are graded! Assignment 5 coming after exam, over Chapter 6.
Exam 1 will be Thursday October 22 9:00 pm to Saturday October 24 9:00 pm, please do not wait until the last minute to start. Exam 1 will cover Aho chapters 1 through 5, and R. Focus on what we have done in class as the most important ideas.
· You may not discuss the exams with anyone during the 48 hour exam period, other than the professor.
· You may use your notes and textbook, though I recommend you prepare as if you did not have these resources, so you don’t have to look everything up.
· You may not use any online resources, other than to access R if necessary.
· You may not seek help with any part of the exam from any online resources, including tutors.
· I looked for extra practice problems with solutions that I could share with you, but I did not find any over relevant topics. Though I can understand the request, especially in a traditional semester. However, please keep in mind that you will have 48 hours to take the exam, and you will have access to your textbook and notes, so I think that alleviates at least some of the need for extra practice problems.
Questions
Finish Aho Chapter 6: Hypothesis testing for comparing two population means from independent (quantitative) populations
· Remember that all hypothesis testing and confidence procedures that we cover in our class require that the sampling procedure used is random sampling. If that is not the case, then adjustments would have to be made, though we don’t have time to cover these in our class.
· Remember that if you are asked to use a sample to learn about a population, you must do either a confidence interval or hypothesis test. If the problem does not say confidence interval anywhere, you are being asked to do a hypothesis test. Here are the steps to address
Step 0: what are you observing from each individual, and what kind of problem is it (e.g. categorical or quantitative)?
Step 1: check any necessary conditions
Step 2: define the relevant parameter in words
Step 4: compute test statistic
Step 5: compute p-value
Step 7: Explain decision (step 6) in the context of the problem (without statistical jargon)
· Hypothesis testing for two independent (quantitative) populations
· Two independent sample t test
· Overview of how different than one sample t test
· Pooled vs unpooled (Welch’s approximate t test or Saittherwaite approximate)
· For unpooled, df =
· Confidence interval for difference in two population means
· Example:
· Sample from Population 1: 12, 13, 13, 14, 15, 17, 20, 25, 33
· Sample from Population 2: 14, 15, 15, 16, 17, 17, 21, 25, 27, 35, 50
· R code (y is quantitative response variable and x is categorical explanatory (group) variable)
y<-c(12,13,13,14,15,17,20,25,33,14,15,15,16,17,17,21,25,27,35,50)
x<-c(1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2)
For unpooled two sample t test:> t.test(y~x, alternative = c("two.sided"), mu = 0, var.equal = FALSE,conf.level = 0.95)Welch Two Sample t-testdata: y by xt = -1.206, df = 17.046, p-value = 0.2443alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-13.495698 3.677517sample estimates:mean in group 1 mean in group 218.00000 22.90909
For pooled two sample t test:
> t.test(y~x, alternative = c("two.sided"), mu = 0, var.equal = TRUE,conf.level = 0.95)Two Sample t-testdata: y by xt = -1.1524, df = 18, p-value = 0.2642alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:-13.858902 4.040721sample estimates:mean in group 1 mean in group 218.00000 22.90909
· Duality between confidence intervals and hypothesis tests
· What if we cannot conclude that is normally distributed?
· Wilcoxon rank sum test (sometimes referred to as “Mann-Whitney U”, which gives the same p-value as Wilcoxon rank sum test)
· If is not normally distributed, that means skewness or outliers are causing a problem, which means we focus on population median instead of population mean
· Requires that the two populations have the same shape and same variability
· Hand calculate test statistic W
· Key idea is to rank the values as if they were all in one group, using average ranks for ties
· If one group has systematically lower values, then that group would generally receive lower ranks
· = sum of ranks in sample 1
· = sum of ranks in sample 2
· Whichever group has higher population median should generally produce higher values, so should have higher sum of ranks
· Rescale and by subtracting lowest possible value for each (if each group’s values were lower than all the values for the other group)
·
·
· This is because sum of first n whole numbers is n(n+1)/2
· Aho textbook says
· For two tail test, W = min( , )
· For right tail test, W =
· For left tail test, W =
· R just uses W= , so that is what we will do
· Use R to get p-value (y is quantitative response variable, x is categorical explanatory (group) variable)
> wilcox.test(y~x, alternative = c("two.sided"), mu = 0, exact = TRUE)Wilcoxon rank sum test with continuity correctiondata: y by xW = 30, p-value = 0.1472alternative hypothesis: true location shift is not equal to 0Warning message:In wilcox.test.default(x = c(12, 13, 13, 14, 15, 17, 20, 25, 33), :cannot compute exact p-value with ties
· Two independent sample t test has more power than Wilcoxon rank sum test
· Decision tree for checking conditions for two independent (quantitative) populations
Start Aho Chapter 7: Sampling Design and Experimental Design
· Terminology already covered previously
· Variables
· Explanatory vs response variable
· Other words for explanatory: independent variable, predictor, covariate
· Other word for response variable: dependent variable, outcome variable
· Categorical vs quantitative variables
· Discrete vs continuous variables
· Sampling
· The goal is always to get a representative sample, so that statistics will be unbiased, meaning that the average value of the statistic is equal to the parameter that is being estimated
· Common kinds of bias
· Selection bias- something about how you select the sample leads to an unrepresentative sample
· Example: undercoverage- some subjects were not available to be selected
· Response bias- subjects supply information, but it is inaccurate
· Example: people lie about their age
· Nonresponse bias- some subjects are selected, but are not included in the sample because they cannot be reached or found
· If the subjects that do not provide information are different than those that do, that is a biased sample
· Example: people refuse to answer what their age is
· There are three main ways to get a representative sample from a population
· Simple random sampling
· Every subject has an equal chance of being selected
· Stratified random sampling
· Population is divided into strata, and hopefully within strata, the subjects are similar to each other (homogenous)
· A random sample is taken from each strata, which makes sure that no group is missed, and if done well, this can make the statistics from the sample have better properties
· Cluster sampling
· Population is divided into clusters, and then some of the clusters are chosen randomly (and all subjects in each chosen cluster are in the sample)
· This is typically motivated by large geographical distances between clusters, and using clusters allows data to be acquired more efficiently
· Everything we do in our class requires having a simple random sample
· The main thing you should worry about when dealing with your data is if the sample was taken in a way other than simple random sampling, because the way you analyze the data has to account for using stratified or cluster sampling
· We won’t worry about this in our class, other than you reading a description and determining if the sample was taken other than as a simple random sample
· Observational study vs randomized experiment
· After a sample is taken, we collect variables from each subject
· if subjects are not manipulated by the researcher, then the data collection method is called an observational study
· if subjects are manipulated (randomly assigned to different categories of the primary explanatory variable) by the researcher, then the data collection method is called a randomized experiment
· Confounding and lurking variables
· When we have a response variable (y) and one or more explanatory variables (x’s), the goal is to learn about the relationship between y and the x’s
· Sometimes there can be a strong relationship between y and x1 , but that can be because there is another variable x2 that is related to both y and x1
· x2 is called a confounding variable
· example: y = murder rate, x1 = ice cream sales
· example: y= lung cancer (Yes/No), x1 = smoking status
· when x2 is not known or not observed, x2 is often called a lurking variable
· Website: http://www.rossmanchance.com/applets/Subjects.html
· Response variable is how high people can jump
· Primary explanatory variable is group (1 or 2), where group 1 has been training with a special kind of shoe, and group 2 is traditional shoe
· Other observed explanatory variables are gender and height
· There are many different ways to design a randomized experiment, depending on the research goals
· If you are someone that may be designing experiments in your future, I recommend you read section 7.6 on pages 269 to 287
· Unfortunately we don’t have time to go into great detail on these many types of experimental design
· Confounding can take place for quantitative explanatory variables too, but you typically don’t have control over these, you just have to try to incorporate all important explanatory variables into the analysis
· What does confounding look like in 3D?
· Y = BP, X1 = fiber intake, X2 = weight
· Note the difference in the slopes of the two lines. The red line (line at the top) is when weight is ignored (not in the model), and the blue line (line at the bottom) is when weight is in the model alongside fiber.
![]()
· Note the red line is the average pattern for the relationship between BP and fiber when weight is not included in the model
· When weight is included, the relationship between BP and fiber is where the blue plane intersects the BP/fiber plane, and that slope is not as steep as the red line
Chapter 8 reading: Aho Chapter 8: Pages 295-311, 316 (starting at section 8.4)- 318 (ignoring Kendall’s tau)
a
0
H