statistics with R programming
Week 9 Outline (10-22-20):
Assignment 5 is on BB, due Friday at 5.
Questions
Finish Aho Chapter 7: Sampling Design and Experimental Design
· Terminology already covered previously
· Variables
· Explanatory vs response variable
· Other words for explanatory: independent variable, predictor, covariate
· Other word for response variable: dependent variable, outcome variable
· Categorical vs quantitative variables
· Discrete vs continuous variables
· Sampling
· The goal is always to get a representative sample, so that statistics will be unbiased, meaning that the average value of the statistic is equal to the parameter that is being estimated
· Common kinds of bias
· Selection bias- something about how you select the sample leads to an unrepresentative sample
· Example: undercoverage- some subjects were not available to be selected
· Response bias- subjects supply information, but it is inaccurate
· Example: people lie about their age
· Nonresponse bias- some subjects are selected, but are not included in the sample because they cannot be reached or found
· If the subjects that do not provide information are different than those that do, that is a biased sample
· Example: people refuse to answer what their age is
· There are three main ways to get a representative sample from a population
· Simple random sampling
· Every subject has an equal chance of being selected
· Stratified random sampling
· Population is divided into strata, and hopefully within strata, the subjects are similar to each other (homogenous)
· A random sample is taken from each strata, which makes sure that no group is missed, and if done well, this can make the statistics from the sample have better properties
· Cluster sampling
· Population is divided into clusters, and then some of the clusters are chosen randomly (and all subjects in each chosen cluster are in the sample)
· This is typically motivated by large geographical distances between clusters, and using clusters allows data to be acquired more efficiently
· Everything we do in our class requires having a simple random sample
· The main thing you should worry about when dealing with your data is if the sample was taken in a way other than simple random sampling, because the way you analyze the data has to account for using stratified or cluster sampling
· We won’t worry about this in our class, other than you reading a description and determining if the sample was taken other than as a simple random sample
· Observational study vs randomized experiment
· After a sample is taken, we collect variables from each subject
· if subjects are not manipulated by the researcher, then the data collection method is called an observational study
· if subjects are manipulated (randomly assigned to different categories of the primary explanatory variable) by the researcher, then the data collection method is called a randomized experiment
· Confounding and lurking variables
· When we have a response variable (y) and one or more explanatory variables (x’s), the goal is to learn about the relationship between y and the x’s
· Sometimes there can be a strong relationship between y and x1 , but that can be because there is another variable x2 that is related to both y and x1
· x2 is called a confounding variable
· example: y = murder rate, x1 = ice cream sales
· example: y= lung cancer (Yes/No), x1 = smoking status
· when x2 is not known or not observed, x2 is often called a lurking variable
· When you randomly assign subjects to groups (as defined by the primary explanatory variable), as in a randomized experiment, you can balance out other (confounding) variables, so that the only difference between the groups is that particular variable of primary interest, which will allow you to infer a cause and effect relationship between the response variable and the primary explanatory variable
· Website: http://www.rossmanchance.com/applets/Subjects.html
· Response variable is how high people can jump
· Primary explanatory variable is group (1 or 2), where group 1 has been training with a special kind of shoe, and group 2 is traditional shoe
· Other observed explanatory variables are gender and height
· There are many different ways to design a randomized experiment, depending on the research goals
· If you are someone that may be designing experiments in your future, I recommend you read section 7.6 on pages 269 to 287
· Unfortunately we don’t have time to go into great detail on these many types of experimental design
· Confounding can take place for quantitative explanatory variables too, but you typically don’t have control over these, you just have to try to incorporate all important explanatory variables into the analysis
· What does confounding look like in 3D?
· Y = BP, X1 = fiber intake, X2 = weight
· Note the difference in the slopes of the two lines. The red line (line at the top) is when weight is ignored (not in the model), and the blue line (line at the bottom) is when weight is in the model alongside fiber.
· Note the red line is the average pattern for the relationship between BP and fiber when weight is not included in the model
· When weight is included, the relationship between BP and fiber is where the blue plane intersects the BP/fiber plane, and that slope is not as steep as the red line
Start Chapter 8: Correlation
· Main idea is that we have two quantitative variables, and want to know how strong the association is between them
· We typically use y for response variable and x for explanatory variable
· Scatterplot description for SLR- address 4 issues:
· (1) What is the average pattern? Linear or clearly non-linear
· (2) What is the direction? Positive or negative
· (3) What is the strength? Very strong, moderately strong, somewhat weak, etc.
· Strength means how much vertical variability is there from the average pattern
· (4) Identify any significant outliers
· Example:
· We use r for (Pearson’s) sample correlation coefficient, and for (Pearson’s) population correlation coefficient
· r is always between -1 and 1
· In order to use r to summarize strength, we need to verify that the average pattern is linear (or at least not clearly not linear) and no influential outliers
· Regression applet for effect of outliers:
· http://www.rossmanchance.com/applets/RegShuffle.htm
· One way to check to see if outliers are influential is to calcuale r with and without the outlier present
· Nonlinear- can still calculate r (not a good idea), but it can be misleading
· Other facts about r
· If you change the units on x or y, the value of r will not change
· If you switch the roles of x and y, the value of r will not change
· A high correlation is not causation
· Ice cream sales vs murder rates
· Spurious correlation website: https://www.tylervigen.com/spurious-correlations
· We won’t hand calculate r, but it can be found from the calculator or R
· How to get r from R
> cor(crabs$width,crabs$weight,method="pearson")
[1] 0.8868715
· Can do hypothesis testing and confidence intervals for
· Aho text talks about testing and calculating confidence intervals for , both of which require that y and x have a bivariate normal distribution (see Aho pages 299-300 for what bivariate normal looks like), in addition to no obvious non-linear relationship, and no influential outliers
· While the confidence interval can be useful (and easily found with R as you can see below), we will wait until the next chapter to consider hypothesis testing for regression, as there is an equivalent way to test by focusing on the slope of the best fitting line
> cor.test(crabs$width,crabs$weight,method="pearson")
Pearson's product-moment correlationdata: crabs$width and crabs$weightt = 25.102, df = 171, p-value < 2.2e-16alternative hypothesis: true correlation is not equal to 095 percent confidence interval:0.8501665 0.9149979sample estimates:cor0.8868715
· What if the average pattern is not linear, or if there are influential outliers?
· A transformation of y or x may fix non-linearity, such as square root, square, etc.
· We won’t do anything with this in our class
· Spearman’s correlation for population and for sample
· requires monotonic average pattern (increasing or decreasing but not linear)
· Accounts for non-linear when monotonic, and minimizes effect of outliers
· Rank the values of x, and separately rank the values of y, and then compute pearson’s correlation on the ranks instead of the original data
· How to get from R
> cor(crabs$width,crabs$weight,method="spearman")[1] 0.899067
Start Chapter 9: Linear regression
· Main idea is that we have a quantitative response variable y, and one or more explanatory variables (categorical or quantitative) that we call the x’s
· If there is only one x, we have simple linear regression (SLR)
· If there is more than one x, we have multiple linear regression (MLR)
· SLR
· Besides correlation, we want to summarize the relationship between y and x by describing the average pattern with a linear function, called the ordinary least squares (OLS) regression line, or best fitting line
· The slope is a summary of the relationship between y and x because it describes how fast y changes with respect to x
· SLR model: for subject
· This is sometimes written as because the line describes the average pattern of y, or expected value of y
· is the population y-intercept
· is the population slope
· is the population residual or error for subject i
· We never have the entire population, so we often write the SLR model in term of the sample data:
·
· Putting “hats” on the beta’s just means estimated, or sample version of
· This can also be written in terms of the equation of the best fitting line with a “hat” on y:
· So is the point on the line for subject i, or the predicted value of y for subject i
· This means the residual (or error) can be written as
How to deal with bimodal residual errors | Towards Data Science· How to find the best fitting line? Ordinary least squares (OLS)
· This means we want to find the line that makes the residuals as small as possible
· http://www.rossmanchance.com/applets/RegShuffle.htm
· Interpreting the slope and y-intercept
· Predicting y based on a given value of x
· can be viewed as the predicted value of y for an individual subject with value of explanatory variable
· can also be viewed as the average value of y for a individual subjects with common value of explanatory variable
· Just as when we talked about having a point estimate vs a confidence interval, we have confidence intervals for both perspectives, which we’ll chat about after we cover hypothesis testing (because there are assumptions to worry about)
· Coefficient of determination
· Goal is to see how much of the total variation in y is explained by the regression line, so we partition the total variation into two parts, residual and regression:
· Example:
|
|
|
|
Predicted value of y |
Total variation in y |
Residual error |
Regression |
|
|
|
|
|
|
|
|
|
1 |
1 |
2
|
|
|
|
|
|
2 |
3 |
6
|
|
|
|
|
|
3 |
5 |
4
|
|
|
|
|
|
4 |
7 |
8
|
|
|
|
|
· Hypothesis testing
· Focusing on slope directly
· Confidence interval for mean of y and prediction interval for individual subject
· MLR
Chapter 9 reading: Aho Chapter 9: Pages 321-359