statistics with R programming
Week 10 Outline (11-4-20):
Assignment 6 coming soon.
Exam 1 is not graded.
Questions
Start Chapter 9: Linear regression
· Main idea is that we have a quantitative response variable y, and one or more explanatory variables (categorical or quantitative) that we call the x’s
· If there is only one x, we have simple linear regression (SLR)
· If there is more than one x, we have multiple linear regression (MLR)
· SLR
· Besides correlation, we want to summarize the relationship between y and x by describing the average pattern with a linear function, called the ordinary least squares (OLS) regression line, or best fitting line
· The slope is a summary of the relationship between y and x because it describes how fast y changes with respect to x
· SLR model: for subject
· We never have the entire population, so we often write the SLR model in term of the sample data:
·
· Putting “hats” on the beta’s just means estimated, or sample version of
· This can also be written in terms of the equation of the best fitting line with a “hat” on y:
· So is the point on the line for subject i, or the predicted value of y for subject i
· This means the residual (or error) can be written as
· Coefficient of determination
· Goal is to see how much of the total variation in y is explained by the regression line, so we partition the total variation into two parts, residual and regression:
· Example:
|
|
|
|
Predicted value of y |
Total variation in y |
Residual error |
Regression |
|
|
|
|
|
|
|
|
|
1 |
1 |
2
|
|
|
|
|
|
2 |
3 |
6
|
|
|
|
|
|
3 |
5 |
4
|
|
|
|
|
|
4 |
7 |
8
|
|
|
|
|
· Hypothesis testing
· Still need a random sample to do these
· Main idea is to determine if there is evidence in the sample that x and y have a linear relationship in the population
· If there is no relationship between x and y in the population, then there would be no discernable pattern
· The best fitting line to this scatterplot would be horizontal
· That means , which says there is no linear relationship between x and y in the population
· can be , or , or just as in previous chapters, depending on the problem statement
· We can still follow the same steps as before, but the conditions to check are different, and the test statistic is different
· Conditions to check in order for the test for the slope to be valid (just to be really clear, we still require a random sample, and that the relationship is not clearly not linear, and no influential outliers, just as before)
· Normality: The residuals must be normally distributed (sample size does not play a role here)
· Can just make boxplot or histogram of residuals to check
· Constant variance: y should have the same variability regardless of the value of x (or equivalently, the residuals should have the same variability when plotted against the predicted (fitted) values)
· This can be checked with scatterplot of y vs x as shown, but typically is checked with scatterplot with residuals plotted on vertical axis and predicted (fitted) values on horizontal axis
· This is because when we get to MLR, there will be more than one x, but still only one set of residuals and one set of predicted values
· The above scatterplots would correspond to residual vs predicted plots as shown here:
· Putting these together looks like
· Test statistic
· We won’t calculate by hand because of limited time, but you can see the formula in the book, and it closely resembles the t test statistic for one quantitative variable
· Example of 8 steps: crabs data, y=weight, x=width
· Determine if there is evidence in the sample that there is a positive linear relationship between weight and width in the population.
Step 0: From each individual (crab), we observe weight and width, both quantitative, and we want to determine if there is a relationship, so this is simple linear regression.
Step 1: (1) linear relationship (or at least nothing clearly not linear)? Scatterplot of y vs x looks linear
> ggplot(data=crabs,aes(x=width,y=weight))+geom_point()
(2) residuals normal?
> fit<-lm(data=crabs,weight~width)
> summary(fit)
Call:lm(formula = weight ~ width, data = crabs)Residuals:Min 1Q Median 3Q Max-1.18600 -0.11615 -0.00513 0.15076 1.01550Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) -3.944019 0.255027 -15.46 <2e-16 ***width 0.242642 0.009666 25.10 <2e-16 ***---
> ggplot(data=crabs,aes(y=fit$residuals))+geom_boxplot()
(3) Constant variance (no funnel shape)?
> plot(fit)
Step 2: = population slope for x=width and y=weight
Step 3:
= 0.05
Step 4: test statistic is t = 25.10 from R
> fit<-lm(data=crabs,weight~width)> summary(fit)Call:lm(formula = weight ~ width, data = crabs)Residuals:Min 1Q Median 3Q Max-1.18600 -0.11615 -0.00513 0.15076 1.01550Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) -3.944019 0.255027 -15.46 <2e-16 ***width 0.242642 0.009666 25.10 <2e-16 ***
Step 5: From R we get two sided p-value < 2x10^-16, so we divide by 2, to get p-value < 1x10^-16
Step 6: p-value = 1x10^-16 < 0.05 = so we reject
Step 7: There is sufficient evidence to support that there is a positive linear relationship between weight and width in the population.
· Can also get CI for slope
> confint(fit,'width',level=0.95,data=crabs)2.5 % 97.5 %width 0.2235614 0.261723
· Anova approach to testing population slope
· This will give the same result as t test for SLR, but different for MLR as we will see below
· The main difference is the test statistic F, which follows an F distribution if the conditions are met
· F compares the SS Regression to the SS residual (rather than the sample slope to the value in Ho), incorporating appropriate degrees of freedom
> fit<-lm(data=crabs,weight~width)> summary(fit)Call:lm(formula = weight ~ width, data = crabs)Residuals:Min 1Q Median 3Q Max-1.18600 -0.11615 -0.00513 0.15076 1.01550Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) -3.944019 0.255027 -15.46 <2e-16 ***width 0.242642 0.009666 25.10 <2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.2674 on 171 degrees of freedomMultiple R-squared: 0.7865, Adjusted R-squared: 0.7853F-statistic: 630.1 on 1 and 171 DF, p-value: < 2.2e-16
· Confidence interval for mean of y for all individuals with a given value of x, and prediction interval for y for an individual subject with a particular value of x
· Remember not to predict y for values of x outside the range of x’s in the data (extrapolation)
> new.dat <- data.frame(width=25)> predict(fit, newdata = new.dat, interval = 'confidence')fit lwr upr1 2.122036 2.074874 2.169199> predict(fit, newdata = new.dat, interval = 'prediction')fit lwr upr1 2.122036 1.592156 2.651917
· MLR
· MLR model: for subject
https://qph.fs.quoracdn.net/main-qimg-f51f77ed4567fb0a12611b4f50214c1c· We never have the entire population, so we often write the SLR model in term of the sample data:
·
· Putting “hats” on the beta’s just means estimated, or sample version of
· This can also be written in terms of the equation of the best fitting line with a “hat” on y:
· Slopes (regression coefficients) are interpreted the same way as in SLR, but we are also force the other x’s to remain constant when we interpret the slope for a particular x
· We still require that
· No obvious non-linear relationship, but this is harder to check in more than two dimensions, though we can check the scatterplot of each x with y
· No extreme outliers exerting influence
· We use r2 typically in MLR, because r is specific to two variables, but r2 can be interpreted as the percent of variability in y explained by the x’s (or explained by the model)
· You can still use r, but it is interpreted as the correlation between the raw values of and (and is always positive in MLR)
· We still want to make predictions for y based on given values for the x’s
· We still need a strong relationship (“high” value of r2) to make reliable predictions
· We still need to avoid extrapolation (making predictions based on x’s outside the range of the x’s in the data)
· There are more issues in MLR because there is more than one explanatory variable
· How to decide what x’s should be in the model
· There are different ways of doing this, but the best is if the expert in the context of the data knows which x’s should be present
· Some people use hypothesis testing on individual slopes, but this is not ideal
· There are more complicated ways of trying to decide which x’s should be in the model, but we don’t have time to go into it in our class
· Collinearity
· The quantitative explanatory variables are strongly related to each other
· This redundancy causes reliability issues in the slope estimates (makes their variability artificially higher)
Perfectly colinear x’s:
Not colinear: y=mpg, x1 = weight of engine, x2 = compression ratio of engine
Highly colinear: y = mpg, x1 = weight, x2 = displacement of engine
· Collinearity can be detected by
· High correlations between quantitative x’s
· Variance inflation factor (or tolerance)
· For each quantitative x, a regression is run where that x variable is the response variable, and all other quantitative x’s are explanatory variables
· From that regression we calculate a value of rx2
· VIF = rx2)
· VIF cannot be smaller than 1, and bigger values mean more collinearity
· Need to load car package to use vif() in R
> fitMLR<-lm(mpg~weight+displacement+acceleration,data=cars)> summary(fitMLR)Call:lm(formula = mpg ~ weight + displacement + acceleration, data = cars)Residuals:Min 1Q Median 3Q Max-11.7382 -2.8112 -0.3607 2.5231 16.1845Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 41.2990756 1.8614975 22.186 < 2e-16 ***weight -0.0061889 0.0007396 -8.368 1.03e-15 ***displacement -0.0108953 0.0065036 -1.675 0.0947 .acceleration 0.1738507 0.0975107 1.783 0.0754 .
> vif(fitMLR)weight displacement acceleration8.444839 9.899313 1.556585
· Interaction (between two quantitative x’s)
· The relationship between y and one of the x’s can depend on the value of another x variable
· Interaction between two x’s is incorporated by including the product (multiplication) between the two x’s:
· Interpretation of
> fitMLR<-lm(mpg~weight*displacement,data=cars)> summary(fitMLR)Call:lm(formula = mpg ~ weight * displacement, data = cars)Residuals:Min 1Q Median 3Q Max-13.9214 -2.4624 -0.3155 1.8369 17.9674Coefficients:Estimate Std. Error t value Pr(>|t|)(Intercept) 5.395e+01 1.925e+00 28.032 < 2e-16 ***weight -8.998e-03 8.412e-04 -10.697 < 2e-16 ***displacement -7.936e-02 1.130e-02 -7.025 9.44e-12 ***weight:displacement 1.772e-05 2.778e-06 6.379 5.00e-10 ***
· Geometrically interaction (between two quantitative x’s) puts curvature into the plane
No interaction Interaction
· Confounding
· The relationship between y and one of the x’s (measured by the slope for that x) can change a lot based on if a confounding variable is included in the model or not
· This can be avoided by including all important x’s in the model
· What does confounding look like in 3D?
· Y = BP, X1 = fiber intake, X2 = weight
· Note the difference in the slopes of the two lines. The red line (line at the top) is when weight is ignored (not in the model), and the blue line (line at the bottom) is when weight is in the model alongside fiber.
· We can tell that interaction is not included here because the plane does not have any curvature to it.
· Note the red line is the average pattern for the relationship between BP and fiber when weight is not included in the model
· When weight is included, the relationship between BP and fiber is where the blue plane intersects the BP/fiber plane, and that slope is not as steep as the red line
· Categorical explanatory variables
· Sometimes there will be explanatory variables that we want to incorporate alongside the quantitative explanatory variables
· We have to use dummy (indicator) variables to do so
· These are 0 or 1, basically like true or false
· You need one less dummy variable than number of categories
· If you want to include gender, then you could have 1 for male and 0 for female
· If you want to include a variable with three categories, then you need two dummy variables
· Example: race could be white, black or other
· If we pick “other” to be the reference category, then we could call the dummy variables dummywhite and dummyblack
|
Race |
Dummywhite |
Dummyblack |
|
White |
1 |
0 |
|
Black |
0 |
1 |
|
Other |
0 |
0 |
· Suppose blood pressure was the response variable, and we found the equation of the model to be
· Now predict y for each of the three groups to see what the slopes mean
· Interaction between a quantitative x and a categorical x
· Suppose y is blood pressure, and we have two explanatory variables, race and temperature
· That means we need the product of dummywhite with temperature and dummyblack with temperature
· Suppose the equation of the model is
· How to interpret these slopes? Simplify by interpreting separately for each race
· Can test all slopes = 0 simultaneously with anova approach
· Can test subset of slopes
· Can have CI for mean of Y or prediction interval
· One way to do MLR from scratch
Chapter 9 reading: Aho Chapter 9: Pages 321-359
Chapter 10 reading: Aho Chapter 10: Pages 421-428, 436-442, 454-456 (section 10.7), 459-463 (section 10.8), 466-469 (section 10.9)