statistics with R programming

jamesfiona1993

Week11outline.docx

Home >Mathematics homework help >Statistics homework help >statistics with R programming

Week 11 Outline (11-11-20):

Assignment 6 on BB, due Sunday at midnight.

Don’t forget about group presentation. I’ve only been approached by one student so far about a topic.

Exam 1 is not graded.

When to have Exam 2, and what it should cover? What about Thurs Dec 3 at 5, to Sat Dec 5 at 5? Chapters 6, 7, 8 and 9 (SLR only, not MLR), so over assignments 5 and 6?

Questions

Continue Aho Chapter 9

· We are still talking about OLS (ordinary least squares) regression, where our response variable is quantitative

· Now we turn our attention to Multiple Linear Regression (MLR) which means we have more than one explanatory variable, and these explanatory variables can be quantitative and/or categorical

· Having more than one explanatory variable raises issues that we do not encounter in simple linear regression (SLR), because there was only one explanatory variable

· Everything we cover here still requires our sample is taken randomly

· MLR

· MLR model: for subject

https://qph.fs.quoracdn.net/main-qimg-f51f77ed4567fb0a12611b4f50214c1c

· We never have the entire population, so we often write the SLR model in term of the sample data:

· In the above image, is being played by b

· Putting “hats” on the beta’s just means estimated, or sample version of

· This can also be written in terms of the equation of the best fitting line with a “hat” on y:

· Slopes (regression coefficients) are interpreted the same way as in SLR, but we are also force the other x’s to remain constant when we interpret the slope for a particular x

· We still require that

· No obvious non-linear relationships, but this is harder to check in more than two dimensions, though we can check the scatterplot of each x with y

· No extreme outliers exerting influence

· These can be easily checked with a scatter matrix for all the quantitative variables, which is just a bunch of scatterplots all shown together

· The one I use here requires the psych package

> head(cars)

  mpg cylinders displacement horsep weight acceleration year location

1  18         8          307    130   3504         12.0   70        1

2  15         8          350    165   3693         11.5   70        1

3  18         8          318    150   3436         11.0   70        1

4  16         8          304    150   3433         12.0   70        1

5  17         8          302    140   3449         10.5   70        1

6  15         8          429    198   4341         10.0   70        1

> newdata<-cars%>%select(1,3:5)

> head(newdata)

  mpg displacement horsep weight

1  18          307    130   3504

2  15          350    165   3693

3  18          318    150   3436

4  16          304    150   3433

5  17          302    140   3449

6  15          429    198   4341

> pairs.panels(newdata,method="pearson",density=F,ellipse=F,lm=T,bg=c("red","yellow","blue")[cars$location],pch=21)

· We use r2 typically in MLR, because r is specific to two variables, but r2 can be interpreted as the percent of variability in y explained by the x’s (or explained by the model)

· You can still use r, but it is interpreted as the correlation between the raw values of and (and is always positive in MLR)

· We still want to make predictions for y based on given values for the x’s

· We still need a strong relationship (“high” value of r2) to make reliable predictions

· We still need to avoid extrapolation (making predictions based on x’s outside the range of the x’s in the data)

· There are more issues in MLR because there is more than one explanatory variable

· How to decide what x’s should be in the model

· There are different ways of doing this, but the best is if the expert in the context of the data knows which x’s should be present

· Some people use hypothesis testing on individual slopes, but this is not ideal

· There are more complicated ways of trying to decide which x’s should be in the model, but we don’t have time to go into it in our class

· If you have more x’s than sample size, then you must somehow collapse the number of x’s in order to fit a model

· Collinearity

· The quantitative explanatory variables are strongly related to each other

· This redundancy causes reliability issues in the slope estimates (makes their variability artificially higher)

Perfectly colinear x’s:

Not colinear: y=mpg, x1 = weight of engine, x2 = compression ratio of engine

Highly colinear: y = mpg, x1 = weight, x2 = displacement of engine

· Collinearity can be detected by

· High correlations between quantitative x’s

· Variance inflation factor (or tolerance = 1/VIF)

· For each quantitative x, a regression is run where that x variable is the response variable, and all other quantitative x’s are explanatory variables

· From that regression we calculate a value of rx2

· VIF = rx2)

· VIF cannot be smaller than 1, and bigger values mean more collinearity (rule of thumb, VIF > 10 means a lot of collinearity, and 4<VIF<10 means moderate collinearity)

· Need to load car package to use vif() in R

> fitMLR<-lm(mpg~weight+displacement+acceleration,data=cars)

> summary(fitMLR)

Call:
lm(formula = mpg ~ weight + displacement + acceleration, data = cars)
Residuals:
     Min       1Q   Median       3Q      Max 
-11.7382  -2.8112  -0.3607   2.5231  16.1845 
Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  41.2990756  1.8614975  22.186  < 2e-16 ***
weight       -0.0061889  0.0007396  -8.368 1.03e-15 ***
displacement -0.0108953  0.0065036  -1.675   0.0947 .  
acceleration  0.1738507  0.0975107   1.783   0.0754 .  

> vif(fitMLR)
      weight displacement acceleration 
    8.444839     9.899313     1.556585 


· Interaction (between two quantitative x’s)
· The relationship between y and one of the x’s can depend on the value of another x variable
· Interaction between two x’s is incorporated by including the product (multiplication) between the two x’s:  
· Interpretation of 
· In order to interpret the slope from the interaction (product term), both of the original variables (often called main effects) must also be in the model
· There are differing philosophies on searching for interactions, but for our class when we are talking about MLR, you only need to worry about interactions if you are asked to do so

> fitMLR<-lm(mpg~weight*displacement,data=cars)
> summary(fitMLR)
Call:
lm(formula = mpg ~ weight * displacement, data = cars)
Residuals:
     Min       1Q   Median       3Q      Max 
-13.9214  -2.4624  -0.3155   1.8369  17.9674 
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          5.395e+01  1.925e+00  28.032  < 2e-16 ***
weight              -8.998e-03  8.412e-04 -10.697  < 2e-16 ***
displacement        -7.936e-02  1.130e-02  -7.025 9.44e-12 ***
weight:displacement  1.772e-05  2.778e-06   6.379 5.00e-10 ***

· Geometrically interaction (between two quantitative x’s) puts curvature into the plane



			No interaction					Interaction

· Confounding
· The relationship between y and one of the x’s (measured by the slope for that x) can change a lot based on if a confounding variable is included in the model or not
· This can be avoided by including all important x’s in the model
· What does confounding look like in 3D?
· Y = BP, X1 = fiber intake, X2 = weight
· Note the difference in the slopes of the two lines.  The red line (line at the top) is when weight is ignored (not in the model), and the blue line (line at the bottom) is when weight is in the model alongside fiber.
· We can tell that interaction is not included here because the plane does not have any curvature to it.



· Note the red line is the average pattern for the relationship between BP and fiber when weight is not included in the model
· When weight is included, the relationship between BP and fiber is where the blue plane intersects the BP/fiber plane, and that slope is not as steep as the red line


· Categorical explanatory variables
· Sometimes there will be explanatory variables that we want to incorporate alongside the quantitative explanatory variables
· We have to use dummy (indicator) variables to do so
· These are 0 or 1, basically like true or false
· You need one less dummy variable than number of categories
· If you want to include gender, then you could have 1 for male and 0 for female
· If you want to include a variable with three categories, then you need two dummy variables
· Example: race could be white, black or other
· If we pick “other” to be the reference category, then we could call the dummy variables dummywhite and dummyblack


	

Race

	

Dummywhite

	

Dummyblack



	
White 
	
1
	
0


	
Black 
	
0
	
1


	
Other 
	
0
	
0





· Suppose blood pressure was the response variable, and we found the equation of the model to be

· Now predict y for each of the three groups to see what the slopes mean



# Uses location 1 as reference group
> fitMLR<-lm(mpg~weight+displacement+acceleration+factor(location),data=cars)
> summary(fitMLR)
Call:
lm(formula = mpg ~ weight + displacement + acceleration + factor(location), 
    data = cars)
Residuals:
     Min       1Q   Median       3Q      Max 
-12.5893  -2.7235  -0.3225   2.3763  15.3953 
Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)       38.5507145  2.0194877  19.089  < 2e-16 ***
weight            -0.0061732  0.0007396  -8.347 1.21e-15 ***
displacement      -0.0050043  0.0069244  -0.723 0.470289    
acceleration       0.2339626  0.0980582   2.386 0.017509 *  
factor(location)2  0.9188979  0.6913708   1.329 0.184589    
factor(location)3  2.3423538  0.6819803   3.435 0.000657 ***




# Uses location 2 as reference group
> fitMLR<-lm(mpg~weight+displacement+acceleration+relevel(factor(location),ref="2"),
data=cars)
> summary(fitMLR)
Call:
lm(formula = mpg ~ weight + displacement + acceleration + relevel(factor(location), 
    ref = "2"), data = cars)
Residuals:
     Min       1Q   Median       3Q      Max 
-12.5893  -2.7235  -0.3225   2.3763  15.3953 
Coefficients:
                                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)                           39.4696124  1.9519160  20.221  < 2e-16 ***
weight                                -0.0061732  0.0007396  -8.347 1.21e-15 ***
displacement                          -0.0050043  0.0069244  -0.723   0.4703    
acceleration                           0.2339626  0.0980582   2.386   0.0175 *  
relevel(factor(location), ref = "2")1 -0.9188979  0.6913708  -1.329   0.1846    
relevel(factor(location), ref = "2")3  1.4234558  0.7057713   2.017   0.0444 *




· Interaction between a quantitative x and a categorical x
· Suppose y is blood pressure, and we have two explanatory variables, race and temperature
· That means we need the product of dummywhite with temperature and dummyblack with temperature
· Suppose the equation of the model is 

· How to interpret these slopes?  Simplify by interpreting separately for each race

· Can test all slopes = 0 simultaneously with anova approach
· Still requires that we have constant variance and normal residuals


> summary(fitMLR)
Call:
lm(formula = mpg ~ weight * displacement, data = cars)
Residuals:
     Min       1Q   Median       3Q      Max 
-13.9214  -2.4624  -0.3155   1.8369  17.9674 
Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          5.395e+01  1.925e+00  28.032  < 2e-16 ***
weight              -8.998e-03  8.412e-04 -10.697  < 2e-16 ***
displacement        -7.936e-02  1.130e-02  -7.025 9.44e-12 ***
weight:displacement  1.772e-05  2.778e-06   6.379 5.00e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.105 on 394 degrees of freedom
Multiple R-squared:  0.7263,	Adjusted R-squared:  0.7242 
F-statistic: 348.4 on 3 and 394 DF,  p-value: < 2.2e-16




· Can test subset of slopes to be 0 (called partial F test)
· Still requires that we have constant variance and normal residuals
· Requires that all x’s the smaller model are also in the larger model (nested)
· Useful for generally comparing two models, or testing all interaction terms at once
> fitMLR1<-lm(mpg~weight+displacement+acceleration,data=cars)
> fitMLR2<-lm(mpg~weight,data=cars)
> anova(fitMLR1,fitMLR2)
Analysis of Variance Table
Model 1: mpg ~ weight + displacement + acceleration
Model 2: mpg ~ weight
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1    394 7266.2                                
2    396 7474.8 -2   -208.59 5.6552 0.003789 **



· Can have CI for mean of Y or prediction interval just as we did for MLR, and you have to specify values of all x’s in the model to do so (just as we did in SLR but there was only one x)
· Still requires that we have constant variance and normal residuals
· See code from last time to get these

· One way to do MLR from scratch, based only on what we’ve covered in our class
· These steps assume you don’t have a large number of explanatory variables
· There is so much more to regression, but we can only scratch the surface in our class.

· Do an exploratory data analysis (one variable at a time), to get familiar with all variables.  This would include summaries of all variables, especially the response variable.  
· For quantitative explanatory variables, check for outliers (and follow up on any to rule out data entry errors) or other unusual characteristics of the data, such as different sample sizes for the different variables (missing values).  What are typical values of the quantitative variables?  What is the shape of the distribution of the quantitative variables, especially the dependent variable?
· For categorical explanatory variables, determine how many individuals fall in each category.  If there are some categories with low counts, some adjustment may need to be made, such as collapsing categories in some meaningful way.  If any categorical variables have too many categories (more than 5 or 6), then some collapsing may need to occur.



· Perform basic bivariate (Y with one X at a time) explorations.  
· Make a scatter-matrix of your quantitative explanatory variables with the response variable, verifying that all relationships with the response variable are linear (or at least not obviously non-linear), and looking for outliers.  Transformations may need to be employed at this point.
· You can also make side-by-side boxplots for the categorical explanatory variables with the response variable, and note any interesting features, such as outliers, shapes of distributions, or difference in centers or variability.

· Check for collinearity for the quantitative explanatory variables.  
· If there is significant collinearity, then you will have to decide which independent variables to get rid of at this point.  

· Fit a MLR regression model for the x’s specified by the owner of the data (the expert in the data context should determine which x’s are important
· This includes dummy variables for categorical explanatory variable


· If any interactions are requested, check for them, by forming products (multiply) of independent variables.  
· If there are multiple interactions to check, consider doing partial F test to check all relevant slopes simultaneously.
· If an interaction term is to be kept in the final model, make sure to keep the original variables that it was created from (the main effects), even if they are not significant.

· When you get the final model, check conditions.  (If you have non-normal residuals or non-constant variance, transformations of y such as sqrt(y) may need to be used as remedies at this point, and model building redone, depending on how things look.)
· normality of residuals (boxplot, histogram)
· outliers (scattermatrix, residual vs fitted plot)
· constant variance (residuals vs fitted plot)
· You should have already looked for non-linearity in scattermatrix, but the residual vs fitted plot is also useful for this, meaning there should be no pattern at all in the residual vs fitted plot


· Interpret all slopes that are in your final model, including interactions.


· 

Interpret  for your final model.  What does this tell you about how much faith you should have in any predictions?

· Make any requested predictions, including confidence interval and/or prediction interval.




Chapter 10 reading:  Aho Chapter 10:  Pages 421-428, 436-442, 454-456 (section 10.7), 459-463 (section 10.8), 466-469 (section 10.9)

2
r