regression

profileGman 21$
Week7MultipleLinearRegression.pdf

MLR Recall: Linear Regression is a data analysis technique that tries to find a linear pattern in the data. We use all the data to calculate a straight line which may be used to predict the values. The equation of line for a Simple Linear Regression (SLR) is: �̂� = 𝛽1𝑥 + 𝛽0 Where 𝛽1 is the slope coefficient or the coefficient, 𝛽0 is the y-intercept and �̂� is the predicted y value. For Multiple Linear Regression (MLR) the equation of a line will look like: �̂� = 𝛽1𝑥1 + 𝛽2𝑥2 + 𝛽3𝑥3 + 𝛽4𝑥4+ 𝛽5𝑥5 + … + 𝛽0 Where 𝛽1, 𝛽2, 𝛽3, 𝛽4, 𝛽5, … are the slopes coefficient or the coefficients, 𝛽0 is the y-intercept and �̂� is the predicted y value. For Multiple Linear Regression you will have more than 1 slope, but you will still only have 1 y-intercept. The number of slopes will depend on the number of x- variables you have. In the example equation above, I only wrote out 5 slopes but know you can have any amount as along as it is 2 and above. Because if there is only 1-slope then this is a SLR NOT a MLR. Example: Let’s move away some our car price data and look at home prices. Is the MLR model a good predictor of home prices? If so, what variables help predict the price of a home? Data: Price Area (Sq Ft) Floor Bedrooms Bathrooms

$ 69,000 600 1 2 1

$ 118,500 1000 2 2 2 $ 125,000 1100 1 3 2

$ 139,300 1300 2 3 2 $ 147,900 1700 2 3 2

$ 169,900 1800 1 3 2.5 $ 134,900 1300 1 4 2.5

$ 169,900 1700 2 4 3

$ 194,500 2000 2 5 3.5 $ 209,900 2100 3 5 4

Looking at our example, we want to use Area, Floors, Bedrooms and Bathrooms to try and predict home prices. This means the x-variables are Area, Floors, Bedrooms, and Bathrooms and the y-variable is Price. Because there are 4 x-variables, this means there are 4 slopes in the MLR model. Next, we will run a Regression using Excel. We will use the Data Analysis ToolPak to run the Regression. Go to Data - > Data Analysis When the new window pops ups, scroll to where it says “Regression”, highlight it and Click “OK”

Then it will say “Input” Input Y Range: Click in the box and highlight the y values or the price column. Input x Range: Click in the box and highlight the x values or the area, floors, bedrooms, and bathrooms columns. Check the box that say “Labels” this will tell you that the first row has labels in it. Output Options Make sure the second bubble is highlighted. “New Worksheet Ply” Make sure you check the box for Residuals and Standardized Results Then Click “OK”

(Remember, the x-values predicts the y-value. Area, Floors, Bedrooms and Bathrooms will predict what the Price of a Home. This is very important to understand and remember) It should look like this:

Once you click OK, here is the Regression Output:

Looking at the output we see the estimated regression line is: 𝑃𝑟𝑖𝑐�̂� = .053181(𝐴𝑟𝑒𝑎) − .111766(𝐹𝑙𝑜𝑜𝑟) − 5.382699(𝐵𝑒𝑑) + 24.79927(𝐵𝑎𝑡ℎ) + 27.9676 To interpret the Bathroom Slope we state: For holding all other slopes constant, if the number of Bathrooms in a house increases by 1, then the price of the home will increase by $24,799. This makes sense because bathrooms are something everyone uses, and a good selling point for the home that we will discuss below some more. The y-intercept is $27,968. This means if you have 0 Sq. Ft., 0 Floors, 0 Bedrooms and 0 Bathrooms then the price of a home will be $27,968. This doesn’t make sense in the context of our problem because if you don’t have anything in a home, then you will have a plot of land. And plots of land sell differently than homes that are already done, and you would use different x-variables as predictors for plots of land.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.994699234

R Square 0.989426567

Adjusted R Square 0.98096782

Standard Error 5.602053083

Observations 10

ANOVA

df SS MS F Significance F

Regression 4 14683.58101 3670.895252 116.9708249 3.99315E-05

Residual 5 156.9149937 31.38299874

Total 9 14840.496

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 27.96762856 7.053708269 3.964953964 0.010689849 9.835494214 46.09976291

Area (Sq Ft) 0.053181736 0.008231449 6.460798716 0.001322498 0.032022122 0.074341349

Floor -0.11176635 3.801943471 -0.029397162 0.977685138 -9.884973176 9.661440476

Bedrooms -5.382699832 4.706373686 -1.143704302 0.304530968 -17.48081854 6.715418878

Bathrooms 24.79927334 7.643770423 3.24437705 0.022837965 5.150335932 44.44821074

Is this model a good predictor for Price of a Home? Looking at the overall Significance F value, we get 3.99315E-05. The “E” means scientific notation in Excel. This can be rewritten as 3.99315 x 10-5, moving the decimal place 5 places to the LEFT the overall p-value is: .0000399315. .0000399215 < .05, Yes, this model is a good predictor for Price of a Home. Now that we know the overall model is significant, which variables, or slopes, are a good predictor for Home Price? We want to look at the individual p-values for each slope. - Area: .001322 < .05, Yes this is significant and a good predictor for Home Price.

- Floor: .97769 > .05, No, this is not significant and not a good predictor for Home Price.

- Beds: .30453 > .05, No, this is not significant and not a good predictor for Home Price.

- Baths: .02284 < .05, Yes this is significant and a good predictor for Home Price.

We see that Area and Bathrooms are significant predictors for Home Price because those p-values are less than .05.

From the Regression output we notice that the Standard Error is 5.602 and the Adjusted R-squared is 98.1%.

The last thing we want to look at are the standardize residuals to see if there are any outliers.

RESIDUAL OUTPUT

Observation Predicted Price Residuals Standard Residuals Price

1 73.79877725 -4.79877725 -1.149263527 69,000$

2 119.7589785 -1.25897848 -0.301513901 118,500$

3 119.8062186 5.193781442 1.24386344 125,000$

4 130.3307993 8.969200671 2.148042024 139,300$

5 151.6034936 -3.703493572 -0.886953043 147,900$

6 169.4330702 0.466929849 0.111825454 169,900$

7 137.4595025 -2.559502515 -0.612977585 134,900$

8 171.0200671 -1.120067077 -0.268245883 169,900$

9 193.9915246 0.508475405 0.121775237 194,500$

10 211.5975685 -1.697568474 -0.406552217 209,900$

Any Standardize Residuals outside of the range -2 to 2 can be considered an outlier. Anything outside of -3 to 3, is also an outlier and you might want to do additional research to investigate this data point. When the Home Price is $139,300 the Predicted Price was $130,331. The Standardize Residual is 2.14. This data point can be considered an outlier. This means that the model underpredicted this data point for the house. If you are buying this home, then this is good news for you, because you can make a lower offer. If you are selling this home, that might not be such good news for you because you want to get as much money for your home as possible. When the Home Price is $209,900 the Predicted Price was $211,596. The Standardize Residual is -.406. This data point is not an outlier. But this means that the model overpredicted this data point for the house. If you are selling this home, this might be good news for you because then you can list the home for higher than you originally thought. But you don’t want to list it too high because you do want offers. But it is something to consider. Make sure you look at the context of the problem to see if the outliers play into your favor or not. But regardless, you will want to look at a range from -2 to 2.