Statistic 7

edwin.villa
Week7Regression.pdf

The correlation is the direction and strength of association between 2 variables is

often expressed in a single number called the correlation coefficient. This is

denoted by the variable r.

• r can only be between -1 and 1, -1 ≤ r ≤ 1. • If r = 0, then there is no linear relationship at all. • If r = -1, then there is a perfect linear relationship that slopes down. • If r = 1, then there is a perfect linear relationship that slopes up.

The Coefficient of Determination refers to how much percent Variation is around the model. This is denoted by R². Note: If you have one you can find the other. Simple Linear Regression is a data analysis technique that tries to find a linear pattern in the data. In linear regression, we use all the data to calculate a straight line which may be used to predict the values. We will also discuss if the linear regression is significant and if the independent variable (x) is a significant predictor of the dependent variable (y). The equation of line for a Simple Linear Regression (SLR) is: �̂� = 𝛽1𝑥 + 𝛽0 Where 𝛽1 is the slope coefficient or the coefficient, 𝛽0 is the y-intercept and �̂� is the predicted y value. Let review our car price example. From the car price data, we also found out what year these cars where manufactured in.

Car Price: Year

Observation 1 $ 20,000 2015 Observation 2 $ 25,000 2016

Observation 3 $ 30,000 2018 Observation 4 $ 31,000 2018

Observation 5 $ 22,500 2016 Observation 6 $ 25,000 2016

Observation 7 $ 29,500 2018

Observation 8 $ 24,000 2015 Observation 9 $ 24,500 2017

Observation 10 $ 25,000 2017

Having this information, we first want to see if there is a correlation between Year and Car Price. Usually the older the car, cheaper the car is. As the age goes up, the price will go down. The Price of the car depends on what Year it was manufactured. This describes a negative correlation, but I want to see if my assumption is correct and what the actual correlation value is. Before we can do any calculations on the data, we will need to convert the Year to a numeric value. Keeping the physical Year is going to skew the data and it doesn’t make sense when we will get into analyzing and interpreting the data. If the car was made in 2018, then this means the car will be 1 year old. 2019 – 2018 = 1. I am rounding all these to full years for ease of the example. Converting all these Years will look like:

Car Price: Year Years Old

Observation 1 $ 20,000 2015 4

Observation 2 $ 25,000 2016 3 Observation 3 $ 30,000 2018 1

Observation 4 $ 31,000 2018 1 Observation 5 $ 22,500 2016 3

Observation 6 $ 25,000 2016 3

Observation 7 $ 29,500 2018 1 Observation 8 $ 24,000 2015 4

Observation 9 $ 24,500 2017 2 Observation 10 $ 25,000 2017 2

Now that we have our data we can start analyzing it. To find the correlation we will use the =CORREL( ) function in Excel. In Excel type in the “=” and the CORREL(;put a left parentheses, then highlight the first column; type in a comma, highlight the second column; close the parentheses ) and hit Enter. Note: it does not matter which column you highlight first.

Here we see that the Correlation = -.8846. This is in fact negative correlation and agrees with our assumption. As the Age of the Car goes up, the Price of the Car will go down. Now that we have the Correlation we can find R2. -.8846 * - .8846 = 78.25%

Next, we will run a Regression using Excel. We will use the Data Analysis ToolPak to run the Regression. Go to Data - > Data Analysis When the new window pops ups, scroll to where it says “Regression”, highlight it and Click “OK”

Once you Click “OK”, a new window pops up

Where it will say “Input” Input Y Range: Click in the box and highlight the y values. Input x Range: Click in the box and highlight the x values. Check the box that say “Labels” this will tell you that the first row has labels in it. Output Options Make sure the second bubble is highlighted. “New Worksheet Ply” Residuals Make sure you check the box for Residuals and Standardized Results Then Click “OK” (Remember, the x-value predicts the y-value. The Year of the Car will predict what the Price of the Car is. This tells us that Years Old is the x-value and Price is the y-value. This is very important to understand and remember) It should look like this:

Once you click OK, here is the Regression Output:

Looking at the output we see the Multiple R is the correlation. We know the

Correlation is negative, but the regression will give us the positive value. Make

sure you look at the coefficients for validation. We also see that the R-squared is

78.25%. Which is what we calculated before. R-squared tells us that:

78.25% of variation in the data between Age and Price, can we have accounted

for by this model. The best R-squared value is 100%, our value is less than that,

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.884606501

R Square 0.782528661

Adjusted R Square 0.755344744

Standard Error 1725.490814

Observations 10

ANOVA

df SS MS F Significance F

Regression 1 85706451.61 85706451.61 28.78646 0.000673381

Residual 8 23818548.39 2977318.548

Total 9 109525000

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

but it is still high enough to give us a good indication of what the data will look

like and it tells us that we want to interpret the model further.

Next, we want to see if this model is significant and if Years is a significant

predictor for Price. We will look at the Significance F value. Recall: if the p-value

is < alpha, then Yes this is significant.

The p-value associated with this model is .00067338.

.00067338 < .05. The p-value is in fact less than alpha. We can state that Yes,

Years is in fact a significant predictor for Price. This can be valuable information

to know if we are going out to buy a new car. Now that we know the model is

significant let’s write out the Regression Equation and interpret the values.

In the Regression Output if we look like the under Coefficients, this is where we

will find the values to write out the Regression Equation. I highlighted them in

Yellow below.

Next to those value we see the word “Intercept”, this corresponds to the y- intercept value. And we see the words “Years Old”, this corresponds to the slope coefficient value. Using this equation �̂� = 𝛽1𝑥 + 𝛽0, we will write out the regression equation and replace “x” and “y” with the actual variable names.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 We see that the y-intercept is $31,959.68. This means when Years Old equals 0, the Price of a Car should be $31,959.68.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (0) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = 31,959.68 This makes sense because Year 0 is 2019. So, if you bought this type of car in the Year 2019, you will expect to pay $31,959.68. Please note: The y-intercept while in this case does make sense does not always have a practical meaning. The y-intercept WILL NOT make sense in every scenario. It is OK for the y-intercept not make sense with certain problems. For example, if you wanted to use the Weight of a Car to predict the Price, the Weight

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 31959.67742 1296.435244 24.65196589 7.83E-09 28970.09239 34949.26245 28970.09239 34949.26245

Years Old -2629.032258 490.0064638 -5.365301179 0.000673 -3758.98919 -1499.075326 -3758.98919 -1499.075326

of a Car will NEVER be 0 pounds, so the y-intercept is not meaningful and would not have a practical meaning in the problem. Next, we want to interpret the Slope. As the Years Old increases by 1 year, then the Price of the Car will go down by $2,629.03. Or as the car gets older and older the price will keep decreasing by $2,629.03 every year. Lastly, I want to use my Regression Equation to predict prices. What would I expect to pay for a car that was manufactured in 2014? Remember 2019 – 2014 = 5. This means the car is 5 Years Old. This is the value you want to substitute into the Regression Equation. DO NOT put 2019 into the equation.

𝑃𝑟𝑖𝑐�̂� = −2,629.03 (𝑌𝑒𝑎𝑟𝑠 𝑂𝑙𝑑) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −2,629.03 (5) + 31,959.68 𝑃𝑟𝑖𝑐�̂� = −13,145.16 + 31,959.68 𝑃𝑟𝑖𝑐�̂� = $18,814.52 In the Year 2014, when the car is 5 Years Old, we will expect to pay $18,814.52 for a car. This is a good analysis for a SLR. But if we wanted to analysis the data further? We could run a Multiple Linear Regression (MLR). Multiple Linear Regression is just like it sounds. Instead of having only 1 x-variable, we have multiple x- variables. In our example the x-variable was Years Old, and it did a good job at predicting Price. But what other values could you use to predict the Price of a Car? One this that comes to mind is Total Miles. When you are looking to buy a car, you also want to look at Total Miles. Usually you want a car with fewer miles on it. The fewer miles on the car, the higher the price. Or the more miles you have a car, the lower the price. This appears to be another negative correlation, or relationship. Another variable that comes to mind is a 5-star safety rating. The safer the car the more people are willing to pay for safety. If a car has 5 stars it will be more expensive than if a car only had 2 or 3 stars. This appears to be a positive correlation or relationship.

You would want to run a MLR to justify and verify your claims, but these are just a few variables you could include to turn this SLR to a MLR.