Attached Files
Quiz 8 Notes
Scatterplots, Correlation and Regression
We are turning to our last quiz topic; regression. To get to regression, we need to understand several concepts first.
To start with, we will be working with two quantitative variables. The goal is to see if there is a relationship/association between the two variables. As one variable increases, what does the second variable do? If the second variable makes a consistent change then a relationship may exist. MAJOR POINT: saying a relationship exists does NOT mean there is Causation. The greatest abuse of statistical work is here, when a person runs a regression then says Variable A causes Variable B to change. You must have experimental results to establish causation.
Looking at the two variables that will be in a regression you need to know that each variable plays a specific role. One of the variables, X, will be the independent/explanatory variable and the other, Y, will be the dependent/ response variable. In a regression we are looking to see if changes in, Y; occur as X changes. It is very important that you establish at the beginning which of your variables will be X and which will be Y. Swapping the places for the two variables may not work. Let’s do an example.
In economics, we discuss the relationship of the quantity demand and the price of a good. Which one would be the X in a regression, and which would be, Y? The Law of Demand says, “as the price of a good increases, the quantity demanded decreases”. Which is allowed to change on its own and which the follows? If you said, price is allowed to change you are correct. So, the price of the good is the X variable and we look to see how much the quantity demanded changes, Y. It would not make sense to an economist if you switched the two variables.
When we go to visualize the two quantitative variables, we are building what is called a scatterplot. In a scatterplot we put the independent variable on the X axis (hence the name X variable) and the dependent variable on the Y axis (hence the name Y variable). There 5 aspects to a scatterplot that we need to discuss. This discussion will start to build a common terminology and common frame of reference.
5 Aspects of a Scatterplot Association
1. Direction – are the points on the scatterplot lining up in a specific direction?
The left scatterplot above is showing a negative direction (downward sloping to the right), as the X variable increases, the Y variable is decreasing. The right scatterplot above is showing a positive direction (upward sloping to the right), as the X variable increases so does the Y variable. There is a third option, no relationship. It will look like what I call a shotgun pattern, the scatterplot below illustrates.
v
v
v
v
v
v
v
2. Form – does the pattern in the scatterplot show a linear or non-linear relationship? The two scatterplots on the last page would be good candidates to show a linear relationship. A non-linear pattern should have a definite “U” shape or “N” shape or a sideways “S” pattern.
3. Strength – how close are the point on the scatterplot? If they are very close it is called a “strong” relationship. Somewhat close, “moderate”; and loosely together, weak.
Strong Moderate Weak
4. Outliers – is (are) a there point(s) that stick out from the rest? Outliers can come in three forms:
a. “X” Outliers –
The point on the right side of the scatterplot sticks out in the X direction but not the Y
direction (Up or down). X outliers can artificially increase the strength of the
relationship.
b. “Y” Outliers –
The point at the top of the scatterplot sticks out in the Y direction but not in the X Comment by 18154:
Direction (Left or right). Y outliers can shift the intercept and may be the slope.
c. X & Y outliers – these are outliers that stick out in both directions. They are known as
“influential” outliers because they can completely alter the relationship.
vv
vv
vv
vv
vv
vv
vv
As you can see, the point up and to the right is the outlier. The regression model will we work with will try to put a linear line using all the points, so it might look like the red line (Upward sloping). When in reality the relationship is more negative like the blue line (Downward sloping).
5. Trends – does the collections of points suggest a trend beyond the range of points or a trend within the range of points. Extending a trend beyond the last point in a range can be very dangerous because the world is not linear. Going a little beyond is okay but too far can lead to very incorrect predictions. Within a range where for some reason an empty space exists is usually not a problem, as long as the strength is moderate or greater.
Yes May be No
Correlation
Correlation measures the direction and strength of a relationship between two quantitative variables. Scatterplots can be manipulated to change the perceived view of a relationship. The correlation coefficient value, “r”, removes that ability to manipulate the axes and change the view of the relationship.
Correlation changes all the points on the scatterplot into X and Y standardized coordinates. Mentally, imagine placing cross hairs on top of a scatterplot and then calculate the distance for each point from where the cross hairs intersect. If you use standard deviations to measure you are doing what the correlation does.
r = Σ ZX ZY/n - 1
Correlation Properties:
1. “r” will always be between -1 and 1
2. A -1 means a perfect line sloping down to the right. 1 means a perfect line sloping up to the right.
3. The closer to -1 or 1 the stronger the relationship
4. “r” = 0 means no relationship
5. Correlations have no units
6. Outliers can change the “r” so do correlations with and without outliers
7. Watch out for Spurious relationships. A spurious relationship is a false relationship. If you just pick
two quantitative variables that have nothing in common you may get a high “r” value by chance
8. Correlation does not equal Causation
Ranges for “r”
1. Strong = .7 to 1
2. Moderate = .4 to .7
3. Weak = .1 to .4
4. No relationship = 0 to .1
Same is true in the negative direction
What is a “good” “r” value? It depends on the field of study, to some degree. Fields that can control for outside variables will want higher “r” and fields that cannot control as much will accept lower “r” values.
Examples:
(Rough idea of the differences)
Chemistry wants over .5
Biology wants over .3
Social Sciences over .2
Examples of “r”
1. “r” = .623, you would say something like: It is a moderately, positive relationship
2. “r” = -.427, you would say a weakly moderate, negative relationship.
You can add as many adjectives as you need to better describe what the value is representing. Yes, people express slightly different views of the same value.
Regression
Correlation tells:
1. That there is an association/relationship
2. It tells what the direction and strength of the association/relationship is
Correlation does not tell what that association/relationship is.
Regression tries to answers that last part, what is the association/relationship between the two variables.
We will be using a simple, linear regression. Simple means only one X variable. Linear means we will be fitting a linear line through the data on the scatterplot. The line will be the “line of best fit”.
“Line of best Fit” means that the computer will place a line through the scatterplot points that has the least amount of error (distance from the line for the data points This is also called the “sum of the least squares”. The computer is measuring the distance of each point from the line then squaring them to remove the negatives. The line that has the smallest sum of the error is also the line that best fits the data.
Best fit can be misleading. If your data is weak and shotgun patterned, your error could be very high, and the line may be a bad representation of the data, but the computer will give you the line that is the best possibility.
The Model:
Since we are doing a linear regression will be using a linear equation:
You know this equation as Y = mX + b.
We will adjust that equation for the regression:
Ŷ = β0 + β1X β = Beta
Where ŷ equals the predicted “Y”. Where β0 equals the Y intercept. Where β1 equals the slope of the line. Using this equation we are idealizing the data to get a “predicted y” for every X. The “predicted y” could be close to the actual Y or Y’s; or quite a way apart.
When a regression is run, the computer will generate the equation and the “r” and “r2”. “r2” is the correlation coefficient squared but it explains something very different. “r2” explains how much of the change in your Y variable occurs as your X variable changes. The stronger the relationship is, the higher your “r2” should be. Your regression is capturing more of the change in Y.
Well, that is that is the basic of scatterplot, correlation and regression. Now on to the fun stuff, let’s run a regression. I will be following the steps from your computer packet. In the packet is says to split the data by gender, this was assuming we were still doing the project. We are not, so you just have to pick two quantitative variables. Please, use your variables from the survey.
You do have to determine beforehand which variable will be the independent/explanatory variable and which one will be the dependent/response variable.
The steps for the regression are different than we have been using as you can see from Quiz 8. I am using variables from the T/R class: Hours Worked and Hours Slept.
Everything in red is either explanation or comment. Also, remember that copy/paste helps when redoing a test
Step 1: State your Problem
We will run a regression to see if there is a relationship between the variables, “Average Hours a Week Worked” and “Average hours of Sleep”. We will make “Worked” our X variable and “Sleep” our Y variable. (I am saying that the number of hours worked impacts the number of hours slept. I expect a negative relationship but who knows?)
Step 2: Input and Visualize Your Data (Produce a Scatterplot)
I just realized I did not put the steps to get a scatterplot in the packet. In StatCrunch click on “Graph” then click on “Scatterplot”. New box: Choose your X variable and Y variable then click “compute”. Copy/paste the scatterplot like the one above.
Step 3: Check the Association and Comment on Each Aspect
1. Direction: Negative (You could argue no relationship)
2. Form: Not Non-linear (Key point here; if the data is truly non-linear you cannot run the test. I have had only one student with truly non-linear data. Using “Not Non-linear” is not saying it is linear, only that a linear line could work)
3. Strength: Weak (You could say very weak)
4. Outliers: Yes at (25, 9) and (65, 5) (These are the points at the top of the scatterplot and at the lower righthand corner) (Once you have stated at least one outlier any additional outliers are totally your choice. If you want to argue there are no outliers that is okay, but you must explain why) (Pet peeve: make sure that you have the X value first. I do not want to spend time figuring out what points you are talking about)
5. Trends: No (This aspect should only have a yes if the strength is at least moderate)
Step 4: Check the Conditions
1. Random – random enough, student signed up independently
2. Linear – Not Non-linear (Violate this condition you cannot run the test)
3. Equal Variance Condition: No pattern in the Residual plot (When you run the regression I have you also producing a scatterplot of the residuals (error). This checks to see if there is a pattern in the leftover error once the part explained by the regression is removed. If a pattern exists that implies another variable may be impacting your Y variable. In this particular case there is no pattern. Patterns that need to be checked for; a funnel shape, a fan tail shape or a non-linear shape. Below is an example of a funnel shape.)
-------------------------------------------------------------------------------------
v
The dotted line will NOT appear in your residual plot. I put it there to help illustrate the funnel pattern. The human mind is trained to look for patterns but here finding patterns is not the objective. When you look you must take into account ALL the points, not just the ones that may support a pattern. I have had many students think they see a pattern until I point a couple points that are outside what they think is a pattern.
4. Nearly Normal Condition: (A regression test assumes that the distribution of your “Y’s”, if you were to have multiple Y’s for each X value is distributed Normally. You check this condition by looking at the histogram I have you produce when you do the regression.)
This histogram is not normal and means there is a violation. (You cannot change this histogram, just live with it)
5. Outlier Condition: Yes, outliers exist at Yes at (25, 9) and (65, 5) (This means I will be running with the outliers and without the outliers, like we did for the 2-mean testing.)
Step 5: Run the Regression
Following the directions in the computer packet. Click on “Stat”, then click on “Regression”, etc.
(Be sure to highlight “Histogram of Residuals and Residuals vs. X-values. Histogram is for the Nearly Normal Condition and the Residuals is the scatterplot for the Equal Variance Condition) (you should get outputs that look like those below)(Also, I always choose to make the sign for the Intercept Ha a positive because most intercepts are and I made the Ha for the Slope negative because that is what I stated in the Aspect Step for Direction)
Simple linear regression results:
Dependent Variable: var3 Independent Variable: var9 var3 = 6.9254592 - 0.011584884 var9 = regression equation Sample size: 22 R (correlation coefficient) = -0.16849666 = “r” R-sq = 0.028391125 = “r2” Estimate of error standard deviation: 1.0476051
Parameter estimates:
|
Parameter |
Estimate |
Std. Err. |
Alternative |
DF |
T-Stat |
P-value |
|
Intercept |
6.9254592 |
0.43919667 |
≠ 0 |
20 |
15.768469 |
<0.0001 |
|
Slope |
-0.011584884 |
0.015154135 |
≠ 0 |
20 |
-0.76447021 |
0.4535 |
Analysis of variance table for regression model:
|
Source |
DF |
SS |
MS |
F-stat |
P-value |
|
Model |
1 |
0.64138133 |
0.64138133 |
0.5844147 |
0.4535 |
|
Error |
20 |
21.949528 |
1.0974764 |
|
|
|
Total |
21 |
22.590909 |
|
|
|
(The part above that starts with, “Analysis of Variance….” is not needed but you can leave if you copy it in.) (I added the stuff in red and changed the color for the word, slope and its value)
The scatterplot above has the regression line added. It does show a slightly negative relationship
The Residual vs. X-values Scatterplot can be left here or moved back to the condition section under the Equal Variance Condition. It shows no pattern, I added the arrows to illustrate that there is no funnel, fan tail or non-linear pattern.
I put the histogram here also because it comes when you do the regression. You can leave it here or put in back in the Condition section like I did. Repeating what I said earlier this is not normal as the red curve shows normal. This is a violation that needs to be discussed in the conclusion.
Step 6 State Your Conclusions
We ran the regression and found a “r” = -.1685 and a “r2” = .0284. The “r” = -.1685 means we have a weak negative relationship. The “r2” = .0284 means that 2.84% of the change in the average number of hours slept occurs as the average number of hours worked changes (Very little change is explained). The regression equation states that for every 1 additional hour worked, sleep decreases by .012 hours. The p-value for the slope of the regression is .4535 which means there is insufficient evidence to say there is relationship between the variables; “Hours” and “Sleep”. (This p-value would be the p-value if we ran a hypothesis test on the slope of the regression. We will discuss this in more detail further on)
We had a violation in the Nearly Normal Condition which means that the data may not as accurate as needed for the test and this may be impacting the results. We also had 2 outliers so we will rerun the test without the outliers. (If there had been a violation of the Equal Variance Condition the language would be as the following. There was a violation in the Equal Variance Condition, pattern in the residuals, which means another variable may be impacting the Y variable.)
Without the Outliers
Step 2 Redo the Scatterplot
Step 3: Check the Association and Comment on Each Aspect
1. Direction: No direction (You could argue negative or positive)
2. Form: Not Non-linear
3. Strength: Weak to none
4. Outliers: no, outliers have been removed
5. Trends: No
Step 4: Check the Conditions
1. Random – random enough, student signed up independently
2. Linear – Not Non-linear
3. Equal Variance Condition: No pattern in the Residual plot shown below
4. Nearly Normal Condition: Based on the histogram below, there is still a violation. The data is not as normally distributed as needed
5. Outliers: No, outliers have been removed
Step 5: Run the Regression (First try I left Ha for the Slope a negative. It is wrong, the relationship is slightly positive. I need to run again. I would not keep this first run but I wanted to show you that things do change)
Simple linear regression results:
Dependent Variable: var3 Independent Variable: var9 var3 = 6.5362284 + 0.0027787202 var9 Sample size: 20 R (correlation coefficient) = 0.040879315 R-sq = 0.0016711184 Estimate of error standard deviation: 0.89057107
Parameter estimates:
|
Parameter |
Estimate |
Std. Err. |
Alternative |
DF |
T-Stat |
P-value |
|
Intercept |
6.5362284 |
0.41788691 |
> 0 |
18 |
15.641142 |
<0.0001 |
|
Slope |
0.0027787202 |
0.016008173 |
< 0 |
18 |
0.17358134 |
0.5679 |
Analysis of variance table for regression model:
|
Source |
DF |
SS |
MS |
F-stat |
P-value |
|
Model |
1 |
0.023896993 |
0.023896993 |
0.030130483 |
0.8641 |
|
Error |
18 |
14.276103 |
0.79311683 |
|
|
|
Total |
19 |
14.3 |
|
|
|
Second Run using a positive sign for Ha in Slope
Simple linear regression results:
Dependent Variable: var3 Independent Variable: var9 var3 = 6.5362284 + 0.0027787202 var9 Sample size: 20 R (correlation coefficient) = 0.040879315 R-sq = 0.0016711184 Estimate of error standard deviation: 0.89057107
Parameter estimates:
|
Parameter |
Estimate |
Std. Err. |
Alternative |
DF |
T-Stat |
P-value |
|
Intercept |
6.5362284 |
0.41788691 |
> 0 |
18 |
15.641142 |
<0.0001 |
|
Slope |
0.0027787202 |
0.016008173 |
> 0 |
18 |
0.17358134 |
0.4321 |
Analysis of variance table for regression model:
|
Source |
DF |
SS |
MS |
F-stat |
P-value |
|
Model |
1 |
0.023896993 |
0.023896993 |
0.030130483 |
0.8641 |
|
Error |
18 |
14.276103 |
0.79311683 |
|
|
|
Total |
19 |
14.3 |
|
|
|
Step 6 State Your Conclusions
We ran the regression and found a “r” = .0409 and a “r2” = .0017. The “r” = .0409 means we probably have no relationship. The “r2” = .0017 means that 0.17% of the change in the average number of hours slept occurs as the average number of hours worked changes (Very little change is explained). He regression equation states that for every 1 additional hour worked, sleep increases by .003 hours. The p-value for the slope of the regression is .4321 which means there is insufficient evidence to say there is relationship between the variables; “Hours” and “Sleep”.
We still had a violation in the Nearly Normal Condition which means that they data may not as accurate as needed for the test and this may be impacting the results.
After removing the outliers there were some changes:
“r” went from -.1685 to .0409, a change from negative to positive but also a decrease to point there is probably no relationship
“r2” went from .0284 to .0017, from capturing a small percentage of “Sleep’s” change to almost none
Slope for the Regression – went from negative (-.012) to a positive (.003)
P-value – went from .4535 to .4321, very little change
Overall removing the outliers makes it appear the relationship went from a negative to a positive relationship. In reality, the relationship was weak to start with and became weaker after removing the outliers. With the p-values both above 40% there is little evidence to say there was a relationship in either run.
(Do not be surprised if you get similar results, especially if you use a small sample size, one class size of data. If you get confusing results, contact me and we can work through it.)
It is a lot of work but some of it is repeating and copy/paste should help.
Step 5 (Run the Regression), the first two values in the section labelled, “Parameter estimates” are the values for the intercept and slope. Remember this for Questions 7 -10.
High values for “r” and “r2” do not always mean a relationship has to exist. There must be some logic as to why you are testing two variables.