Correlation
iStockphoto/Thinkstock
chapter 10
Linear Regression
Learning Objectives
After reading this chapter, you will be able to. . .
1 explain the relationship between correlation and regression.
2. understand the importance of prediction using regression.
3. identify the two values that determine the regression line in least-squares regression.
4 predict the value of a criterion based on a predictor value.
5. describe the extension of bivariate regression to multiple regression.
6. report and interpret results of multiple regression in APA format.
CN
CO_LO
CO_TX
CO_NL
CT
CO_CRD
suk85842_10_c10.indd 367 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
Regression is a powerful analytical tool that involves using relationships between vari-ables to predict one from a measure of the other. Generally speaking, people make regression-like predictions often. The presence of clouds in the morning sky prompts us to take an umbrella to work, for example, or a phone call from unexpected guests prompts us to prepare extra food for a meal. This chapter follows the same thinking, except that the predictions are mathematical.
Regression topics represent a bit of a puzzle. Sometimes readers shy away from regression topics, thinking that they are difficult to understand or at least difficult to master. How- ever, regression, as the reader will see, is neither difficult to understand nor to execute. Social scientists rely on regression in virtually every advanced statistical procedure. Most of those high-end statistical techniques—such as multivariate analysis of variance, dis- criminant function analysis, and structural equations modeling—are beyond the scope of an introductory text, but regression analysis is an essential part of the preparation for each of them. In the meantime, we will learn to use regression for its own purposes, to detect significant independent variables known as predictors on a dependent outcome variable known as the criterion.
10.1 Regression and Correlation
In Chapter 9, we made the point that when two variables are correlated, it is because they both contain some of the same information. If intelligence and reading comprehension are correlated, it is because to some degree they both measure the same characteristic. The more highly they are correlated, the greater the quantity of whatever is measured that the two characteristics have in common.
Recall that the coefficient of determination (rxy 2) indicates the proportion of one variable,
the x variable, for example, that can be explained by the other, which is designated y. If intelligence (x) and reading comprehension (y) are correlated rxy 5 .8, then rxy
2 5 .64. The coefficient of determination tells us that 64% of whatever reading comprehension measures can be explained by differences in intelligence. Another way to say this is that the coefficient of determination indicates how much information two correlated variables have in common.
If correlated variables share information, it seems logical that if we had information about the value of one of those variables, we could make a better-than-chance prediction of what the value of the other would be.
• If age and height are correlated for teenagers and we know how old someone is, we should be able to make a better-than-chance prediction of how tall that per- son is, right? Or perhaps it can be turned around, and knowing how tall a teen is can predict that teen’s age.
• If education and income are correlated and we know how many years of school- ing someone has had, we should be able to make an educated guess about that person’s level of income, right?
• If soldiers’ exposure to combat situations is correlated with their manifestation of posttraumatic stress disorder (PTSD), can the severity of PTSD be predicted from the number of times a soldier has been in combat?
H1
TX_DC
BLF
TX
BL
BLL
suk85842_10_c10.indd 368 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
Questions like these are at the heart of regression, and people have been asking such ques- tions for a long time. Karl Gauss, the man who defined the characteristics of the normal (Gaussian) distribution, began developing the mathematics behind regression in the early part of the 19th century. Many others have also made contributions to this form of quan- titative analysis. Collectively, their work has allowed experts in a variety of fields to use regression procedures in their decision making for many years.
• Economists gather data on unemployment rates, wholesale inventories, and con- sumer spending in order to predict the rate at which the economy will grow. This approach is effective because each of those variables correlates with economic expansion.
• Meteorologists use changes in barometric pressure to predict the kind of weather that is coming. Because drops in barometric pressure are predictors of violent storms in the Great Plains states and in the southeastern part of the United States, meteorologists watch particularly for dramatic drops in air pressure.
• Odds makers use data on a team’s past performance, any injuries to key players, and the quality of the opponent to predict the outcomes in games.
• Psychologists can use genetic and social factors, including the history of alcohol abuse in a family, to predict an individual’s predisposition to abuse drugs.
• Human resource professionals can use current employee performance measures to predict the effectiveness of future employee hires.
Each of these scenarios is possible because of correlations between variables. Correlations pave the way for prediction. The point is not that the people in these examples necessar- ily sit down with mathematical models to calculate the probability that certain results will emerge (which is something we will do in this chapter), but they could. In fact, a review of the professional literature provides ample evidence that scholars perform analyses like these frequently. They are important because prediction allows those who must act to be proactive. Rather than waiting for some important con- dition to emerge, its timing can be anticipated with some precision, and then an appropriate action can be taken. Prediction is the basis for sound decision making.
The Language of Regression
There are many types of regression procedures. However, even though we are interested in just one in this chapter, the concepts and much of the language used here are common to the different approaches. Because variables share information, one can be used to pre- dict the other.
Those terms are common enough in statistics, but in regression discussions the danger is that the words independent and dependent make it very easy to begin thinking of the relationship between the antecedent variables and that dependent variable as causal. Although we also used this language with t-tests and ANOVA, the risk is greater in cor- relation discussions because the discussion begins with the assumption of a relationship between the variables. To avoid this slippery slope, we will make an adjustment in the regression discussion. Rather than the terms dependent variable and independent variable, as common as those terms are elsewhere, we will refer to the variable to be predicted in a regression procedure as the criterion variable and the variable used to make the
A From the standpoint of regression, why does the size of a correlation coefficient matter?
Try It!
suk85842_10_c10.indd 369 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
prediction as the predictor variable. This language is adopted to minimize the risk of confusing correlations with causal relationships.
Note that the suggestion is not that there is not a causal relationship. There may be just such a connection. In fact, the relationship between exposure to combat and the develop- ment of posttraumatic stress probably is causal. The point is that the correlation alone— and correlation is the foundation for pursuing regression—is not sufficient by itself to establish causality.
Although the terms criterion and predictor are used here for descriptive purposes, there needs to be some shorthand indicators as well. The symbols used in regression are the same as those used with the Pearson correlation in Chapter 9: x and y for the correlated variables. The symbols will be
• x for the predictor variable and • y for the criterion variable.
Which Is the Predictor?
The confusion that can occur when equating correlation with causation is increased by recognizing that either variable in a significant correlation can be used to predict the other. If there is a correlation between the degree of posttraumatic stress disorder and the length of exposure to combat, it means that each variable is equally related to the other. Someone could
• predict the degree of PTSD from the length of combat exposure or • turn it around and predict the length of combat exposure from the degree of PTSD.
Either variable in a statistically significant correlation can predict the other. From the point of view of the mathematics involved, it does not matter, although practical considerations may dictate which variable must be the predictor and which must be the criterion. Some- times one of the variables will prove more elusive than the other because the data involved is more difficult to gather. In such cases, the difficulty involved may dictate that the more accessible variable becomes the predictor and that the less available variable be predicted rather than gathered.
If reading comprehension scores are significantly correlated with intelligence scores and someone wishes to predict the value of one from the other, it makes sense that the reading scores would be the predictor variable. Reading scores are easier to collect than intelli- gence scores. Most major intelligence tests must be administered one on one by someone who has been trained with the instrument, which makes the process expensive and time- consuming. Reading tests are typically less demanding to administer.
There are other considerations. Perhaps aptitude scores from a college admissions test that students take in high school are correlated with the grades that students earn dur- ing their first year of college study. From the standpoint of the correlation, scores can be predicted from grades quite as readily as grades can be predicted from scores. It does not make much sense to predict backwards. It will generally be the students’ future rather than their past performance that someone will be interested in knowing.
suk85842_10_c10.indd 370 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
Picturing Regression
Scatterplots are an easy way to graphically represent the correlation between variables. In Chapter 9 the scatterplot was used to illustrate the correlation between verbal ability and intelligence. Recall that each point in the scatterplot represented one person’s scores on both of the measured characteristics. When we use the scatterplot for regression purposes, the vertical axis will indicate the level of the criterion variable, y, and the horizontal axis will indicate the level of the predictor variable, x:
• x signifies the predictor variable in a scatterplot. • y is the criterion variable.
As noted in Chapter 9, when two variables are highly correlated, there tends to be little scatter from left to right along the ranges of the two variables. For example, a researcher randomly selects a group of 20 first-year college students at the end of their first year of study at a large university and gathers from them two types of data: (a) the number of hours per week they typically study and (b) their recorded grade averages at the end of their first semester. The researcher wants to determine how well the number of hours the student studies per week will predict the student’s grades. This means that the hours studied is the predictor variable, x, and grade average is the criterion variable, y. The data follows:
Hours Studied (x) Grade Average (y)
1 1 1.5
2 2 1.8
3 3 2.2
4 3 2.0
5 5 2.0
6 5 2.1
7 7 2.4
8 7 2.2
9 8 2.4
10 10 2.7
11 10 2.6
12 11 2.9
13 13 3.0
14 15 3.0
15 16 3.1
16 16 2.7
17 16 3.3
18 17 3.0
19 18 3.4
20 20 4.0
suk85842_10_c10.indd 371 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
This data can be used to create the same type of graph completed in Chapter 9. If the data is placed in columns in an Excel spreadsheet just as they are here (the numbering down the left is automatic in Excel and not considered one of the data columns), the commands for creating the scatterplot are then simply Insert and then Scatter. With some additions to label the axes and title the graph, the result is Figure 10.1.
To plot the data in this example, draw the vertical and horizontal axes of the graph, and then for each individual subject indicate the confluence of the individual’s grade (vertical axis) with the same individual’s study time (horizontal axis). Once all subjects are plotted, at least three conclusions can be drawn:
1. There is a correlation between the number of hours studied and the students’ first-semester grades. If this was not the case, the dots would have no particular pattern and would have just been randomly scattered. However, the dots array fairly consistently from lower left to upper right.
2. The correlation is substantial; there is not much scatter in the scatterplot, and the trend from lower left to upper right is easy to recognize. If the relationship had been negative (the more they studied, the poorer they performed), the data pattern would have been from the upper left to the lower right. In that case, as the value of one variable increased, the other would decrease. (This might be the case, for example, if this was a correlation of the number of hours students spend playing video games and their grades. The more they play, the less well they do in their classes.)
3. The relationship appears to be linear. There is not a point in the pattern where it appears that as one variable seems to increase, the other tends to level off or even diminish.
Figure 10.1: A scatterplot for the relationship between study time and grades
0
2.0
2.5
1.0
1.5
0.5
3.0
3.5
4.0
4.5
0 5 10 15 20 25
Hours Studied Per Week
F ir st
T e rm
G ra
d e s
o n a
4 .0
S ca
le
suk85842_10_c10.indd 372 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
About Linearity
The third point about the relationship between x and y being linear is particularly impor- tant. The type of regression discussed in this chapter is based on the assumption that the association between the predictor and criterion variables is linear. There are other regres- sion techniques that are used when the relationship between variables is not linear, but in the reasoning used in this chapter, there needs to be linearity. Because there is a linear connection, a straight line drawn through or as close as possible to as many of the data points as possible might look like the line in the graph in Figure 10.2A.
Figure 10.2: Regression lines
That line through the data points is called a regression line. If it is positioned so that it is as close as possible to as many of the 20 data points as it can be and still be a straight line, it can be used to determine any value of y from a specified value of x (or for that matter, any value of x from a specified value of y). For example, someone using the graph can select any value of x along the horizontal axis, go vertically from that x value up to the line, and then move left horizontally from the line to where the y-axis is. The value of the point at which the y-axis is encountered will be the value of y for the specified x value. See Figure 10.2B.
0
4
5
2
3
1
0 5 10 15 20 25
Hours Studied Per Week
F ir st
T e rm
G ra
d e s
o n a
4 .0
S ca
le
A. A Regression Line through a Scatterplot
0
4
5
2
3
1
0 5 10 15 20 25
Hours Studied Per Week
F ir st
T e rm
G ra
d e s
o n a
4 .0
S ca
le
B. Using the Regression Line to Determine y from x
suk85842_10_c10.indd 373 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
No one in the sample of 20 indicated a study time of 12.5 hours per week, but perhaps one of the researcher’s colleagues who knows what kind of analysis is under way asks, “I know someone who studies 12.5 hours every week. What is your best guess for this stu- dent’s grade average?” If the regression line is positioned accurately, the researcher can find 12.5 on the x-axis, travel vertically up to the regression line, and then move left to the y-axis to determine that 12.5 hours per week of study time predicts a grade point average of about 2.9.
The researcher gathered data from only 20 students, which is a little risky, but they were randomly selected, and sampling theory tells us that a randomly selected sample will differ from the population of all freshman students only by chance. In spite of the small sample size, the researcher may be lucky and find that the sampling error is minimal.
Coping with Less-Than-Perfect Correlations
Besides the fact that the graph in Figure 10.2 does not provide very detailed markings (the researcher had to guess that 12.5 hours studied will produce a grade average of “about 2.9”), other factors affect prediction accuracy. The researcher cannot be exactly precise about the grades because the correlation between the two variables is not perfect. Although a grade average of 2.9 might be the best possible prediction given this data, it is quite likely that the prediction will not be exactly accurate. Maybe for a particular student who studies 12.5 hours per week, a GPA of 2.8 would turn out to be a better prediction, or per- haps 3.0.
Without yet calculating the correlation, the evidence for rxy is the scatter in the data points. For example, note that three students reported studying the same amount, 16 hours per week, but all of them ended the semester with different grade averages. Of course, an imperfect correlation such as this one reflects the fact that besides probably including
some error in the way the variables are measured, grades are affected by more than just study time. The researcher has not accounted for differences in academic ability or differences in the quality of the study time. Those problems aside, as long as the correlation between the predictor and criterion variables is statistically significant, the pre- dicted value of y from the value of x will be more accurate over time than a number of random predictions.
Error in prediction is something we learn to tolerate. No one sues the United States Commerce Department if the prediction for job growth is off by one-tenth for the year. We do not petition the television station to fire the meteorologist when the forecast high for the day is wrong by a couple of degrees. Error is inevitable when predictions have to be based on imperfectly correlated variables, but we can at least have some measure of how extensive the error is likely to be. Later in the chapter, we will learn to calculate the amount of error and then include it with the predicted value in order to have a gauge of prediction accuracy.
B What is the visual evidence in a scatterplot for a weak correlation?
Try It!
suk85842_10_c10.indd 374 10/23/13 1:43 PM
CHAPTER 10Section 10.1 Regression and Correlation
Understanding the Least-Squares Criterion
A scatterplot is a helpful way to introduce the idea of regression, but relying on the scat- tered points in a graph in order to predict one variable from the other is not practical. It is a conceptual model for what will actually be done mathematically with a regression equa- tion as depicted in Figure 10.3. The equation is derived in a way that satisfies the require- ments for what is called the least-squares criterion. The least-squares criterion (think of criterion in this case as a requirement) is this: The regression line must be positioned so that the sum of all possible prediction errors (y 2 ŷ) has the lowest possible value. In other words, the distance (depicted as gray lines in the graph) from the data point to the regression line is as minimal as possible. These errors that are represented in the difference between y and ŷ are termed residual scores as shown in Figure 10.3.
Figure 10.3: The least-squares criterion graph
Source: Used by permission Alan B. Hale, Ph.D.
Describing the Prediction Error
Because of error in the regression solution, at least some of the predictions are going to be inaccurate. These errors emerge as differences between the criterion variable’s predicted value and what its actual value would be with all possible data and if there was no need to rely on the figure or on mathematics for the prediction. If
x is the actual value of the predictor variable, and
y is the actual value of the criterion variable, we can use
y-intercept
x1
x (independent)
y1
y2
y2
y2 � y2
y (d
e p e n d e n t)
^
^
Minimize: �(yi � yi)2 Least Squares Method
n
i = 1
^
Line: y = a + bx
suk85842_10_c10.indd 375 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
ŷ (y-hat) to symbolize the predicted value of the criterion variable and
y 2 ŷ as a way to indicate the prediction error or residual scores in a particular solution.
Look at Figure 10.2 again. If the correlation between hours studied and end-of-term grades was perfect, the data points would form a straight line. With no scatter in the line, each value of x would result in just one corresponding value of y. Those three people who reported 16 hours per week of study time would all have had the same grades at the end of the semester.
But with scatter in the data points, establishing the regression line is not a matter of just connecting the dots. Because the regression line is a straight line and it has to be posi- tioned so as to minimize the prediction errors, its positioning involves some compromise. It is a line of best fit, not a line of perfect fit.
Using the Residual Scores
If someone were to rely on Figure 10.2 to make a series of predictions and then go through university records to determine what the actual grade averages were for those 20 students after their first semester, any difference between what was predicted and what the grades actually were would be reflected in the residual scores. The residual scores would indicate the amount of prediction error.
If all of those residual scores were added up, what would they total? This is how the ques- tion looks in symbols:
a ( y 2 ŷ) 5 ?
When a number of predictions are made, some of the residual scores will be positive (the actual value of y will be larger than the predicted value, ŷ), and some will be negative (y is smaller than ŷ). And if all the residual scores were summed, their total would be 0. The positive and negative residual scores would cancel each other out and sum to 0.
The zero sum is not very helpful to someone who needs to know how much error there is in a regression solution. A better indicator of prediction error would be to square the residual scores, which would make all the negative residual scores positive, and then sum them. If the sum of those squared values has its lowest possible value, the solution meets the least-squares criterion. The regression equation used here was developed to satisfy that requirement. That is why this particular form of regression is called ordinary least- squares regression.
10.2 Ordinary Least-Squares Regression With One Predictor
Theoretically, there can be any number of predictors, x1, x2, x3, . . ., xn in a regression problem rather than just the single x. Toward the end of the chapter, there is a descrip- tion of how multiple regression works. This form of multiple regression is just ordinary
suk85842_10_c10.indd 376 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
least-squares regression with multiple predictors. The focus here is on regression with just one predictor. It is sometimes called simple regression or bivariate regression because there are only two variables involved.
The regression equation provides the math needed to position the regression line. So that the positioning of the regression line meets the least-squares criterion, there are two ques- tions to be answered:
1. At what point does the regression line cross the y-axis in the graph? 2. How much does the criterion variable (y) change when the predictor variable (x)
increases by 1.0?
The answer to the first question establishes the value of the intercept. The intercept indi- cates the value of y where the regression line intercepts the vertical axis. Put more suc- cinctly, the intercept is the value y when x 5 0. If you look at Figure 10.2 again, it appears that the regression line crosses the y-axis at about y 5 1.2. Following is an equation to calculate this value, which will indicate how close we were with the estimate.
The answer to the second question requires a value that represents the slope of the line, or just the slope. Refer to Figure 10.2. It is possible to estimate from the way the line is positioned that if x increases by 1.0, it appears that y increases by about .2. The slope value reveals how radically the regression line inclines or declines.
The Regression Equation
The bivariate regression equation has this form:
ŷ 5 a 1 bx 1 e Formula 10.1
Where
ŷ 5 the predicted value of the criterion variable
a 5 the intercept
b 5 the slope of the regression line
x 5 the value of the predictor variable
e 5 prediction error
The equation shows that for a given value of the predictor variable (x), the best prediction for ŷ is the value of the intercept plus the product of the slope times the predictor plus some component of error. Calculating the amount of error in the regression problem is a separate process rather than part of the initial calculation. The e indicates that there is always error in the prediction. Because the error will be calculated separately, we can take the e out of the equation, which makes the formula
ŷ 5 a 1 bx
suk85842_10_c10.indd 377 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Before calculating a regression solution, we need to know the intercept value, a, and the slope, b. They each have their own equations, but most of the terms involved are statistics that are already familiar. First, the formula for the intercept:
a 5 My 2 bMx Formula 10.2
Where
a 5 the intercept
b 5 (this is the new part) the slope of the regression line
My 5 the mean of the criterion variable, y
Mx 5 the mean of the predictor variable, x
The intercept value is
1. the mean of the criterion variable minus 2. the slope value (once that is determined) times the mean of the predictor variable.
Because the intercept formula includes the slope of the regression line, or the regression coefficient, we need to start there. That formula is
b 5 rxy a sy sx b Formula 10.3
Where
b 5 the slope of the regression line
rxy 5 the correlation coefficient for the two variables
sy 5 the standard deviation of the criterion variable, y
sx 5 the standard deviation of the predictor variable, x
Calculating a Regression Solution
Using the study-time and semester-grades data, calculate a regression solution. Remem- ber that the researcher estimated that someone who studied 12.5 hours per week would probably have a semester grade average of about 2.9. How accurate was that estimate?
The process will be as follows:
1. Calculate the means and standard deviations for x and y. 2. Calculate the correlation of x and y. 3. Calculate the slope of the line, or the regression coefficient, b. 4. Calculate the regression intercept, a. 5. Calculate the value of ŷ.
suk85842_10_c10.indd 378 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
1. For the two variables, verify that
Hours Studied (x) Grade Average (y)
Mean 10.15 2.615
Standard deviation 5.941 613
2. Using Formula 9.2, calculate the correlation as follows:
rxy 5 n a xy 2 ( a x)( a y)
Å e cn a x 2 2 ( a x)
2 d cn a y2 2 ( a y2 d f
rxy 5 201596.32 2 12032 152.32
"532012,7312 2 12032 2 4 3 1201143.912 2 152.32 2 2 46
rxy 5 1,309.4
"3 113,4112 1142.912 4 5 .946
Checking this value against the critical value from Table 9.1, rxy .05(18) 5 .444, indi- cates that the correlation is statistically significant.
To do the correlation on Excel, remember that having set up the data in the two columns, the commands are DataSData AnalysisSCorrelation.
3. The slope of the line is
b 5 rxy a sy sx b
5 .946 a .613 5.941
b
5 .098
This value indicates that y increases .098 for every 1.0 increase in x. Earlier we guessed that the slope of the line in the graph in Figure 10.2 might be about .2, which turns out to be incorrect.
4. The regression line intercept is
a 5 My 2 bMx
5 2.615 2 (.098)(10.15)
5 1.620
This value indicates that if x 5 0, then y 5 1.620. Based on the visual best fit that we made to the graph in Figure 10.2, we guessed that the intercept would be about y 5 1.2. That estimate was not very close either.
suk85842_10_c10.indd 379 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
5. What grades would students likely earn if they study 12.5 hours per week? The estimate based on that visually placed regression line in Figure 10.2 was a grade average of about 2.9. Check this “guesstimate” by solving for ŷ and comparing the two solutions.
ŷ 5 a 1 bx
5 1.620 1 (.098)(12.5)
5 2.845
6. Based on the data from the 20 students, the best prediction of the grade average someone who studies 12.5 hours per week will earn is 2.845. Interestingly, that value is not too far from the earlier prediction.
Notice that the data from two correlated variables were helpful in predicting a value from one of the two variables that was unknown, from a known value of the other variable. It took a few descriptive statistics (the means, standard deviations, and the correlation coef- ficient) and the relatively straightforward equations for the slope, the intercept, and for the predicted value of y.
Practicing the Regression Solution
Now that the calculations of the slope (b) and the intercept (a) have been completed, it is a simple matter to solve for any other value of x. For example, what grades can be predicted for someone who seems to spend every waking minute studying and reports studying 30 hours per week?
ŷ 5 a 1 bx
5 1.620 1 (.098)(30)
5 4.560
Now this is interesting because grades are on a 4-point scale; they can actually go no higher than 4.0, which is straight As. The regression procedure does not “know” there is an effective ceiling to how high grades can be.
What grade average is predicted for someone who does not study at all? Is such a student likely to receive a grade average that is also 0?
ŷ 5 a 1 bx
5 1.620 1 (.098)(0)
5 1.620
Even with no time devoted to weekly study, a student is unlikely to have a 0 GPA. But this is an answer we already had. Remember that the intercept is defined as the value of y if x 5 0. For someone who does not study, x is equal to 0, and we could have just reported the value of the intercept to answer the question.
C If the correlation between two variables is negative, how will that be reflected in a regression solution?
Try It!
suk85842_10_c10.indd 380 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Apply It!
Regression in a Small Business
The manager of a small, family-owned bakery that makes wedding cakes would like to use linear regression techniques to predict business growth based on
population growth in the metropolitan area. The company has been in business for 70 years, and sales have steadily increased as the population of the town has grown. First, the man- ager would like to know if population growth is a good predictor of business growth. In other words, are the two variables correlated? Second, if they are correlated, what would be the expected sales if the population of the town grows to 260,000?
To answer these questions, the manager gathers information on their town’s population at 5-year intervals over the last 70 years. He also computes wedding cake sales for those same years. The data follows:
Year Population Cakes Sold
1945 25,780 112
1950 29,580 105
1955 36,500 125
1960 39,870 154
1965 43,580 189
1970 57,800 146
1975 59,000 278
1980 70,000 812
1985 82,000 624
1990 91,000 694
1995 129,000 1,163
2000 149,000 1,360
2005 176,000 2,285
2010 198,000 2,519
If the population is plotted as the x (predictor) value and the number of wedding cakes sold as the y (criterion) value, we see in Figure 10.4 that there seems to be a strong linear rela- tionship between the two variables.
The line through the 14 data points is the regression line. To calculate the regression solu- tion, the manager first computes the means and standard deviations for the population and sales figures.
(continued)
suk85842_10_c10.indd 381 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Apply It! (continued)
Figure 10.4: Number of cakes sold as a function of town population
Mean Standard Deviation
Population 84,794 56,467
Cakes sold 755 810
Next he calculates the correlation factor (rxy) to be .9735.
The slope of the line is
b 5 rxy a sy sx b
b 5 .9735 a 810 56,467
b
b 5 .01396
The regression line intercept is
a 5 My 2 bMx
a 5 755 2 .01396(84,794)
a 5 2429
0
2,000
2,500
1,000
1,500
500
0 50,000 100,000 150,000 200,000 250,000
3,000
(continued)
suk85842_10_c10.indd 382 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Apply It! (continued)
The manager has recently read a report saying that the population of the town will be approximately 260,000 in 5 years. What will be the estimated sales given this value of the predictor variable?
ŷ 5 a 1 bx
ŷ 5 2429 1 (.01396)(260,000)
ŷ 5 3,201
Next, the manager wants to calculate a confidence interval for this regression solution; this is known as a prediction interval. As in previous calculations of confidence intervals (CI), it is a measure of reliability based on the standard error of the estimate about the mean (SEest), such that, if the manager had to calculate the CI he would use the mean value of 755 cakes sold. Instead, the manager wants to know the prediction interval (PI) that is a measure of reliability based on the standard error of the estimate about some future data value. In this case, he wants to calculate the PI for 3,201 cakes sold (the ŷ value calculated with a popula- tion of 260,000). Therefore, to do this he first calculates the standard error of the estimate:
SEest 5 sy "11 2 r2xy2 5 810 "11 2 .97352 2 5 185
To determine PI at a 95% confidence level,
PI 5 6t(SEest) 1 ŷ
t for 12 degrees of freedom is 2.179 at p 5 .05, so
PI 5 6tn22 (SEest) 1 ŷ
PI 5 62.179(185)1 3,201
PI 5 2,798 to 3,604
Therefore, the manager can be 95% confident that the bakery’s annual sales of wedding cakes will be between 2,798 and 3,604 if the town’s population grows to 260,000. The man- ager can plan for staffing and equipment levels to support this level of business.
Apply It! boxes written by Shawn Murphy.
suk85842_10_c10.indd 383 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Interpreting the Regression Results
Because b 5 rxy a sy sx b , the value of the regression slope is actually a proportion of the ratio
of sy to sx. The proportion is determined by the strength of the correlation. It is academic because it never happens for researchers in psychology or any of the other social sciences, but if the correlation between the two variables is perfect (rxy 5 1.0), the slope becomes 1 times that ratio of sy to sx. As the correlation diminishes, the value of the slope is a decreasing proportion of that ratio. If the correlation is .50, for example, the slope’s value will be half the ratio of sy to sx.
The slope of the regression line does necessarily need to be a positive value. If the correla- tion between the predictor and criterion variables is negative, the slope will be negative. This means that for every 1.0 increase in x, y will decline, and the regression line will move downward in a scatterplot from left to right.
Suppose a criminologist is using incarcerated inmates’ good behavior to predict the length of the inmate’s sentence. As the number of days of good behavior while incarcer- ated increases, the overall length of the inmate’s sentence decreases. The slope in this case is a negative value.
A negative value for the slope is not unusual.
• If the amount of time students spend on video games is used to predict their grade averages during the first year of college, the correlation between those variables will probably be negative; as time on games increases, grades probably decline. If the correlation is negative, the slope in a regression solution will also be negative.
• If the frequency of substance abuse is used to predict job productivity, the slope is probably negative.
• The number of extramarital affairs is probably negatively correlated with the length of a marriage, and so it results in a negative slope when it is used to pre- dict marital harmony.
Determining the Error in a Regression Solution
The grade average for the student who studies 12.5 hours per week during that first semester of college study was 2.845. The grade average is calculated with the data that is available for the 20 students in the data set. Random sampling will make any sampling error, any degree to which the sample is unlike the population of all freshman students at the university, minimal.
However, even if the data set included data for every freshman student, the answer still will not necessarily be precisely accurate for one individual. The regression equations allow the best prediction, but it is a generalization based on the group, which may or may not be exactly accurate for a particular student. No matter how large the sample, and regardless of how it is selected, some prediction error is inevitable as long as the correla- tion between predictor and criterion is , 1.0.
suk85842_10_c10.indd 384 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
This reality does not mean that the regression process is flawed. It just means that it is imperfect. We made a similar point about calculating the various standard error statistics in the chapters on t-test. The fact that there is error in the test statistic does not imply that mistakes were made. It indicates that there is unaccounted-for variability in the data. It is the same with regression procedures. The least-squares criterion explains that the equa- tions are designed to minimize error, but it cannot eliminate it entirely. There is always some error associated in any social science research.
For any data set where a large number of predictions are made, some of the predictions will err by being too high, some will be too low, and a few might be correct. The fact that all these prediction errors would sum to 0, a (y 2 ŷ) 5 0, is small consolation if we are making one prediction for one individual and the outcome is important. So that we may know how much to trust the prediction, regression procedures include a way to estimate the amount of error.
The Standard Error of the Estimate
Recall the standard error of the mean and the standard error of the difference that are calculated for the t-tests. Both statistics are measures of error variance in the other tests. For regression, there is a similar measure of error variance called the standard error of the estimate (SEest). Theoretically, the standard error of the estimate is explained this way:
• If a researcher calculates a very large number of regression solutions from a data set and
• for each solution determines the residual score, or the difference between the actual and predicted values of the criterion variable (y 2 ŷ),
• the standard error of the estimate is the standard deviation of all those resid- ual scores.
The only way that residual scores can be determined, of course, is if the researcher already has all the actual values of y to begin with. If you had that information, what would be the point of using regression? The standard deviation of residual scores explains the standard error of the estimate, but it is not a guide to how the researcher calculates that statistic. Recall that in theory, the standard error of the mean (Chapter 4) is the standard deviation of all the sample means in the population. But that value was estimated by dividing the sample standard deviation by the square root of the number in the sample. The standard error of the estimate can be estimated in a similar way:
SEest 5 sy 1"1 2 r 2 xy 2 Formula 10.4
Where
SEest 5 the standard error of the estimate
sy 5 the standard deviation of the criterion (y) variable
rxy 2 5 the square of the correlation coefficient
suk85842_10_c10.indd 385 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
For the study-time and semester-grades problem, the standard error of the estimate will be as follows:
1. With the correlation between study time and grade averages rxy 5 .946 2. and the standard deviation of the y variable (grade averages) Sy 5 .613, 3. the standard error of the estimate is
SEest 5 sy "11 2 r2xy 2 5 .613 "11 2 .9462 2 5 .199
A large SEest value indicates substantial error in the prediction. Consider the factors affect- ing the size of the standard error of the estimate.
1. The sy value is the standard deviation of the variable to be predicted. Highly vari- able data sets result in large standard deviation values and, as a result, in large SEest values.
2. The sy value is a multiplier for the result of "11 2 r2xy 2 . The more highly x and y are correlated, the smaller this resulting value will be and, as a consequence, the smaller the SEest value will be.
The smallest SEest can be is 0, and the largest the value can be is the value of sy. In either case, can you see why?
• If the correlation between predictor and criterion is perfect (that is, if rxy 5 1.0),
the latter part of the term "11 2 r2xy 2 becomes "11 2 12 or "0 ; sy 3 0 5 0. • At the other extreme, if the correlation between predictor and criterion has its
lowest possible value (0), the latter part of the term becomes "11 2 02 or 1; sy 3 1 5 sy.
Using the Standard Error of the Estimate
By itself, the standard error of the estimate is not very helpful. It is hard to know when the amount of error is comparatively large, and when it is not. Remembering the relationship between a standard deviation and the normal distributions provides some guidance.
Recall that in a normal distribution, the area from 1 standard deviation below the mean to 1 standard deviation above the mean includes about two-thirds of the entire popula- tion. Noting that the SEest is like a standard deviation of all possible error scores, from the predicted value of m minus 1 SEest to m plus 1 SEest provides a range within which the true value such as the population mean (m) of the criterion variable, y, will occur about 68% of the time. Therefore, if
m 5 2.845 and
SEest 5 .199
suk85842_10_c10.indd 386 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
the range from 2.646 (2.845 2 .199) to 3.044 (2.845 1 .199) will include the true grade aver- age for someone who studies 12.5 hours per week 68% of the time. Or more concisely, with p 5 .68, the value of y is between 2.646 and 3.044.
Determining the true predicted value 68% of the time leaves a great deal to chance. When confidence intervals are used in regression procedures, it is more common to calculate them so that they capture the true value of 95% of the time. This is called a 95% confidence interval. The process for a .95 confidence interval for the grade average and number of hours studied problem is as follows:
CI 5 m 6 t(SEest) Formula 10.5
Where
CI 5 the confidence interval for the regression solution
t 5 a critical value of t for n 2 2 df for p 5 .05 (for a .99 confidence interval, it is the value for p 5 .01)
SEest 5 the standard error of the estimate according to Formula 10.4
m 5 the mean value for the criterion variable
So, for the problem predicting grade average from the number of hours studied, a .95 con- fidence interval will be as follows:
With SEest 5 .199,
t(df 5 18) 5 2.101, and
m 5 2.845
CI 5 6 tn22(SEest) 1 ŷ
5 62.101(.199) 1 2.845
5 3.263, 2.427
To be .95 confident of having captured the true grade average for someone who studies 12.5 hours per week, the range for possible grades needs to be from 3.263 down to 2.427. This wide confidence interval stretches from what is ordinarily a grade of C, to a substantial B. It is a wider interval than if we were satisfied with .9 confidence, and not as wide as if we adopted .99 confidence. Several factors affect the width of the confidence interval:
• the level of confidence, • the sample size, which affects both the amount of variability in y and the critical
value of t, and • the strength of the correlation.
D What are the factors in the width of a confidence interval?
Try It!
suk85842_10_c10.indd 387 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Another Regression Problem
A second regression problem provides an opportunity to calculate the best prediction and then a confidence interval for that predicted value. A social worker is responsible for encouraging people who are indigent to advance their educations in order to improve their living conditions. To demonstrate to clients that schooling affects income, the social worker gathers the following data for a group of 12 people:
Years of Education
Income in Thousands
1 10 23.3
2 12 27
3 12 30.5
4 14 34
5 14 45
6 16 55
7 16 57.5
8 16 62
9 16 68
10 18 70
11 18 85
12 18 90
One of the clients who is a high school graduate (12 years of education) asks, “If I attend the community college and complete a two-year certification program, what is my income likely to be?” The social worker proceeds to answer that question.
Years of Education (x) Income in Thousands (y)
Means 15.0 53.942
Standard deviations 2.629 22.342
The correlation:
rxy 5 n a xy 2 ( a x)( a y)
Å e cn a x 2 2 ( a x)
2 d cn a y2 2 ( a y2 d f
rxy 5 12110.3192 2 11802 1647.32
"531212,7762 2 11802 2 4 3 112140,407.392 2 1647.32 2 2 46
rxy 5 7,314
"19122 165,891.392 5 .944
suk85842_10_c10.indd 388 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
The slope (regression coefficient):
b 5 rxy a sy sx b
5 .944 a22.342 2.629
b
5 8.022
The intercept (regression constant):
a 5 My 2 bMx
5 53.942 2 (8.022)(15.0)
5 266.388
Now to solve the equation for 14 years of education (12 plus the 2 years of community college education),
ŷ 5 a 1 bx
5 266.388 1 (8.022)(14)
5 45.92
The best prediction that the social worker can make with this data set is that the individual is likely to make about 45.92 thousand dollars ($45,920) after completing the additional schooling. If the social worker wants a confidence interval for that answer, the first step is to determine the standard error of the estimate:
SEest 5 sy "11 2 r2xy ) 5 22.342 "11 2 .94422 5 7.372
And then to determine the prediction interval:
PI 5 6t(SEest) 1 ŷ
5 62.228(7.372) 1 45.92
5 62.345, 29.495
With all the variability in income (it ranges from $23,300 to $90,000 for this sample), the prediction interval cannot be very precise. For this data set, with p 5 .95 or 95%, complet- ing the community college program will result in an income somewhere between $62,345 and $29,495.
E If additional data in a regression problem produces a higher correlation between x and y, what will be the impact on a confidence interval for the solution?
Try It!
suk85842_10_c10.indd 389 10/23/13 1:43 PM
CHAPTER 10Section 10.2 Ordinary Least-Squares Regression With One Predictor
Apply It!
Regression Techniques at a Power Plant
The engineer at a power plant wishes to develop a method to accurately determine the flow rate of cooling water to the condenser at any time. Currently, the engineer must hire an outside consulting firm to measure the flow rate using ultrasonic test equipment. Although extremely accurate, this method is time-consuming and expensive. The engineer knows that water flow rates through a pipe are proportional to the square root of the pressure drop across a restriction. The engineer would like to measure flow rate across a naturally occurring restriction and use this to calculate the flow rate. To determine the slope of the equation that relates these two variables, she uses linear regression techniques.
In this example, the engineer already knows that these two variables are correlated, that there is a linear relationship between them and that the intercept is 0 because the differential pressure measurement is 0 when the flow rate is 0. She will use regression techniques to find the slope of the line.
First, the engineer installs a differential pressure monitor across a restriction. She then measures the pressure drop while the ultrasonic measurement device measures flow rate. She conducts three different tests with either one, two, or all three of the circulating water pumps in operation. Test results showing values for the square root of the pressure drop (predictor) and flow rate (criterion) are shown below. The pressure drop is measured in inches of water, and the flow rate is measured in gallons per minute.
Square Root of Pressure Drop
Flow Rate
0 0
3.75 49,000
6.75 95,000
9.53 132,000
Although this is a small number of data points, the engineer already knows that this is a linear relationship and that the two variables are correlated. Rather than determining the strength of correlation, she is more interested in the slope between the two values.
As expected, the correlation factor of rxy 5 0.9995 is very nearly equal to 1.
She finds the slope of the line from the correlation factor and the standard deviation of the two variables.
b 5 rxy a sy sx b
b 5 .9995 a57,172 4.008
b
b 5 13,977 (continued)
suk85842_10_c10.indd 390 10/23/13 1:43 PM
CHAPTER 10Section 10.3 Regression With Excel
Apply It! (continued)
She then uses this value to develop a mathematical model for the flow rate in gallons per minute based on the square root of the differential pressure.
ŷ 5 0 1 (13,977)x
For example, if the engineer records 42.5 inches of water pressure (square root 5 6.52), she knows the cooling water flow rate is
ŷ 5 (13,977)(6.52)
5 91,119 gallons per minute
The regression analytical tool has allowed the engineer to determine a difficult-to-measure variable (flow rate) based on an easily measured variable (the square root of the differential pressure). This is possible because the two variables are correlated.
Apply It! boxes written by Shawn Murphy.
10.3 Regression With Excel
The regression procedure in Excel provides useful output. Using the previous problem predicting the income from education, the procedure is as follows: 1. Arrange the data in columns, A for years of education and B for income. Enter
the labels “years” and “income” in cells A1 and B1, respectively. 2. Select the Data tab at the top of the page and then Data Analysis at the far right
just below the page tabs. 3. In the Analysis Tools window, scroll down to Regression and click OK. 4. Click the Input Y Range window and drag the cursor from B2 to B13. 5. Click the Input X Range window and drag the cursor from A2 to A13. 6. Click the Output Range window and enter something like A15 so that the output
doesn’t overwrite the data. Click OK.
The result is shown in Figure 10.5.
suk85842_10_c10.indd 391 10/23/13 1:43 PM
CHAPTER 10Section 10.3 Regression With Excel
Figure 10.5: Using Excel to predict income from years of education
Column A has been expanded in Figure 10.5 to make it easier to read the output. Regres- sion statistics in Excel follow:
• The same correlation value that is calculated in the longhand solution, although Excel calls it Multiple R, which is actually the name for correlation procedures involving more than one predictor.
• The R2 value is the square of the Pearson correlation indicating the amount of variance in criterion (y) explained by predictor (x).
suk85842_10_c10.indd 392 10/23/13 1:43 PM
CHAPTER 10Section 10.3 Regression With Excel
• The Adjusted R2 is the correlation value diminished because of the risk in small samples.
• The Standard Error value is the standard error of the estimate, but using the Adjusted R Square value.
• The number of observations is n 5 12. • The ANOVA tests the assumption that there is no significant relationship
between x and y (i.e., the prediction model). Note that it is significant F(1, 11) 5 81.07, p 5 4.12E-06 ( p 5 .00000412).
• The last table in Figure 10.5 provides the regression solution. The intercept and slope values are similar to the longhand calculations. The differences are likely due to round-off differences. The standard error values for a and b were not calculated longhand, nor were the significance tests or confidence intervals for those individual values. The significance tests are redundant because in the type of regression completed in this chapter, if rxy is significant, it means that x is a significant predictor of y. In this case, a t statistic is provided to indicate signifi- cance of x’s influence on y. In this case the t(10) 5 9.00, p 5 .00000412. This value is important when there are multiple predictors that will indicate which of the predictors in the multiple regression are significantly influencing the criteria. This is demonstrated later in the chapter in SPSS Example 2.
Shrinkage and Overfitting the Sample
When samples are small and correlations are relatively weak, it is important to make an allowance for prediction error. Even when confi- dence intervals are narrow, there are risks involved in regression solu- tions. Because the solution is based on the available data, it is easy to develop a solution that fits a particular sample. However, there might be new data that is gathered later that does not apply as easily. This problem is referred to as overfitting the sample. It is always a con- cern in regression, but it is most problematic when samples are small. Another term to describe this problem is shrinkage. Shrinkage is the degree to which the accuracy of a regression solution is diminished when it is used with new data.
The Requirements for Ordinary Least-Squares Regression
There are many different regression procedures. Each is adapted to a different set of circumstances. The requirements of bivariate ordinary least-squares regression are the following:
• The criterion variable involved must be interval or ratio scale. The predictors may be at least categorical (with two outcomes).
• The variables must be normally distributed in their populations. • The relationship between the criterion variable and the predictor variable must
be linear. • There should be no linear relationship between the predictors; in other words, if
they correlate too highly, then multicollinearity has occurred. • The data must have similar amounts of variability throughout their
ranges—homoscedastic.
F What does shrinkage mean in regression, and how can it be avoided?
Try It!
suk85842_10_c10.indd 393 10/23/13 1:43 PM
CHAPTER 10Section 10.4 A Conceptual Introduction to Multiple Regression
10.4 A Conceptual Introduction to Multiple Regression
The regression discussion in this chapter has been confined to what is called simple or bivariate regression. It involves one predictor and one criterion variable. Multiple regression uses similar logic but employs more than one predictor. If multiple variables are correlated with a criterion variable, and if each of those multiple variables has some- thing in common with the criterion variable that is unique, multiple predictors can often provide a more precise prediction of y than a single predictor. Here is the multiple regres- sion equation with two predictors:
ŷ 5 a 1 b1x1 1 b2x2
Although there is still an intercept value, a, now there are two b values, two slopes, one for each of the predictor variables, x1 and x2. The complication in multiple regression is that there has to be a way to control redundancy between predictors. If two predictors pro- vide some of the same information about y, and variables that describe people inevitably do, there has to be a way to control the repetition so that the prediction is accurate. For that reason, the b values in multiple regression are partial unstandardized regression coef- ficients. Remember that the regression coefficient, or slope, indicates how much y changes each time x increases by 1.0. The partial regression coefficient answers the question, How much does y change when x1 increases by 1.0, and x2 is held constant, that is, when neu- tralizing the effect of x2? With the second predictor, it is a similar question in reverse: How much does y increase when x2 increases by 1.0 and x1 is held constant? This means that the regression coefficients for each predictor have to be calculated so that each indicates the contribution to y of the particular variable, exclusive of the effect of the other predic- tor or predictors. Although the processes involved are not difficult, they require more arithmetic and a new form of correlation called a multiple correlation. The multiple cor- relation coefficient, which is the correlation between a criterion variable and two or more predictor variables, gauges the strength of the correlation. Its symbol is the uppercase R as seen in the Excel regression output. The same is true for R2 as the multiple coefficient of determination, which indicates the variance of the criterion that can be explained by the combination of all predictors combined. The multiple coefficient of nondetermination is 1 2 R2, which is the unexplained variance or error.
One other important distinction is the difference between the unstandardized regression coefficients (symbolized b) versus the standardized regression coefficients (symbolized as b). The only difference between these is that b is the unstandardized partial coefficient (the unstandardized beta value) for each predictor that is used in the regression equa- tion, ŷ 5 a 1 b1x1 1 b2x2, where each predictor adds its own coefficient (or slope) to the equation in calculating ŷ. b (the standardized beta value) is calculated using the standard error associated with the b values and then tested to see if this value is significantly differ- ent from zero (similar to a correlation); this calculation is a t-test value. If the t-test value is significant ( p , .05), then the predictor makes a significant contribution to the overall regression/prediction model. In addition, since the b values are standardized (measured in standard deviation units), it can be compared with other predictors to show the contri- bution of each predictor in influencing y. The higher the absolute value of the b value, the greater the influence on the criterion variable. In SPSS and other packages, both unstan- dardized and standardized beta values are calculated, where the b value is symbolized as “B” and the b value is listed as “beta,” as you will see within the coefficients tables in Figures 10.7 and 10.9 in the next section.
suk85842_10_c10.indd 394 10/23/13 1:43 PM
CHAPTER 10Section 10.5 Presenting Results
10.5 Presenting Results
Using a public data set from Pew Research (2011), GenChange.sav file, both a bivariate and a multiple correlation will be performed using SPSS. SPSS Example 1: Steps for a Simple Regression Analysis
A data analyst wishes to test age as a predictor of happiness. Her hypothesis is that age is a significant predictor of happiness. If significance is found, the prediction model would allow the research to calculate a person’s happiness based on his/her age.
To perform a simple regression analysis, go to AnalyzeSRegressionSLinear. Input age into the Independent(s) box and q1 (happiness) into the Dependent box (your screen should look like that in Figure 10.6). Then click OK. Output tables are presented in Fig- ure 10.7.
Figure 10.6: SPSS steps for a simple regression
Source: Data from Pew Research Social & Demographic Trends. (2011). General public survey on veterans & generational change. Retrieved from http://www.pewsocialtrends.org/category/datasets/
suk85842_10_c10.indd 395 10/23/13 1:43 PM
CHAPTER 10Section 10.5 Presenting Results
Figure 10.7: SPSS output of a simple regression
Source: Data from Pew Research Social & Demographic Trends. (2011). General public survey on veterans & generational change. Retrieved from http://www.pewsocialtrends.org/category/datasets/
Model Summary
1
Model R Square
0.007
Adjusted R Square
0.083a
R Std. Error of the Estimate
0.006 1.317
a. Predictors: (Constant), AGE. What is your age?
ANOVAa
1
Model Sum of Squares
df
Regression 24.334
Mean Square F Sig.
1 24.334 14.036 0.000b
Residual 3469.024 2001 1.734
Total 3493.358 2002
a. Dependent Variable: Q1. Generally, how would you say things are these days in your life— would you say that you are very happy, pretty happy, or not too happy? b. Predictors: (Constant), AGE. What is your age?
Coefficientsa
1
Std. Error
B Beta
Unstandardized Coefficients
Standardized Coefficients
Model t Sig.
AGE. What is your age?
(Constant) 1.776 0.085
0.006 0.002 0.083
20.981
3.746
0.000
0.000
a. Dependent Variable: Q1. Generally, how would you say things are these days in your life— would you say that you are very happy, pretty happy, or not too happy?
suk85842_10_c10.indd 396 10/23/13 1:43 PM
CHAPTER 10Section 10.5 Presenting Results
SPSS Example 2: Steps for a Multiple Regression Analysis
To perform a multiple regression analysis, go to AnalyzeSRegressionSLinear. Input q2a (satisfaction with family life), q2b (satisfaction with finances), q2c (satisfaction with health), age, educ (education), and income into the Independent(s) box and q1 (hap- piness) into the Dependent box. Click on Statistics and check Estimates and Part and partial correlations (your screen should look like that in Figure 10.8). Click Continue and then OK. Output tables are presented in Figure 10.9.
Figure 10.8: SPSS steps for a multiple regression
Source: Data from Pew Research Social & Demographic Trends. (2011). General public survey on veterans & generational change. Retrieved from http://www.pewsocialtrends.org/category/datasets/
suk85842_10_c10.indd 397 10/23/13 1:43 PM
CHAPTER 10Section 10.5 Presenting Results
Figure 10.9: SPSS output of a multiple regression
Model Summary
1
Model R Square
0.146
Adjusted R Square
0.383a
R Std. Error of the Estimate
0.144 1.222
a. Predictors: (Constant), INCOME. Last year, that is in 2010, what was your total family income from all sources, before taxes? Just stop me when I get to the right category. [READ], Q2c. Please tell me whether you are satisfied or dissatisfied, on the whole, with the following aspects of your life. [Would you say you are very (dis)satisfied or SOMEWHAT dis(satisfied)]?—Your health, AGE. What is your age?, Q2b. Please tell me whether you are satisfied or dissatisfied, on the whole, with the following aspects of your life. [Would you say you are very (dis)satisfied or SOMEWHAT dis(satisfied)]?—Your personal financial situation, Q2a. Please tell me whether you are satisfied or dissatisfied, on the whole, with the following aspects of your life. [Would you say you are very (dis)satisfied or SOMEWHAT dis(satisfied)]?—Your family life, EDUC. What is the last grade or class you completed in school? (DO NOT READ) b. Dependent Variable: Q1. Generally, how would you say things are these days in your life—would you say that you are very happy, pretty happy, or not too happy?
ANOVAa
1
Model Sum of Squares
df
Regression 511.550
Mean Square F Sig.
6 85.258 57.071 0.000b
Residual 2981.808 1996 1.494
Total 3493.358 2002
a. Dependent Variable: Q1. Generally, how would you say things are these days in your life—would you say that you are very happy, pretty happy, or not too happy? b. Predictors: (Constant), INCOME. Last year, that is in 2010, what was your total family income from all sources, before taxes? Just stop me when I get to the right category. [READ], Q2c. Please tell me whether you are satisfied or dissatisfied, on the whole, with the following aspects of your life. [Would you say you are very (dis)satisfied or SOMEWHAT dis(satisfied)]?—Your health, AGE. What is your age?, Q2b. Please tell me whether you are satisfied or dissatisfied, on the whole, with the following aspects of your life. [Would you say you are very (dis)satisfied or SOMEWHAT dis(satisfied)]?—Your personal financial situation, Q2a. Please tell me whether you are satisfied or dissatisfied, on the whole, with the following aspects of your life. [Would you say you are very (dis)satisfied or SOMEWHAT dis(satisfied)]?— Your family life, EDUC. What is the last grade or class you completed in school? (DO NOT READ)
suk85842_10_c10.indd 398 10/23/13 1:43 PM
CHAPTER 10Section 10.6 Interpreting Results
Figure 10.9: SPSS output of a multiple regression (continued)
Source: Publication Manual of the American Psychological Association, 6th edition. © 2009 American Psychological Association, pp. 119–122.
10.6 Interpreting Results
Refer to the most recent edition of the APA manual for specific detail on formatting statistics; Table 10.1 may be used as a quick guide in presenting the statistics covered in this chapter.
Table 10.1: Guide to APA formatting of statistics results
Abbreviation or Term Description
a y-intercept of the regression line
b Slope of the regression line as a unstandardized beta value
R Multiple regression coefficient
R2 Multiple coefficient of determination
b Standardized beta value of the predictor on the criterion
PI Prediction interval
Source: Publication Manual of the American Psychological Association, 6th edition. © 2009 American Psychological Association, pp. 119–122.
Coefficientsa
1
Std. Error
B Beta
Unstandardized Coefficients
Standardized Coefficients
Model t Sig.
Q2a.
(Constant)
PartialZero- order
Part
Correlations
0.000
0.000
Q2c.
Q2b. 0.000
0.000
a. Dependent Variable: Q1. Generally, how would you say things are these days in your life— would you say that you are very happy, pretty happy, or not too happy?
EDUC.
AGE. 0.002
0.382
EDUC.
1.077
0.159
0.209
0.078
0.005
�0.015
�0.012
0.129
0.020
0.020
0.022
0.001
0.018
0.011
8.328
8.017
10.556
3.591
3.151
�0.875
�1.142 0.254
0.178
0.234
0.081
�0.019
�0.025
0.067
0.177
0.230
0.080
0.070
�0.020
�0.026
0.271
0.306
0.212
0.083
�0.089
�0.079
0.166
0.218
0.074
0.065
�0.018
�0.024
suk85842_10_c10.indd 399 10/23/13 1:43 PM
CHAPTER 10Summary
Using the results from SPSS Example 1, we present the results from Figure 10.7 in the following way:
• The Model Summary table, R 5 .083, indicates the correlation between the cri- terion (q1—happiness) and the predictor (age), whereas R2 5 .007 indicates the variance in the criterion (q1—happiness) explained by the predictor (age). The positive correlation means that as age increases, so does the level of happiness.
• The ANOVA table indicates that the overall regression model is statistically sig- nificant, F(1, 2001) 5 14.04, p , .05.
• The Coefficients table indicates age as a significant predictor of q1 (happiness), b 5 .083, t(2001) 5 3.746, p , .05. The regression model equation is, therefore, ŷq1 5 1.78 1 .006(xage)
Using the results from SPSS Example 2, we present the results from Figure 10.9, in the fol- lowing way:
• The Model Summary table, R 5 .383, indicates the correlation between the criterion (q1—happiness) and the predictors (q2a, q2b, q2c, age, educ, income), whereas R2 5 .146 indicates the variance in the criterion (q1—happiness) explained by the predictors (q2a, q2b, q2c, age, educ, income)
• The ANOVA table indicates that the regression model is statistically significant, F(6, 1996) 5 57.07, p , .05.
• The Coefficients table indicates q2a as a significant predictor of q1 (happiness), b 5 .178, t(1996) 5 8.02, p , .05; q2b as a significant predictor of q1 (happiness), b 5 .234, t(1996) 5 10.56, p , .05; q2c as a significant predictor of q1 (happi- ness), b 5 .081, t(1996) 5 3.59, p , .05; and age as a significant predictor of q1 (happiness), b 5 .067, t(1996) 5 3.15, p , .05. There were two nonsignificant predictors, educ was not a significant predictor of q1 (happiness), b 5 2.019, t(1996) 5 2.875, p 5 .382; and income was not a significant predictor of q1 (happiness), b 5 2.025, t(1996) 5 21.142, p 5 .254. The next step would be to eliminate the nonsignificant predictors since they do not add much value to the overall regression model. By excluding educ and income, the regression model equation model is, ŷq1 5 1.08 1 .159(xq2a) 1 .209(xq2b) 1 .079(xq2c) 1 .005(xage).
Summary The correlation coefficient is an elegant statistic. Whenever separate measures have some quality in common, correlations indicate the strength of the relationship between them. Building on correlations, regression procedures capitalize on this by using what is con- tained in one measure to predict what the level of the other measure is likely to be (Objec- tive 1). Because prediction is a part of all science and of virtually every social domain as well, regression has remarkably wide application. When variables are related but one is more difficult to measure than the other, the more accessible variable can be used to pre- dict the more elusive variable (Objective 2).
suk85842_10_c10.indd 400 10/23/13 1:43 PM
CHAPTER 10Key Terms
There are many types of regression. Bivariate regression has one predictor variable and one variable predicted, the criterion variable. The math involved is called least-squares regression or ordinary least-squares regression and reveals where to position a regression line so that the sum of the squared errors from a series of predictions has their lowest pos- sible value. The line is a visual representation of the relationship between the variables. It allows the prediction of y from x, and, because there is no assumption about which is the cause, a prediction also of x from y, when that is helpful. The regression line is a best- case fit given the available data, but it does not provide perfect correlations between the predictor and criterion variable. There will always be some error in the predicted value of y. The standard error of the estimate indicates the magnitude of the error and, when used in a confidence interval, it shows how large the interval around ŷ must be in order to have confidence that the true value of y is within that range. The regression line has two values, the constant and the slope that can be used to derive the regression equation, ŷ 5 a 1 bx, where a 5 constant, b 5 slope, and ŷ 5 estimation of the criterion based on x 5 predictor value (Objectives 3 and 4).
An extension of the bivariate or simple regression is the multiple regression where sev- eral predictors are involved. Here the same principles of statistics are involved with the addition of predictor variables that may influence the variance of the criterion variable (Objective 5).
When regression solutions are tailored too closely to a data set, particularly a small data set, the solution is overfitted to the sample. This means there will be enough error in the values of a, b, and the positioning of the regression line that the solution will not predict as well for other data sets. Shrinkage refers to this reduction in the value of the regression solution as it is applied to new data. Finally, both simple and multiple regression are exe- cuted using Excel and SPSS, and the results are reported and interpreted in APA format (Objective 6).
Key Terms
beta values The quantifiable relationship of the predictor on the criterion. The larger the absolute value of the beta value, the greater the influence the predictor has on the criteria. There can be unstandardized and standardized beta values.
criterion variable The variable for which the value is predicted in a regression procedure.
intercept The point where the regression line crosses the y-axis when the regression solution is plotted in a graph. Its value is the value of the criterion variable when the predictor variable x 5 0.
least-squares criterion or ordinary least- squares regression A form of regression in which the sum of the squared prediction errors must have its lowest possible value.
multicollinearity High correlations of pre- dictors in a regression, which is an assump- tion violation in performing a regression.
multiple correlation coefficient The corre- lation of multiple predictor variables with one criterion variable. The symbol is R.
suk85842_10_c10.indd 401 10/23/13 1:43 PM
CHAPTER 10Chapter Exercises
overfitting the sample Occurs when the regression solution predicts less well for any other data than it predicts for the sample.
prediction interval A measure of reliability based on the standard error of the estimate about some future data value.
predictor variable The variable used to predict the value of the criterion variable.
regression coefficient The value equal to the value of the slope.
regression line A line fitted through the data points in a scatterplot illustrating the relationship between predictors and crite- rion variables. It allows a prediction of the criterion from the predictor.
residual scores The differences between the actual and predicted values of the crite- rion variable when a number of predictions are completed.
shrinkage The degree to which the solu- tion diminishes in accuracy when it is applied to new data sets.
simple or bivariate regression Regression with one predictor variable.
slope The attitude of the regression line in regression. It is determined by the impact on y of increasing x by 1.0.
standard error of the estimate (SEest) A measure of error in a regression solution. It is based on the strength of the correlation between the variables and the variability in the criterion variable.
Chapter Exercises
Answers to Try It! Questions The answers to all Try It! questions introduced in this chapter are provided below.
A. The size of the correlation matters because the larger it is, the more the two variables have in common, and the more accurately the value of one can be predicted from the value of the other.
B. A weak correlation is indicated by extensive scatter among the points and the distance away from the regression line (indicating the degree of error or residual) that make up the scatterplot.
C. A negative correlation is reflected in a slope that declines from left to right in the graph. The value of b, the regression coefficient, will be negative.
D. The factors in the width, or size, of a confidence interval are the strength of the correlation, the variability in the criterion variable, the sample size, and the level of confidence.
E. A higher correlation between x and y will result in a narrower confidence inter- val for the solution. A higher correlation results in more precision.
F. In regression, shrinkage means that a regression solution does not fit subsequent data sets as well as it fits the sample for which it was initially calculated. The best way to avoid shrinkage is to ensure that the sample reflects the characteristics of the population. This means large, randomly selected samples.
suk85842_10_c10.indd 402 10/23/13 1:43 PM
CHAPTER 10Chapter Exercises
Review Questions The answers to the odd-numbered items can be found in the answers appendix.
The table immediately below is a correlation matrix. It’s an economical way to show the correlations among several variables. The correlation between probsolv (problem solv- ing) and analytic (analytical ability), for example, is determined by moving down the left column for one variable, across the top for the other variable, and then finding the value at the point where the two meet. At the intersection of probsolv down the left and analytic across the top, the value is .726. The correlation of probsolv and analytic is r 5 .726.
Use the following information to answer Questions 1–4.
Correlation Matrix
Problem Solving
Analytic Compre- hension
Reasoning Computation Vocabulary
probsolv 1.0 .726 .833 .598 .919 .714
analytic .726 1.0 .767 .857 .734 .894
comprehen .833 .767 1.0 .686 .736 .740
reasoning .598 .857 .686 1.0 .534 .852
computat .919 .734 .736 .534 1.0 .675
vocab .714 .894 .740 .852 .675 1.0
Note: All correlations are statistically significant.
Descriptive Statistics
Test Mean Standard Deviation
Problem solving 43.000 8.441
Analytic 46.500 9.317
Comprehension 46.500 8.893
Reasoning 48.000 6.144
Computation 52.750 7.502
Vocabulary 54.850 5.250
1. Noting the correlation between problem solving and the computant score, what computant score can be predicted for a student whose problem solving score is 49?
a. How much will the computant score increase for every 1.0 increase in the prob- lem solving score?
b. What value will the computant score have if the problem solving score is 0? c. In terms of regression solutions, why is the value of the computant score rel-
evant when the problem solving score is 0?
2. What is the standard error of the estimate for the solution to Question 1?
suk85842_10_c10.indd 403 10/23/13 1:43 PM
CHAPTER 10Chapter Exercises
3. Calculate a .99 confidence interval for the solution to Question 1. Assume n 5 52. a. What is the confidence interval expected to contain? b. On average, how often will the assumption referred to in Question 3a be wrong? c. What could a researcher do to shrink the confidence interval?
4. Referring to the matrix at the beginning of the Review Questions, what variable will provide the best prediction of comprehension scores? Explain.
5. What impact does a negative correlation between x and y have on the slope of the regression line?
6. What are the factors that determine error in a regression prediction?
Analyzing the Research Review the article abstract provided below. You can then access the full articles via your university’s online library portal to answer the critical thinking questions. Answers can be found in the answers appendix.
Using a Pearson Correlation in a Health-Related Quality of Life Study
Bize, R., & Plotnikoff, R. C. (2009). The relationship between a short measure of health status and physical activity in a workplace population. Psychology, Health & Medi- cine, 14(1), 53–61.
Article Abstract
Many interventions promoting physical activity (PA) are effective in preventing disease onset, and although studies have found a positive relationship between health-related quality of life (HRQL) and PA, most of these studies have focused on older adults and those with chronic conditions. Less is known regarding the association between PA level and HRQL among healthy adults. Our objective was to analyze the relationship between PA level and HRQL among a sample of 573 employees aged 20–68 taking part in a work- place intervention to promote PA. Measures included HRQL (using a single item) and PA (i.e. Godin Leisure-Time Questionnaire). The Modified Canadian Aerobic Fitness Test (MCAFT) was also completed by 10% of the employees. MET-minute scores (assess- ing energy expenditure over one week) were compared across HRQL categories using ANOVA. A multiple linear regression analysis was conducted to further examine the rela- tionship between HRQL and PA, controlling for potential covariates. Participants in the higher health status categories were found to report higher levels of energy expenditure (one-way ANOVA, p , 0.001). In the multiple linear regression model, each unit increase in health status level translated in a mean increase of 356 MET-minutes in energy expen- diture ( p , 0.001). This single-item assessment of health status explained six percent of the variance in energy expenditure. The study concludes that higher energy expenditure through PA among an adult workplace population is positively associated with increased health status, and it also suggests that a single-item HRQL measure is suitable for com- munity- and population-based studies, reducing response burden and research costs.
suk85842_10_c10.indd 404 10/23/13 1:43 PM
CHAPTER 10Chapter Exercises
Critical Thinking Questions
1. From the study, what are the degrees of freedom in the simple regression where the sample size is 887?
2. Why was a multiple regression analysis conducted to model the relationship between HRQL and PA?
3. What are the predictors for the multiple regression?
4. What percentage of the variance in the criterion can be explained by the predictors for the multiple regression? What is the multiple coefficient of nondetermination value?
suk85842_10_c10.indd 405 10/23/13 1:43 PM
suk85842_10_c10.indd 406 10/23/13 1:43 PM