4w
263
9Linear Regression
Seth Joel/Corbis
Chapter Learning Objectives After reading this chapter, you should be able to do the following:
1. Explain the relationship between correlation and regression.
2. Describe the regression line in least-squares regression.
3. Estimate a predictor-based criterion value using regression.
4. Explain multiple regression.
tan82773_09_ch09_263-294.indd 263 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.1 Regression and Correlation
Introduction Regression is a powerful analytical tool that in its simplest form uses the relationship between two variables to predict one from the other. People often make regression-like predictions. The presence of clouds in the morning sky prompts us to take an umbrella to work, for exam- ple, or a phone call from unexpected guests results in extra food prepared for a meal. The con- cepts in this chapter follow the same thinking, except that the predictions are mathematical.
Social scientists rely on regression in virtually every advanced statistical procedure. Most of those high-end statistical techniques—such as multivariate analysis of variance, discriminant-function analysis, and structural-equations modeling—are beyond the scope of an introductory text, but regression analysis is an essential part of the preparation for each of them. In the mean- time, regression has value in its own right, as a mathematical process that uses the relationship between two variables to predict the value of one from the value of the other.
9.1 Regression and Correlation Chapter 8 made the point that when two variables are correlated, it is because they share information. For example, if intelligence correlates with reading comprehension, it is because to some degree each measures a common characteristic. The more highly they are correlated, the greater the quantity of whatever is measured that the two characteristics have in com- mon, which is what the coefficient of determination (rxy2) indicates. It reveals the proportion of one variable that can be explained by the other. If intelligence (x) and reading comprehen- sion (y) are correlated at say, rxy 5 0.8, then rxy2 5 0.64: 64% of whatever reading compre- hension measures can be explained by variations in intelligence.
If correlated variables share information and we have information about the value of one of those variables, we should be able to make a better-than-chance prediction of the corre- sponding value of the other.
• If age and height are correlated for teenagers, and we know how old a subject is, we should be able to make a better-than-chance prediction of the individual’s height. Conversely, if we know a teen’s height, we ought to be able to predict age.
• If education and income are correlated, and we know how many years of schooling a sub- ject has had, we should be able to make a reasonable prediction of that person’s income.
• If the length of soldiers’ exposure to combat correlates with their manifestation of post-traumatic stress disorder (PTSD), we can predict the severity of PTSD from the length of combat exposure.
Regression allows us to address issues such as these mathematically. The concept is not new. Karl Gauss, the same mathematician who defined the characteristics of the normal (Gaussian) distribution, began developing the procedures behind regression in the early part of the 19th century. Many others have also contributed. Collectively, their work has allowed experts in a variety of fields to use regression procedures in their decision-making for many years.
• Economists gather data on unemployment rates, wholesale inventories, and consumer spending in order to predict the rate at which the economy will grow. This approach is effective because each of those variables correlates with economic expansion.
tan82773_09_ch09_263-294.indd 264 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.1 Regression and Correlation
• Meteorologists use changes in baromet- ric pressure to predict weather. Because drops in barometric pressure are predic- tors of violent storms in the Great Plains states and in the southeastern part of the United States, meteorologists watch particularly for dramatic drops in air pressure.
• Sports oddsmakers rely on data such as a team’s past performance, injuries to key players, and the quality of the opponent to predict game outcomes.
• Psychologists use genetic and social factors, including the history of alcohol abuse in a family, to predict an individual’s predisposition to abuse drugs.
Each of these scenarios is possible because of correlations between variables. Correlations pave the way for prediction. The point is not that the people in the examples necessarily sit down with mathematical models to calculate the probability that certain results will emerge, but they could. In fact, a review of the professional literature provides ample evidence that scholars per- form analyses like these frequently. They are impor- tant because prediction allows those who must act to be proactive. Rather than waiting for some important condition to emerge, affected parties can anticipate its timing with some precision, and then take appropri- ate action. Prediction is the basis for sound decision- making.
The Language of Regression Many types of regression procedures are employed; although this chapter concerns itself with just one, the concepts and much of the language used here are common to the differ- ent approaches. In the Statistical Package for the Social Sciences (SPSS), one of the most popular computer programs for statistical analysis, the variable to be predicted is called the dependent variable. The variable used to make the prediction is called the independent variable.
Those terms are common enough in statistics, but in regression discussions, the words inde- pendent and dependent run the risk of suggesting a causal relationship between the anteced- ent variables and that dependent variable. Although we also used this language with t tests and ANOVA, the risk is greater in correlation discussions because the discussion begins by assuming a relationship between the variables. To avoid this slippery slope, we will make an adjustment in discussing regression. Rather than the terms “dependent variable” and “inde- pendent variable,” as common as those terms are elsewhere, we will refer to the variable to be predicted in a regression procedure as the criterion variable, and the variable used to make the prediction as the predictor variable. We adopt this language to minimize the risk of confusing correlations with causal relationships.
Try It!: #1 From the standpoint of making a predic- tion, why does the strength of the correla- tion between the two variables involved matter?
Victor Zastol`skiy/Hemera/Thinkstock
Meteorologists use regression procedures to predict the occurrence of violent storms.
tan82773_09_ch09_263-294.indd 265 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.1 Regression and Correlation
This does not mean that no causal relationship exists; just such a connection may be at work. In fact, the relationship between exposure to combat and the development of post-traumatic stress, for example, probably is causal. The point is that the correlation alone—and correla- tion is the foundation for pursuing regression—is not usually sufficient by itself to establish causality.
Although this chapter uses the terms criterion and predictor here for descriptive purposes, some shorthand indicators are needed as well. The symbols used in regression are the same as those used with the Pearson correlation in Chapter 8: x and y for the correlated variables. Here, x symbolizes the predictor variable and y symbolizes the criterion variable.
Choosing the Predictor The confusion that can occur when equating correlation with cause increases when we recog- nize that either variable in a significant correlation can be used to predict the other. If a cor- relation exists between the degree of post-traumatic stress disorder (PTSD) and the length of exposure to combat, it means that each variable is equally related to the other. A researcher might, for instance, predict the degree of PTSD from the length of combat exposure or predict the converse relationship: the length of combat exposure from the degree of PTSD.
Either variable in a statistically significant correlation can predict the other. From the point of view of the mathematics involved, which variable predicts which does not matter, although practical considerations may dictate the predictor and the criterion. Sometimes one of the variables will prove more elusive than the other because the data involved are more difficult to gather. In such cases, the difficulty involved may require that the more accessible variable becomes the predictor and that the less available variable be predicted rather than gathered.
If reading comprehension scores are significantly correlated with intelligence scores, and someone wishes to predict the value of one from the other, to use reading scores as the predic- tor variable makes sense. Reading scores are more accessible than intelligence scores. Most major intelligence tests must be administered to one subject at a time by someone trained to use the instrument. This process makes gathering intelligence scores expensive and time- consuming. Reading tests, on the other hand, can be group administered and usually require little training.
Other factors must be considered when determining predictors and criteria. Perhaps the scores from the college-aptitude tests students take in high school are correlated with the grades that students earn during their first year of college study. From the standpoint of the correlation, scores can be predicted from grades quite as readily as grades can be pre- dicted from scores, but it will generally be the students’ future—rather than their past— performance that will be of interest.
Picturing Regression Chapter 8’s scatterplot illustrated the correlation between verbal ability and intelligence. In the graph, each point represented one subject’s scores on two variables. When variables are highly correlated, the points reflect an inclining or declining line from left to right in the scat- terplot, depending upon whether the correlation is positive or negative. Little “scatter” along the line indicates high correlation between variables.
tan82773_09_ch09_263-294.indd 266 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
0
2.0
2.5
1.0
1.5
0.5
3.0
3.5
4.0
4.5
0 5 10 15 20 25
Hours studied per week
F ir st
t e rm
g ra
d e s
o n a
4 .0
s ca
le
Section 9.1 Regression and Correlation
When scatterplots are applied to regression, the pre- dictor variable scores (x) are plotted on the horizontal axis, the criterion variable scores (y) on the vertical axis. Perhaps a researcher randomly selects a group of 20 students at the end of their first term of study at a large university and gathers from them two types of data: (a) the number of hours per week they typically study and (b) their recorded grade averages at the end of the first term. The researcher wants to determine how well the number of hours the student studies per week will predict the student’s grades. This means that the hours studied is the predictor variable, x, and grade average is the criterion variable, y. Table 9.1 lists the data.
These data can be used to create another scatterplot like the one in Chapter 8. If the data are placed in col- umns in an Excel spreadsheet just as they are here (the left-hand numbering occurs automatically in Excel and is not considered one of the data columns), the com- mands for creating the scatterplot are Insert and then Scatter. The resulting scatterplot (with added labels and title) is Figure 9.1.
To plot the data manually, draw the vertical and hori- zontal axes of the graph. Mark equal intervals on the horizontal axis for increasing hours studied and along
Figure 9.1: A scatterplot for the relationship between study time and grades
0
2.0
2.5
1.0
1.5
0.5
3.0
3.5
4.0
4.5
0 5 10 15 20 25
Hours studied per week
F ir st
t e rm
g ra
d e s
o n a
4 .0
s ca
le
Table 9.1: Study data for hours studied and grade average
Subject Hours studied (x) Grade average (y)
1 1 1.5 2 2 1.8 3 3 2.2 4 3 2.0 5 5 2.0 6 5 2.1 7 7 2.4 8 7 2.2 9 8 2.4 10 10 2.7 11 10 2.6 12 11 2.9 13 13 3.0 14 15 3.0 15 16 3.1 16 16 2.7 17 16 3.3 18 17 3.0 19 18 3.4 20 20 4.0
tan82773_09_ch09_263-294.indd 267 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.1 Regression and Correlation
the vertical axis for increasing grade averages. For the first subject, identify one hour studied on the horizontal axis and move vertically to GPA 5 1.5. Mark the point. With points plotted for all sub- jects, three conclusions emerge:
• The number of hours studied and the students’ first-term grades appear cor- related. If this was not the case, the dots would have no particular pattern.
• The correlation is substantial; an imagi- nary line drawn from lower left to upper right varies little. If we plotted the num- ber of hours students spend playing video games with their grades, the pattern might be from upper left to lower right: more play, poorer grades.
• The relationship between x and y appears to be consistent and linear.
About Linearity The third point about the linear relationship between x and y is particularly important. The type of regression discussed here assumes that the association between the predictor and criterion variables is linear. Because of the linearity assumption, we can draw a straight line through the data, attempting to remain as close to as many data points as possible while keep- ing it straight. Such a line might resemble the graph in Figure 9.2A.
The line through the data points is called a regression line. It is positioned so that it is as close as possible to as many of the 20 data points as it can be and still be a straight line. The regression line can be used to determine the value of y from a specified value of x. For exam- ple, someone using the graph can select any value of x along the horizontal axis, go vertically from that x value up to the line, and then move left horizontally from the line to the y axis. The value where the y axis is encountered will be the value of y for the specified x value, accord- ing to the data from these 20 students. Figure 9.2B shows how to use the regression line to determine y from x.
No one in the sample of 20 indicated 12.5 hours per week study time, but perhaps one of the researcher’s colleagues, aware of what kind of analysis is underway, asks, “I know someone who studies 12.5 hours every week. What corresponding grade average might we expect for that student?” If the regression line is positioned accurately, the researcher can locate 12.5 on the x axis, travel vertically up to the regression line and then move left to the y axis to deter- mine that 12.5 hours per week of study time predicts a grade point average of about 2.9.
The researcher gathered data from only 20 students. The sample size is somewhat risky, but subjects were randomly selected, and sampling theory tells us that a randomly selected sam- ple will differ from the population of all freshman students only by chance. In spite of the small sample size, perhaps the sampling error is minimal.
Digital Vision/Photodisc/Thinkstock
Regression can be used to help us understand the correlation between the amount of time students spend studying and the grades they earn.
tan82773_09_ch09_263-294.indd 268 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
0
4
5
2
3
1
0 5 10 15 20 25
Hours studied per week
F ir st
t e rm
g ra
d e s
o n a
4 .0
s ca
le A. A Regression line through a scatterplot
0
4
5
2
3
1
0 5 10 15 20 25
Hours studied per week
F ir st
t e rm
g ra
d e s
o n a
4 .0
s ca
le
B. Using the regression line to determine y from x
Section 9.1 Regression and Correlation
Coping with Less-Than-Perfect Correlations Besides the fact that the graph in Figure 9.2A does not provide very detailed markings (the researcher had to guess that 12.5 hour studied will produce a grade average of “about 2.9”), other factors affect prediction accuracy. The researcher cannot be precise about what grade a specified number of hours studied will predict because the correlation between the two vari- ables is imperfect. Although a grade average of 2.9 might be the best possible prediction given these data, it is quite likely that the prediction will not be exact. For a particular student who studies 12.5 hours per week, a GPA of 2.8 or perhaps 3.0 may be a more accurate prediction.
Without yet calculating the correlation, the evidence for rxy , 1.0 is the scatter in the data points. For example, note that three students reported studying the same 16 hours per week but ended the term with different grade averages. This result reflects the fact that grades are
Figure 9.2A: Regression lines
0
4
5
2
3
1
0 5 10 15 20 25
Hours studied per week
F ir st
t e rm
g ra
d e s
o n a
4 .0
s ca
le A. A Regression line through a scatterplot
Figure 9.2B: Using the regression line to determine y from x
0
4
5
2
3
1
0 5 10 15 20 25
Hours studied per week
F ir st
t e rm
g ra
d e s
o n a
4 .0
s ca
le
B. Using the regression line to determine y from x
tan82773_09_ch09_263-294.indd 269 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.1 Regression and Correlation
affected by more than just study time. The researcher has not accounted for differences in academic ability, class rigor, teaching quality, or a host of other variables. Those problems aside, as long as the correlation between the predictor and criterion variables is statistically significant, the predicted value of y from the value of x will be more accurate over time than a number of random predictions.
Error in prediction is something we tolerate. No one sues the College Board if a student with a high SAT score performs poorly the first year in college. Viewers do not petition the television station to fire the meteo- rologist when the forecast high for the day is wrong by a couple of degrees. Error is inevitable when predictions must be based on imperfectly correlated variables, but
we can at least have some measure of how extensive the error is likely to be. Later, the chapter will discuss how to calculate the amount of error and then use it to qualify the predicted value in a way that gauges prediction accuracy.
Understanding the Least-Squares Criterion A scatterplot is a helpful way to introduce the idea of regression, but relying on the scattered points in a graph and a positioned line to predict one variable from the other is not practical. The regression line is a conceptual model for what a regression equation actually does. The equation meets what is called the least-squares criterion, the requirement that the regres- sion line be positioned so that the sum of all possible prediction errors has the lowest pos- sible value. In short, the equation for the regression solution minimizes prediction error.
Describing Prediction Error Whenever correlations are imperfect, regression solution will include some error. If we pre- dict the corresponding value of y for a series of x values, and we know the actual value of y in each case, error is the difference between the criterion variable’s predicted value according to the regression equation, and its actual value according to the data. To avoid confusing the various values, we will identify them as follows:
• x is the actual value of the predictor variable, • y is the actual value of the criterion variable, and • y9 (y prime) is the predicted value of the criterion variable.
Researchers often do not know the actual value of y (thus the value of regression procedures), but if they did, the difference between the actual and predicted values (y 2 y9) would indicate the error in a solution. The y 2 y9 difference is called a residual score.
Look at Figure 9.2A again. If the correlation between hours studied and end-of-term grades formed a straight line—that is, if the correlation between those variables was rxy 5 1.0—each value of x would result in just one corresponding value of y. Those three people who reported 16 hours per week study time would all have had the same grades at the end of the term.
But with scatter in the data points, establishing the regression line is not a matter of just con- necting the dots. Because the regression line is a straight line and must be positioned so as to
Try It!: #2 What is the visual evidence in a scatterplot for a weak correlation?
tan82773_09_ch09_263-294.indd 270 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
minimize the prediction errors, its placement involves some compromise. It is a “line of best fit,” not a “line of perfect fit.”
Using the Residual Scores If someone were to rely on Figure 9.2A to make a series of predictions and then go through university records to determine what the actual grade averages were for those 20 students after their first term, any difference between predicted grades and actual grades would be a residual score.
If all of those residual scores were added up [∑(y 2 y9)], what would they total? Some residual scores would be positive (the actual value of y will be larger than the predicted value, y9) and some negative (y is smaller than y9). The positive and negative residual scores would cancel each other out and sum to 0; summing up residual errors does not reveal much about the amount of error in a series of regression solutions. However, if the residual scores are squared (which eliminates the negative residual scores), and then summed, the result better indicates the amount of error. When the regression line is positioned so that the sum of those squared values is as low as possible, the solution meets the least-squares criterion. Positioning the line so as to minimize error is the function of regression equation and the reason that this particu- lar form of regression is called ordinary least-squares regression.
9.2 Ordinary Least-Squares Regression with One Predictor Theoretically, a regression problem can have any number of predictors, x1, x2, x3 . . . . Having more than one predictor makes the procedure “multiple” regression. Toward the end of the chapter, we will describe how multiple regression works. For now, the chapter will focus on regression with just one predictor, sometimes called simple regression or bivariate regres- sion because there are only two variables involved: a predictor variable and a criterion variable.
To position the regression line so as to meet the least-squares criterion requires answers to two questions:
1. Where does the regression line cross the y axis in the graph? 2. How much does the criterion variable (y) change when the predictor variable (x)
increases by 1.0?
The answer to the first question establishes the regression line’s intercept, called that because it indicates the value of y where the regression line intercepts the y axis. The inter- cept is the value of y when x 5 0. Look at Figure 9.2A again, and note that the regression line appears to cross the y axis at about y 5 1.2. Following is an equation to calculate the intercept value, which will indicate how close the estimate is.
The second question above concerns the slope of the line. It indicates how much the regres- sion line inclines or declines from left to right. Using the units in Figure 9.2A, we might esti- mate that whenever x increases by 5.0, from left to right, y increases by a little less than 1.0. Reducing it to the units used in the second question above, if x increases by 1.0, y increases by something less than 0.2.
tan82773_09_ch09_263-294.indd 271 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
The Regression Equation The simple, or bivariate, regression equation has this form:
Formula 9.1
y9 5 a 1 bx 1 e
where
y9 5 the predicted value of the criterion variable
a 5 the intercept
b 5 the slope of the regression line
x 5 the value of the predictor variable
e 5 prediction error
Formula 9.1 shows that for any value of the predictor (x), the predicted value of y is the value of the intercept (a), plus the slope times the predictor variable’s value (bx), plus error (e). Cal- culating the amount of error in a regression problem is a separate process, and the e will be dropped from the equation hereafter, resulting in y9 5 a 1 bx. The error symbol is in Formula 9.1 to remind us that absent a perfect correlation, prediction error is always present.
Before calculating a regression solution, we need to know the intercept value, a, and the slope, b. Each has its own equation, but the components of both are already familiar. First, the for- mula for the intercept is
Formula 9.2
a 5 My 2 bMx
where
a 5 the intercept
b 5 the slope of the regression line
My 5 the mean of the criterion variable, y
Mx 5 the mean of the predictor variable, x
That is, the intercept value is the mean of the criterion variable minus the slope value times the mean of the predictor variable.
Because the intercept formula includes the slope of the regression line, or the regression coefficient, we need to start there. To determine the regression coefficient, b, use the following:
Formula 9.3
b 5 rxy( sy sx
)
tan82773_09_ch09_263-294.indd 272 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
where
b 5 the slope of the regression line
rxy 5 the correlation coefficient for the two variables
sy 5 the standard deviation of the criterion variable, y
sx 5 the standard deviation of the predictor variable, x
Calculating a Regression Solution Using the study-time and term-grades data, we will calculate a regression solution for the individual who studied 12.5 hours per week. The graph suggested that such a person would probably have a term grade average of about 2.9. How accurate was that estimate? To answer, we will need to calculate the following:
• the means and standard deviations for x (hours studied) and y (grade average), • the correlation of x and y, • the slope of the line (the regression coefficient) b, • the regression intercept, a, and • the value of y9.
1. For the x and y variables, verify that Mx 5 10.15, sx 5 5.941, My 5 2.615, and sx 5 0.613. 2. Using Formula 8.2, calculate the correlation as follows:
rxy 5 n∑xy 2 (∑x)(∑y)
Î {[n∑x2 2 (∑x)2][n∑y2 2 (∑y)2]}
rxy 5 20(596.3) 2 (203)(52.3)
Î {[20(2,731) 2 2032][10(143.91) 2 52.32]}
5 1,309.4
Î (13,411)(142.91)
5 0.946
Checking this value against the critical value from Table 8.5, rxy0.05(18) 5 0.444, indicates that the correlation is statistically significant.
To identify the correlation using Excel, remember that having set up the data in the two col- umns, the commands are Data u Data Analysis u Correlation.
3. The slope of the line is
b 5 rxy( sy sx
)
5 0.946( 0.613 5.941
)
5 0.098
tan82773_09_ch09_263-294.indd 273 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
This value indicates that y increases 0.098 for every 1.0 increase in x. Earlier, we guessed that the slope of the line in the Figure 9.2 graphs might be about 0.2. The estimate was not very accurate.
4. The regression line intercept is
a 5 My 2 bMx
5 2.615 2 (0.098)(10.15)
5 1.620
This value indicates that if x 5 0, then y 5 1.620. Based on the visual best fit that we made to the graphs in Figures 9.2A and 9.2B, we guessed that the intercept would be about y 5 1.2. So that estimate was not close either. In Figure 9.3, rather than estimating where the regression line would be positioned to establish a best fit, as we did in the 9.2 figures, Excel has com- pleted the regression calculations and positioned the line. Note that the Excel values conform more closely to our calculations.
What grades would someone who studies 12.5 hours per week likely earn? From Figure 9.2A, we estimated about 2.9. Check this estimate by solving for y9, comparing the two solutions, and consulting Figure 9.3.
y9 5 a 1 bx
5 1.620 1 (0.098)(12.5)
5 2.845
Based on the data from the 20 students, the grade average predicted for someone who studies 12.5 hours per week is 2.845. Interestingly, that value is not far from the earlier prediction.
Practicing the Regression Solution Now that the calculations of the slope (b) and the intercept (a) have been completed, it is a simple matter to solve for any other value of x. For example, what grades can be predicted for someone who seems to study incessantly, perhaps 30 hours a week? Based on the formula for the predicted value y9 5 a 1 bx, we can substitute in our values of a (1.620), b (0.098), and x (30) to find the following:
y9 5 a 1 bx
5 1.620 1 (0.098)(30)
5 4.560
This result is interesting because grades are averaged on a four-point scale, meaning they can be no higher than 4.0, straight As. The regression procedure does not “know” there is an effective ceiling to how high grades can be. The linearity assumption, to which we referred
tan82773_09_ch09_263-294.indd 274 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
earlier, is that the relationship continues higher and lower in either direction from the data gathered.
What grade average is predicted for someone who does not study at all? Is such a student likely to receive a grade average that is also 0?
y9 5 a 1 bx
5 1.620 1 (0.098)(0)
5 1.620
Even with no time devoted to weekly study, a student is unlikely to have a GPA of zero. But this is an answer we already had. Remember that the intercept is defined as the value of y if x 5 0. In terms of our analysis, this is equivalent to asking what the GPA value (the y variable) is for someone who studies zero hours (the x variable). To answer that question, we could have just reported the value of the intercept.
Determining the Error in a Regression Solution The prediction of the grade average for the student who studies 12.5 hours per week during that first term of college was 2.845. It is the best prediction that can be made with the data that are available for the 20 students in the data set. Random sampling will make any sam- pling error, any degree to which the sample is unlike the population of all freshman students at the university, minimal.
However, even if the data set included data for every freshman student, the answer still will not necessarily be precisely accurate for one individual. The regression equations allow the best pre- diction, but it is a generalization based on the group, which may or may not be exactly accurate for a particular student. No matter how large the sample, and regardless of how it is selected, some prediction error is inevitable as long as the correlation between predictor and criterion is , 1.0.
This reality does not mean that the regression process is flawed, simply imperfect. We made a similar point about calculating the various standard-error statistics in the chapters about t test. The fact that the test statistic includes error does not imply that mistakes were made. Error indicates variability in the data for which the research cannot account. It is the same with regression procedures. The least-squares criterion explains that the equations are designed to minimize error but cannot eliminate it entirely.
For any data set that makes a large number of predictions, some of the predictions will err by being too high, some will be too low, and a few might be correct. The fact that all these prediction errors would sum to 0, ∑(y 2 y9) 5 0, is small consolation if we are making one prediction for one individual and the outcome is important. To know how much to trust the prediction, regression procedures include a way to estimate the amount of error.
The Standard Error of the Estimate Recall the standard error of the mean and the standard error of the difference that are calcu- lated for the t tests. Both those statistics measure error variance in the other tests. Regression
tan82773_09_ch09_263-294.indd 275 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
has a similar measure of error variance called the standard error of the estimate (SEest). Theoretically, the standard error of the estimate is explained this way:
If a researcher calculates a very large number of regression solutions from a data set and for each solution determines the residual score, or the difference between the actual and predicted values of the criterion variable (y 2 y9), then the standard error of the estimate is the standard deviation of all those residual scores.
The only way that residual scores can be determined, of course, is if the researcher already has all the actual values of y. If the researcher had that information, what would be the point of using regression? The standard deviation of residual scores explains the standard error of the estimate, but it is not a guide to how the researcher calculates that statistic. Recall that in theory, the standard error of the mean (Chapter 4) is the standard deviation of all the sample means in the population. But that value was estimated by dividing the sample standard devia- tion by the square root of the number in the sample. The standard error of the estimate can be calculated in a similar way:
Formula 9.4
SEest 5 syÎ (1 2 rxy2) where
SEest 5 the standard error of estimate
sy 5 the standard deviation of the criterion (y) variable
rxy 2 5 the square of the correlation coefficient
For the hours-studied and grade-point-average problem, the standard error of the estimate will be as follows:
Given the correlation between study time and grade averages of rxy 5 0.946, and the standard deviation of the y variable (grade averages) of Sy 5 0.613, the standard error of the estimate is
SEest 5 syÎ (1 2 rxy2) 5 0.613Î (1 2 0.9462)
5 0.199
A large SEest value indicates substantial error in the prediction. Consider the factors affecting the size of the standard error of the estimate.
1. The sy value is the standard deviation of the variable to be predicted. Highly variable data sets result in large standard deviation values and, as a result, in large SEest values.
tan82773_09_ch09_263-294.indd 276 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
2. The sy value is a multiplier for the result of Î (1 2 rxy2). The more highly x and y are correlated, the smaller this resulting value will be and, as a consequence, the smaller the SEest value.
The smallest SEest can be is 0, and the largest the value can be is the value of sy. In either case, can you see why?
• If the correlation between predictor and criterion is perfect (that is, if rxy 5 1.0), the latter part of the term Î (1 2 rxy2) becomes Î (1 2 1) or Î0; sy 3 0 5 0.
• At the other extreme, if the correlation between predictor and criterion has its lowest possible value (0), the latter part of the term becomes Î (1 2 0), or 1; sy 3 1 5 sy.
Using the Standard Error of the Estimate By itself, the standard error of the estimate is not that helpful. It is hard to know when the amount of error is comparatively large and when it is not. Remembering the relationship between a standard deviation and the normal distributions provides some guidance.
Recall that in a normal distribution, the area from one standard deviation below the mean to one standard deviation above the mean includes about two-thirds of the entire population. Noting that the SEest is like a standard deviation of all possible error scores, from the pre- dicted value (y9) minus 1 SEest to y9 plus 1 SEest provides a range within which the true value of the criterion variable, y, will occur about 68% of the time. Therefore, if y9 5 2.845 and SEest 5 0.199, then the range from 2.646 (2.845 2 0.199) to 3.044 (2.845 1 0.199) will in- clude the true grade average for someone who studies 12.5 hours per week 68% of the time. To put it more concisely, with p 5 0.68, the value of y is between 2.646 and 3.044.
Determining the true predicted value 68% of the time leaves a great deal to chance. When researchers use confidence intervals (CIs) in regression procedures, they more commonly calculate them so that they capture the true value of y 95% or 99% of the time. These are called 0.95 or 0.99 confidence intervals. The process for a 0.95 confidence interval for the grade average and number of hours studied problem is as follows:
Formula 9.5
CI 5 6t(SEest) 1 y9
where
CI 5 the confidence interval for the regression solution
t 5 a critical value of t for n 2 2 df for p 5 0.05 (for a 0.99 confidence interval, it is the value for p 5 0.01)
SEest 5 the standard error of the estimate according to Formula 9.4
y9 5 the predicted value for the criterion variable
So, for the problem predicting grade average from the number of hours studied, a 0.95 confi- dence interval will be as follows:
tan82773_09_ch09_263-294.indd 277 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
With
SEest 5 0.199
t(df 5 18) 5 2.101
and
y9 5 2.845
then
CI 5 6tn22(SEest) 1 y’
5 62.101(0.199) 1 2.845
5 3.263, 2.427
To be 0.95 confident of having captured the true grade average for someone who studies 12.5 hours per week, the range for possible grades needs to encompass every value from 3.263 down to 2.427. This wide confidence interval stretches from a substantial B to what is ordi- narily a C. It is a wider interval than if we were satisfied
with 0.90 confidence, and not as wide as if we adopted 0.99 confidence. Several factors affect the width of the confidence interval:
• the level of confidence, as we must note with the difference between 0.90, 0.95, and 0.99
• the sample size, which affects both the amount of variability in y and the critical value of t
• the strength of the correlation
Try It!: #3 What factors affect the width of a confi- dence interval?
Apply It! Using Regression to Predict Growth
A psychologist is considering whether to pur- chase a marriage and family therapy practice from someone who is retiring. The prospec- tive purchaser wants to predict the practice’s growth. This sort of procedure is sometimes called trend analysis, but the work involved is regression.
Historically, the psychologist’s practice appears to have grown along with the population of the town, which is currently booming because of the growth of nearby government-research and KatarzynaBialasiewicz/iStock/Thinkstock
tan82773_09_ch09_263-294.indd 278 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
300
250
200
150
100
50
0 0 50000 100000 150000 200000 250000
Population
N o.
o f cl
ie n ts
Section 9.2 Ordinary Least-Squares Regression with One Predictor
development-defense industries. To verify the relationship between the number of clients and population growth, the buyer will calculate a cor- relation to start. If the two variables are signifi- cantly correlated, the psychologist can then use the town population to predict growth in the num- ber of clients. Since data from the county office predict that within the next five years the town will grow to 260,000, the psychologist wishes to know how many clients can be expected when the population reaches that projected number.
To pursue all of this, the psychologist gathers data on the town’s population at one-year intervals over the previous 14 years along with the number of clients in the practice for the same years. Table 9.2 lists these data.
If population is plotted as the x (predictor) value and the number of clients as the y (criterion) value, Figure 9.3 seems to indicate a strong linear relationship between the two variables.
The line through the 14 data points is what Excel calls a “trend line.” For our purposes, it is the regression line. To proceed with a regression solution, the psychologist needs the means and standard deviations for the two variables, the strength of the correlation between x and y, and the values for a, the intercept or regression constant, and b, the slope or the regression coefficient.
Figure 9.3: Number of clients as a function of population growth
300
250
200
150
100
50
0 0 50000 100000 150000 200000 250000
Population
N o.
o f cl
ie n ts
First, for the correlation, we have rxy 5 0.943. In comparison, r0.05(12) 5 0.532, so the correla- tion is statistically significant. The means and standard deviations for the two variables are shown in Table 9.3.
Table 9.2: Population and the number of clients
Year Population No. of clients
2001 25,780 27 2002 29,580 32 2003 36,500 75 2004 39,870 82 2005 43,580 102 2006 57,800 111 2007 59,000 131 2008 70,000 118 2009 82,000 152 2010 91,000 149 2011 129,000 174 2012 149,000 188 2013 176,000 209 2014 198,000 254
(continued)
tan82773_09_ch09_263-294.indd 279 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
(continued)
Table 9.3: Descriptive statistics for population and the number of clients
Mean Standard deviation
Population (x) 84793.571 56466.608 No. of clients (y) 128.214 63.926
The slope of the regression line, or the regression coefficient, is
b 5 rxy( sy sx
)
5 0.943( 63.926
56466.608 )
5 0.001
The intercept, sometimes called the regression constant, is
a 5 My 2 bMx 5 128.214 2 (0.001 3 84793.571)
5 43.420
Recall that the psychologist’s question was how many clients could be expected if the popula- tion of the town grew to 260,000. In the regression equation, then, the value of x is 260,000, so using y9 5 a 1 bx results in y9 5 43.420 1 (0.001)(260,000) 5 303.420.
The data suggest that if the population grew to 260,000, the best prediction for number of clients is about 303. To have a sense of how much error there might be in this prediction, the psychologist needs a confidence interval, for which an important part is the standard error of the estimate:
SEest 5 syÎ (1 2 rxy
2)
5 63.926Î (1 2 0.9432)
5 2.661
Since the significance of the correlation was tested at p 5 0.05 (note the critical value for the correlation), the confidence interval should be at the same level. That will make it a 0.95 con- fidence interval.
CI0.95 5 6t(SEest) 1 y9
The value of t for 12 degrees of freedom is 2.179 at p 5 0.05, so
CI0.95 5 6t0.05(12) (SEest) 1 y9
5 62.179(2.661) 1 303.420
5 309.218, 297.622
With 95% confidence, the psychologist can expect somewhere between 298 and 309 clients when the city’s population is 260,000. This is actually quite a precise interval, something that reflects the strength of the correlation between the population (x) and the number of clients (y).
Apply It! boxes written by Shawn Murphy
tan82773_09_ch09_263-294.indd 280 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
Interpreting the Regression Results The value of the slope, b 5 rxy(
sy sx ); the regression coefficient, is a proportion of the ratio of
sy to sx. The proportion is determined by the strength of the correlation. It never happens in human-subjects research, but if the correlation between the two variables were perfect (rxy 5 1.0), the value of the slope would be one times that ratio of sy to sx. As the correlation diminishes, the slope’s value is a decreasing proportion of that ratio. At rxy 5 0.50, for example, the slope’s value is half the ratio of sy to sx.
The slope of the regression line need not be a posi- tive value. If the correlation between the predictor and criterion variables is negative, the slope will be negative. This means that as x increases, y declines, something illustrated by a scatterplot which lowers from left to right.
Suppose a criminologist is using incarcerated inmates’ good behavior to predict the length of the inmate’s sentence. As the number of days of good behavior while incarcerated increases, the overall length of the inmate’s sentence decreases. The regression slope for such a problem would be negative, declining from left to right. Negative values for b and slopes that decline from left to right are not unusual.
• If the amount of time students spend on video games is used to predict their grade averages during the first year of college, the correlation between those variables will probably be negative; as time on games increases, grades probably decline.
• If the frequency of substance abuse is used to predict job productivity, the slope is probably negative.
• The number of extramarital affairs is probably negatively correlated with the length of a marriage. The regression line for predicting marital harmony would be negative.
Using the first example above, an enterprising graduate student with an interest in predicting students’ grades gathers data on video gaming and grades from 10 randomly selected under- graduates, resulting in Table 9.4.
Table 9.4: Video gaming and grades
Student Video gaming hours Grades
1 0 3.9 2 1 3.8 3 1 3.6 4 3 3.6 5 5 3.4 6 5 3.0 7 7 2.9 8 6 2.7 9 4 2.9 10 8 2.5
Try It!: #4 If the correlation between x and y is nega- tive, what happens to the predicted value of y as x increases?
tan82773_09_ch09_263-294.indd 281 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
4.5
3.5
2.5
1.5
0.5
4
3
2
1
0 0 2 4 6 8 10
Hours per day video gaming
G ra
d e s
Section 9.2 Ordinary Least-Squares Regression with One Predictor
The scatterplot and the Excel solution for the regression line are represented in Figure 9.4.
Figure 9.4: The relationship between hours in video gaming per day and grades
4.5
3.5
2.5
1.5
0.5
4
3
2
1
0 0 2 4 6 8 10
Hours per day video gaming
G ra
d e s
Because the correlation is negative, the regression line here declines from left to right; accord- ing to these data, more time spent with gaming results in lower grades.
Another Regression Problem A social worker is responsible for encouraging indigent people to advance their educations so as to improve their living conditions. To demonstrate to clients that schooling affects income, the social worker gathers the data for a group of 12 people as listed in Table 9.5.
Table 9.5: Data for study comparing education and income
Group member Years of education (x) Income in thousands (y)
1 10 23.3 2 12 27 3 12 30.5 4 14 34 5 14 45 6 16 55 7 16 57.5 8 16 62 9 16 68 10 18 70 11 18 85 12 18 90
One client who is a high school graduate (12 years of education) asks, “If I attend the commu- nity college and complete a two-year certification program, what is a good prediction for my income?” Regression analysis with x 5 14 will answer the question.
tan82773_09_ch09_263-294.indd 282 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
First, the social worker calculates the correlation:
rxy 5 n∑xy 2 (∑x)(∑y)
Î {[n∑x2 2 (∑x)2][n∑y2 2 (∑y)2]}
5 12(10.319) 2 (180)(647.3)
Î {[12(2,776) 2 1802][12(40,407.39) 2 647.32]}
5 7,314
Î (912)(65,891.39)
5 0.944
Calculating the slope and intercept requires knowing the means and standard deviations. Table 9.6 tabulates these.
Table 9.6: Means and standard deviations for study comparing education and income
Years of education (x) Income in thousands (y)
Means 15.0 53.942 Standard deviations 2.629 22.342
The slope (regression coefficient) is then
b 5 rxy( sy sx
)
5 0.944(22.3422.629 )
5 8.022
and the intercept (regression constant) is
a 5 My 2 bMx
5 53.942 2 (8.022)(15.0)
5 266.388
Solving the equation for 14 years of education (12 plus the 2 years of community college education) produces
y9 5 a 1 bx
5 266.388 1 (8.022)(14)
5 45.92
tan82773_09_ch09_263-294.indd 283 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
The best prediction that the social worker can make with these data is that the individual is likely to make about 45.92 thousand dollars ($45,920) with the benefit of the additional schooling. If the social worker wants a confidence interval for that answer, the first step is to determine the standard error of the estimate:
SEest 5 syÎ (1 2 rxy2)
5 22.342 (1 2 0.9442)
5 7.372
The confidence interval is
CI 5 6t(SEest) 1 y9
5 62.228(7.372) 1 45.92
5 62.345, 29.495
Although the correlation between education and income is very high (r 5 0.944), other factors affect the value of the standard error of the estimate and therefore the confidence
interval. One is the variability in each of the predic- tor and criterion variables. The other is the probabil- ity level involved, which will be discussed later. With so much variability in income (it ranges from $23,300 to $90,000 for this sample), the standard error of the estimate is quite large, and the confidence interval is not precise. For this data set, a CI0.95 suggests that the community college program will result in an income somewhere between $29,495 and $62,345.
Try It!: #5 If additional data in a regression problem produce a higher correlation between x and y, what will be the impact on a confi- dence interval for the solution?
Apply It! Intelligence and Working Memory
Although psychometric intelligence and working memory represent separate traditions in the study of intelligence, research indicates they are linked (see, for example, Thomas, Rammsayer, Schweizer, & Troche, 2015). Working memory is often evaluated in terms of the number of separate bits of data an indi- vidual can accurately retrieve after a brief exposure. As such, even with a short-form test of psychometric intelligence, working memory data are much easier to collect than traditional intelligence scores. A psychol- ogist reasons that if the correlation between working memory and psychometric intelligence is statistically
significant and substantial, working memory data can be used to predict intelligence scores. Table 9.7 presents data on both variables for 10 subjects.
Fuse/Thinkstock
tan82773_09_ch09_263-294.indd 284 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.2 Ordinary Least-Squares Regression with One Predictor
Table 9.7: Short-term memory (STM) and intelligence
No. of items retrieved (STM)
Psychometric intelligence
6 105 4 95 5 100 7 105 7 110 6 100 8 115 6 100 6 95 5 90
After collecting the data, the psychologist encounters someone who has STM 5 10, a full two points better than anyone in the sample. What is the best prediction for intelligence if STM 5 10?
Table 9.8 shows the calculated means and standard deviations.
Table 9.8: Descriptive statistics for short-term memory and intelligence
No. of items (x) Intelligence score (y)
Means 6.0 101.50 Standard deviations 1.155 7.472
Using the Pearson Correlation formula gives rxy 5 0.837.
Since the critical value for r0.05(8) 5 0.632, the psychologist can be confident that the relation- ship between short-term memory ability and intelligence scores is not random. A regressions solution based on this relationship is appropriate. For a regression solution for STM 5 10,
The slope (regression coefficient) is
b 5 rxy( sy sx
)
5 0.837( 7.472 1.155 )
5 5.415
The intercept (regression constant) is
a 5 My 2 bMx 5 101.50 2 5.415(6.0)
5 69.010
(continued)
tan82773_09_ch09_263-294.indd 285 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.3 Regression with Excel
9.3 Regression with Excel The regression procedure which is part of Excel for Windows provides useful output. Using the problem predicting the income from education, the procedure is as follows:
(continued)
Solving the equation for STM 5 10 results in:
y9 5 a 1 bx
5 69.010 1 5.415(10)
5 123.160
Given the data from the 10 people in the sample, the best prediction of a psychometric intel- ligence score for someone who has a short-term memory capacity of 10 is about 123 points. This solution raises an important issue related to regression. Although the STM 5 10 is 1.667 times the mean of 6.0 for STM (10/6 5 1.667), the predicted intelligence score is only 1.213 times the mean for all intelligence scores. Put more succinctly, the criterion value is less extreme than the predictor. This concept, called regression to the mean, is characteristic of linear regression solutions; an extreme predictor will always predict a less extreme criterion value. Consider the nature of normal distributions, and the reason becomes clear. Most of the individuals in any normal distribution occur to the right of extremely low values and to the left of extremely high values in the distribution. Regression solutions reflect this characteristic.
To calculate a confidence interval for the solution, the psychologist first determines the stan- dard error of the estimate:
SEest 5 syÎ (1 2 rxy
2)
5 7.472Î (1 2 0.8372)
5 4.089
Then the confidence interval is
CI 5 6t(SEest) 1 y9
5 62.228(4.089) 1 123.160 5 132.270
With 0.95 probability, the intelligence score for someone who has a short-term memory score of 10 is somewhere between 114.050 and 132.270.
As always with confidence intervals, the width of the interval is a function of the size of the standard error of the estimate, the amount of variability in the data, and finally, the level of probability at which it is calculated. Ordinarily the confidence interval is calculated for the same level as that at which the test was conducted. When we test at α 5 0.05 we have estab- lished the probability of a type 1 (alpha) error—in other words, 5% of the time, what appears to be a statistically significant finding will be a random outcome. The corresponding level of probability for a confidence interval is p 5 0.95, which is to suggest that 5% of the time, the actual value of the criterion variable will be either above or below the calculated interval.
Apply It! boxes written by Shawn Murphy
tan82773_09_ch09_263-294.indd 286 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.3 Regression with Excel
1. Arrange the data in columns, A for years of education and B for income. Enter the labels “years” and “income” in cells A1 and B1, respectively.
2. Select the Data tab at the top of the page and then Data Analysis at the far right just below the page tabs.
3. In the Analysis Tools window, scroll down to Regression and click OK. 4. Click the Input Y Range window and drag the cursor from B2 to B13. 5. Click the Input X Range window and drag the cursor from A2 to A13. 6. Click the Output Range window and enter something like A15 so that the output
does not overwrite the data. Click OK.
The result is shown in Figure 9.5.
Figure 9.5: Using Excel to predict income from years of education
Source: Microsoft Excel. Used with permission from Microsoft.
tan82773_09_ch09_263-294.indd 287 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.3 Regression with Excel
Column A has been expanded in Figure 9.5 to make it easier to read the output. Regression statistics in Excel illustrate the following:
• The same correlation value is calculated in the manual solution, although Excel calls it “Multiple R,” which is actually the name for correlation procedures involving more than one predictor.
• The R Square value is the square of the Pearson correlation, indicating the amount of variance in y explained by x.
• The Adjusted R Square is the correlation value diminished because of the risk in small samples (which this is).
• The Standard Error value is the standard error of the estimate, but using the Adjusted R Square value.
• The number of Observations is n 5 12. • The ANOVA tests probability that the relationship between x and y occurred by chance.
The entry 4.12E-06 is a short-hand way of indicating 4.12 with the decimal 6 places to the left (4.12 ( 1026)) so that p 5 0.00000412 that the x 2 y relationship is random.
• The last table in Figure 9.5 provides the regression solution. The intercept and slope values are similar to the manual calculations, with differences attributable to differences in rounding. The standard error values for a and b were not calculated manually, nor were the significance tests or confidence intervals for those individual values. The significance tests are redundant because in least-squares regression, if rxy is statistically significant, x is also a significant predictor of y.
Excel for Mac does not include a regression procedure. It does, however, include correlation and the other descriptive statistics. It will also produce a scatterplot with a “trend line,” which is like the regression line in ordinary least-squares regression.
Shrinkage and Overfitting the Sample When samples are small and correlations are relatively weak, bear in mind the potential for prediction error. Even when confidence intervals are narrow, regression solutions carry potential risks. By necessity, the solution is based on the available data. This is not a problem as long as the existing data set represents the population of all such data reasonably well;
no sample, however, can exactly emulate a population, and small samples involve particular risks. When a regression solution fits a sample but not the popula- tion, the problem is overfitting the sample. Another term to describe this characteristic is shrinkage: the degree to which the accuracy of a regression solution is diminished when used with other data from the same population. In the earlier example where the number
of hours in daily video-gaming was used to predict grades, the sample of 10 students may not be a good representation of the population of all undergraduates. But the solution is (by necessity) based on those 10 students who were all the graduate student/researcher had available. A larger sample might provide a different correlation between the two variables. A larger sample might also provide less variability in either the x or y variable, which is typical as sample sizes grow. With either change, the solution based on 10 students would be inac- curate for the population; the solution would be overfitted to the sample.
Try It!: #6 What does shrinkage mean in regression, and how can it be avoided?
tan82773_09_ch09_263-294.indd 288 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Section 9.4 A Conceptual Introduction to Multiple Regression
The Requirements for Ordinary Least-Squares Regression Many different regression procedures are possible, with each adapted to a different set of circumstances. Bivariate, ordinary least-squares regression requires the following:
• The variables involved must be interval or ratio scale. • The variables must be normally distributed in their populations. • The predictor and criterion variables must show statistically significant correlation. • The relationship between the variables must be linear. • The data must have similar amounts of variability throughout their ranges.
9.4 A Conceptual Introduction to Multiple Regression To this point, the chapter has confined its discussion to what is called “simple” or “bivari- ate” regression. It involves one predictor (x) and one criterion variable (y). Multiple regres- sion, on the other hand, uses similar logic but employs more than one predictor. If multiple variables are correlated with a criterion variable, and they are not too highly correlated with each other, multiple predictors can often provide a more precise prediction of y than a single predictor. Here is the multiple regression equation with two predictors:
y9 5 a 1 b1x1 1 b2x2
The intercept or constant value, a, is defined as the value of y when both x values equal zero. There are two b values, two slopes, one for each of the predictor variables, x1 and x2. Each indicates how much y changes when the particular x value increases by 1.0, and the value of the other x value is held constant.
Recall how using short-term memory data predicted psychometric intelligence. Suppose someone modifies that problem so that short-term memory (STM; now x1) and problem- solving ability (prob.solv.; x2) are both used to predict psychometric intelligence (y). The intercept would indicate the psychometric intelligence value if both STM and prob.solv. 5 0. The b1 value would indicate how psychometric intelligence changes if STM increases by 1.0, and problem-solving ability is unchanged. The b2 value would indicate how much psycho- metric intelligence changes if prob.solv. increases by 1.0 and STM is unchanged.
Holding one predictor unchanged as the other increases controls redundancy between pre- dictors. When two predictors are correlated with a criterion value, they also tend to be cor- related with each other, and an accurate prediction requires determining what each predictor reveals about y that is unique.
As a footnote to this overview, note that not all regression procedures are based on a linear relationship between x (or the xs) and y. Bivariate and the multiple regression mentioned above are based on that requirement, but the relationship between x and y is not always lin- ear. Indeed, the x and y variables are not always measured on an interval or ratio scale. Those instances involve still other regression procedures, procedures beyond the scope of an intro- ductory text.
tan82773_09_ch09_263-294.indd 289 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Summary and Resources
Writing Up Statistics Although this chapter has focused on simple regression with one predictor variable, human complexity makes explaining behavior with one variable very difficult. Consequently, the research more commonly represents multiple regression. Bai, Lai, Lee, Chang, and Chiou (2015) used several variables to explain fatigue among patients receiving dialysis because of failed kidneys. The research literature indicated that age, length of employment, amount of physi- cal activity, amount of medication taken, and degree of depression were all factors in dialysis patients’ fatigue. Gathering data for 193 patients, the researchers developed a regression model that explained 64.2% of the variance in fatigue. Such a model would allow the researchers to make quite accurate predictions of the level of fatigue, given the levels of these other variables.
Summary and Resources
Chapter Summary The correlation coefficient is an elegant statistic. Whenever separate measures have some quality in common, correlations indicate the strength of the relationship between them. Regression procedures capitalize on this by using what is contained in one measure to pre- dict the probable level of the other measure (Objective 1). Because prediction is a part of all science and of virtually every social domain as well, regression has remarkably wide appli- cation. When variables are related, but one is more difficult to measure than the other, the more accessible variable can be used to predict the more elusive variable (Objective 3).
Many types of regression exist. Bivariate regression has one predictor variable and one variable predicted, the criterion variable. The math in least-squares regression, or ordinary least-squares regression, produces a solution that minimizes the sum of the squared errors from a series of predictions. The regression line is a visual representation of the relation- ship between the variables. It allows the prediction of y from x, and because the regression makes no assumption about which variable is the cause, it also predicts x from y, when that is helpful (Objective 2).
The regression line is a best fit given the available data, but because correlations between the predictor and criterion variable are never perfect in human subjects research, some prediction error is always likely. The standard error of the estimate is an average measure of that error and, when used in a confidence interval, indicates how large the interval must be around the predicted value of y, called “y prime” (y9) to capture with confidence the true value of y.
When regression solutions are based on a data set not representative of the population, they can be “overfitted” to the sample. The evidence of overfitting is a solution that predicts less well for other data sets drawn from the same population. The reduction in the utility of the regression solution is sometimes referred to as shrinkage.
Bivariate regression employs one predictor variable to estimate the value of a criterion. The same principles used here for bivariate regression can be applied to multiple regression. For that procedure, the information about a criterion contained in multiple predictor variables is used to calculate multiple regression coefficients which are combined to estimate the value of the criterion variable (Objective 4).
tan82773_09_ch09_263-294.indd 290 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Summary and Resources
criterion variable The variable for which the value is predicted in a regression procedure.
intercept The point where the regression line crosses the y axis when the regression solution is plotted in a graph, defined as the value of y when x 5 0.
least-squares criterion The requirement that the sum of squared errors have their lowest possible value.
ordinary least-squares regression A form of regression in which the sum of the squared prediction errors must have its low- est possible value.
overfitting the sample When the accuracy of regression solutions diminishes with new data sets.
predictor variable In regression, the vari- able used to predict the value of the criterion variable.
regression coefficient Indicates the atti- tude of the regression line. Sometimes called the slope value, it is defined as the impact on y of increasing x by 1.0.
regression line When positioned in a scat- terplot, an illustration of the relationship
between predictor and criterion variables. It is positioned so as to minimize the sum of all squared prediction errors.
regression to the mean Term used to describe the fact that extreme values of a predictor always predict less extreme values of a criterion variable. This occurs because normal distributions have the preponder- ance of data in the middle of the distribu- tion, and frequency declines with distance from the mean.
residual scores The differences between the actual and predicted values of the criterion.
shrinkage The degree to which a regres- sion solution diminishes in accuracy when applied to new data sets.
simple or bivariate regression Refers to regression with one predictor variable.
slope Indicates how much the regression line inclines or declines from left to right.
standard error of the estimate An average measure of error in a regression solution. It is based on the strength of the correlation between the variables and the variability in the criterion variable.
Key Terms
Review Questions Answers to the odd-numbered questions are provided in Appendix A.
The table immediately below is an economical way to show the correlations among several variables. Using the different correlations listed across the top line and also down the first column, we can determine the correlation coefficient between any two variables. The cor- relation between probsolv (problem solving) and analytic (analytical ability), for example, is determined by moving down the left column for one variable, across the top for the other variable, and then finding the value at the point where the two meet. At the intersection of probsolv down the left and analytic across the top, the value is 0.726. The correlation of probsolv and analytic is r 5 0.726.
tan82773_09_ch09_263-294.indd 291 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Summary and Resources
Use the following information to answer questions 1–6.
Correlation Matrix*
problemsolv analytic comprehen reasoning computat vocab
probsolv 1.000 0.726 0.833 0.598 0.919 0.714 analytic 0.726 1.000 0.767 0.857 0.734 0.894 comprehen 0.833 0.767 1.000 0.686 0.736 0.740 reasoning 0.598 0.857 0.686 1.000 0.534 0.852 computat 0.919 0.734 0.736 0.534 1.000 0.675 vocab 0.714 0.894 0.740 0.852 0.675 1.000
*All correlations are statistically significant.
Descriptive Statistics
Test Mean Standard deviation
Problem solving 43.000 8.441 Analytic 46.500 9.317 Comprehension 46.500 8.893 Reasoning 48.000 6.144 Computation 52.750 7.502 Vocabulary 54.850 5.250
1. Noting the correlation between problem solving and the computation score, what computation score can be predicted for a student whose problem-solving score is 49?
a. How much will computation score increase for every 1.0 increase in problem solving? b. What value will computation have if problem solving is 0? c. In terms of regression solutions, why is the value of computation relevant when
problem solving is 0?
2. What is the standard error of the estimate for the Question 1 solution?
3. Calculate a 0.99 confidence interval for the Question 1 solution. Assume n 5 52.
a. What is the confidence interval expected to contain? b. On average, how often will the assumption referred to in Question 3a be wrong? c. What could a researcher do to shrink the confidence interval?
4. Referring to the matrix at the beginning of the Review Questions, what variable will provide the best prediction of comprehension scores? Explain.
tan82773_09_ch09_263-294.indd 292 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Summary and Resources
5. From the matrix at the beginning of the Review Questions, what vocabulary score is predicted for someone who has an analytic score of 57.5?
6. What comprehension score is predicted for someone who has a reasoning score of 60?
7. The data below indicate the number of times subjects are reinforced for solving each problem and the number of problems correctly solved in the class period.
Reinforcements Number of problems solved
2 3 5 6 4 5 1 3 3 4 5 7 4 4 6 9
a. What is the correlation between reinforcement and response rates? b. What is the best prediction for number of responses for someone who has been
reinforced 5 times? c. Assume n 5 52 and determine a 0.95 confidence interval for the 7b. solution. d. What would the confidence interval be if the correlation were only rxy 5 0.7?
8. If two sets of data are uncorrelated, what is the best prediction for the value of y?
9. What impact does a negative correlation between x and y have on the slope of the regression line?
10. What factors determine error in a regression prediction?
Answers to Try It! Questions
1. The size of the correlation matters because the larger it is, the more the two vari- ables have in common, and the more accurately the value of one can be predicted from the value of the other.
2. A weak correlation is indicated by extensive scatter among the points in a scatterplot.
3. The factors in the width, or size, of a confidence interval are the strength of the correlation, the variability in the criterion variable, the sample size, and the level of confidence.
tan82773_09_ch09_263-294.indd 293 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.
Summary and Resources
4. A negative correlation is reflected in a slope that declines from left to right in the graph. The value of b, the regression coefficient, will be negative.
5. A higher correlation between x and y results in a narrower confidence interval for the solution. A higher correlation results in more precision.
6. In regression, shrinkage means that a regression solution does not fit subsequent data sets as well as it fits the sample for which it was initially calculated. The best way to avoid shrinkage is to ensure that the sample reflects the characteristics of the population. This means large, randomly selected samples.
tan82773_09_ch09_263-294.indd 294 3/3/16 1:01 PM
© 2016 Bridgepoint Education, Inc. All rights reserved. Not for resale or redistribution.