week 8 PPOL 505 DB

profileMz Genuine
SalkindCh16.docx

16 USING LINEAR REGRESSION PREDICTING THE FUTURE

16: MEDIA LIBRARY

Premium Videos

Core Concepts in Stats Video

· Linear Regression

Lightboard Lecture Video

· Multiple Regression

Time to Practice Video

· Chapter 16: Problem 2

Difficulty Scale

(as hard as they get!)

WHAT YOU WILL LEARN IN THIS CHAPTER

· Understanding how prediction works and how it can be used in the social and behavioral sciences

· Understanding how and why linear regression works when predicting one variable on the basis of another

· Judging the accuracy of predictions

· Understanding how multiple regression works and why it is useful

INTRODUCTION TO LINEAR REGRESSION

You’ve seen it all over the news—concern about obesity and how it affects work and daily life. A set of researchers in Sweden was interested in looking at how well mobility disability and/or obesity predicted job strain and whether social support at work can modify this association. The study included more than 35,000 participants, and differences in job strain mean scores were estimated using linear regression, the exact focus of what we are discussing in this chapter. The results found that level of mobile disability did predict job strain and that social support at work significantly modified the association among job strain, mobile disability, and obesity.

Want to know more? Go to the library or go online …

Norrback, M., De Munter, J., Tynelius, P., Ahlstrom, G., & Rasmussen, F. (2016). The association of mobility disability, weight status and job strain: A cross-sectional study. Scandinavian Journal of Public Health, 44, 311–319.

WHAT IS PREDICTION ALL ABOUT?

Here’s the scoop. Not only can you compute the degree to which two variables are related to one another (by computing a correlation coefficient as we did in Chapter 5), but you can also use these correlations to predict the value of one variable based on the value of another. This is a very special case of how correlations can be used, and it is a very powerful tool for social and behavioral sciences researchers.

The basic idea is to use a set of previously collected data (such as data on variables X and Y), calculate how correlated these variables are with one another, and then use that correlation and the knowledge of X to predict Y. Sound difficult? It’s not really, especially once you see it illustrated.

For example, a researcher collects data on total high school grade point average (GPA) and first-year college GPA for 400 students in their freshman year at the state university. He computes the correlation between the two variables. Then, he uses the techniques you’ll learn about later in this chapter to take a new set of high school GPAs and (knowing the relationship between high school GPA and first-year college GPA from the previous set of students) predict what first-year GPA should be for a new student who is just starting out. Pretty nifty, huh?

Here’s another example. A group of kindergarten teachers is interested in finding out how well extra help after for their students aids them in first grade. That is, does the amount of extra help in kindergarten predict success in first grade? Once again, these teachers know the correlation between the amount of extra help and first-grade performance from prior years; they can apply it to a new set of students and predict first-grade performance based on the amount of kindergarten help.

How does regression work? Data are collected on past events (such as the existing relationship between two variables) and then applied to a future event given knowledge of only one variable. It’s easier than you think.

The higher the absolute value of the correlation coefficient, regardless of whether it is direct or indirect (positive or negative), the more accurate the prediction is of one variable from the other based on that correlation. That’s because the more two variables share in common, the more you know about the second variable based on your knowledge of the first variable. And you may already surmise that when the correlation is perfect (+1.0 or −1.0), then the prediction is perfect as well. If rxy = −1.0 or +1.0 and if you know the value of X, then you also know the exact value of Y. Likewise, if rxy = −1.0 or +1.0 and you know the value of Y, then you also know the exact value of X. Either way works just fine.

What we’ll do in this chapter is go through the process of using linear regression to predict a Y score from an X score. We’ll begin by discussing the general logic that underlies prediction, then review some simple line-drawing skills and, finally, discuss the prediction process using specific examples.

Why the prediction of Y from X and not the other way around? Convention. Seems like a good idea to have a consistent way to identify variables, so the Y variable becomes the dependent variable or the one being predicted and the X variable becomes the independent variable and is the variable used to predict the value of Y. And when predicted, the Y value is represented as Y′ (read as  Y prime )—the predicted value of Y. (To sound like an expert, you might call the independent variable a predictor and the dependent variable the criterion. Purists save the terms independent and dependent to describe cause-and-effect relationships, which we cannot assume when talking about correlations.)

THE LOGIC OF PREDICTION

Before we begin with the actual calculations and show you how correlations are used for prediction, let’s understand the argument for why and how prediction works. We will continue with the example of predicting college GPA from high school GPA.

Prediction is the computation of future outcomes based on a knowledge of present ones. When we want to predict one variable from another, we need to first compute the correlation between the two variables. Table 16.1 shows the data we will be using in this example. Figure 16.1 shows the scatterplot (see Chapter 5) of the two variables that are being computed.

Table 16.1 ⬢ Total High School GPA and First-Year College GPA

High School GPA

First-Year College GPA

3.50

3.30

2.50

2.20

4.00

3.50

3.80

2.70

2.80

3.50

1.90

2.00

3.20

3.10

3.70

3.40

2.70

1.90

3.30

3.70

Figure 16.1 ⬢ Scatterplot of high school GPA and college GPA

To predict college GPA from high school GPA, we have to create a  regression equation  and use that to plot what is called a  regression line . A regression line reflects our best guess as to what score on the Y variable (college GPA) would be predicted by a score on the X variable (high school GPA). For all the data you see in Table 16.1, the regression line is drawn so that it minimizes the distance between itself and each of the points on the predicted (Y′) variable. You’ll learn shortly how to draw that line, shown in Figure 16.2.

What does the regression line you see in Figure 16.2 represent?

First, it’s the regression of the Y variable on the X variable. In other words, Y (college GPA) is being predicted from X (high school GPA). This regression line is also called the  line of best fit . The line fits these data because it minimizes the distance between each individual point and the regression line. Those distances are errors because it means the prediction was wrong; it was some distance from the right answer. The line is drawn to minimize those errors. For example, if you take all these points and try to find the line that best fits them all at once, the line you see in Figure 16.2 is the one you would use.

Second, it’s the line that allows us our best guess (at estimating what college GPA would be, given each high school GPA). For example, if high school GPA is 3.0, then college GPA should be around (remember, this is only an eyeball prediction) 2.8. Take a look at Figure 16.3 to see how we did this. We located the predictor value (3.0) on the x-axis, drew a perpendicular line from the x-axis to the regression line, then drew a horizontal line to the y-axis, and finally estimated what the predicted value of Y would be.

Figure 16.3 ⬢ Estimating college GPA given high school GPA

Third, the distance between each individual data point and the regression line is the  error in prediction —a direct reflection of the correlation between the two variables. For example, if you look at data point (3.3, 3.7), marked in Figure 16.4, you can see that this (X, Y) data point is above the regression line. The distance between that point and the line is the error in prediction, as marked in Figure 16.4, because if the prediction were perfect, then all the predicted points would fall where? Right on the regression or prediction line.

Figure 16.4 ⬢ Prediction is rarely perfect: estimating the error in prediction

Fourth, if the correlation were perfect (and the x-axis meets the y-axis at Y ’s mean), all the data points would align themselves along a 45° angle, and the regression line would pass through each point (just as we said earlier in the third point).

Given the regression line, we can use it to precisely predict any future score. That’s what we’ll do right now—create the line and then do some prediction work.

CORE CONCEPTS IN STATS VIDEO

Linear Regression

DRAWING THE WORLD’S BEST LINE (FOR YOUR DATA)

The simplest way to think of prediction is that you are determining the score on one variable (which we’ll call Y—the  criterion  or dependent variable) based on the value of another score (which we’ll call X—the  predictor  or independent variable).

The way that we find out how well X can predict Y is through the creation of the regression line we mentioned earlier in this chapter. This line is created from data that have already been collected. The equations are then used to predict scores using a new value for X, the predictor variable.

Formula 16.1 shows the general formula for the regression line, which may look familiar because you may have used something very similar in a high school or college math course. In geometry, it’s the formula for any straight line:

(16.1)

Y'=bX+a,Y′=bX+a,

where

· Y ′ is the predicted score of Y based on a known value of X;

· b is the slope, or direction, of the line;

· X is the score being used as the predictor; and

· a is the point at which the line crosses the y-axis.

Let’s use the same data shown earlier in Table 16.1, along with a few more calculations that we will need thrown in.

X

Y

X2

Y2

XY

3.5

3.3

12.25

10.89

11.55

2.5

2.2

6.25

4.84

5.50

4.0

3.5

16.00

12.25

14.00

3.8

2.7

14.44

7.29

10.26

2.8

3.5

7.84

12.25

9.80

1.9

2.0

3.61

4.00

3.80

3.2

3.1

10.24

9.61

9.92

3.7

3.4

13.69

11.56

12.58

2.7

1.9

7.29

3.61

5.13

3.3

3.7

10.89

13.69

12.21

Total

31.4

29.3

102.50

89.99

94.75

From this table, we see that

· ∑ X, or the sum of all the X values, is 31.4.

· ∑Y, or the sum of all the Y values, is 29.3.

· ∑ X 2, or the sum of each X value squared, is 102.5.

· ∑ Y 2, or the sum of each Y value squared, is 89.99.

· ∑ XY, or the sum of the products of X and Y, is 94.75.

Formula 16.2 is used to compute the slope of the regression line (b in the equation for a straight line):

(16.2)

b=ΣXY−(ΣXΣY/n)ΣX2−[(ΣX)2/n].b=ΣXY−(ΣXΣY/n)ΣX2−[(ΣX)2/n].

In Formula 16.3, you can see the computed value for b, the slope of the line:

(16.3)

b=94.75−[(31.4×29.3)/10]102.5−[(31.4)2/10],b=2.7493.904=0.704.b=94.75−[(31.4×29.3)/10]102.5−[(31.4)2/10],b=2.7493.904=0.704.

Formula 16.4 is used to compute the point at which the line crosses the y-axis (a in the equation for a straight line):

(16.4)

a=ΣY−bΣXn.a=ΣY−bΣXn.

In Formula 16.5, you can see the computed value for a, the intercept of the line:

(16.5)

a=29.3−(0.704×31.4)10,a=7.1910=0.719.a=29.3−(0.704×31.4)10,a=7.1910=0.719.

Now, if we go back and substitute b and a into the equation for a straight line (Y = bX + a), we come up with the final regression line:

Y'=0.704X+0.719.Y′=0.704X+0.719.

Why the Y ′ and not just a plain Y    ? Remember, we are using X to predict Y, so we use Y ′ to mean the predicted and not the actual value of Y.

So, now that we have this equation, what can we do with it? Predict Y, of course.

For example, let’s say that high school GPA equals 2.8 (or X = 2.8). If we substitute the value of 2.8 into the equation, we get the following formula:

Y'=0.704(2.8)+0.719=2.69.Y′=0.704(2.8)+0.719=2.69.

So, 2.69 is the predicted value of Y (or Y ′) given X is equal to 2.8. Now, for any X score, we can easily and quickly compute a predicted Y score.

You can use this formula and the known values to compute predicted values. That’s most of what we just talked about. But you can also plot a regression line to show how well the scores (what you are trying to predict) actually fit the data from which you are predicting. Take another look at Figure 16.2, the plot of the high school–college GPA data. It includes a regression line, which is also called a  trend line . How did we get this line? Easy. We used the same charting skills you learned in Chapter 5 to create a scatterplot; then we selected Add Fit Line in the SPSS Chart Editor. Poof! Done!

You can see that the trend is positive (in that the line has a positive slope) and that the correlation is .6835—very positive. And you can see that the data points do not align directly on the line, but they are pretty close, which indicates that there is a relatively small amount of error.

Not all lines that fit best between a bunch of data points are straight. Rather, they could be curvilinear, just as you can have a curvilinear relationship between your variables, as we discussed in Chapter 5. For example, the relationship between anxiety and performance is such that when people are not at all anxious or very anxious, they don’t perform very well. But if they’re moderately anxious, then performance can be enhanced. The relationship between these two variables is curvilinear, and the prediction of Y from X takes that into account. Dealing with curvilinear relationships is beyond the scope of this book, but fortunately, most relationships you’ll see in the social sciences are essentially linear.

HOW GOOD IS YOUR PREDICTION?

How can we measure how good a job we have done predicting one outcome from another? We know that the higher the absolute magnitude of the correlation between two variables, the better the prediction. In theory, that’s great. But being practical, we can also look at the difference between the predicted value (Y ′) and the actual value (Y) when we first compute the formula of the regression line.

For example, if the formula for the regression line is Y ′ = 0.704X + 0.719, the predicted Y (or Y ′) for an X value of 2.8 is 0.704(2.8) + 0.719, or 2.69. We know that the actual Y value that corresponds to an X value is 3.5 (from the data set shown in Table 16.1). The difference between 3.5 and 2.69 is 0.81, and that’s the size of the error in prediction.

Another measure of error that you could use is the coefficient of determination (see Chapter 5), which is the percentage of error that is reduced in the relationship between variables. For example, if the correlation between two variables is .4 and the coefficient of determination is 16% or .42, the reduction in error is 16% since initially we suspect the relationship between the two variables starts at 0 or 100% error (no predictive value at all).

If we take all of these differences, we can compute the average amount that each data point differs from the predicted data point, or the  standard error of estimate . This is a kind of standard deviation that reflects average error along the line of regression. The value tells us how much imprecision there is in our estimate. As you might expect, the higher the correlation between the two values (and the better the prediction), the lower this standard error of estimate will be. In fact, if the correlation between the two variables is perfect (either +1 or −1), then the standard error of estimate is zero. Why? Because if prediction is perfect, all of the actual data points fall on the regression line, and there’s no error in estimating Y from X.

The predicted Y ′, or dependent variable, need not always be a continuous one, such as height, test score, or problem-solving skills. It can be a categorical variable, such as admit/don’t admit, Level A/Level B, or Social Class 1/Social Class 2. The score that’s used in the prediction is “dummy coded” to be a 0 or a 1 (or any two values) and then used in the same equation. Yes, you are right that the level of measurement for this sort of correlational stuff is supposed to be at the interval level, but a variable with just two values works mathematically as if it has equal-sized intervals because there is only one interval.

USING SPSS TO COMPUTE THE REGRESSION LINE

Let’s use SPSS to compute the regression line that predicts Y′ from X. The data set we are using is Chapter 16 Data Set 1. We will be using the number of hours of training to predict how severe injuries will be if someone is injured playing football.

There are two variables in this data set:

Variable

Definition

Training (X)

Number of hours per week of strength training

Injuries (Y)

Severity of injuries on a scale from 1 to 10

Here are the steps to compute the regression line that we discussed in this chapter. Follow along and do it yourself.

1. Open the file named Chapter 16 Data Set 1.

2. Click Analyze → Regression → Linear. You’ll see the Linear Regression dialog box shown in Figure 16.5.

3. Click on the variable named Injuries and then move it to the Dependent: variable box. It’s the dependent variable because its value depends on the value of number of hours of training. In other words, it’s the variable being predicted.

4. Click on the variable named Training and then move it to the Independent(s): variable box.

5. Click OK, and you will see the partial results of the analysis, as shown in Figure 16.6.

We’ll get to the interpretation of this output in a moment. First, let’s have SPSS overlay a regression line on the scatterplot for these data like the one you saw earlier in Figure 16.2.

6. Click Graphs → Legacy Dialogs → Scatter/Dot.

7. Click Simple Scatter and then click Define. You’ll see the simple Scatterplot dialog box.

8. Click Injuries and move it to the variable label to the Y Axis: box. Remember, the predicted variable is represented by the y-axis.

9. Click Training and move it to the variable label to the X Axis: box.

10. Click OK, and you will see the scatterplot as shown in Figure 16.7.

Now let’s draw the regression line.

11. If you are not in the chart editor, double-click on the chart to select it for editing.

12. Click on the Add Fit Line at Total button (on the second row of buttons, about fifth from the left) that looks a little like this: .

13. Close the Properties box that opened when you selected the Add Fit Line at Total button and then close the chart editor window. The completed scatterplot, with the regression line, is shown in Figure 16.8 along with the multiple regression value R2, which equals 0.21. As you will read more about shortly, the multiple regression correlation coefficient is the regression of all the X values on the predicated value.

When you have the Properties dialog box open for drawing the regression line, notice that there is a set of Confidence Intervals options. When clicked, these show you a boundary within which there is a specific probability as to how good the prediction is. For example, if you click Mean and specify 95%, the graph will show you the boundaries surrounding the regression line, within which there is a 95% chance of the predicted scores occurring. This idea of wanting to be within a certain range of error 95% of the time is the same as wanting a .05 significance level for statistical analyses.

Figure 16.5 ⬢ Linear Regression dialog box

Understanding the SPSS Output

The SPSS output tells us several things:

1. The formula for the regression line is taken from the first set of output shown in Figure 16.6 as Y ′ = –0.125X + 6.847. This equation can be used to predict level of injury given any number of hours spent in strength training.

2. As you can see in Figure 16.8, the regression line has a negative slope, reflecting a negative correlation (of –.458, which is what Beta is in Figure 16.6) between hours of training and severity of injuries. So it appears, given the data, that the more one trains, the fewer severe injuries occur.

3. You can also see that the prediction is significant—in other words, predicting Y from X is based on a significant relationship between the two variables such that the test of significance for both the constant (Training) and the predicted variable (Injuries) is significantly different from zero (which it would be if there was no predictive value for X predicting Y).

So just how good is the prediction? Well, the SPSS output (which we did not show you) also indicates that the standard error of estimate for Injuries (the predicted variable) is 2.182; double that (4.36) and you’ll see that there is a 95% chance (remember 1.96 or about 2 standard deviations away from the mean creates a 95% confidence interval) the prediction will fall between the mean of all injuries (which is 4.33) and ±4.46. So, based on the correlation coefficient, the prediction is okay but not great.

THE MORE PREDICTORS THE BETTER? MAYBE

All of the examples that we have used so far in the chapter have been for one criterion or outcome measure and one predictor variable. There is also the case of regression where more than one predictor or independent variable is used to predict a particular outcome. If one variable can predict an outcome with some degree of accuracy, then why couldn’t two do a better job? Maybe so, but there’s a big caveat—read on.

For example, if high school GPA is a pretty good indicator of college GPA, then how about high school GPA plus number of hours of extracurricular activities? So, instead of

Y'=bX+a,Y′=bX+a,

the model for the regression equation becomes

Y'=bX1+bX2+a,Y′=bX1+bX2+a,

where

· X1 is the value of the first independent variable,

· X2 is the value of the second independent variable,

· b is the regression weight for that particular variable, and

· a is the intercept of the regression line, or where the regression line crosses the y-axis.

As you may have guessed, this model is called  multiple regression  (multiple predictors, right?). So, in theory anyway, you are predicting an outcome from two independent variables rather than one. But you want to add additional predictor variables only under certain conditions. Read on.

LIGHTBOARD LECTURE VIDEO

Multiple Regression

Any variable you add has to make a unique contribution to understanding the dependent variable. Otherwise, why use it? What do we mean by unique? The additional variable needs to explain differences in the predicted variable that the first predictor does not. That is, the two variables in combination should predict Y better than any one of the variables would do alone.

In our example, level of participation in extracurricular activities could make a unique contribution. But should we add a variable such as the number of hours each student studied in high school as a third independent variable or predictor? Because number of hours of study is probably highly related to high school GPA (another of our predictor variables, remember?), study time probably would not add very much to the overall prediction of college GPA. We might be better off looking for another variable (such as ratings on letters of recommendation) rather than collecting the data on study time.

Take a look at Figure 16.9, which is the result of a multiple regression analysis that adds the number of extracurricular activity hours to the data you saw in Table 16.1. You can see how both high school GPA and number of hours of extracurricular activity are significant contributors to first-year college GPA. This is a powerful way of examining what and how more than one independent variable contribute to prediction of another variable.

Figure 16.9 ⬢ A multiple regression analysis

The Big Rule(s) When It Comes to Using Multiple Predictor Variables

If you are using more than one predictor variable, try to keep the following two important guidelines in mind:

1. When selecting a variable to predict an outcome, select a predictor variable (X) that is related to the criterion variable (Y). That way, the two share something in common (remember, they should be correlated).

2. When selecting more than one predictor variable (such as X1 and X2), try to select variables that are independent or uncorrelated with one another but are both related to the outcome or predicted (Y) variable.

In effect, you want only independent or predictor variables that are related to the dependent variable and are unrelated to each other. That way, each one makes as distinct a contribution as possible to predicting the dependent or predicted variable.

There are whole books on multiple regression, and much of what one needs to learn about this powerful procedure is beyond the scope of this book. Chapter 18 talks more about multiple regression.

How many predictor variables are too many? Well, if one variable predicts some outcome, and two are even more accurate, then why not three, four, or 

 five predictor variables? In practical terms, every time you add a variable, an expense is incurred. Someone has to go collect the data, it takes time (which is $$$ when it comes to research budgets), and so on. From a theoretical sense, there is a fixed limit on how many variables can contribute to an understanding of what we are trying to predict. Remember that it is best when the predictor or independent variables are independent or unrelated to each other. The problem is that once you get to three or four variables, fewer things can remain unrelated. Better to be accurate and conservative than to include too many variables and waste money and the power of prediction.

Real-World Stats

How children feel about what they do is often very closely related to how well they do what they do. The aim of this study was to analyze the consequences of emotion during a writing exercise. In the model this research follows, motivation and affect (the experience of emotion) play an important role during the writing process. Fourth and fifth graders were instructed to write autobiographical narratives with no emotional content, positive emotional content, and negative emotional content. The results showed no effect regarding these instructions on the proportion of spelling errors, but the results did reveal an effect on the length of narrative the children wrote. A simple regression analysis (just like the ones we did and discussed in this chapter) showed a correlation and some predictive value between working memory capacity and the number of spelling errors in the neutral condition only. Since the model on which the researchers based much of their preliminary thought about this topic states that emotions can increase the cognitive load or the amount of “work” necessary in writing, that becomes the focus of the discussion in this research article.

Want to know more? Go online or to the library and find …

Fartoukh, M., Chanquoy, L., & Piolat, A. (2012). Effects of emotion on writing processes in children. Written Communication, 29, 391–411.

Summary

Prediction is a useful application of simple correlation, and it is a very powerful tool for examining complex relationships. This chapter might have been a little more difficult than others, but you’ll be well served by what you have learned, especially if you can apply it to the research reports and journal articles that you have to read. We are now at the end of lots of chapters on inference, and we’re about to move on in the next part of this book to using statistics when the sample size is very small or when the assumption that the scores are distributed in a normal way is violated.

Time to Practice

1. How does linear regression differ from analysis of variance?

2. Chapter 16 Data Set 2 contains the data for a group of participants who took a timed test. The data are the average amount of time the participants took on each item (Time) and the number of guesses it took to get each item correct (Correct).

a. What is the regression equation for predicting response time from number correct?

b. What is the predicted response time if the number correct is 8?

c. What is the difference between the predicted and the actual number correct for each of the predicted response times?

Time to Practice Video

Chapter 16: Problem 2

Chapter 16, Problem 2 will show you how to compute a linear regression, so you can answer the question about the regression equation and make some predictions. Doing a regression is pretty straightforward. First, we want to go to our SPSS Data Set, which is Chapter 16, Data Set 2. Once we're here, we want to go to Analyze, then Regression, then Linear, since we're using just two variables. When we click on Linear, you want to think about where you place your dependent and independent variables. The dependent is the outcome, or that thing that is dependent on something else. In this case, the number of correct answers is dependent on the amount of time we think they spent on it. So, correct answers is the dependent and time is the independent. Click OK, and our output pops up. The key thing we're going to look at for making the prediction is under Coefficients. And, for doing this, we're going to look at the Unstandardized Beta Weights. Here you see 7.414 and negative This is the two bits of information. So let's refresh ourselves with what the data set looks like in terms of creating a regression equation. This is the regression equation. And so, what we have already, we figured out the beta and then we figured out the point at which it crosses the y-axis. When we put that information here, here is our regression equation, y equals negative So, that's the equation for Part A. Part B asks us to make a prediction. What is the predicted time if the number of correct answers is 8? In this situation, we take our regression equation, but now we add the x, in this case 8, the number of correct answers, we do the computation, and we come up with 6.262, which is the predicted time. Part C asks you to determine the difference between the predicted and actual time for each of them. And then all you want to do there is, when you look at your data set, is take the predicted and compute that, but instead of putting in 8, put in each of these numbers, and come up and compute the difference. And that's how we answer Chapter 16, Problem 2. Good luck.

1. Betsy is interested in predicting how many 75-year-olds will develop Alzheimer’s disease and is using as predictors level of education and general physical health graded on a scale from 1 to 10. But she is interested in using other predictor variables as well. Answer the following questions:

a. What criteria should she use in the selection of other predictors? Why?

b. Name two other predictors that you think might be related to the development of Alzheimer’s disease.

c. With the four predictor variables—level of education and general physical health and the two new ones that you named in (b)—draw out what the model of the regression equation would look like.

2. Go to the library or online and locate three different research studies in your area of interest that use linear regression. It’s okay if the studies contain more than one predictor variable. Answer the following questions for each study:

a. What is one independent variable? What is the dependent variable?

b. If there is more than one independent variable, what argument does the researcher make that these variables are independent of one another?

c. Which of the three studies seems to present the least convincing evidence that the dependent variable is predicted by the independent variable, and why?

3. Here’s where you can apply the information in one of this chapter’s tips and get a chance to predict a Super Bowl winner! Joe Coach was curious to know whether the average number of games won in a year predicts Super Bowl performance (win or lose). The X variable was the average number of games won during the past 10 seasons. The Y variable was whether the team ever won the Super Bowl during the past 10 seasons. Here are the data:

Team

Average Number of Wins Over 10 Years

Bowl? (1 = yes and 0 = no)

Savannah Sharks

12

1

Pittsburgh Pelicans

11

0

Williamstown Warriors

15

0

Bennington Bruisers

12

1

Atlanta Angels

13

1

Trenton Terrors

16

0

Virginia Vipers

15

1

Charleston Crooners

9

0

Harrisburg Heathens

8

0

Eaton Energizers

12

1

a. How would you assess the usefulness of the average number of wins as a predictor of whether a team ever won a Super Bowl?

b. What’s the advantage of being able to use a categorical variable (such as 1 or 0) as a dependent variable?

c. What other variables might you use to predict the dependent variable, and why would you choose them?

4. Check your calculation of the correlation coefficient of the relationship between coffee consumption and stress done in Chapter 15, Question 5. If you want to know whether coffee consumption predicts group membership:

a. What is the predictor or the independent variable?

b. What is the criterion or the dependent variable?

c. Do you have an idea what R2 will be?

5. Time to try out multiple predictor variables. Take a look at the data shown here where the outcome is becoming a great chef. We suspect that variables such as number of years of experience cooking, level of formal culinary education, and number of different positions (sous chef, pasta station, etc.) all contribute to rankings or scores on the Great Chef Test.

By this time, you should be pretty much used to creating equations from data like these, so let’s get to the real questions:

Years of Experience

Level of Education

# Positions

Score on Great Chef Test

5

1

5

88

6

2

4

78

12

3

9

56

21

3

8

88

7

2

5

97

9

1

8

90

13

2

8

79

16

2

9

85

21

2

9

60

11

1

4

89

15

2

7

88

15

3

7

76

1

3

3

78

17

2

6

98

26

2

8

91

11

2

6

88

18

3

7

90

31

3

12

98

27

2

16

88

a. Which are the best predictors of a chef’s score?

b. What score can you expect from a person with 12 years of experience and a Level 2 education who has held five positions?

6. Take a look at Chapter 16 Data Set 3, where number of home sales (Number_Homes_Sold) is being predicted by years in the business (Years_In_Business) and level of education in years (Level_Of_Education). Why is level of education such a poor contributor to the overall prediction (using both years in the business and level of education combined) of number of homes sold? What’s the best predictor and how do you know? (Hint: These are sort of trick questions. Before you go ahead and analyze the data, look at the raw data in the file for the characteristics you know are important for one variable to be correlated with another.)

7. For any combinations of predicted and predictor variables, what should be the nature of the relationship between them?

Student Study Site

Get the tools you need to sharpen your study skills! Visit  edge.sagepub.com/salkindfrey7e  to access practice quizzes, eFlashcards, original and curated videos, data sets, and more!