Statistics
Exercise 29
Calculating Simple Linear Regression
Simple linear regression is a procedure that provides an estimate of the value of a dependent variable (outcome) based on the value of an independent variable (predictor). Knowing that estimate with some degree of accuracy, we can use regression analysis to predict the value of one variable if we know the value of the other variable (Cohen & Cohen, 1983). The regression equation is a mathematical expression of the influence that a predictor has on a dependent variable, based on some theoretical framework. For example, in Exercise 14, Figure 14-1 illustrates the linear relationship between gestational age and birth weight. As shown in the scatterplot, there is a strong positive relationship between the two variables. Advanced gestational ages predict higher birth weights.
A regression equation can be generated with a data set containing subjects' x and y values. Once this equation is generated, it can be used to predict future subjects' y values, given only their x values. In simple or bivariate regression, predictions are made in cases with two variables. The score on variable y (dependent variable, or outcome) is predicted from the same subject's known score on variable x (independent variable, or predictor).
Research Designs Appropriate for Simple Linear Regression
Research designs that may utilize simple linear regression include any associational design (Gliner et al., 2009). The variables involved in the design are attributional, meaning the variables are characteristics of the participant, such as health status, blood pressure, gender, diagnosis, or ethnicity. Regardless of the nature of variables, the dependent variable submitted to simple linear regression must be measured as continuous, at the interval or ratio level.
Statistical Formula and Assumptions
Use of simple linear regression involves the following assumptions (Zar, 2010):
1. Normal distribution of the dependent (y) variable
2. Linear relationship between x and y
3. Independent observations
4. No (or little) multicollinearity
5. Homoscedasticity
320
Data that are homoscedastic are evenly dispersed both above and below the regression line, which indicates a linear relationship on a scatterplot. Homoscedasticity reflects equal variance of both variables. In other words, for every value of x, the distribution of y values should have equal variability. If the data for the predictor and dependent variable are not homoscedastic, inferences made during significance testing could be invalid (Cohen & Cohen, 1983; Zar, 2010). Visual examples of homoscedasticity and heteroscedasticity are presented in Exercise 30.
In simple linear regression, the dependent variable is continuous, and the predictor can be any scale of measurement; however, if the predictor is nominal, it must be correctly coded. Once the data are ready, the parameters a and b are computed to obtain a regression equation. To understand the mathematical process, recall the algebraic equation for a straight line:
Data for Additional Computational Practice for the Questions to be Graded
Using the example from Mancini and colleagues (2014) , students enrolled in an RN to BSN program were assessed for demographics at enrollment. The predictor in this example is age at program enrollment, and the dependent variable was number of months it took for the student to complete the RN to BSN program. The null hypothesis is: “Student age at enrollment does not predict the number of months until completion of an RN to BSN program.” The data are presented in Table 29-2 . A simulated subset of 20 students was randomly selected for this example so that the computations would be small and manageable.
TABLE 29-2
AGE AT ENROLLMENT AND MONTHS TO COMPLETION IN AN RN TO BSN PROGRAM
|
Student ID |
x |
y |
x2 |
xy |
|
|
(Student Age) |
(Months to Completion) |
|
|
|
1 |
23 |
17 |
529 |
391 |
|
2 |
24 |
9 |
576 |
216 |
|
3 |
24 |
17 |
576 |
408 |
|
4 |
26 |
9 |
676 |
234 |
|
5 |
31 |
16 |
961 |
496 |
|
6 |
31 |
11 |
961 |
341 |
|
7 |
32 |
15 |
1,024 |
480 |
|
8 |
33 |
12 |
1,089 |
396 |
|
9 |
33 |
15 |
1,089 |
495 |
|
10 |
34 |
12 |
1,156 |
408 |
|
11 |
34 |
14 |
1,156 |
476 |
|
12 |
35 |
10 |
1,225 |
350 |
|
13 |
35 |
17 |
1,225 |
595 |
|
14 |
39 |
20 |
1,521 |
780 |
|
15 |
40 |
9 |
1,600 |
360 |
|
16 |
42 |
12 |
1,764 |
504 |
|
17 |
42 |
14 |
1,764 |
588 |
|
18 |
44 |
10 |
1,936 |
440 |
|
19 |
51 |
17 |
2,601 |
867 |
|
20 |
24 |
11 |
576 |
264 |
|
sum Σ |
677 |
267 |
24,005 |
9,089 |
EXERCISE 29 Questions to Be Graded
Name: _______________________________________________________ Class: _____________________
Date: ___________________________________________________________________________________
Follow your instructor's directions to submit your answers to the following questions for grading. Your instructor may ask you to write your answers below and submit them as a hard copy for grading. Alternatively, your instructor may ask you to use the space below for notes and submit your answers online at http://evolve.elsevier.com/Grove/Statistics/ under “Questions to Be Graded.”
1. If you have access to SPSS, compute the Shapiro-Wilk test of normality for the variable age (as demonstrated in Exercise 26 ). If you do not have access to SPSS, plot the frequency distributions by hand. What do the results indicate?
2. State the null hypothesis where age at enrollment is used to predict the time for completion of an RN to BSN program.
3. What is b as computed by hand (or using SPSS)?
4. What is a as computed by hand (or using SPSS)?
332
5. Write the new regression equation.
6. How would you characterize the magnitude of the obtained R2 value? Provide a rationale for your answer.
7. How much variance in months to RN to BSN program completion is explained by knowing the student's enrollment age?
8. What was the correlation between the actual y values and the predicted y values using the new regression equation in the example?
9. Write your interpretation of the results as you would in an APA-formatted journal.
10. Given the results of your analyses, would you use the calculated regression equation to predict future students' program completion time by using enrollment age as x? Provide a rationale for your answer.