Forecasting: Time series and trend analysis
115
4
After completing this chapter, students will be able to:
CHAPTER OUTLINE
LEARNING OBJECTIVES
Regression Models
1. Identify variables and use them in a regression model.
2. Develop simple linear regression equations from sample data and interpret the slope and intercept.
3. Compute the coefficient of determination and the coefficient of correlation and interpret their meanings.
4. Interpret the F test in a linear regression model.
5. List the assumptions used in regression and use residual plots to identify problems.
4.1 Introduction
4.2 Scatter Diagrams
4.3 Simple Linear Regression
4.4 Measuring the Fit of the Regression Model
4.5 Using Computer Software for Regression
4.6 Assumptions of the Regression Model
4.7 Testing the Model for Significance
4.8 Multiple Regression Analysis
4.9 Binary or Dummy Variables
4.10 Model Building
4.11 Nonlinear Regression
4.12 Cautions and Pitfalls in Regression Analysis
6. Develop a multiple regression model and use it for prediction purposes.
7. Use dummy variables to model categorical data.
8. Determine which variables should be included in a multiple regression model.
9. Transform a nonlinear function into a linear one for use in regression.
10. Understand and avoid mistakes commonly made in the use of regression analysis.
CHAPTER
Summary • Glossary • Key Equations • Solved Problems • Self-Test • Discussion Questions and Problems
Case Study: North–South Airline • Bibliography
Appendix 4.1 Formulas for Regression Calculations
Appendix 4.2 Regression Models Using QM for Windows
Appendix 4.3 Regression Analysis in Excel QM or Excel 2007
4.1 Introduction
Regression analysis is a very valuable tool for today’s manager. Regression has been used to model such things as the relationship between level of education and income, the price of a house and the square footage, and the sales volume for a company relative to the dollars spent on advertising. When businesses are trying to decide which location is best for a new store or branch office, regression models are often used. Cost estimation models are often regression models. The applicability of regression analysis is virtually limitless.
There are generally two purposes for regression analysis. The first is to understand the relationship between variables such as advertising expenditures and sales. The second purpose is to predict the value of one variable based on the value of the other. Because of this, regression is a very important forecasting technique and will be mentioned again in Chapter 5.
In this chapter, the simple linear regression model will first be developed, and then a more complex multiple regression model will be used to incorporate even more variables into our model. In any regression model, the variable to be predicted is called the dependent variable or response variable. The value of this is said to be dependent upon the value of an independent variable, which is sometimes called an explanatory variable or a predictor variable.
Two purposes of regression analysis are to understand the relationship between variables and to predict the value of one based on the other.
4.2 Scatter Diagrams
To investigate the relationship between variables, it is helpful to look at a graph of the data. Such a graph is often called a scatter diagram or a scatter plot. Normally the independent variable is plotted on the horizontal axis and the dependent variable is plotted on the vertical axis. The following example will illustrate this.
Triple A Construction Company renovates old homes in Albany. Over time, the company has found that its dollar volume of renovation work is dependent on the Albany area payroll. The figures for Triple A’s revenues and the amount of money earned by wage earners in Albany for the past six years are presented in Table 4.1. Economists have predicted the local area pay- roll to be $600 million next year, and Triple A wants to plan accordingly.
Figure 4.1 provides a scatter diagram for the Triple A Construction data given in Table 4.1. This graph indicates that higher values for the local payroll seem to result in higher sales for the company. There is not a perfect relationship because not all the points lie in a straight line, but there is a relationship. A line has been drawn through the data to help show the relationship that exists between the payroll and sales. The points do not all lie on the line, so there would be some error involved if we tried to predict sales based on payroll using this or any other line. Many lines could be drawn through these points, but which one best represents the true relationship? Regression analysis provides the answer to this question.
116 CHAPTER 4 • REGRESSION MODELS
A scatter diagram is a graph of the data.
TRIPLE A’S SALES LOCAL PAYROLL ($100,000s) ($100,000,000s)
6 3
8 4
9 6
5 4
4.5 2
9.5 5
TABLE 4.1 Triple A Construction Company Sales and Local Payroll
4.3 SIMPLE LINEAR REGRESSION 117
0 1 2 0
2
4
6
8
10
12
3 4 5 6 7 8
Payroll ($100 million)
S al
es (
$1 00
,0 00
)
FIGURE 4.1 Scatter Diagram of Triple A Construction Company Data
4.3 Simple Linear Regression
In any regression model, there is an implicit assumption (which can be tested) that a relationship exists between the variables. There is also some random error that cannot be predicted. The un- derlying simple linear regression model is
(4-1)
where
dependent variable (response variable)
independent variable (predictor variable or explanatory variable)
intercept (value of Y when )
slope of regression line
random error
The true values for the intercept and slope are not known, and therefore they are estimated using sample data. The regression equation based on sample data is given as
(4-2)
where
predicted value of Y
estimate of based on sample results
estimate of based on sample results
In the Triple A Construction example, we are trying to predict the sales, so the dependent variable (Y) would be sales. The variable we use to help predict sales is the Albany area payroll, so this is the independent variable (X). Although any number of lines can be drawn through these points to show a relationship between X and Y in Figure 4.1, the line that will be chosen is the one that in some way minimizes the errors. Error is defined as
(4-3)
Since errors may be positive or negative, the average error could be zero even though there are extremely large errors—both positive and negative. To eliminate the difficulty of negative errors
e = Y - YN
Error = 1Actual value2 - 1Predicted value2
b1,b1 = b0,b0 =
YN =
YN = b0 + b1X
P = b1 =
X = 0b0 = X = Y =
Y = b0 + b1X + P
The dependent variable is Y and the independent variable is X.
Estimates of the slope and intercept are found from sample data.
The regression line minimizes the sum of the squared errors.
118 CHAPTER 4 • REGRESSION MODELS
canceling positive errors, the errors can be squared. The best regression line will be defined as the one with the minimum sum of the squared errors. For this reason, regression analysis is sometimes called least-squares regression.
Statisticians have developed formulas that we can use to find the equation of a straight line that would minimize the sum of the squared errors. The simple linear regression equation is
The following formulas can be used to compute the intercept and the slope:
(4-4)
(4-5)
The preliminary calculations are shown in Table 4.2. There are other “shortcut” formulas that are helpful when doing the computations on a calculator, and these are presented in Appendix 4.1. They will not be shown here, as computer software will be used for most of the other examples in this chapter.
Computing the slope and intercept of the regression equation for the Triple A Construction Company example, we have
The estimated regression equation therefore is
or
If the payroll next year is $600 million then the predicted value would be
or $950,000.
YN = 2 + 1.25162 = 9.5
1X = 62,sales = 2 + 1.251payroll2 YN = 2 + 1.25X
b0 = Y - b1X = 7 - 11.252142 = 2
b1 = ©1X - X21Y - Y2
©1X - X22 = 12.5
10 = 1.25
Y = ©X
6 =
42
6 = 7
X = ©X
6 =
24
6 = 4
b0 = Y - b1X
b1 = ©1X - X21Y - Y2
©1X - X22 Y =
©Y
n = average 1mean2 of Y values
X = ©X
n = average 1mean2 of X values
YN = b0 + b1X
Y X (X – X– )2 (X – X– )(Y – Y– )
6 3 (3 – 4)2 = 1 (3 – 4)(6 – 7) = 1
8 4 (4 – 4)2 = 0 (4 – 4)(8 – 7) = 0
9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4
5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0
4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5
9.5 5 (5 – 4)2 = 1 (5 – 4)(9.5 – 7) = 2.5
Y � 42 X � 24
Y = 24>6 = 4Y = 42>6 = 7
g1X - X21Y - Y2 = 12.5g1X - X22 = 10gg
TABLE 4.2 Regression Calculations for Triple A Construction
4.4 MEASURING THE FIT OF THE REGRESSION MODEL 119
4.4 Measuring the Fit of the Regression Model
A regression equation can be developed for any variables X and Y, even random numbers. We certainly would not have any confidence in the ability of one random number to predict the value of another random number. How do we know that the model is actually helpful in predicting Y based on X? Should we have confidence in this model? Does the model provide better predic- tions (smaller errors) than simply using the average of the Y values?
In the Triple A Construction example, sales figures (Y) varied from a low of 4.5 to a high of 9.5, and the mean was 7. If each sales value is compared with the mean, we see how far they deviate from the mean and we could compute a measure of the total variability in sales. Because Y is sometimes higher and sometimes lower than the mean, there may be both positive and nega- tive deviations. Simply summing these values would be misleading because the negatives would cancel out the positives, making it appear that the numbers are closer to the mean than they actually are. To prevent this problem, we will use the sum of the squares total (SST) to meas- ure the total variability in Y:
(4-6)
If we did not use X to predict Y, we would simply use the mean of Y as the prediction, and the SST would measure the accuracy of our predictions. However, a regression line may be used to predict the value of Y, and while there are still errors involved, the sum of these squared errors will be less than the total sum of squares just computed. The sum of the squares error (SSE) is
(4-7)
Table 4.3 provides the calculations for the Triple A Construction Example. The mean is compared to each value and we get
The prediction for each observation is computed and compared to the actual value. This results in
The SSE is much lower than the SST. Using the regression line has reduced the variability in the sum of squares by This is called the sum of squares due to22.5 - 6.875 = 15.625.
SSE = 6.875
1YN2 SST = 22.5
1Y = 72 SSE = ge2 = g1Y - YN 22
SST = g1Y - Y22
Deviations (errors) may be positive or negative.
The SST measures the total variability in Y about the mean.
The SSE measures the variability in Y about the regression line.
One of the purposes of regression is to understand the relationship among variables. This model tells us that for each $100 million (represented by X) increase in the payroll, we would ex- pect the sales to increase by $125,000 since ($100,000s). This model helps Triple A Construction see how the local economy and company sales are related.
b1 = 1.25
TABLE 4.3 Sum of Squares for Triple A Construction
Y X (Y � Y– )2 Ŷ (Y �Ŷ)2 (Ŷ � Y– )2
6 3 (6 – 7)2 � 1 2 � 1.25(3) � 5.75 0.0625 1.563
8 4 (8 – 7)2 � 1 2 � 1.25(4) � 7.00 1 0
9 6 (9 – 7)2 � 4 2 � 1.25(6) � 9.50 0.25 6.25
5 4 (5 – 7)2 � 4 2 � 1.25(4) � 7.00 4 0
4.5 2 (4.5 – 7)2 � 6.25 2 � 1.25(2) � 4.50 0 6.25
9.5 5 (9.5 – 7)2 � 6.25 2 � 1.25(5) � 8.25 1.5625 1.563
SST � 22.5 SSE � 6.875 SSR � 15.625Y = 7
g1YN - Y22 = 15.625g1Y - YN 22 = 6.875g1Y - Y22 = 22.5
120 CHAPTER 4 • REGRESSION MODELS
regression (SSR) and indicates how much of the total variability in Y is explained by the regres- sion model. Mathematically, this can be calculated as
(4-8)
Table 4.3 indicates
There is a very important relationship between the sums of squares that we have computed:
(4-9)
Figure 4.2 displays the data for Triple A Construction. The regression line is shown, as is a line representing the mean of the Y values. The errors used in computing the sums of squares are shown on this graph. Notice how the sample points are closer to the regression line than they are to the mean.
Coefficient of Determination The SSR is sometimes called the explained variability in Y while the SSE is the unexplained variability in Y. The proportion of the variability in Y that is explained by the regression equa- tion is called the coefficient of determination and is denoted by Thus,
(4-10)
Thus, can be found using either the SSR or the SSE. For Triple A Construction, we have
This means that about 69% of the variability in sales (Y) is explained by the regression equation based on payroll (X).
If every point in the sample were on the regression line (meaning all errors are 0), then 100% of the variability in Y could be explained by the regression equation, so and
The lowest possible value of is 0, indicating that X explains 0% of the variability in Y. Thus, can range from a low of 0 to a high of 1. In developing regression equations, a good model will have an value close to 1.r2
r2 r2SSE = 0.
r2 = 1
r2 = 15.625
22.5 = 0.6944
r2
r2 = SSR
SST = 1 -
SSE
SST
r2.
SST = SSR + SSE
1Sum of squares total2 = 1Sum of squares due to regression2 + 1Sum of squares error2 SSR = 15.625
SSR = g1YN - Y22
is the proportion of variability in Y that is explained by the regression equation.
r2
If every point lies on the regression line, and SSE � 0.
r2 � 1
0 1 2 0
2
4
6
8
10
12
3 4 5 6 7 8
Payroll ($100 million)
S al
es (
$1 00
,0 00
)
Y – Y ^ Y
Y – Y Y – Y
^
Y = 2 � 1.25 X ^
⎧ ⎪ ⎨ ⎪⎪⎩
⎧
⎧
⎨ ⎩
⎩⎨
Y
X
FIGURE 4.2 Deviations from the Regression Line and from the Mean
4.4 MEASURING THE FIT OF THE REGRESSION MODEL 121
The correlation coefficient ranges from -1 to �1.
(a) Perfect Positive Correlation: r � �1
Y
X (b) Positive Correlation: 0 � r �1
Y
X
(c) No Correlation: r � 0
Y
X (d ) Perfect Negative Correlation: r � �1
Y
X
FIGURE 4.3 Four Values of the Correlation Coefficient
Correlation Coefficient Another measure related to the coefficient of determination is the coefficient of correlation. This measure also expresses the degree or strength of the linear relationship. It is usually expressed as r and can be any number between and including +1 and -1. Figure 4.3 illustrates possible scatter diagrams for different values of r. The value of r is the square root of It is negative if the slope is negative, and it is positive if the slope is positive. Thus,
(4-11)
For the Triple A Construction example with
We know it is positive because the slope is +1.25.
r = 10.6944 = 0.8333
r2 = 0.6944,
r = �2r2
r2.
Multiple Regression Modeling at Canada’s TransAlta Utilities
TransAlta Utilities (TAU) is a $1.6 billion energy company oper- ating in Canada, New Zealand, Australia, Argentina, and the United States. Headquartered in Alberta, Canada, TAU is that country’s largest publicly owned utility. It serves 340,000 cus- tomers in Alberta through 57 customer-service facilities, each of which was staffed by 5 to 20 customer service linemen. The 270 linemen’s jobs are to handle new connections and repairs and to patrol power lines and check substations. This existing system was not the result of some optimal central planning but was put in place incrementally as the company grew.
With help from the University of Alberta, TAU wanted to develop a causal model to decide how many linemen would be best assigned to each facility. The research team decided to build a multiple regres- sion model with only three independent variables. The hardest part of the task was to select variables that were easy to quantify based
on available data. In the end, the explanatory variables were number of urban customers, number of rural customers, and geographic size of a service area. The implicit assumptions in this model are that the time spent on customers is proportional to the number of customers; and the time spent on facilities (line patrol and substation checks) and travel are proportional to the size of the service region. By definition, the unexplained time in the model accounts for time that is not explained by the three variables (such as meetings, breaks, or unpro- ductive time).
Not only did the results of the model please TAU managers, but the project (which included optimizing the number of facili- ties and their locations) saved $4 million per year.
Source: Based on E. Erkut, T. Myroon, and K. Strangway. “TransAlta Re- designs Its Service-Delivery Network,” Interfaces (March–April 2000): 54–69.
IN ACTION
122 CHAPTER 4 • REGRESSION MODELS
4.5 Using Computer Software for Regression
Software such as QM for Windows (Appendix 4.2), Excel, and Excel QM (Appendix 4.3) is often used for regression calculations. We will rely on Excel for most of the calculations in the rest of this chapter. When using Excel to develop a regression model, the input and output for Excel 2007 and Excel 2010 are the same.
The Triple A Construction example will be used to illustrate how to develop a regression model in Excel 2010. Go to the Data tab and select Data Analysis, as shown in Program 4.1A. If Data Analysis does not appear, then the Excel add-in Data Analysis from the Analysis Tool- Pak must be enabled or activated. Appendix F at the end of this book provides instructions on how to enable this and other add-ins for Excel 2010 and Excel 2007. Once an add-in is activated, it will remain on the Data tab for future use.
When the Data Analysis window opens, scroll down to and highlight Regression and click OK, as illustrated in Program 4.1A. The Regression window will open, as shown in Program 4.1B, and you can input the X and Y ranges. Check the Labels box because the cells with the variable name were included in the first row of the X and Y ranges. To have the out- put presented on this page rather than on a new worksheet, select Output Range and give a cell address for the start of the output. Click the OK button, and the output appears in the out- put range specified.
Program 4.1C shows the intercept (2), slope (1.25), and other information that was previ- ously calculated for the Triple A Construction example.
The sums of squares are shown in the column headed by SS. Another name for error is residual. In Excel, the sum of squares error is shown as the sum of squares residual. The values in this output are the same values shown in Table 4.3:
The coefficient of determination is shown to be 0.6944. The coefficient of correlation (r) is
called Multiple R in the Excel output, and this is 0.8333.
1r22 Sum of squares total = SST = 22.5
Sum of squares error 1residual2 = SSE = 6.8750
Sum of squares regression = SSR = 15.625
Errors are also called residuals.
When the Data Analysis window opens. scroll down to Regression.
Go to the Data tab.
Select Data Analysis.
Click ok.
PROGRAM 4.1A Accessing the Regression Option in Excel 2010
4.6 ASSUMPTIONS OF THE REGRESSION MODEL 123
Specify the X and Y ranges.Check the Labels box if the first row in the X and Y ranges includes the variable names.
Click OK to have Excel develop the regression model.
Specify the location for the output. To put this on the current worksheet, click Output Range and give a cell location for this to begin.
PROGRAM 4.1B Data Input for Regression in Excel
A high r2 (close to 1) is desirable.
A low (e.g., less than 0.05) Significance F (p-value for overall model) indicates a significant relationship between X and Y.
The regression coefficients are given here.
The SSR (regression), SSE (residual or error), and SST (total) are shown in the SS column of the ANOVA table.
PROGRAM 4.1C Excel Output for the Triple A Construction Example
4.6 Assumptions of the Regression Model
If we can make certain assumptions about the errors in a regression model, we can perform statisti- cal tests to determine if the model is useful. The following assumptions are made about the errors:
1. The errors are independent. 2. The errors are normally distributed. 3. The errors have a mean of zero. 4. The errors have a constant variance (regardless of the value of X).
124 CHAPTER 4 • REGRESSION MODELS
A plot of the errors may highlight problems with the model.
E rr
or
X
FIGURE 4.4A Pattern of Errors Indicating Randomness
E rr
or
X
FIGURE 4.4B Nonconstant Error Variance
E rr
or
X
FIGURE 4.4C Errors Indicate Relationship is Not Linear
It is possible to check the data to see if these assumptions are met. Often a plot of the residuals will highlight any glaring violations of the assumptions. When the errors (residuals) are plotted against the independent variable, the pattern should appear random.
Figure 4.4 presents some typical error patterns, with Figure 4.4A displaying a pattern that is expected when the assumptions are met and the model is appropriate. The errors are random and no discernible pattern is present. Figure 4.4B demonstrates an error pattern in which the errors increase as X increases, violating the constant variance assumption. Figure 4.4C shows errors
consistently increasing at first, and then consistently decreasing. A pattern such as this would indicate that the model is not linear and some other form (perhaps quadratic) should be used. In general, patterns in the plot of the errors indicate problems with the assumptions or the model specification.
Estimating the Variance While the errors are assumed to have constant variance this is usually not known. It can be estimated from the sample results. The estimate of is the mean squared error (MSE) and is denoted by The MSE is the sum of squares due to error divided by the degrees of freedom:*
(4-12)
where
of observations in the sample
of independent variables
In this example, and So
From this we can estimate the standard deviation as
(4-13)
This is called the standard error of the estimate or the standard deviation of the regression. In the example shown in Program 4.1D,
This is used in many of the statistical tests about the model. It is also used to find interval estimates for both Y and regression coefficients.**
s = 1MSE = 11.7188 = 1.31
s = 1MSE
s2 = MSE = SSE
n - k - 1 =
6.8750
6 - 1 - 1 =
6.8750
4 = 1.7188
k = 1.n = 6
k = number
n = number
s2 = MSE = SSE
n - k - 1
s2. �2 1�22,
4.7 TESTING THE MODEL FOR SIGNIFICANCE 125
*See bibliography at end of this chapter for books with further details. **The MSE is a common measure of accuracy in forecasting. When used with techniques besides regression, it is com- mon to divide the SSE by n rather than n - k - 1.
The error variance is estimated by the MSE.
4.7 Testing the Model for Significance
Both the MSE and provide a measure of accuracy in a regression model. However, when the sample size is too small, it is possible to get good values for both of these even if there is no relationship between the variables in the regression model. To determine whether these values are meaningful, it is necessary to test the model for significance.
To see if there is a linear relationship between X and Y, a statistical hypothesis test is per- formed. The underlying linear model was given in Equation 4-1 as
If then Y does not depend on X in any way. The null hypothesis says there is no linear relationship between the two variables (i.e., ). The alternate hypothesis is that there is a linear relationship (i.e., ). If the null hypothesis can be rejected, then we have proven that a linear relationship does exist, so X is helpful in predicting Y. The F distribution is used for test- ing this hypothesis. Appendix D contains values for the F distribution which can be used when calculations are performed by hand. See Chapter 2 for a review of the F distribution. The results of the test can also be obtained from both Excel and QM for Windows.
b1 Z 0 b1 = 0
b1 = 0,
Y = b0 + b1X + P
r2
An F test is used to determine if there is a relationship between X and Y.
126 CHAPTER 4 • REGRESSION MODELS
If the significance level for the F test is low, there is a relationship between X and Y.
The F statistic used in the hypothesis test is based on the MSE (seen in the previous sec- tion) and the mean squared regression (MSR). The MSR is calculated as
(4-14)
where
The F statistic is
(4-15)
Based on the assumptions regarding the errors in a regression model, this calculated F statistic is described by the F distribution with
where
If there is very little error, the denominator (MSE) of the F statistic is very small relative to the numerator (MSR), and the resulting F statistic would be large. This would be an indi- cation that the model is useful. A significance level related to the value of the F statistic is then found. Whenever the F value is large, the significance level (p-value) will be low, indi- cating that it is extremely unlikely that this could have occurred by chance. When the F value is large (with a resulting small significance level), we can reject the null hypothesis that there is no linear relationship. This means that there is a linear relationship and the values of MSE and are meaningful.
The hypothesis test just described is summarized here:
Steps in Hypothesis Test for a Significant Regression Model
1. Specify null and alternative hypotheses:
2. Select the level of significance (�). Common values are 0.01 and 0.05.
3. Calculate the value of the test statistic using the formula
4. Make a decision using one of the following methods: (a) Reject the null hypothesis if the test statistic is greater than the F value from the table
in Appendix D. Otherwise, do not reject the null hypothesis:
(b) Reject the null hypothesis if the observed significance level, or p-value, is less than the level of significance (�). Otherwise, do not reject the null hypothesis:
Reject if p-value 6 �
p-value = P1F 7 calculated test statistic2 df2 = n - k - 1
df1 = k
Reject if Fcalculated 7 F�,df1,df2
F = MSR
MSE
H1:b1 Z 0
H0:b1 = 0
r2
k = the number of independent 1X2 variables
degrees of freedom for the denominator = df2 = n - k - 1.
degrees of freedom for the numerator = df1 = k
F = MSR
MSE
k = number of independent variables in the model
MSR = SSR
k
4.7 TESTING THE MODEL FOR SIGNIFICANCE 127
Triple A Construction Example To illustrate the process of testing the hypothesis about a significant relationship, consider the Triple A Construction example. Appendix D will be used to provide values for the F distribution.
Step 1.
Step 2.
Step 3. Calculate the value of the test statistic. The MSE was already calculated to be 1.7188. The MSR is then calculated so that F can be found:
Step 4. (a) Reject the null hypothesis if the test statistic is greater than the F value from the table in Appendix D:
The value of F associated with a 5% level of significance and with degrees of freedom 1 and 4 is found in Appendix D. Figure 4.5 illustrates this:
Thus, there is sufficient data to conclude that there is a statistically significant relationship between X and Y, so the model is helpful. The strength of this relationship is measured by
Thus, we can conclude that about 69% of the variability in sales (Y) is explained by the regression model based on local payroll (X).
The Analysis of Variance (ANOVA) Table When software such as Excel or QM for Windows is used to develop regression models, the out- put provides the observed significance level, or p-value, for the calculated F value. This is then compared to the level of significance (α) to make the decision.
r2 = 0.69.
Reject H0 because 9.09 7 7.71
Fcalculated = 9.09
F0.05,1,4 = 7.71
df2 = n - k - 1 = 6 - 1 - 1 = 4
df1 = k = 1
F = MSR
MSE =
15.6250
1.7188 = 9.09
MSR = SSR
k =
15.6250
1 = 15.6250
Select � = 0.05.
H1 : b1 Z 0 1linear relationship exists between X and Y2 H0 : b1 = 0 1no linear relationship between X and Y2
F � 7.71 9.09
0.05
FIGURE 4.5 F Distribution for Triple A Construction Test for Significance
128 CHAPTER 4 • REGRESSION MODELS
4.8 Multiple Regression Analysis
The multiple regression model is a practical extension of the model we just observed. It allows us to build a model with several independent variables. The underlying model is
(4-16)
where
dependent variable (response variable)
th independent variable (predictor variable or explanatory variable)
intercept (value of Y when all )
coefficient of the ith independent variable
number of independent variables
random error
To estimate the values of these coefficients, a sample is taken and the following equation is developed:
(4-17)
where
predicted value of Y
sample intercept (and is an estimate of )
sample coefficient of ith variable (and is an estimate of )
Consider the case of Jenny Wilson Realty, a real estate company in Montgomery, Alabama. Jenny Wilson, owner and broker for this company, wants to develop a model to determine a suggested listing price for houses based on the size of the house and the age of the house. She selects a sample of houses that have sold recently in a particular area, and she records the selling price, the square footage of the house, the age of the house, and also the con- dition (good, excellent, or mint) of each house as shown in Table 4.5. Initially Jenny plans to
bibi = b0b0 =
YN =
YN = b0 + b1X1 + b2X2 + Á + bkXk
P = k = bi =
Xi = 0b0 = Xi = i
Y =
Y = b0 + b1X1 + b2X2 + Á + bkXk + P
A multiple regression model has more than one independent variable.
DF SS MS F SIGNIFICANCE F
Regression k SSR MSR = SSR/k MSR/MSE P(F MSR/MSE)
Residual n – k – 1 SSE MSE = SSE/(n–k–1)
Total n – 1 SST
TABLE 4.4 Analysis of Variance (ANOVA) Table for Regression
Table 4.4 provides summary information about the ANOVA table. This shows how the num- bers in the last three columns of the table are computed. The last column of this table, labeled Significance F, is the p-value, or observed significance level, which can be used in the hypothe- sis test about the regression model.
Triple A Construction ANOVA Example The Excel output that includes the ANOVA table for the Triple A Construction data is shown in Program 4.1C. The observed significance level for is given to be 0.0394. This means
Because this probability is less than 0.05 (α), we would reject the hypothesis of no linear rela- tionship and conclude that there is a linear relationship between X and Y. Note in Figure 4.5 that the area under the curve to the right of 9.09 is clearly less than 0.05, which is the area to the right of the F value associated with a 0.05, level of signicance.
P1F 7 9.09092 = 0.0394
F = 9.0909
4.8 MULTIPLE REGRESSION ANALYSIS 129
use only the square footage and age to develop a model, although she wants to save the infor- mation on condition of the house to use later. She wants to find the coefficients for the follow- ing multiple regression model:
where
predicted value of dependent variable (selling price)
intercept
and value of the two independent variables (square footage and age), respectively
and slopes for and respectively
The mathematics of multiple regression becomes quite complex, so we leave formulas for and to regression textbooks.* Excel can be used to develop a multiple regression model
just as it was used for a simple linear regression model. When entering the data in Excel, it is important that all of the independent variables are in adjoining columns to facilitate the input. From the Data tab in Excel, select Data Analysis and then Regression, as shown earlier, in Program 4.1A. This opens the regression window to allow the input, as shown in Program 4.2A. Note that the X Range includes the data in two columns (B and C) because there are two inde- pendent variables. The Excel output that Jenny Wilson obtains is shown in Program 4.2B, and it provides the following equation:
Evaluating the Multiple Regression Model A multiple regression model can be evaluated in a manner similar to the way a simple linear regression model is evaluated. Both the p-value for the F test and can be interpreted the same with multiple regression models as they are with simple linear regression models. However, as
r2
= 146,630.89 + 43.82 X1 - 2898.69 X2
YN = b0 + b1X1 + b2X2
b2b0, b1,
X2,X1b2 =b1
X2 =X1
b0 = Y
YN =
YN = b0 + b1X1 + b2X2
SELLING SQUARE PRICE ($) FOOTAGE AGE CONDITION
95,000 1,926 30 Good
119,000 2,069 40 Excellent
124,800 1,720 30 Excellent
135,000 1,396 15 Good
142,800 1,706 32 Mint
145,000 1,847 38 Mint
159,000 1,950 27 Mint
165,000 2,323 30 Excellent
182,000 2,285 26 Mint
183,000 3,752 35 Good
200,000 2,300 18 Good
211,000 2,525 17 Good
215,000 3,800 40 Excellent
219,000 1,740 12 Mint
TABLE 4.5 Jenny Wilson Real Estate Data
*See, for example, Norman R. Draper and Harry Smith. Applied Regression Analysis, 3rd ed. New York: John Wiley & Sons, Inc., 1998.
Excel can be used to develop multiple regression models.
130 CHAPTER 4 • REGRESSION MODELS
Input the X range to include both column B and column C.
Output range begins at cell A19.
Variable names (row 3) are included in X and Y ranges, so Labels must be checked.
PROGRAM 4.2A Input Screen for the Jenny Wilson Realty Multiple Regression Example
The coefficient of determination (r2) is 0.67.
The regression coefficients are found here.
A low significance level for F proves a relationship exists between Y and at least one of the independent (X ) variables.
The p-values are used to test the individual variables for significance.
PROGRAM 4.2B Output for the Jenny Wilson Realty Multiple Regression Example
To determine which of the independent variables in a multiple regression model is signifi- cant, a significance test on the coefficient for each variable is performed. While statistics text- books can provide the details of these tests, the results of these tests are automatically displayed in the Excel output. The null hypothesis is that the coefficient is and the alternate hypothesis is that it is not zero The test statistic is calculated in Excel, and the p-value is given. If the p-value is lower than the level of significance (α), then the null hypothe- sis is rejected and it can be concluded that the variable is significant.
Jenny Wilson Realty Example In the Jenny Wilson Realty example in Program 4.2B, the overall model is statistically signifi- cant and useful in predicting the selling price of the house because the p-value for the F test is 0.002. The value is 0.6719, so 67% of the variability in selling price for these houses can be explained by the regression model. However, there were two independent variables in the model—square footage and age. It is possible that one of these is significant and the other is not. The F test simply indicates that the model as a whole is significant.
r2
1H1:bi Z 02. 0 1H0:bi = 02
there is more than one independent variable, the hypothesis that is being tested with the F test is that all the coefficients are equal to 0. If all these are 0, then none of the independent variables in the model is helpful in predicting the dependent variable.
4.9 BINARY OR DUMMY VARIABLES 131
Two significance tests can be performed to determine if square footage or age (or both) are significant. In Program 4.2B, the results of two hypothesis tests are provided. The first test for variable (square footage) is
Using a 5% level of significance the null hypothesis is rejected because the p-value for this is 0.0013. Thus, square footage is helpful in predicting the price of a house.
Similarly, the variable (age) is tested using the Excel output, and the p-value is 0.0039. The null hypothesis is rejected because this is less than 0.05. Thus, age is also helpful in predict- ing the price of a house.
X2
1� = 0.052, H1:b1 Z 0
H0:b1 = 0
X1
4.9 Binary or Dummy Variables
All of the variables we have used in regression examples have been quantitative variables such as sales figures, payroll numbers, square footage, and age. These have all been easily measura- ble and have had numbers associated with them. There are many times when we believe a quali- tative variable rather than a quantitative variable would be helpful in predicting the dependent variable Y. For example, regression may be used to find a relationship between annual income and certain characteristics of the employees. Years of experience at a particular job would be a quantitative variable. However, information regarding whether or not a person has a college degree might also be important. This would not be a measurable value or quantity, so a special variable called a dummy variable (or a binary variable or an indicator variable) would be used. A dummy variable is assigned a value of 1 if a particular condition is met (e.g., a person has a college degree), and a value of 0 otherwise.
Return to the Jenny Wilson Realty example. Jenny believes that a better model can be de- veloped if the condition of the property is included. To incorporate the condition of the house into the model, Jenny looks at the information available (see Table 4.5), and sees that the three categories are good condition, excellent condition, and mint condition. Since these are not quan- titative variables, she must use dummy variables. These are defined as
Notice there is no separate variable for “good” condition. If and are both 0, then the house cannot be in excellent or mint condition, so it must be in good condition. When using dummy variables, the number of variables must be 1 less than the number of categories. In this problem, there were three categories (good, excellent, and mint condition) so we must have two dummy variables. If we had mistakenly used too many variables and the number of dummy variables equaled the number of categories, then the mathematical computations could not be performed or would not give reliable values.
These dummy variables will be used with the two previous variables ( —square footage, and —age) to try to predict the selling prices of houses for Jenny Wilson. Programs 4.3A and 4.3B provide the Excel input and output for this new data, and this shows how the dummy vari- ables were coded. The significance level for the F test is 0.00017, so this model is statistically significant. The coefficient of determination is 0.898, so this is a much better model than the previous one. The regression equation is
This indicates that a house in excellent condition would sell for about $33,162 more than a house in good condition A house in mint condition1X3 = 0, X4 = 02.1X3 = 1, X4 = 02YN = 121,658 + 56.43X1 - 3,962X2 + 33,162X3 + 47,369X4
1r22 X2
X1
X4X3
= 0 otherwise
X4 = 1 if house is in mint condition
= 0 otherwise
X3 = 1 if house is in excellent condition
A dummy variable is also called an indicator variable or a binary variable.
The number of dummy variables must equal one less than the number of categories of a qualitative variable.
would sell for about $47,369 more than a house in good condition.1X3 = 0, X4 = 12
132 CHAPTER 4 • REGRESSION MODELS
The X range includes columns B, C, D, and E, but not column F.
PROGRAM 4.3A Input Screen for the Jenny Wilson Realty Example with Dummy Variables
The overall model is helpful because the significance F probability is low (much less than 5%).
Each of the variables individually is helpful because the p-values for each of them is low (much less than 5%).
The coefficient of age is negative, indicating that the price decreases as a house gets older.
PROGRAM 4.3B Output for the Jenny Wilson Realty Example with Dummy Variables
4.10 Model Building
In developing a good regression model, possible independent variables are identified and the best ones are selected to be used in the model. The best model is a statistically significant model with a high and few variables.
As more variables are added to a regression model, will usually increase, and it cannot decrease. It is tempting to keep adding variables to a model to try to increase However, if too many independent variables are included in the model, problems can arise. For this reason, the adjusted value is often used (rather than ) to determine if an additional independent vari- able is beneficial. The adjusted takes into account the number of independent variables in the model, and it is possible for the adjusted to decrease. The formula for is
r2 = SSR
SST = 1 -
SSE
SST
r2r2 r2
r2r2
r2. r2
r2
The value of can never decrease when more variables are added to the model.
r2
The adjusted may decrease when more variables are added to the model.
r2
4.11 NONLINEAR REGRESSION 133
The adjusted is
(4-18)
Notice that as the number of variables (k) increases, will decrease. This causes to increase, and consequently the adjusted will decrease unless the extra
variable in the model causes a significant decrease in the SSE. Thus, the reduction in error (and SSE) must be sufficient to offset the change in k.
As a general rule of thumb, if the adjusted increases when a new variable is added to the model, the variable should probably remain in the model. If the adjusted decreases when a new variable is added, the variable should not remain in the model. Other factors should also be considered when trying to build the model, but they are beyond the introduc- tory level of this chapter.
STEPWISE REGRESSION While the process of model building may be tedious, there are many statistical software packages that include stepwise regression procedures to do this. Stepwise regression is an automated process to systematically add or delete independent variables from a regression model. A forward stepwise procedure puts the most significant variable in the model first and then adds the next variable that will improve the model the most, given that the first variable is already in the model. Variables continue to be added in this fashion until all the vari- ables are in the model or until any remaining variables do not significantly improve the model. A backwards stepwise procedure begins with all independent variables in the model, and one-by-one the least helpful variables are deleted. This continues until only significant variables remain. Many variations of these stepwise models exist.
MULTICOLLINEARITY In the Jenny Wilson Realty example illustrated in Program 4.3B, we saw an of about 0.90 and an adjusted of 0.85. While other variables such as the size of the lot, the number of bedrooms, and the number of bathrooms might be related to the selling price of a house, we may not want to include these in the model. It is likely that these variables would be correlated with the square footage of the house (e.g., more bedrooms usually means a larger house), which is already included in the model. Thus, the information provided by these addi- tional variables might be duplication of information already in the model.
When an independent variable is correlated with one other independent variable, the vari- ables are said to be collinear. If an independent variable is correlated with a combination of other independent variables, the condition of multicollinearity exists. This can create problems in interpreting the coefficients of the variables as several variables are providing duplicate infor- mation. For example, if two independent variables were monthly salary expenses for a company and annual salary expenses for a company, the information provided in one is also provided in the other. Several sets of regression coefficients for these two variables would yield exactly the same results. Thus, individual interpretation of these variables would be questionable, although the model itself is still good for prediction purposes. When multicollinearity exists, the overall F test is still valid, but the hypothesis tests related to the individual coefficients are not. A variable may appear to be significant when it is insignificant, or a variable may appear to be insignificant when it is significant.
r2r2
r2 r2
r2SSE>1n - k - 12 n - k - 1
Adjusted r2 = 1 - SSE>1n - k - 12
SST>1n - 12 r2
A variable should not be added to the model if it causes the adjusted to decrease.r2
Multicollinearity exists when a variable is correlated to other variables.
4.11 Nonlinear Regression
The regression models we have seen are linear models. However, at times there exist nonlinear rela- tionships between variables. Some simple variable transformations can be used to create an appar- ently linear model from a nonlinear relationship. This allows us to use Excel and other linear regression programs to perform the calculations. We will demonstrate this in the following example.
On every new automobile sold in the United States, the fuel efficiency (as measured by miles per gallon of gasoline (MPG) of the automobile is prominently displayed on the window sticker. The MPG is related to several factors, one of which is the weight of the automobile. Engineers at Colonel Motors, in an attempt to improve fuel efficiency, have been asked to study the impact of weight on MPG. They have decided that a regression model should be used to do this.
Transformations may be used to turn a nonlinear model into a linear model.
134 CHAPTER 4 • REGRESSION MODELS
1.00 2.00 3.00 4.00 5.00 0
5
10
25
15
20
30
35
40
45
Weight (1,000 lb.)
M P
G
FIGURE 4.6A Linear Model for MPG Data
A sample of 12 new automobiles was selected, and the weight and MPG rating were recorded. Table 4.6 provides this data. A scatter diagram of this data in Figure 4.6A shows the weight and MPG. A linear regression line is drawn through the points. Excel was used to develop a simple linear regression equation to relate the MPG (Y) to the weight in 1,000 lb. in the form
YN = b0 + b1X1
1X12
TABLE 4.6 Automobile Weight vs. MPG
WEIGHT WEIGHT MPG (1,000 LB.) MPG (1,000 LB.)
12 4.58 20 3.18
13 4.66 23 2.68
15 4.02 24 2.65
18 2.53 33 1.70
19 3.09 36 1.95
19 3.11 42 1.92
1.00 2.00 3.00 4.00 5.00 0
5
10
25
15
20
30
35
40
45
Weight (1,000 lb.)
M P
G
FIGURE 4.6B Nonlinear Model for MPG Data
4.11 NONLINEAR REGRESSION 135
The Excel output is shown in Program 4.4. From this we get the equation
or
The model is useful since the significance level for the F test is small and However, further examination of the graph in Figure 4.6A brings into question the use of a lin- ear model. Perhaps a nonlinear relationship exists, and maybe the model should be modified to account for this. A quadratic model is illustrated in Figure 4.6B. This model would be of the form
The easiest way to develop this model is to define a new variable
This gives us the model
We can create another column in Excel, and again run the regression tool. The output is shown in Program 4.5. The new equation is
The significance level for F is low (0.0002) so the model is useful, and The adjusted increased from 0.719 to 0.814, so this new variable definitely improved the model.
This model is good for prediction purposes. However, we should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared). Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in while holding all other variables constant. Obviously holding oneX1,
X1
X2X1
r2 r2 = 0.8478.
YN = 79.8 - 30.2X1 + 3.4X2
YN = b0 + b1X1 + b2X2
X2 = 1weight22 MPG = b0 + b11weight2 + b21weight22
r2 = 0.7446.
MPG = 47.6 - 8.21weight in 1,000 lb.2 YN = 47.6 - 8.2X1
PROGRAM 4.4 Excel Output for Linear Regression Model with MPG Data
PROGRAM 4.5 Excel Output for Nonlinear Regression Model with MPG Data
A low significance value for F and a high are indications of a good model.
r2
136 CHAPTER 4 • REGRESSION MODELS
variable constant while changing the other is impossible in this example since If changes, then must change also. This is an example of a problem that exists when multi- collinearity is present.
Other types of nonlinearities can be handled using a similar approach. A number of transfor- mations exist that may help to develop a linear model from variables with nonlinear relationships.
X2
X1X2 = X2 1.
4.12 Cautions and Pitfalls in Regression Analysis
This chapter has provided a brief introduction into regression analysis, one of the most widely used quantitative techniques in business. However, some common errors are made with regres- sion models, so caution should be observed when using this.
If the assumptions are not met, the statistical tests may not be valid. Any interval estimates are also invalid, although the model can still be used for prediction purposes.
Correlation does not necessarily mean causation. Two variables (such as the price of auto- mobiles and your annual salary) may be highly correlated to one another, but one is not causing the other to change. They may both be changing due to other factors such as the economy in general or the inflation rate.
If multicollinearity is present in a multiple regression model, the model is still good for pre- diction, but interpretation of individual coefficients is questionable. The individual tests on the regression coefficients are not valid.
Using a regression equation beyond the range of X is very questionable. A linear relation- ship may exist within the range of values of X in the sample. What happens beyond this range is unknown; the linear relationship may become nonlinear at some point. For example, there is usually a linear relationship between advertising and sales within a limited range. As more money is spent on advertising, sales tend to increase even if everything else is held constant. However, at some point, increasing advertising expenditures will have less impact on sales unless the company does other things to help, such as opening new markets or expanding the product offerings. If advertising is increased and nothing else changes, the sales will probably level off at some point.
Related to the limitation regarding the range of X is the interpretation of the intercept Since the lowest value for X in a sample is often much greater than 0, the intercept is a point on the regression line beyond the range of X. Therefore, we should not be concerned if the t-test for this coefficient is not significant as we should not be using the regression equation to predict a value of Y when This intercept is merely used in defining the line that fits the sample points the best.
Using the F test and concluding a linear regression model is helpful in predicting Y does not mean that this is the best relationship. While this model may explain much of the variability in Y, it is possible that a nonlinear relationship might explain even more. Similarly, if it is con- cluded that no linear relationship exists, another type of relationship could exist.
A statistically significant relationship does not mean it has any practical value. With large enough samples, it is possible to have a statistically significant relationship, but might be 0.01. This would normally be of little use to a manager. Similarly, a high could be found due to random chance if the sample is small. The F test must also show significance to place any value in r2.
r2 r2
X = 0.
1b02.
A high correlation does not mean one variable is causing a change in the other.
The regression equation should not be used with values of X that are below the lowest value of X or above the highest value of X found in the sample.
A significant F value may occur even when the relationship is not strong.
Summary
Regression analysis is an extremely valuable quantitative tool. Using scatter diagrams helps to see relationships between vari- ables. The F test is used to determine if the results can be con- sidered useful. The coefficient of determination is used to measure the proportion of variability in Y that is explained by the regression model. The correlation coefficient measures the relationship between two variables.
1r22 Multiple regression involves the use of more than one in-
dependent variable. Dummy variables (binary or indicator vari- ables) are used with qualitative or categorical data. Nonlinear models can be transformed into linear models.
We saw how to use Excel to develop regression models. Interpretation of computer output was presented, and several examples were provided.
KEY EQUATIONS 137
Glossary
Adjusted A measure of the explanatory power of a regression model that takes into consideration the number of independent variables in the model.
Binary Variable See Dummy Variable. Coefficient of Correlation A measure of the strength of
the relationship between two variables. Coefficient of Determination The percent of the vari-
ability in the dependent variable that is explained by the regression equation.
Collinearity A condition that exists when one independent variable is correlated with another independent variable.
Dependent Variable The Y-variable in a regression model. This is what is being predicted.
Dummy Variable A variable used to represent a qualitative factor or condition. Dummy variables have values of 0 or 1. This is also called a binary variable or an indicator variable.
Error. The difference between the actual value (Y) and the predicted value
Explanatory Variable The independent variable in a regres- sion equation.
Independent Variable The X-variable in a regression equa- tion. This is used to help predict the dependent variable.
Least Squares A reference to the criterion used to select the regression line, to minimize the squared distances between the estimated straight line and the observed values.
Mean Squared Error (MSE) An estimate of the error variance.
Multicollinearity A condition that exists when one independent variable is correlated with other independent variables.
1YN 2.
1Y21r22
1r2
r2 Multiple Regression Model A regression model that has more than one independent variable.
Observed Significance Level Another name for p-value. p-Value A probability value that is used when testing a
hypothesis. The hypothesis is rejected when this is low. Predictor Variable Another name for explanatory variable. Regression Analysis A forecasting procedure that uses the
least squares approach on one or more independent variables to develop a forecasting model.
Residual. Another term for error. Response Variable The dependent variable in a regression
equation. Scatter Diagrams Diagrams of the variable to be forecasted,
plotted against another variable, such as time. Also called scatter plots.
Standard Error of the Estimate An estimate of the standard deviation of the errors and is sometimes called the standard deviation of the regression.
Stepwise Regression An automated process to systematically add or delete independent variables from a regression model.
Sum of Squares Error (SSE) The total sum of the squared differences between each observation (Y) and the predicted value
Sum of Squares Regression (SSR) The total sum of the squared differences between each predicted value and the mean
Sum of Squares Total (SST) The total sum of the squared differences between each observation (Y) and the mean 1Y2.1Y2. 1YN 21YN 2.
Key Equations
(4-1) Underlying linear model for simple linear regression.
(4-2) Simple linear regression model computed from a sample.
(4-3) Error in regression model.
(4-4)
Slope in the regression line.
(4-5) The intercept in the regression line.
(4-6) Total sums of squares.
(4-7) Sum of squares due to error.
(4-8) Sum of squares due to regression. SSR = g1YN - Y22 SSE = ge2 = g1Y - YN 22 SST = g1Y - Y22 b0 = Y - b1X
b1 = ©1X - X21Y - Y2
©1X - X22 e = Y - YN
YN = b0 + b1X
Y = b0 + b1X + P (4-9) Relationship among sums of squares in regression.
(4-10)
Coefficient of determination.
(4-11) Coefficient of correlation. This has the same sign as the slope.
(4-12)
An estimate of the variance of the errors in regression; n is the sample size and k is the number of independent variables.
(4-13) An estimate of the standard deviation of the errors. Also called the standard error of the estimate.
s = 1MSE
s2 = MSE = SSE
n - k - 1
r = ; 2r2
r2 = SSR
SST = 1 -
SSE
SST
SST = SSR + SSE
138 CHAPTER 4 • REGRESSION MODELS
(4-14)
Mean square regression. k is the number of independ- ent variables.
(4-15)
F statistic used to test significance of overall regression model.
F = MSR
MSE
MSR = SSR
k
(4-16) Underlying model for multiple regression model.
(4-17) Multiple regression model computed from a sample.
(4-18)
Adjusted used in building multiple regression models.
r2
Adjusted r2 = 1 - SSE>1n - k - 12
SST>1n - 12 YN = b0 + b1X1 + b2X2 + Á + bkXk
Y = b0 + b1X1 + b2X2 + Á + bkXk + P
Solved Problems
Solved Problem 4-1 Judith Thompson runs a florist shop on the Gulf Coast of Texas, specializing in floral arrangements for weddings and other special events. She advertises weekly in the local newspapers and is considering in- creasing her advertising budget. Before doing so, she decides to evaluate the past effectiveness of these ads. Five weeks are sampled, and the advertising dollars and sales volume for each of these is shown in the following table. Develop a regression equation that would help Judith evaluate her advertising. Find the coefficient of determination for this model.
SALES ($1,000) ADVERTISING ($100)
11 5
6 3
10 7
6 2
12 8
Solution
SALES Y ADVERTISING X
11 5 (5 - 5)2 = 0 (5 - 5)(11 - 9) = 0
6 3 (3 - 5)2 = 4 (3 - 5)(6 - 9) = 6
10 7 (7 - 5)2 = 4 (7 - 5)(10 - 9) = 2
6 2 (2 - 5)2 = 9 (2 - 5)(6 - 9) = 9
12 8 (8 - 5)2 = 9 (8 - 5)(12 - 9) = 9
g Y � 45 g X � 25
� 9 � 5
X = 25>5Y = 45>5 g1X - X21Y - Y2 = 26g1X - X22 = 26
1X � X21Y � Y21X - X22
The regression equation is
YN = 4 + 1X
b0 = Y - b1X = 9 - 112152 = 4
b1 = ©1X - X21Y - Y2
©1X - X22 = 26
26 = 1
SOLVED PROBLEMS 139
To compute we use the following table:r2,
Y X
11 5 9 (11 - 9)2 = 4 (11 - 9)2 = 4
6 3 7 (6 - 7)2 = 1 (6 - 9)2 = 9
10 7 11 (10 - 11)2 = 1 (10 - 9)2 = 1
6 2 6 (6 - 6)2 = 0 (6 - 9)2 = 9
12 8 12 (12 - 12)2 = 0 (12 - 9)2 = 9
gY = 45 gX = 25
SSE SSTX = 5Y = 9
g1Y - Y22 = 32g1Y - YN 22 = 6
1Y - Y221Y - YN22YN � 4 � 1X
The slope tells us that for each 1 unit increase in X (or $100 in advertising), sales increase by 1 unit (or $1,000). Also, indicating that about 81% of the variability in sales can be explained by the regression model with advertising as the independent variable.
Solved Problem 4-2 Use Excel with the data in Solved Problem 4-1 to find the regression model. What does the F test say about this model?
Solution Program 4.6 provides the Excel output for this problem. We see the equation is
The coefficient of determination is shown to be 0.8125. The significance level for the F test is 0.0366, which is less than 0.05. This indicates the model is statistically significant. Thus, there is suffi- cient evidence in the data to conclude that the model is useful, and there is a relationship between X (advertising) and Y (sales).
1r22 YN = 4 + 1X
r2 = 0.8125 1b1 = 12
PROGRAM 4.6 Excel Output for Solved Problem 4-2
140 CHAPTER 4 • REGRESSION MODELS
Self-Test
� Before taking the self-test, refer to the learning objectives at the beginning of the chapter, the notes in the margins, and the glossary at the end of the chapter.
� Use the key at the back of the book to correct your answers. � Restudy pages that correspond to any questions that you answered incorrectly or material you feel uncertain about.
c. the coefficient of determination would be -1. d. the coefficient of determination would be 0.
8. When using dummy variables in a regression equation to model a qualitative or categorical variable, the number of dummy variables should equal to a. the number of categories. b. one more than the number of categories. c. one less than the number of categories. d. the number of other independent variables in the
model. 9. A multiple regression model differs from a simple linear
regression model because the multiple regression model has more than one a. independent variable. b. dependent variable. c. intercept. d. error.
10. The overall significance of a regression model is tested using an F test. The model is significant if a. the F value is low. b. the significance level of the F value is low. c. the value is low. d. the slope is lower than the intercept.
11. A new variable should not be added to a multiple regres- sion model if that variable causes a. to decrease. b. the adjusted to decrease. c. the SST to decrease. d. the intercept to decrease.
12. A good regression model should have a. a low and a low significance level for the F test. b. a high and a high significance level for the F test. c. a high and a low significance level for the F test. d. a low and a high significance level for the F test.r2
r2 r2
r2
r2 r2
r2
Discussion Questions and Problems
Discussion Questions 4-1 What is the meaning of least squares in a regression
model?
4-2 Discuss the use of dummy variables in regression analysis.
4-3 Discuss how the coefficient of determination and the coefficient of correlation are related and how they are used in regression analysis.
4-4 Explain how a scatter diagram can be used to iden- tify the type of regression to use.
4-5 Explain how the adjusted value is used in devel- oping a regression model.
4-6 Explain what information is provided by the F test.
4-7 What is the SSE? How is this related to the SST and the SSR?
4-8 Explain how a plot of the residuals can be used in developing a regression model.
r2
1. One of the assumptions in regression analysis is that a. the errors have a mean of 1. b. the errors have a mean of 0. c. the observations (Y) have a mean of 1. d. the observations (Y) have a mean of 0.
2. A graph of the sample points that will be used to develop a regression line is called a. a sample graph. b. a regression diagram. c. a scatter diagram. d. a regression plot.
3. When using regression, an error is also called a. an intercept. b. a prediction. c. a coefficient. d. a residual.
4. In a regression model, Y is called a. the independent variable. b. the dependent variable. c. the regression variable. d. the predictor variable.
5. A quantity that provides a measure of how far each sam- ple point is from the regression line is a. the SSR. b. the SSE. c. the SST. d. the MSR.
6. The percentage of the variation in the dependent variable that is explained by a regression equation is measured by a. the coefficient of correlation. b. the MSE. c. the coefficient of determination. d. the slope.
7. In a regression model, if every sample point is on the regression line (all errors are 0), then a. the correlation coefficient would be 0. b. the correlation coefficient would be -1 or 1.
DISCUSSION QUESTIONS AND PROBLEMS 141
Problems 4-9 John Smith has developed the following forecasting
model:
where
Demand for K10 air conditioners
the outside temperature
(a) Forecast the demand for K10 when the tempera- ture is 70°F.
(b) What is the demand for a temperature of 80°F? (c) What is the demand for a temperature of 90°F?
4-10 The operations manager of a musical instrument distributor feels that demand for bass drums may be related to the number of television appearances by the popular rock group Green Shades during the pre- ceding month. The manager has collected the data shown in the following table:
1°F2X1 = YN =
YN = 36 + 4.3X1
(a) Develop a regression model that could be used to predict the final average in the course based on the first test grade.
(b) Predict the final average of a student who made an 83 on the first test.
(c) Give the values of r and for this model. Inter- pret the value of in the context of this problem.
4-14 Using the data in Problem 4-13, test to see if there is a statistically significant relationship between the grade on the first test and the final average at the 0.05 level of significance. Use the formulas in this chapter and Appendix D.
4-15 Using computer software, find the least squares re- gression line for the data in Problem 4-13. Based on the F test, is there a statistically significant relation- ship between the first test grade and the final aver- age in the course?
4-16 Steve Caples, a real estate appraiser in Lake Charles, Louisiana, has developed a regression model to help appraise residential housing in the Lake Charles area. The model was developed using recently sold homes in a particular neighborhood. The price (Y) of the house is based on the square footage (X) of the house. The model is
The coefficient of correlation for the model is 0.63. (a) Use the model to predict the selling price of a
house that is 1,860 square feet. (b) A house with 1,860 square feet recently sold for
$95,000. Explain why this is not what the model predicted.
(c) If you were going to use multiple regression to develop an appraisal model, what other quantita- tive variables might be included in the model?
(d) What is the coefficient of determination for this model?
4-17 Accountants at the firm Walker and Walker believed that several traveling executives submit unusually high travel vouchers when they return from business trips. The accountants took a sample of 200 vouchers submitted from the past year; they then developed the following multiple regression equation relating expected travel cost (Y) to number of days on the road and distance traveled in miles:
YN = $90.00 + $48.50X1 + $0.40X2
1X221X12
YN = 13,473 + 37.65X
r2 r2
DEMAND FOR GREEN SHADES BASS DRUMS TV APPEARANCE
3 3
6 4
7 7
5 6
10 8
8 5
(a) Graph these data to see whether a linear equation might describe the relationship between the group’s television shows and bass drum sales.
(b) Using the equations presented in this chapter, compute the SST, SSE, and SSR. Find the least squares regression line for these data.
(c) What is your estimate for bass drum sales if the Green Shades performed on TV six times last month?
4-11 Using the data in Problem 4-10, test to see if there is a statistically significant relationship between sales and TV appearances at the 0.05 level of significance. Use the formulas in this chapter and Appendix D.
4-12 Using computer software, find the least squares regression line for the data in Problem 4-10. Based on the F test, is there a statistically significant rela- tionship between the demand for drums and the number of TV appearances?
4-13 Students in a management science class have just re- ceived their grades on the first test. The instructor has provided information about the first test grades in some previous classes as well as the final average for the same students. Some of these grades have been sampled and are as follows:
STUDENT 1 2 3 4 5 6 7 8 9
1st test grade 98 77 88 80 96 61 66 95 69
Final average 93 78 84 73 84 64 64 95 76
Note: means the problem may be solved with QM for Windows; means the problem may be
solved with Excel QM; and means the problem may be solved with QM for Windows and/or Excel QM.
142 CHAPTER 4 • REGRESSION MODELS
The coefficient of correlation computed was 0.68. (a) If Thomas Williams returns from a 300-mile trip
that took him out of town for five days, what is the expected amount that he should claim as ex- penses?
(b) Williams submitted a reimbursement request for $685; what should the accountant do?
(c) Comment on the validity of this model. Should any other variables be included? Which ones? Why?
4-18 Thirteen students entered the undergraduate busi- ness program at Rollins College 2 years ago. The following table indicates what their grade-point av- erages (GPAs) were after being in the program for 2 years and what each student scored on the SAT exam (maximum 2400) when he or she was in high school. Is there a meaningful relationship between grades and SAT scores? If a student scores a 1200 on the SAT, what do you think his or her GPA will be? What about a student who scores 2400?
(a) Plot these data and determine whether a linear model is reasonable.
(b) Develop a regression model. (c) What is expected ridership if 10 million tourists
visit the city? (d) If there are no tourists at all, explain the pre-
dicted ridership.
4-20 Use computer software to develop a regression model for the data in Problem 4-19. Explain what this output indicates about the usefulness of this model.
4-21 The following data give the starting salary for stu- dents who recently graduated from a local university and accepted jobs soon after graduation. The start- ing salary, grade-point average (GPA), and major (business or other) are provided.
STUDENT SAT SCORE GPA STUDENT SAT SCORE GPA
A 1263 2.90 H 1443 2.53
B 1131 2.93 I 2187 3.22
C 1755 3.00 J 1503 1.99
D 2070 3.45 K 1839 2.75
E 1824 3.66 L 2127 3.90
F 1170 2.88 M 1098 1.60
G 1245 2.15
4-19 Bus and subway ridership in Washington, D.C., dur- ing the summer months is believed to be heavily tied to the number of tourists visiting the city. During the past 12 years, the following data have been obtained:
NUMBER OF TOURISTS RIDERSHIP
YEAR (1,000,000s) (100,000s)
1 7 15
2 2 10
3 6 13
4 4 15
5 14 25
6 15 27
7 16 24
8 12 20
9 14 27
10 20 44
11 15 34
12 7 17
SALARY $29,500 $46,000 $39,800 $36,500
GPA 3.1 3.5 3.8 2.9
Major Other Business Business Other
SALARY $42,000 $31,500 $36,200
GPA 3.4 2.1 2.5
Major Business Other Business
(a) Using a computer, develop a regression model that could be used to predict starting salary based on GPA and major.
(b) Use this model to predict the starting salary for a business major with a GPA of 3.0.
(c) What does the model say about the starting salary for a business major compared to a non- business major?
(d) Do you believe this model is useful in predicting the starting salary? Justify your answer, using information provided in the computer output.
4-22 The following data give the selling price, square footage, number of bedrooms, and age of houses that have sold in a neighborhood in the past 6 months. Develop three regression models to predict the selling price based upon each of the other factors individually. Which of these is best?
SELLING SQUARE AGE PRICE($) FOOTAGE BEDROOMS (YEARS)
64,000 1,670 2 30
59,000 1,339 2 25
61,500 1,712 3 30
79,000 1,840 3 40
87,500 2,300 3 18
92,500 2,234 3 30
95,000 2,311 3 19
113,000 2,377 3 7
(Continued on next page)
DISCUSSION QUESTIONS AND PROBLEMS 143
4-23 Use the data in Problem 4-22 and develop a regres- sion model to predict selling price based on the square footage and number of bedrooms. Use this to predict the selling price of a 2,000-square-foot house with 3 bedrooms. Compare this model with the mod- els in Problem 4-22. Should the number of bed- rooms be included in the model? Why or why not?
4-24 Use the data in Problem 4-22 and develop a regres- sion model to predict selling price based on the square footage, number of bedrooms, and age. Use this to predict the selling price of a 10-year-old, 2,000-square-foot house with 3 bedrooms.
4-25 Tim Cooper plans to invest money in a mutual fund that is tied to one of the major market indices, either the S&P 500 or the Dow Jones Industrial Average. To obtain even more diversification, Tim has thought about investing in both of these. To determine whether investing in two funds would help, Tim decided to take 20 weeks of data and compare the two markets. The closing price for each index is shown in the table below:
4-26 The total expenses of a hospital are related to many factors. Two of these factors are the number of beds in the hospital and the number of admissions. Data were collected on 14 hospitals, as shown in the table below:
SELLING SQUARE AGE PRICE($) FOOTAGE BEDROOMS (YEARS)
115,000 2,736 4 10
138,000 2,500 3 1
142,500 2,500 4 3
144,000 2,479 3 3
145,000 2,400 3 1
147,500 3,124 4 0
144,000 2,500 3 2
155,500 4,062 4 10
165,000 2,854 3 3
WEEK 1 2 3 4 5 6 7
DJIA 10,226 10,473 10,452 10,442 10,471 10,213 10,187
S&P 1,107 1,141 1,135 1,139 1,142 1,108 1,110
WEEK 8 9 10 11 12 13 14
DJIA 10,240 10,596 10,584 10,619 10,628 10,593 10,488
S&P 1,121 1,157 1,145 1,144 1,146 1,143 1,131
WEEK 15 16 17 18 19 20
DJIA 10,568 10,601 10,459 10,410 10,325 10,278
S&P 1,142 1,140 1,122 1,108 1,096 1,089
Develop a regression model that would predict the DJIA based on the S&P 500 index. Based on this model, what would you expect the DJIA to be when the S&P is 1,100? What is the correlation coefficient (r) between the two markets?
NUMBER ADMISSIONS TOTAL EXPENSES HOSPITAL OF BEDs (100s) (MILLIONS)
1 215 77 57
2 336 160 127
3 520 230 157
4 135 43 24
5 35 9 14
6 210 155 93
7 140 53 45
8 90 6 6
9 410 159 99
10 50 18 12
11 65 16 11
12 42 29 15
13 110 28 21
14 305 98 63
Find the best regression model to predict the total expenses of a hospital. Discuss the accuracy of this model. Should both variables be included in the model? Why or why not?
4-27 A sample of 20 automobiles was taken, and the miles per gallon (MPG), horsepower, and total weight were recorded. Develop a linear regression model to predict MPG, using horsepower as the only independent variable. Develop another model with weight as the independent variable. Which of these two models is better? Explain.
MPG HORSEPOWER WEIGHT
44 67 1,844
44 50 1,998
40 62 1,752
37 69 1,980
37 66 1,797
34 63 2,199
35 90 2,404
32 99 2,611
30 63 3,236
28 91 2,606
26 94 2,580
26 88 2,507
(Continued on next page)
144 CHAPTER 4 • REGRESSION MODELS
4-28 Use the data in Problem 4-27 to develop a multiple linear regression model. How does this compare with each of the models in Problem 4-27?
4-29 Use the data in Problem 4-27 to find the best quad- ratic regression model. (There is more than one to consider.) How does this compare to the models in Problems 4-27 and 4-28?
4-30 A sample of nine public universities and nine private universities was taken. The total cost for the year (including room and board) and the median SAT score (maximum total is 2400) at each school were recorded. It was felt that schools with higher median SAT scores would have a better reputation and would charge more tuition as a result of that. The data is in the table below. Use regression to help an- swer the following questions based on this sample data. Do schools with higher SAT scores charge more in tuition and fees? Are private schools more expensive than public schools when SAT scores are
CATEGORY TOTAL COST ($) MEDIAN SAT
Public 21,700 1990
Public 15,600 1620
Public 16,900 1810
Public 15,400 1540
Public 23,100 1540
Public 21,400 1600
Public 16,500 1560
Public 23,500 1890
Public 20,200 1620
Private 30,400 1630
Private 41,500 1840
Private 36,100 1980
Private 42,100 1930
Private 27,100 2130
Private 34,800 2010
Private 32,100 1590
Private 31,800 1720
Private 32,100 1770
MPG HORSEPOWER WEIGHT
25 124 2,922
22 97 2,434
20 114 3,248
21 102 2,812
18 114 3,382
18 142 3,197
16 153 4,380
16 139 4,036
taken into consideration? Discuss how accurate you believe these results are using information related the regression models.
4-31 In 2008, the total payroll for the New York Yankees was $209.1 million, while the total payroll for the Tampa Bay Rays was about $43.8 million, or about one-fifth that of the Yankees. Many people have sug- gested that some teams are able to buy winning sea- sons and championships by spending a lot of money on the most talented players available. The table below lists the payrolls (in millions of dollars) for all 14 Major League Baseball teams in the American League as well as the total number of victories for each in the 2008 season:
TEAM PAYROLL
($MILLIONS) NUMBER
OF VICTORIES
New York Yankees 209.1 89
Detroit Tigers 138.7 74
Boston Red Sox 133.4 95
Chicago White Sox 121.2 89
Cleveland Indians 79.0 81
Baltimore Orioles 67.2 68
Oakland Athletics 48.0 75
Los Angeles Angels 119.2 100
Seattle Mariners 118.0 61
Toronto Blue Jays 98.6 86
Minnesota Twins 62.2 88
Kansas City Royals 58.2 75
Tampa Bay Rays 43.8 97
Texas Rangers 68.2 79
Develop a regression model to predict the total num- ber of victories based on the payroll of a team. Based on the results of the computer output, discuss how accurate this model is. Use the model to predict the number of victories for a team with a payroll of $79 million.
4-32 In 2009, the New York Yankees won 103 baseball games during the regular season. The table on the next page lists the number of victories (W), the earned- run-average (ERA), and the batting average (AVG) of each team in the American League. The ERA is one measure of the effectiveness of the pitching staff, and a lower number is better. The batting average is one measure of effectiveness of the hitters, and a higher number is better. (a) Develop a regression model that could be used to
predict the number of victories based on the ERA. (b) Develop a regression model that could be used to
predict the number of victories based on the bat- ting average.
CASE STUDY 145
(c) Which of the two models is better for predicting the number of victories?
(d) Develop a multiple regression model that in- cludes both ERA and batting average. How does this compare to the previous models?
4-33 The closing stock price for each of two stocks was recorded over a 12-month period. The closing price for the Dow Jones Industrial Average (DJIA) was also recorded over this same time period. These val- ues are shown in the following table:
(a) Develop a regression model to predict the price of stock 1 based on the Dow Jones Industrial Average.
(b) Develop a regression model to predict the price of stock 2 based on the Dow Jones Industrial Average.
(c) Which of the two stocks is most highly corre- lated to the Dow Jones Industrial Average over this time period?
Case Study
North–South Airline
In January 2008, Northern Airlines merged with Southeast Airlines to create the fourth largest U.S. carrier. The new North–South Airline inherited both an aging fleet of Boeing 727-300 aircraft and Stephen Ruth. Stephen was a tough for- mer Secretary of the Navy who stepped in as new president and chairman of the board.
Stephen’s first concern in creating a financially solid com- pany was maintenance costs. It was commonly surmised in the airline industry that maintenance costs rise with the age of the aircraft. He quickly noticed that historically there had been a significant difference in the reported B727-300 maintenance costs (from ATA Form 41s) both in the airframe and engine ar- eas between Northern Airlines and Southeast Airlines, with Southeast having the newer fleet.
On February 12, 2008, Peg Jones, vice president for opera- tions and maintenance, was called into Stephen’s office and asked to study the issue. Specifically, Stephen wanted to know whether the average fleet age was correlated to direct airframe maintenance costs, and whether there was a relationship be- tween average fleet age and direct engine maintenance costs.
Peg was to report back by February 26 with the answer, along with quantitative and graphical descriptions of the relationship.
Peg’s first step was to have her staff construct the average age of Northern and Southeast B727-300 fleets, by quarter, since the introduction of that aircraft to service by each airline in late 1993 and early 1994. The average age of each fleet was calculated by first multiplying the total number of calendar days each aircraft had been in service at the pertinent point in time by the average daily utilization of the respective fleet to total fleet hours flown. The total fleet hours flown was then divided by the number of aircraft in service at that time, giving the age of the “average” aircraft in the fleet.
The average utilization was found by taking the actual total fleet hours flown on September 30, 2007, from Northern and Southeast data, and dividing by the total days in service for all aircraft at that time. The average utilization for Southeast was 8.3 hours per day, and the average utilization for Northern was 8.7 hours per day. Because the available cost data were calcu- lated for each yearly period ending at the end of the first quar- ter, average fleet age was calculated at the same points in time.
TEAM W ERA AVG
New York Yankees 103 4.26 0.283
Los Angeles Angels 97 4.45 0.285
Boston Red Sox 95 4.35 0.270
Minnesota Twins 87 4.50 0.274
Texas Rangers 87 4.38 0.260
Detroit Tigers 86 4.29 0.260
Seattle Mariners 85 3.87 0.258
Tampa Bay Rays 84 4.33 0.263
Chicago White Sox 79 4.14 0.258
Toronto Blue Jays 75 4.47 0.266
Oakland Athletics 75 4.26 0.262
Cleveland Indians 65 5.06 0.264
Kansas City Royals 65 4.83 0.259
Baltimore Orioles 64 5.15 0.268
MONTH DJIA STOCK 1 STOCK 2
1 11,168 48.5 32.4
2 11,150 48.2 31.7
3 11,186 44.5 31.9
4 11,381 44.7 36.6
5 11,679 49.3 36.7
6 12,081 49.3 38.7
7 12,222 46.1 39.5
8 12,463 46.2 41.2
9 12,622 47.7 43.3
10 12,269 48.3 39.4
11 12,354 47.0 40.1
12 13,063 47.9 42.1
13 13,326 47.8 45.2
146 CHAPTER 4 • REGRESSION MODELS
The fleet data are shown in the following table. Airframe cost data and engine cost data are both shown paired with fleet aver- age age in that table.
Bibliography
Berenson, Mark L., David M. Levine, and Timothy C. Kriehbiel. Basic Busi- ness Statistics: Concepts and Applications, 11th ed. Upper Saddle River, NJ: Prentice Hall, 2009.
Black, Ken. Business Statistics: For Contemporary Decision Making, 6th ed. John Wiley & Sons, Inc., 2010.
Draper, Norman R., and Harry Smith. Applied Regression Analysis, 3rd ed. New York: John Wiley & Sons, Inc., 1998.
Kutner, Michael, John Neter, Chris J. Nachtsheim, and William Wasserman. Applied Linear Regression Models, 4th ed., Boston; New York: McGraw-Hill/Irwin, 2004.
Mendenhall, William, and Terry L. Sincich. A Second Course in Statistics: Re- gression Analysis, 6th ed., Upper Saddle River, NJ: Prentice Hall, 2004.
Discussion Question 1. Prepare Peg Jones’s response to Stephen Ruth.
Note: Dates and names of airlines and individuals have been changed in this case to maintain confidentiality. The data and issues described here are real.
North–South Airline Data for Boeing 727-300 Jets
NORTHERN AIRLINE DATA SOUTHEAST AIRLINE DATA
ENGINE AIRFRAME ENGINE AIRFRAME COST COST PER AVERAGE COST PER COST PER AVERAGE
YEAR PER AIRCRAFT($) AIRCRAFT($) AGE (HOURS) AIRCRAFT($) AIRCRAFT($) AGE (HOURS)
2001 51.80 43.49 6,512 13.29 18.86 5,107
2002 54.92 38.58 8,404 25.15 31.55 8,145
2003 69.70 51.48 11,077 32.18 40.43 7,360
2004 68.90 58.72 11,717 31.78 22.10 5,773
2005 63.72 45.47 13,275 25.34 19.69 7,150
2006 84.73 50.26 15,215 32.78 32.58 9,364
2007 78.74 79.60 18,390 35.56 38.07 8,259
Appendix 4.1 Formulas for Regression Calculations
When performing regression calculations by hand, there are other formulas that can make the task easier and are mathematically equivalent to the ones presented in the chapter. These, how- ever, make it more difficult to see the logic behind the formulas and to understand what the re- sults actually mean.
When using these formulas, it helps to set up a table with the columns shown in Table 4.7, which has the Triple A Construction Company data that was used earlier in the chapter. The sample size (n) is 6. The totals for all columns are shown, and the averages for X and Y are cal- culated. Once this is done, we can use the following formulas for computations in a simple lin- ear regression model (one independent variable). The simple linear regression equation is again given as
Slope of regression equation:
b1 = 180.5 - 6142172
106 - 61422 = 1.25
b1 = ©XY - nXY
©X2 - nX2
YN = b0 + b1X
APPENDIX 4.1 FORMULAS FOR REGRESSION CALCULATIONS 147
Y X Y2 X2 XY
6 3 62 � 36 32 � 9 3(6) � 18
8 4 82 � 64 42 � 16 4(8) � 32
9 6 92 � 81 62 �36 6(9) � 54
5 4 52 � 25 42 � 16 4(5) � 20
4.5 2 4.52 � 20.25 22 � 4 2(4.5) � 9
9.5 5 9.52 � 90.25 52 � 25 5(9.5) � 47.5
gY � 42 gX � 24 gY2 � 316.5 gX2 � 106 gXY � 180.5
X = 24>6 = 4Y = 42>6 = 7
TABLE 4.7 Preliminary Calculations for Triple A Construction
Intercept of regression equation:
Sum of squares of the error:
Estimate of the error variance:
Estimate of the error standard deviation:
Coefficient of determination:
This formula for the correlation coefficient automatically determines the sign of r. This could also be found by taking the square root of and giving it the same sign as the slope:
r = 61180.52 - 1242142223611062 - 2424361316.52 - 4224 = 0.833
r = n©XY - ©X©Y23n©X2 - 1©X2243n©Y2 - 1©Y224
r2
r2 = 1 - 6.875
316.5 - 61722 = 0.6944
r2 = 1 - SSE
©Y2 - nY 2
s = 21.71875 = 1.311
s = 1MSE
s2 = 6.875
6 - 2 = 1.71875
s2 = MSE = SSE
n - 2
SSE = 316.5 - 21422 - 1.251180.52 = 6.875
SSE = ©Y2 - b0©Y - b1©XY
b0 = 7 - 1.25142 = 2
b0 = Y - b1X
148 CHAPTER 4 • REGRESSION MODELS
Appendix 4.2 Regression Models Using QM for Windows
The use of QM for Windows to develop a regression model is very easy. We will use the Triple A Construction Company data to illustrate this. After starting QM for Windows, under Modules we select Forecasting. To enter the problem we select New and specify Least Squares—Simple and Multiple Regression, as illustrated in Program 4.7A. This opens the window shown in Program 4.7B. We enter the number of observations, which is 6 in this example. There is only 1 independent (X) variable. When OK is clicked, a window opens and the data is input as shown in Program 4.7C. After entering the data, click Solve, and the forecasting results are shown as in Program 4.7D. The equation as well as other information is provided on this screen. Additional output is available by clicking the Window option on the toolbar.
Recall that the MSE is an estimate of the error variance and the square root of this is the standard error of the estimate. The formula presented in the chapter and used in Excel is
where n is the sample size and k is the number of independent variables. This is an unbiased es- timate of In QM for Windows, the mean squared error is computed as
This is simply the average error and is a biased estimate of The standard error shown in Program 4.7D is not the square root of the MSE in the output, but rather is found using the de- nominator of If this standard error is squared, you get the MSE we saw earlier in the Ex- cel output.
n - 2.
�2.
MSE = SSE>n�2.
MSE = SSE>1n - k - 12 1�22,
PROGRAM 4.7A Initial Input Screen for QM for File—New—Least Squares–Simple and Multiple Regression
There is only one independent variable.
There are six pairs of observations in this sample.
PROGRAM 4.7B Second Input Screen for QM for Windows
APPENDIX 4.2 REGRESSION MODELS USING QM FOR WINDOWS 149
PROGRAM 4.7C Data Input for Triple A Construction Example
The regression equation is shown across two lines.
The standard error is the square root of SSE divided by n–2.
The MSE is the SSE divided by n.
PROGRAM 4.7D QM for Windows Output for Triple A Construction Data
PROGRAM 4.7E ANOVA Summary Output in QM for Windows
The F test was used to test a hypothesis about the overall effectiveness of the model. To see the ANOVA table, after the problem has been solved, select Window—ANOVA Summary, and the screen shown in Program 4.7E will be displayed.
150 CHAPTER 4 • REGRESSION MODELS
Appendix 4.3 Regression Analysis in Excel QM or Excel 2007
Excel QM Perhaps the easiest way to do regression analysis in Excel (either 2007 or 2010) is to use Excel QM, which is available on the companion website for this book. Once Excel QM has been installed as an add-in to Excel (see Appendix F at the end of the book for instructions on doing this), go to the Add-Ins tab and click Excel QM, as shown in Program 4.8A. When the menu appears, point the cursor at Forecasting, and the options will appear. Click on Multiple Regres- sion, as shown in Program 4.8A, for either simple or multiple regression models.
A window will open, as shown in Program 4.8B. Enter the number of past observations and the number of independent (X) variables. You can also enter a name or title for the problem. To enter the data for the Triple A Construction example in this chapter, enter 6 for the past periods (observations) and 1 for the number of independent variables. This will initialize the size of the spreadsheet, and the spreadsheet will appear as presented in Program 4.8C.
The shaded area under Y and x 1 will be empty, but the data are entered in this shaded area, and the calculations are automatically performed. In Program 4.8C, the intercept is 2 (the coeffi- cient in the Y column) and the slope is 1.25 (the coefficient in the x 1 column), resulting in the regression equation
which is the equation found earlier in this chapter.
Excel 2007 When doing regression in Excel (without the Excel QM add-in), the Data Analysis add-in is used in both Excel 2010 and Excel 2007. The steps and illustrations for Excel 2010 provided earlier in this chapter also apply to Excel 2007. However, the procedure to enable or activate this or any other Excel add-in varies, depending on which of the two versions of Excel is being used. See Appendix F at the end of this book for instructions for both Excel 2007 and Excel 2010.
Y = 2 + 1.25X
Go to the Add-In tab in Excel 2007 or Excel 2010. Click Excel QM.
Point the cursor at Forecasting. When options appear, click on Multiple Regression.
PROGRAM 4.8A Using Excel QM for Regression
APPENDIX 4.3 REGRESSION ANALYSIS IN EXCEL QM OR EXCEL 2007 151
Input a title.
Click OK.
Input the number of past observations.
Input the number of independent (X ) variables.
PROGRAM 4.8B Initializing the Spreadsheet in Excel QM
Enter the past observations of Y and X. Results appear automatically.
To forecast Y based on any value of X, simply input the value of X here.The intercepts and slope
are shown here.
The correlation coefficient is given here.
PROGRAM 4.8C Input and Results for Regression in Excel QM
This page intentionally left blank
- CHAPTER 4 Regression Models
- 4.1 Introduction
- 4.2 Scatter Diagrams
- 4.3 Simple Linear Regression
- 4.4 Measuring the Fit of the Regression Model
- Coefficient of Determination
- Correlation Coefficient
- 4.5 Using Computer Software for Regression
- 4.6 Assumptions of the Regression Model
- Estimating the Variance
- 4.7 Testing the Model for Significance
- Triple A Construction Example
- The Analysis of Variance (ANOVA) Table
- Triple A Construction ANOVA Example
- 4.8 Multiple Regression Analysis
- Evaluating the Multiple Regression Model
- Jenny Wilson Realty Example
- 4.9 Binary or Dummy Variables
- 4.10 Model Building
- 4.11 Nonlinear Regression
- 4.12 Cautions and Pitfalls in Regression Analysis
- Summary
- Glossary
- Key Equations
- Solved Problems
- Self-Test
- Discussion Questions and Problems
- Case Study: North–South Airline
- Bibliography
- Appendix 4.1 Formulas for Regression Calculations
- Appendix 4.2 Regression Models Using QM for Windows
- Appendix 4.3 Regression Analysis in Excel QM or Excel 2007