Forecasting: Time series and trend analysis

sri999
Chapter-4.pdf

115

4

After completing this chapter, students will be able to:

CHAPTER OUTLINE

LEARNING OBJECTIVES

Regression Models

1. Identify variables and use them in a regression model.

2. Develop simple linear regression equations from sample data and interpret the slope and intercept.

3. Compute the coefficient of determination and the coefficient of correlation and interpret their meanings.

4. Interpret the F test in a linear regression model.

5. List the assumptions used in regression and use residual plots to identify problems.

4.1 Introduction

4.2 Scatter Diagrams

4.3 Simple Linear Regression

4.4 Measuring the Fit of the Regression Model

4.5 Using Computer Software for Regression

4.6 Assumptions of the Regression Model

4.7 Testing the Model for Significance

4.8 Multiple Regression Analysis

4.9 Binary or Dummy Variables

4.10 Model Building

4.11 Nonlinear Regression

4.12 Cautions and Pitfalls in Regression Analysis

6. Develop a multiple regression model and use it for prediction purposes.

7. Use dummy variables to model categorical data.

8. Determine which variables should be included in a multiple regression model.

9. Transform a nonlinear function into a linear one for use in regression.

10. Understand and avoid mistakes commonly made in the use of regression analysis.

CHAPTER

Summary • Glossary • Key Equations • Solved Problems • Self-Test • Discussion Questions and Problems

Case Study: North–South Airline • Bibliography

Appendix 4.1 Formulas for Regression Calculations

Appendix 4.2 Regression Models Using QM for Windows

Appendix 4.3 Regression Analysis in Excel QM or Excel 2007

4.1 Introduction

Regression analysis is a very valuable tool for today’s manager. Regression has been used to model such things as the relationship between level of education and income, the price of a house and the square footage, and the sales volume for a company relative to the dollars spent on advertising. When businesses are trying to decide which location is best for a new store or branch office, regression models are often used. Cost estimation models are often regression models. The applicability of regression analysis is virtually limitless.

There are generally two purposes for regression analysis. The first is to understand the relationship between variables such as advertising expenditures and sales. The second purpose is to predict the value of one variable based on the value of the other. Because of this, regression is a very important forecasting technique and will be mentioned again in Chapter 5.

In this chapter, the simple linear regression model will first be developed, and then a more complex multiple regression model will be used to incorporate even more variables into our model. In any regression model, the variable to be predicted is called the dependent variable or response variable. The value of this is said to be dependent upon the value of an independent variable, which is sometimes called an explanatory variable or a predictor variable.

Two purposes of regression analysis are to understand the relationship between variables and to predict the value of one based on the other.

4.2 Scatter Diagrams

To investigate the relationship between variables, it is helpful to look at a graph of the data. Such a graph is often called a scatter diagram or a scatter plot. Normally the independent variable is plotted on the horizontal axis and the dependent variable is plotted on the vertical axis. The following example will illustrate this.

Triple A Construction Company renovates old homes in Albany. Over time, the company has found that its dollar volume of renovation work is dependent on the Albany area payroll. The figures for Triple A’s revenues and the amount of money earned by wage earners in Albany for the past six years are presented in Table 4.1. Economists have predicted the local area pay- roll to be $600 million next year, and Triple A wants to plan accordingly.

Figure 4.1 provides a scatter diagram for the Triple A Construction data given in Table 4.1. This graph indicates that higher values for the local payroll seem to result in higher sales for the company. There is not a perfect relationship because not all the points lie in a straight line, but there is a relationship. A line has been drawn through the data to help show the relationship that exists between the payroll and sales. The points do not all lie on the line, so there would be some error involved if we tried to predict sales based on payroll using this or any other line. Many lines could be drawn through these points, but which one best represents the true relationship? Regression analysis provides the answer to this question.

116 CHAPTER 4 • REGRESSION MODELS

A scatter diagram is a graph of the data.

TRIPLE A’S SALES LOCAL PAYROLL ($100,000s) ($100,000,000s)

6 3

8 4

9 6

5 4

4.5 2

9.5 5

TABLE 4.1 Triple A Construction Company Sales and Local Payroll

4.3 SIMPLE LINEAR REGRESSION 117

0 1 2 0

2

4

6

8

10

12

3 4 5 6 7 8

Payroll ($100 million)

S al

es (

$1 00

,0 00

)

FIGURE 4.1 Scatter Diagram of Triple A Construction Company Data

4.3 Simple Linear Regression

In any regression model, there is an implicit assumption (which can be tested) that a relationship exists between the variables. There is also some random error that cannot be predicted. The un- derlying simple linear regression model is

(4-1)

where

dependent variable (response variable)

independent variable (predictor variable or explanatory variable)

intercept (value of Y when )

slope of regression line

random error

The true values for the intercept and slope are not known, and therefore they are estimated using sample data. The regression equation based on sample data is given as

(4-2)

where

predicted value of Y

estimate of based on sample results

estimate of based on sample results

In the Triple A Construction example, we are trying to predict the sales, so the dependent variable (Y) would be sales. The variable we use to help predict sales is the Albany area payroll, so this is the independent variable (X). Although any number of lines can be drawn through these points to show a relationship between X and Y in Figure 4.1, the line that will be chosen is the one that in some way minimizes the errors. Error is defined as

(4-3)

Since errors may be positive or negative, the average error could be zero even though there are extremely large errors—both positive and negative. To eliminate the difficulty of negative errors

e = Y - YN

Error = 1Actual value2 - 1Predicted value2

b1,b1 = b0,b0 =

YN =

YN = b0 + b1X

P = b1 =

X = 0b0 = X = Y =

Y = b0 + b1X + P

The dependent variable is Y and the independent variable is X.

Estimates of the slope and intercept are found from sample data.

The regression line minimizes the sum of the squared errors.

118 CHAPTER 4 • REGRESSION MODELS

canceling positive errors, the errors can be squared. The best regression line will be defined as the one with the minimum sum of the squared errors. For this reason, regression analysis is sometimes called least-squares regression.

Statisticians have developed formulas that we can use to find the equation of a straight line that would minimize the sum of the squared errors. The simple linear regression equation is

The following formulas can be used to compute the intercept and the slope:

(4-4)

(4-5)

The preliminary calculations are shown in Table 4.2. There are other “shortcut” formulas that are helpful when doing the computations on a calculator, and these are presented in Appendix 4.1. They will not be shown here, as computer software will be used for most of the other examples in this chapter.

Computing the slope and intercept of the regression equation for the Triple A Construction Company example, we have

The estimated regression equation therefore is

or

If the payroll next year is $600 million then the predicted value would be

or $950,000.

YN = 2 + 1.25162 = 9.5

1X = 62,sales = 2 + 1.251payroll2 YN = 2 + 1.25X

b0 = Y - b1X = 7 - 11.252142 = 2

b1 = ©1X - X21Y - Y2

©1X - X22 = 12.5

10 = 1.25

Y = ©X

6 =

42

6 = 7

X = ©X

6 =

24

6 = 4

b0 = Y - b1X

b1 = ©1X - X21Y - Y2

©1X - X22 Y =

©Y

n = average 1mean2 of Y values

X = ©X

n = average 1mean2 of X values

YN = b0 + b1X

Y X (X – X– )2 (X – X– )(Y – Y– )

6 3 (3 – 4)2 = 1 (3 – 4)(6 – 7) = 1

8 4 (4 – 4)2 = 0 (4 – 4)(8 – 7) = 0

9 6 (6 – 4)2 = 4 (6 – 4)(9 – 7) = 4

5 4 (4 – 4)2 = 0 (4 – 4)(5 – 7) = 0

4.5 2 (2 – 4)2 = 4 (2 – 4)(4.5 – 7) = 5

9.5 5 (5 – 4)2 = 1 (5 – 4)(9.5 – 7) = 2.5

Y � 42 X � 24

Y = 24>6 = 4Y = 42>6 = 7

g1X - X21Y - Y2 = 12.5g1X - X22 = 10gg

TABLE 4.2 Regression Calculations for Triple A Construction

4.4 MEASURING THE FIT OF THE REGRESSION MODEL 119

4.4 Measuring the Fit of the Regression Model

A regression equation can be developed for any variables X and Y, even random numbers. We certainly would not have any confidence in the ability of one random number to predict the value of another random number. How do we know that the model is actually helpful in predicting Y based on X? Should we have confidence in this model? Does the model provide better predic- tions (smaller errors) than simply using the average of the Y values?

In the Triple A Construction example, sales figures (Y) varied from a low of 4.5 to a high of 9.5, and the mean was 7. If each sales value is compared with the mean, we see how far they deviate from the mean and we could compute a measure of the total variability in sales. Because Y is sometimes higher and sometimes lower than the mean, there may be both positive and nega- tive deviations. Simply summing these values would be misleading because the negatives would cancel out the positives, making it appear that the numbers are closer to the mean than they actually are. To prevent this problem, we will use the sum of the squares total (SST) to meas- ure the total variability in Y:

(4-6)

If we did not use X to predict Y, we would simply use the mean of Y as the prediction, and the SST would measure the accuracy of our predictions. However, a regression line may be used to predict the value of Y, and while there are still errors involved, the sum of these squared errors will be less than the total sum of squares just computed. The sum of the squares error (SSE) is

(4-7)

Table 4.3 provides the calculations for the Triple A Construction Example. The mean is compared to each value and we get

The prediction for each observation is computed and compared to the actual value. This results in

The SSE is much lower than the SST. Using the regression line has reduced the variability in the sum of squares by This is called the sum of squares due to22.5 - 6.875 = 15.625.

SSE = 6.875

1YN2 SST = 22.5

1Y = 72 SSE = ge2 = g1Y - YN 22

SST = g1Y - Y22

Deviations (errors) may be positive or negative.

The SST measures the total variability in Y about the mean.

The SSE measures the variability in Y about the regression line.

One of the purposes of regression is to understand the relationship among variables. This model tells us that for each $100 million (represented by X) increase in the payroll, we would ex- pect the sales to increase by $125,000 since ($100,000s). This model helps Triple A Construction see how the local economy and company sales are related.

b1 = 1.25

TABLE 4.3 Sum of Squares for Triple A Construction

Y X (Y � Y– )2 Ŷ (Y �Ŷ)2 (Ŷ � Y– )2

6 3 (6 – 7)2 � 1 2 � 1.25(3) � 5.75 0.0625 1.563

8 4 (8 – 7)2 � 1 2 � 1.25(4) � 7.00 1 0

9 6 (9 – 7)2 � 4 2 � 1.25(6) � 9.50 0.25 6.25

5 4 (5 – 7)2 � 4 2 � 1.25(4) � 7.00 4 0

4.5 2 (4.5 – 7)2 � 6.25 2 � 1.25(2) � 4.50 0 6.25

9.5 5 (9.5 – 7)2 � 6.25 2 � 1.25(5) � 8.25 1.5625 1.563

SST � 22.5 SSE � 6.875 SSR � 15.625Y = 7

g1YN - Y22 = 15.625g1Y - YN 22 = 6.875g1Y - Y22 = 22.5

120 CHAPTER 4 • REGRESSION MODELS

regression (SSR) and indicates how much of the total variability in Y is explained by the regres- sion model. Mathematically, this can be calculated as

(4-8)

Table 4.3 indicates

There is a very important relationship between the sums of squares that we have computed:

(4-9)

Figure 4.2 displays the data for Triple A Construction. The regression line is shown, as is a line representing the mean of the Y values. The errors used in computing the sums of squares are shown on this graph. Notice how the sample points are closer to the regression line than they are to the mean.

Coefficient of Determination The SSR is sometimes called the explained variability in Y while the SSE is the unexplained variability in Y. The proportion of the variability in Y that is explained by the regression equa- tion is called the coefficient of determination and is denoted by Thus,

(4-10)

Thus, can be found using either the SSR or the SSE. For Triple A Construction, we have

This means that about 69% of the variability in sales (Y) is explained by the regression equation based on payroll (X).

If every point in the sample were on the regression line (meaning all errors are 0), then 100% of the variability in Y could be explained by the regression equation, so and

The lowest possible value of is 0, indicating that X explains 0% of the variability in Y. Thus, can range from a low of 0 to a high of 1. In developing regression equations, a good model will have an value close to 1.r2

r2 r2SSE = 0.

r2 = 1

r2 = 15.625

22.5 = 0.6944

r2

r2 = SSR

SST = 1 -

SSE

SST

r2.

SST = SSR + SSE

1Sum of squares total2 = 1Sum of squares due to regression2 + 1Sum of squares error2 SSR = 15.625

SSR = g1YN - Y22

is the proportion of variability in Y that is explained by the regression equation.

r2

If every point lies on the regression line, and SSE � 0.

r2 � 1

0 1 2 0

2

4

6

8

10

12

3 4 5 6 7 8

Payroll ($100 million)

S al

es (

$1 00

,0 00

)

Y – Y ^ Y

Y – Y Y – Y

^

Y = 2 � 1.25 X ^

⎧ ⎪ ⎨ ⎪⎪⎩

⎨ ⎩

⎩⎨

Y

X

FIGURE 4.2 Deviations from the Regression Line and from the Mean

4.4 MEASURING THE FIT OF THE REGRESSION MODEL 121

The correlation coefficient ranges from -1 to �1.

(a) Perfect Positive Correlation: r � �1

Y

X (b) Positive Correlation: 0 � r �1

Y

X

(c) No Correlation: r � 0

Y

X (d ) Perfect Negative Correlation: r � �1

Y

X

FIGURE 4.3 Four Values of the Correlation Coefficient

Correlation Coefficient Another measure related to the coefficient of determination is the coefficient of correlation. This measure also expresses the degree or strength of the linear relationship. It is usually expressed as r and can be any number between and including +1 and -1. Figure 4.3 illustrates possible scatter diagrams for different values of r. The value of r is the square root of It is negative if the slope is negative, and it is positive if the slope is positive. Thus,

(4-11)

For the Triple A Construction example with

We know it is positive because the slope is +1.25.

r = 10.6944 = 0.8333

r2 = 0.6944,

r = �2r2

r2.

Multiple Regression Modeling at Canada’s TransAlta Utilities

TransAlta Utilities (TAU) is a $1.6 billion energy company oper- ating in Canada, New Zealand, Australia, Argentina, and the United States. Headquartered in Alberta, Canada, TAU is that country’s largest publicly owned utility. It serves 340,000 cus- tomers in Alberta through 57 customer-service facilities, each of which was staffed by 5 to 20 customer service linemen. The 270 linemen’s jobs are to handle new connections and repairs and to patrol power lines and check substations. This existing system was not the result of some optimal central planning but was put in place incrementally as the company grew.

With help from the University of Alberta, TAU wanted to develop a causal model to decide how many linemen would be best assigned to each facility. The research team decided to build a multiple regres- sion model with only three independent variables. The hardest part of the task was to select variables that were easy to quantify based

on available data. In the end, the explanatory variables were number of urban customers, number of rural customers, and geographic size of a service area. The implicit assumptions in this model are that the time spent on customers is proportional to the number of customers; and the time spent on facilities (line patrol and substation checks) and travel are proportional to the size of the service region. By definition, the unexplained time in the model accounts for time that is not explained by the three variables (such as meetings, breaks, or unpro- ductive time).

Not only did the results of the model please TAU managers, but the project (which included optimizing the number of facili- ties and their locations) saved $4 million per year.

Source: Based on E. Erkut, T. Myroon, and K. Strangway. “TransAlta Re- designs Its Service-Delivery Network,” Interfaces (March–April 2000): 54–69.

IN ACTION

122 CHAPTER 4 • REGRESSION MODELS

4.5 Using Computer Software for Regression

Software such as QM for Windows (Appendix 4.2), Excel, and Excel QM (Appendix 4.3) is often used for regression calculations. We will rely on Excel for most of the calculations in the rest of this chapter. When using Excel to develop a regression model, the input and output for Excel 2007 and Excel 2010 are the same.

The Triple A Construction example will be used to illustrate how to develop a regression model in Excel 2010. Go to the Data tab and select Data Analysis, as shown in Program 4.1A. If Data Analysis does not appear, then the Excel add-in Data Analysis from the Analysis Tool- Pak must be enabled or activated. Appendix F at the end of this book provides instructions on how to enable this and other add-ins for Excel 2010 and Excel 2007. Once an add-in is activated, it will remain on the Data tab for future use.

When the Data Analysis window opens, scroll down to and highlight Regression and click OK, as illustrated in Program 4.1A. The Regression window will open, as shown in Program 4.1B, and you can input the X and Y ranges. Check the Labels box because the cells with the variable name were included in the first row of the X and Y ranges. To have the out- put presented on this page rather than on a new worksheet, select Output Range and give a cell address for the start of the output. Click the OK button, and the output appears in the out- put range specified.

Program 4.1C shows the intercept (2), slope (1.25), and other information that was previ- ously calculated for the Triple A Construction example.

The sums of squares are shown in the column headed by SS. Another name for error is residual. In Excel, the sum of squares error is shown as the sum of squares residual. The values in this output are the same values shown in Table 4.3:

The coefficient of determination is shown to be 0.6944. The coefficient of correlation (r) is

called Multiple R in the Excel output, and this is 0.8333.

1r22 Sum of squares total = SST = 22.5

Sum of squares error 1residual2 = SSE = 6.8750

Sum of squares regression = SSR = 15.625

Errors are also called residuals.

When the Data Analysis window opens. scroll down to Regression.

Go to the Data tab.

Select Data Analysis.

Click ok.

PROGRAM 4.1A Accessing the Regression Option in Excel 2010

4.6 ASSUMPTIONS OF THE REGRESSION MODEL 123

Specify the X and Y ranges.Check the Labels box if the first row in the X and Y ranges includes the variable names.

Click OK to have Excel develop the regression model.

Specify the location for the output. To put this on the current worksheet, click Output Range and give a cell location for this to begin.

PROGRAM 4.1B Data Input for Regression in Excel

A high r2 (close to 1) is desirable.

A low (e.g., less than 0.05) Significance F (p-value for overall model) indicates a significant relationship between X and Y.

The regression coefficients are given here.

The SSR (regression), SSE (residual or error), and SST (total) are shown in the SS column of the ANOVA table.

PROGRAM 4.1C Excel Output for the Triple A Construction Example

4.6 Assumptions of the Regression Model

If we can make certain assumptions about the errors in a regression model, we can perform statisti- cal tests to determine if the model is useful. The following assumptions are made about the errors:

1. The errors are independent. 2. The errors are normally distributed. 3. The errors have a mean of zero. 4. The errors have a constant variance (regardless of the value of X).

124 CHAPTER 4 • REGRESSION MODELS

A plot of the errors may highlight problems with the model.

E rr

or

X

FIGURE 4.4A Pattern of Errors Indicating Randomness

E rr

or

X

FIGURE 4.4B Nonconstant Error Variance

E rr

or

X

FIGURE 4.4C Errors Indicate Relationship is Not Linear

It is possible to check the data to see if these assumptions are met. Often a plot of the residuals will highlight any glaring violations of the assumptions. When the errors (residuals) are plotted against the independent variable, the pattern should appear random.

Figure 4.4 presents some typical error patterns, with Figure 4.4A displaying a pattern that is expected when the assumptions are met and the model is appropriate. The errors are random and no discernible pattern is present. Figure 4.4B demonstrates an error pattern in which the errors increase as X increases, violating the constant variance assumption. Figure 4.4C shows errors

consistently increasing at first, and then consistently decreasing. A pattern such as this would indicate that the model is not linear and some other form (perhaps quadratic) should be used. In general, patterns in the plot of the errors indicate problems with the assumptions or the model specification.

Estimating the Variance While the errors are assumed to have constant variance this is usually not known. It can be estimated from the sample results. The estimate of is the mean squared error (MSE) and is denoted by The MSE is the sum of squares due to error divided by the degrees of freedom:*

(4-12)

where

of observations in the sample

of independent variables

In this example, and So

From this we can estimate the standard deviation as

(4-13)

This is called the standard error of the estimate or the standard deviation of the regression. In the example shown in Program 4.1D,

This is used in many of the statistical tests about the model. It is also used to find interval estimates for both Y and regression coefficients.**

s = 1MSE = 11.7188 = 1.31

s = 1MSE

s2 = MSE = SSE

n - k - 1 =

6.8750

6 - 1 - 1 =

6.8750

4 = 1.7188

k = 1.n = 6

k = number

n = number

s2 = MSE = SSE

n - k - 1

s2. �2 1�22,

4.7 TESTING THE MODEL FOR SIGNIFICANCE 125

*See bibliography at end of this chapter for books with further details. **The MSE is a common measure of accuracy in forecasting. When used with techniques besides regression, it is com- mon to divide the SSE by n rather than n - k - 1.

The error variance is estimated by the MSE.

4.7 Testing the Model for Significance

Both the MSE and provide a measure of accuracy in a regression model. However, when the sample size is too small, it is possible to get good values for both of these even if there is no relationship between the variables in the regression model. To determine whether these values are meaningful, it is necessary to test the model for significance.

To see if there is a linear relationship between X and Y, a statistical hypothesis test is per- formed. The underlying linear model was given in Equation 4-1 as

If then Y does not depend on X in any way. The null hypothesis says there is no linear relationship between the two variables (i.e., ). The alternate hypothesis is that there is a linear relationship (i.e., ). If the null hypothesis can be rejected, then we have proven that a linear relationship does exist, so X is helpful in predicting Y. The F distribution is used for test- ing this hypothesis. Appendix D contains values for the F distribution which can be used when calculations are performed by hand. See Chapter 2 for a review of the F distribution. The results of the test can also be obtained from both Excel and QM for Windows.

b1 Z 0 b1 = 0

b1 = 0,

Y = b0 + b1X + P

r2

An F test is used to determine if there is a relationship between X and Y.

126 CHAPTER 4 • REGRESSION MODELS

If the significance level for the F test is low, there is a relationship between X and Y.

The F statistic used in the hypothesis test is based on the MSE (seen in the previous sec- tion) and the mean squared regression (MSR). The MSR is calculated as

(4-14)

where

The F statistic is

(4-15)

Based on the assumptions regarding the errors in a regression model, this calculated F statistic is described by the F distribution with

where

If there is very little error, the denominator (MSE) of the F statistic is very small relative to the numerator (MSR), and the resulting F statistic would be large. This would be an indi- cation that the model is useful. A significance level related to the value of the F statistic is then found. Whenever the F value is large, the significance level (p-value) will be low, indi- cating that it is extremely unlikely that this could have occurred by chance. When the F value is large (with a resulting small significance level), we can reject the null hypothesis that there is no linear relationship. This means that there is a linear relationship and the values of MSE and are meaningful.

The hypothesis test just described is summarized here:

Steps in Hypothesis Test for a Significant Regression Model

1. Specify null and alternative hypotheses:

2. Select the level of significance (�). Common values are 0.01 and 0.05.

3. Calculate the value of the test statistic using the formula

4. Make a decision using one of the following methods: (a) Reject the null hypothesis if the test statistic is greater than the F value from the table

in Appendix D. Otherwise, do not reject the null hypothesis:

(b) Reject the null hypothesis if the observed significance level, or p-value, is less than the level of significance (�). Otherwise, do not reject the null hypothesis:

Reject if p-value 6 �

p-value = P1F 7 calculated test statistic2 df2 = n - k - 1

df1 = k

Reject if Fcalculated 7 F�,df1,df2

F = MSR

MSE

H1:b1 Z 0

H0:b1 = 0

r2

k = the number of independent 1X2 variables

degrees of freedom for the denominator = df2 = n - k - 1.

degrees of freedom for the numerator = df1 = k

F = MSR

MSE

k = number of independent variables in the model

MSR = SSR

k

4.7 TESTING THE MODEL FOR SIGNIFICANCE 127

Triple A Construction Example To illustrate the process of testing the hypothesis about a significant relationship, consider the Triple A Construction example. Appendix D will be used to provide values for the F distribution.

Step 1.

Step 2.

Step 3. Calculate the value of the test statistic. The MSE was already calculated to be 1.7188. The MSR is then calculated so that F can be found:

Step 4. (a) Reject the null hypothesis if the test statistic is greater than the F value from the table in Appendix D:

The value of F associated with a 5% level of significance and with degrees of freedom 1 and 4 is found in Appendix D. Figure 4.5 illustrates this:

Thus, there is sufficient data to conclude that there is a statistically significant relationship between X and Y, so the model is helpful. The strength of this relationship is measured by

Thus, we can conclude that about 69% of the variability in sales (Y) is explained by the regression model based on local payroll (X).

The Analysis of Variance (ANOVA) Table When software such as Excel or QM for Windows is used to develop regression models, the out- put provides the observed significance level, or p-value, for the calculated F value. This is then compared to the level of significance (α) to make the decision.

r2 = 0.69.

Reject H0 because 9.09 7 7.71

Fcalculated = 9.09

F0.05,1,4 = 7.71

df2 = n - k - 1 = 6 - 1 - 1 = 4

df1 = k = 1

F = MSR

MSE =

15.6250

1.7188 = 9.09

MSR = SSR

k =

15.6250

1 = 15.6250

Select � = 0.05.

H1 : b1 Z 0 1linear relationship exists between X and Y2 H0 : b1 = 0 1no linear relationship between X and Y2

F � 7.71 9.09

0.05

FIGURE 4.5 F Distribution for Triple A Construction Test for Significance

128 CHAPTER 4 • REGRESSION MODELS

4.8 Multiple Regression Analysis

The multiple regression model is a practical extension of the model we just observed. It allows us to build a model with several independent variables. The underlying model is

(4-16)

where

dependent variable (response variable)

th independent variable (predictor variable or explanatory variable)

intercept (value of Y when all )

coefficient of the ith independent variable

number of independent variables

random error

To estimate the values of these coefficients, a sample is taken and the following equation is developed:

(4-17)

where

predicted value of Y

sample intercept (and is an estimate of )

sample coefficient of ith variable (and is an estimate of )

Consider the case of Jenny Wilson Realty, a real estate company in Montgomery, Alabama. Jenny Wilson, owner and broker for this company, wants to develop a model to determine a suggested listing price for houses based on the size of the house and the age of the house. She selects a sample of houses that have sold recently in a particular area, and she records the selling price, the square footage of the house, the age of the house, and also the con- dition (good, excellent, or mint) of each house as shown in Table 4.5. Initially Jenny plans to

bibi = b0b0 =

YN =

YN = b0 + b1X1 + b2X2 + Á + bkXk

P = k = bi =

Xi = 0b0 = Xi = i

Y =

Y = b0 + b1X1 + b2X2 + Á + bkXk + P

A multiple regression model has more than one independent variable.

DF SS MS F SIGNIFICANCE F

Regression k SSR MSR = SSR/k MSR/MSE P(F MSR/MSE)

Residual n – k – 1 SSE MSE = SSE/(n–k–1)

Total n – 1 SST

TABLE 4.4 Analysis of Variance (ANOVA) Table for Regression

Table 4.4 provides summary information about the ANOVA table. This shows how the num- bers in the last three columns of the table are computed. The last column of this table, labeled Significance F, is the p-value, or observed significance level, which can be used in the hypothe- sis test about the regression model.

Triple A Construction ANOVA Example The Excel output that includes the ANOVA table for the Triple A Construction data is shown in Program 4.1C. The observed significance level for is given to be 0.0394. This means

Because this probability is less than 0.05 (α), we would reject the hypothesis of no linear rela- tionship and conclude that there is a linear relationship between X and Y. Note in Figure 4.5 that the area under the curve to the right of 9.09 is clearly less than 0.05, which is the area to the right of the F value associated with a 0.05, level of signicance.

P1F 7 9.09092 = 0.0394

F = 9.0909

4.8 MULTIPLE REGRESSION ANALYSIS 129

use only the square footage and age to develop a model, although she wants to save the infor- mation on condition of the house to use later. She wants to find the coefficients for the follow- ing multiple regression model:

where

predicted value of dependent variable (selling price)

intercept

and value of the two independent variables (square footage and age), respectively

and slopes for and respectively

The mathematics of multiple regression becomes quite complex, so we leave formulas for and to regression textbooks.* Excel can be used to develop a multiple regression model

just as it was used for a simple linear regression model. When entering the data in Excel, it is important that all of the independent variables are in adjoining columns to facilitate the input. From the Data tab in Excel, select Data Analysis and then Regression, as shown earlier, in Program 4.1A. This opens the regression window to allow the input, as shown in Program 4.2A. Note that the X Range includes the data in two columns (B and C) because there are two inde- pendent variables. The Excel output that Jenny Wilson obtains is shown in Program 4.2B, and it provides the following equation:

Evaluating the Multiple Regression Model A multiple regression model can be evaluated in a manner similar to the way a simple linear regression model is evaluated. Both the p-value for the F test and can be interpreted the same with multiple regression models as they are with simple linear regression models. However, as

r2

= 146,630.89 + 43.82 X1 - 2898.69 X2

YN = b0 + b1X1 + b2X2

b2b0, b1,

X2,X1b2 =b1

X2 =X1

b0 = Y

YN =

YN = b0 + b1X1 + b2X2

SELLING SQUARE PRICE ($) FOOTAGE AGE CONDITION

95,000 1,926 30 Good

119,000 2,069 40 Excellent

124,800 1,720 30 Excellent

135,000 1,396 15 Good

142,800 1,706 32 Mint

145,000 1,847 38 Mint

159,000 1,950 27 Mint

165,000 2,323 30 Excellent

182,000 2,285 26 Mint

183,000 3,752 35 Good

200,000 2,300 18 Good

211,000 2,525 17 Good

215,000 3,800 40 Excellent

219,000 1,740 12 Mint

TABLE 4.5 Jenny Wilson Real Estate Data

*See, for example, Norman R. Draper and Harry Smith. Applied Regression Analysis, 3rd ed. New York: John Wiley & Sons, Inc., 1998.

Excel can be used to develop multiple regression models.

130 CHAPTER 4 • REGRESSION MODELS

Input the X range to include both column B and column C.

Output range begins at cell A19.

Variable names (row 3) are included in X and Y ranges, so Labels must be checked.

PROGRAM 4.2A Input Screen for the Jenny Wilson Realty Multiple Regression Example

The coefficient of determination (r2) is 0.67.

The regression coefficients are found here.

A low significance level for F proves a relationship exists between Y and at least one of the independent (X ) variables.

The p-values are used to test the individual variables for significance.

PROGRAM 4.2B Output for the Jenny Wilson Realty Multiple Regression Example

To determine which of the independent variables in a multiple regression model is signifi- cant, a significance test on the coefficient for each variable is performed. While statistics text- books can provide the details of these tests, the results of these tests are automatically displayed in the Excel output. The null hypothesis is that the coefficient is and the alternate hypothesis is that it is not zero The test statistic is calculated in Excel, and the p-value is given. If the p-value is lower than the level of significance (α), then the null hypothe- sis is rejected and it can be concluded that the variable is significant.

Jenny Wilson Realty Example In the Jenny Wilson Realty example in Program 4.2B, the overall model is statistically signifi- cant and useful in predicting the selling price of the house because the p-value for the F test is 0.002. The value is 0.6719, so 67% of the variability in selling price for these houses can be explained by the regression model. However, there were two independent variables in the model—square footage and age. It is possible that one of these is significant and the other is not. The F test simply indicates that the model as a whole is significant.

r2

1H1:bi Z 02. 0 1H0:bi = 02

there is more than one independent variable, the hypothesis that is being tested with the F test is that all the coefficients are equal to 0. If all these are 0, then none of the independent variables in the model is helpful in predicting the dependent variable.

4.9 BINARY OR DUMMY VARIABLES 131

Two significance tests can be performed to determine if square footage or age (or both) are significant. In Program 4.2B, the results of two hypothesis tests are provided. The first test for variable (square footage) is

Using a 5% level of significance the null hypothesis is rejected because the p-value for this is 0.0013. Thus, square footage is helpful in predicting the price of a house.

Similarly, the variable (age) is tested using the Excel output, and the p-value is 0.0039. The null hypothesis is rejected because this is less than 0.05. Thus, age is also helpful in predict- ing the price of a house.

X2

1� = 0.052, H1:b1 Z 0

H0:b1 = 0

X1

4.9 Binary or Dummy Variables

All of the variables we have used in regression examples have been quantitative variables such as sales figures, payroll numbers, square footage, and age. These have all been easily measura- ble and have had numbers associated with them. There are many times when we believe a quali- tative variable rather than a quantitative variable would be helpful in predicting the dependent variable Y. For example, regression may be used to find a relationship between annual income and certain characteristics of the employees. Years of experience at a particular job would be a quantitative variable. However, information regarding whether or not a person has a college degree might also be important. This would not be a measurable value or quantity, so a special variable called a dummy variable (or a binary variable or an indicator variable) would be used. A dummy variable is assigned a value of 1 if a particular condition is met (e.g., a person has a college degree), and a value of 0 otherwise.

Return to the Jenny Wilson Realty example. Jenny believes that a better model can be de- veloped if the condition of the property is included. To incorporate the condition of the house into the model, Jenny looks at the information available (see Table 4.5), and sees that the three categories are good condition, excellent condition, and mint condition. Since these are not quan- titative variables, she must use dummy variables. These are defined as

Notice there is no separate variable for “good” condition. If and are both 0, then the house cannot be in excellent or mint condition, so it must be in good condition. When using dummy variables, the number of variables must be 1 less than the number of categories. In this problem, there were three categories (good, excellent, and mint condition) so we must have two dummy variables. If we had mistakenly used too many variables and the number of dummy variables equaled the number of categories, then the mathematical computations could not be performed or would not give reliable values.

These dummy variables will be used with the two previous variables ( —square footage, and —age) to try to predict the selling prices of houses for Jenny Wilson. Programs 4.3A and 4.3B provide the Excel input and output for this new data, and this shows how the dummy vari- ables were coded. The significance level for the F test is 0.00017, so this model is statistically significant. The coefficient of determination is 0.898, so this is a much better model than the previous one. The regression equation is

This indicates that a house in excellent condition would sell for about $33,162 more than a house in good condition A house in mint condition1X3 = 0, X4 = 02.1X3 = 1, X4 = 02YN = 121,658 + 56.43X1 - 3,962X2 + 33,162X3 + 47,369X4

1r22 X2

X1

X4X3

= 0 otherwise

X4 = 1 if house is in mint condition

= 0 otherwise

X3 = 1 if house is in excellent condition

A dummy variable is also called an indicator variable or a binary variable.

The number of dummy variables must equal one less than the number of categories of a qualitative variable.

would sell for about $47,369 more than a house in good condition.1X3 = 0, X4 = 12

132 CHAPTER 4 • REGRESSION MODELS

The X range includes columns B, C, D, and E, but not column F.

PROGRAM 4.3A Input Screen for the Jenny Wilson Realty Example with Dummy Variables

The overall model is helpful because the significance F probability is low (much less than 5%).

Each of the variables individually is helpful because the p-values for each of them is low (much less than 5%).

The coefficient of age is negative, indicating that the price decreases as a house gets older.

PROGRAM 4.3B Output for the Jenny Wilson Realty Example with Dummy Variables

4.10 Model Building

In developing a good regression model, possible independent variables are identified and the best ones are selected to be used in the model. The best model is a statistically significant model with a high and few variables.

As more variables are added to a regression model, will usually increase, and it cannot decrease. It is tempting to keep adding variables to a model to try to increase However, if too many independent variables are included in the model, problems can arise. For this reason, the adjusted value is often used (rather than ) to determine if an additional independent vari- able is beneficial. The adjusted takes into account the number of independent variables in the model, and it is possible for the adjusted to decrease. The formula for is

r2 = SSR

SST = 1 -

SSE

SST

r2r2 r2

r2r2

r2. r2

r2

The value of can never decrease when more variables are added to the model.

r2

The adjusted may decrease when more variables are added to the model.

r2

4.11 NONLINEAR REGRESSION 133

The adjusted is

(4-18)

Notice that as the number of variables (k) increases, will decrease. This causes to increase, and consequently the adjusted will decrease unless the extra

variable in the model causes a significant decrease in the SSE. Thus, the reduction in error (and SSE) must be sufficient to offset the change in k.

As a general rule of thumb, if the adjusted increases when a new variable is added to the model, the variable should probably remain in the model. If the adjusted decreases when a new variable is added, the variable should not remain in the model. Other factors should also be considered when trying to build the model, but they are beyond the introduc- tory level of this chapter.

STEPWISE REGRESSION While the process of model building may be tedious, there are many statistical software packages that include stepwise regression procedures to do this. Stepwise regression is an automated process to systematically add or delete independent variables from a regression model. A forward stepwise procedure puts the most significant variable in the model first and then adds the next variable that will improve the model the most, given that the first variable is already in the model. Variables continue to be added in this fashion until all the vari- ables are in the model or until any remaining variables do not significantly improve the model. A backwards stepwise procedure begins with all independent variables in the model, and one-by-one the least helpful variables are deleted. This continues until only significant variables remain. Many variations of these stepwise models exist.

MULTICOLLINEARITY In the Jenny Wilson Realty example illustrated in Program 4.3B, we saw an of about 0.90 and an adjusted of 0.85. While other variables such as the size of the lot, the number of bedrooms, and the number of bathrooms might be related to the selling price of a house, we may not want to include these in the model. It is likely that these variables would be correlated with the square footage of the house (e.g., more bedrooms usually means a larger house), which is already included in the model. Thus, the information provided by these addi- tional variables might be duplication of information already in the model.

When an independent variable is correlated with one other independent variable, the vari- ables are said to be collinear. If an independent variable is correlated with a combination of other independent variables, the condition of multicollinearity exists. This can create problems in interpreting the coefficients of the variables as several variables are providing duplicate infor- mation. For example, if two independent variables were monthly salary expenses for a company and annual salary expenses for a company, the information provided in one is also provided in the other. Several sets of regression coefficients for these two variables would yield exactly the same results. Thus, individual interpretation of these variables would be questionable, although the model itself is still good for prediction purposes. When multicollinearity exists, the overall F test is still valid, but the hypothesis tests related to the individual coefficients are not. A variable may appear to be significant when it is insignificant, or a variable may appear to be insignificant when it is significant.

r2r2

r2 r2

r2SSE>1n - k - 12 n - k - 1

Adjusted r2 = 1 - SSE>1n - k - 12

SST>1n - 12 r2

A variable should not be added to the model if it causes the adjusted to decrease.r2

Multicollinearity exists when a variable is correlated to other variables.

4.11 Nonlinear Regression

The regression models we have seen are linear models. However, at times there exist nonlinear rela- tionships between variables. Some simple variable transformations can be used to create an appar- ently linear model from a nonlinear relationship. This allows us to use Excel and other linear regression programs to perform the calculations. We will demonstrate this in the following example.

On every new automobile sold in the United States, the fuel efficiency (as measured by miles per gallon of gasoline (MPG) of the automobile is prominently displayed on the window sticker. The MPG is related to several factors, one of which is the weight of the automobile. Engineers at Colonel Motors, in an attempt to improve fuel efficiency, have been asked to study the impact of weight on MPG. They have decided that a regression model should be used to do this.

Transformations may be used to turn a nonlinear model into a linear model.

134 CHAPTER 4 • REGRESSION MODELS

1.00 2.00 3.00 4.00 5.00 0

5

10

25

15

20

30

35

40

45

Weight (1,000 lb.)

M P

G

FIGURE 4.6A Linear Model for MPG Data

A sample of 12 new automobiles was selected, and the weight and MPG rating were recorded. Table 4.6 provides this data. A scatter diagram of this data in Figure 4.6A shows the weight and MPG. A linear regression line is drawn through the points. Excel was used to develop a simple linear regression equation to relate the MPG (Y) to the weight in 1,000 lb. in the form

YN = b0 + b1X1

1X12

TABLE 4.6 Automobile Weight vs. MPG

WEIGHT WEIGHT MPG (1,000 LB.) MPG (1,000 LB.)

12 4.58 20 3.18

13 4.66 23 2.68

15 4.02 24 2.65

18 2.53 33 1.70

19 3.09 36 1.95

19 3.11 42 1.92

1.00 2.00 3.00 4.00 5.00 0

5

10

25

15

20

30

35

40

45

Weight (1,000 lb.)

M P

G

FIGURE 4.6B Nonlinear Model for MPG Data

4.11 NONLINEAR REGRESSION 135

The Excel output is shown in Program 4.4. From this we get the equation

or

The model is useful since the significance level for the F test is small and However, further examination of the graph in Figure 4.6A brings into question the use of a lin- ear model. Perhaps a nonlinear relationship exists, and maybe the model should be modified to account for this. A quadratic model is illustrated in Figure 4.6B. This model would be of the form

The easiest way to develop this model is to define a new variable

This gives us the model

We can create another column in Excel, and again run the regression tool. The output is shown in Program 4.5. The new equation is

The significance level for F is low (0.0002) so the model is useful, and The adjusted increased from 0.719 to 0.814, so this new variable definitely improved the model.

This model is good for prediction purposes. However, we should not try to interpret the coefficients of the variables due to the correlation between (weight) and (weight squared). Normally we would interpret the coefficient for as the change in Y that results from a 1-unit change in while holding all other variables constant. Obviously holding oneX1,

X1

X2X1

r2 r2 = 0.8478.

YN = 79.8 - 30.2X1 + 3.4X2

YN = b0 + b1X1 + b2X2

X2 = 1weight22 MPG = b0 + b11weight2 + b21weight22

r2 = 0.7446.

MPG = 47.6 - 8.21weight in 1,000 lb.2 YN = 47.6 - 8.2X1

PROGRAM 4.4 Excel Output for Linear Regression Model with MPG Data

PROGRAM 4.5 Excel Output for Nonlinear Regression Model with MPG Data

A low significance value for F and a high are indications of a good model.

r2

136 CHAPTER 4 • REGRESSION MODELS

variable constant while changing the other is impossible in this example since If changes, then must change also. This is an example of a problem that exists when multi- collinearity is present.

Other types of nonlinearities can be handled using a similar approach. A number of transfor- mations exist that may help to develop a linear model from variables with nonlinear relationships.

X2

X1X2 = X2 1.

4.12 Cautions and Pitfalls in Regression Analysis

This chapter has provided a brief introduction into regression analysis, one of the most widely used quantitative techniques in business. However, some common errors are made with regres- sion models, so caution should be observed when using this.

If the assumptions are not met, the statistical tests may not be valid. Any interval estimates are also invalid, although the model can still be used for prediction purposes.

Correlation does not necessarily mean causation. Two variables (such as the price of auto- mobiles and your annual salary) may be highly correlated to one another, but one is not causing the other to change. They may both be changing due to other factors such as the economy in general or the inflation rate.

If multicollinearity is present in a multiple regression model, the model is still good for pre- diction, but interpretation of individual coefficients is questionable. The individual tests on the regression coefficients are not valid.

Using a regression equation beyond the range of X is very questionable. A linear relation- ship may exist within the range of values of X in the sample. What happens beyond this range is unknown; the linear relationship may become nonlinear at some point. For example, there is usually a linear relationship between advertising and sales within a limited range. As more money is spent on advertising, sales tend to increase even if everything else is held constant. However, at some point, increasing advertising expenditures will have less impact on sales unless the company does other things to help, such as opening new markets or expanding the product offerings. If advertising is increased and nothing else changes, the sales will probably level off at some point.

Related to the limitation regarding the range of X is the interpretation of the intercept Since the lowest value for X in a sample is often much greater than 0, the intercept is a point on the regression line beyond the range of X. Therefore, we should not be concerned if the t-test for this coefficient is not significant as we should not be using the regression equation to predict a value of Y when This intercept is merely used in defining the line that fits the sample points the best.

Using the F test and concluding a linear regression model is helpful in predicting Y does not mean that this is the best relationship. While this model may explain much of the variability in Y, it is possible that a nonlinear relationship might explain even more. Similarly, if it is con- cluded that no linear relationship exists, another type of relationship could exist.

A statistically significant relationship does not mean it has any practical value. With large enough samples, it is possible to have a statistically significant relationship, but might be 0.01. This would normally be of little use to a manager. Similarly, a high could be found due to random chance if the sample is small. The F test must also show significance to place any value in r2.

r2 r2

X = 0.

1b02.

A high correlation does not mean one variable is causing a change in the other.

The regression equation should not be used with values of X that are below the lowest value of X or above the highest value of X found in the sample.

A significant F value may occur even when the relationship is not strong.

Summary

Regression analysis is an extremely valuable quantitative tool. Using scatter diagrams helps to see relationships between vari- ables. The F test is used to determine if the results can be con- sidered useful. The coefficient of determination is used to measure the proportion of variability in Y that is explained by the regression model. The correlation coefficient measures the relationship between two variables.

1r22 Multiple regression involves the use of more than one in-

dependent variable. Dummy variables (binary or indicator vari- ables) are used with qualitative or categorical data. Nonlinear models can be transformed into linear models.

We saw how to use Excel to develop regression models. Interpretation of computer output was presented, and several examples were provided.

KEY EQUATIONS 137

Glossary

Adjusted A measure of the explanatory power of a regression model that takes into consideration the number of independent variables in the model.

Binary Variable See Dummy Variable. Coefficient of Correlation A measure of the strength of

the relationship between two variables. Coefficient of Determination The percent of the vari-

ability in the dependent variable that is explained by the regression equation.

Collinearity A condition that exists when one independent variable is correlated with another independent variable.

Dependent Variable The Y-variable in a regression model. This is what is being predicted.

Dummy Variable A variable used to represent a qualitative factor or condition. Dummy variables have values of 0 or 1. This is also called a binary variable or an indicator variable.

Error. The difference between the actual value (Y) and the predicted value

Explanatory Variable The independent variable in a regres- sion equation.

Independent Variable The X-variable in a regression equa- tion. This is used to help predict the dependent variable.

Least Squares A reference to the criterion used to select the regression line, to minimize the squared distances between the estimated straight line and the observed values.

Mean Squared Error (MSE) An estimate of the error variance.

Multicollinearity A condition that exists when one independent variable is correlated with other independent variables.

1YN 2.

1Y21r22

1r2

r2 Multiple Regression Model A regression model that has more than one independent variable.

Observed Significance Level Another name for p-value. p-Value A probability value that is used when testing a

hypothesis. The hypothesis is rejected when this is low. Predictor Variable Another name for explanatory variable. Regression Analysis A forecasting procedure that uses the

least squares approach on one or more independent variables to develop a forecasting model.

Residual. Another term for error. Response Variable The dependent variable in a regression

equation. Scatter Diagrams Diagrams of the variable to be forecasted,

plotted against another variable, such as time. Also called scatter plots.

Standard Error of the Estimate An estimate of the standard deviation of the errors and is sometimes called the standard deviation of the regression.

Stepwise Regression An automated process to systematically add or delete independent variables from a regression model.

Sum of Squares Error (SSE) The total sum of the squared differences between each observation (Y) and the predicted value

Sum of Squares Regression (SSR) The total sum of the squared differences between each predicted value and the mean

Sum of Squares Total (SST) The total sum of the squared differences between each observation (Y) and the mean 1Y2.1Y2. 1YN 21YN 2.

Key Equations

(4-1) Underlying linear model for simple linear regression.

(4-2) Simple linear regression model computed from a sample.

(4-3) Error in regression model.

(4-4)

Slope in the regression line.

(4-5) The intercept in the regression line.

(4-6) Total sums of squares.

(4-7) Sum of squares due to error.

(4-8) Sum of squares due to regression. SSR = g1YN - Y22 SSE = ge2 = g1Y - YN 22 SST = g1Y - Y22 b0 = Y - b1X

b1 = ©1X - X21Y - Y2

©1X - X22 e = Y - YN

YN = b0 + b1X

Y = b0 + b1X + P (4-9) Relationship among sums of squares in regression.

(4-10)

Coefficient of determination.

(4-11) Coefficient of correlation. This has the same sign as the slope.

(4-12)

An estimate of the variance of the errors in regression; n is the sample size and k is the number of independent variables.

(4-13) An estimate of the standard deviation of the errors. Also called the standard error of the estimate.

s = 1MSE

s2 = MSE = SSE

n - k - 1

r = ; 2r2

r2 = SSR

SST = 1 -

SSE

SST

SST = SSR + SSE

138 CHAPTER 4 • REGRESSION MODELS

(4-14)

Mean square regression. k is the number of independ- ent variables.

(4-15)

F statistic used to test significance of overall regression model.

F = MSR

MSE

MSR = SSR

k

(4-16) Underlying model for multiple regression model.

(4-17) Multiple regression model computed from a sample.

(4-18)

Adjusted used in building multiple regression models.

r2

Adjusted r2 = 1 - SSE>1n - k - 12

SST>1n - 12 YN = b0 + b1X1 + b2X2 + Á + bkXk

Y = b0 + b1X1 + b2X2 + Á + bkXk + P

Solved Problems

Solved Problem 4-1 Judith Thompson runs a florist shop on the Gulf Coast of Texas, specializing in floral arrangements for weddings and other special events. She advertises weekly in the local newspapers and is considering in- creasing her advertising budget. Before doing so, she decides to evaluate the past effectiveness of these ads. Five weeks are sampled, and the advertising dollars and sales volume for each of these is shown in the following table. Develop a regression equation that would help Judith evaluate her advertising. Find the coefficient of determination for this model.

SALES ($1,000) ADVERTISING ($100)

11 5

6 3

10 7

6 2

12 8

Solution

SALES Y ADVERTISING X

11 5 (5 - 5)2 = 0 (5 - 5)(11 - 9) = 0

6 3 (3 - 5)2 = 4 (3 - 5)(6 - 9) = 6

10 7 (7 - 5)2 = 4 (7 - 5)(10 - 9) = 2

6 2 (2 - 5)2 = 9 (2 - 5)(6 - 9) = 9

12 8 (8 - 5)2 = 9 (8 - 5)(12 - 9) = 9

g Y � 45 g X � 25

� 9 � 5

X = 25>5Y = 45>5 g1X - X21Y - Y2 = 26g1X - X22 = 26

1X � X21Y � Y21X - X22

The regression equation is

YN = 4 + 1X

b0 = Y - b1X = 9 - 112152 = 4

b1 = ©1X - X21Y - Y2

©1X - X22 = 26

26 = 1

SOLVED PROBLEMS 139

To compute we use the following table:r2,

Y X

11 5 9 (11 - 9)2 = 4 (11 - 9)2 = 4

6 3 7 (6 - 7)2 = 1 (6 - 9)2 = 9

10 7 11 (10 - 11)2 = 1 (10 - 9)2 = 1

6 2 6 (6 - 6)2 = 0 (6 - 9)2 = 9

12 8 12 (12 - 12)2 = 0 (12 - 9)2 = 9

gY = 45 gX = 25

SSE SSTX = 5Y = 9

g1Y - Y22 = 32g1Y - YN 22 = 6

1Y - Y221Y - YN22YN � 4 � 1X

The slope tells us that for each 1 unit increase in X (or $100 in advertising), sales increase by 1 unit (or $1,000). Also, indicating that about 81% of the variability in sales can be explained by the regression model with advertising as the independent variable.

Solved Problem 4-2 Use Excel with the data in Solved Problem 4-1 to find the regression model. What does the F test say about this model?

Solution Program 4.6 provides the Excel output for this problem. We see the equation is

The coefficient of determination is shown to be 0.8125. The significance level for the F test is 0.0366, which is less than 0.05. This indicates the model is statistically significant. Thus, there is suffi- cient evidence in the data to conclude that the model is useful, and there is a relationship between X (advertising) and Y (sales).

1r22 YN = 4 + 1X

r2 = 0.8125 1b1 = 12

PROGRAM 4.6 Excel Output for Solved Problem 4-2

140 CHAPTER 4 • REGRESSION MODELS

Self-Test

� Before taking the self-test, refer to the learning objectives at the beginning of the chapter, the notes in the margins, and the glossary at the end of the chapter.

� Use the key at the back of the book to correct your answers. � Restudy pages that correspond to any questions that you answered incorrectly or material you feel uncertain about.

c. the coefficient of determination would be -1. d. the coefficient of determination would be 0.

8. When using dummy variables in a regression equation to model a qualitative or categorical variable, the number of dummy variables should equal to a. the number of categories. b. one more than the number of categories. c. one less than the number of categories. d. the number of other independent variables in the

model. 9. A multiple regression model differs from a simple linear

regression model because the multiple regression model has more than one a. independent variable. b. dependent variable. c. intercept. d. error.

10. The overall significance of a regression model is tested using an F test. The model is significant if a. the F value is low. b. the significance level of the F value is low. c. the value is low. d. the slope is lower than the intercept.

11. A new variable should not be added to a multiple regres- sion model if that variable causes a. to decrease. b. the adjusted to decrease. c. the SST to decrease. d. the intercept to decrease.

12. A good regression model should have a. a low and a low significance level for the F test. b. a high and a high significance level for the F test. c. a high and a low significance level for the F test. d. a low and a high significance level for the F test.r2

r2 r2

r2

r2 r2

r2

Discussion Questions and Problems

Discussion Questions 4-1 What is the meaning of least squares in a regression

model?

4-2 Discuss the use of dummy variables in regression analysis.

4-3 Discuss how the coefficient of determination and the coefficient of correlation are related and how they are used in regression analysis.

4-4 Explain how a scatter diagram can be used to iden- tify the type of regression to use.

4-5 Explain how the adjusted value is used in devel- oping a regression model.

4-6 Explain what information is provided by the F test.

4-7 What is the SSE? How is this related to the SST and the SSR?

4-8 Explain how a plot of the residuals can be used in developing a regression model.

r2

1. One of the assumptions in regression analysis is that a. the errors have a mean of 1. b. the errors have a mean of 0. c. the observations (Y) have a mean of 1. d. the observations (Y) have a mean of 0.

2. A graph of the sample points that will be used to develop a regression line is called a. a sample graph. b. a regression diagram. c. a scatter diagram. d. a regression plot.

3. When using regression, an error is also called a. an intercept. b. a prediction. c. a coefficient. d. a residual.

4. In a regression model, Y is called a. the independent variable. b. the dependent variable. c. the regression variable. d. the predictor variable.

5. A quantity that provides a measure of how far each sam- ple point is from the regression line is a. the SSR. b. the SSE. c. the SST. d. the MSR.

6. The percentage of the variation in the dependent variable that is explained by a regression equation is measured by a. the coefficient of correlation. b. the MSE. c. the coefficient of determination. d. the slope.

7. In a regression model, if every sample point is on the regression line (all errors are 0), then a. the correlation coefficient would be 0. b. the correlation coefficient would be -1 or 1.

DISCUSSION QUESTIONS AND PROBLEMS 141

Problems 4-9 John Smith has developed the following forecasting

model:

where

Demand for K10 air conditioners

the outside temperature

(a) Forecast the demand for K10 when the tempera- ture is 70°F.

(b) What is the demand for a temperature of 80°F? (c) What is the demand for a temperature of 90°F?

4-10 The operations manager of a musical instrument distributor feels that demand for bass drums may be related to the number of television appearances by the popular rock group Green Shades during the pre- ceding month. The manager has collected the data shown in the following table:

1°F2X1 = YN =

YN = 36 + 4.3X1

(a) Develop a regression model that could be used to predict the final average in the course based on the first test grade.

(b) Predict the final average of a student who made an 83 on the first test.

(c) Give the values of r and for this model. Inter- pret the value of in the context of this problem.

4-14 Using the data in Problem 4-13, test to see if there is a statistically significant relationship between the grade on the first test and the final average at the 0.05 level of significance. Use the formulas in this chapter and Appendix D.

4-15 Using computer software, find the least squares re- gression line for the data in Problem 4-13. Based on the F test, is there a statistically significant relation- ship between the first test grade and the final aver- age in the course?

4-16 Steve Caples, a real estate appraiser in Lake Charles, Louisiana, has developed a regression model to help appraise residential housing in the Lake Charles area. The model was developed using recently sold homes in a particular neighborhood. The price (Y) of the house is based on the square footage (X) of the house. The model is

The coefficient of correlation for the model is 0.63. (a) Use the model to predict the selling price of a

house that is 1,860 square feet. (b) A house with 1,860 square feet recently sold for

$95,000. Explain why this is not what the model predicted.

(c) If you were going to use multiple regression to develop an appraisal model, what other quantita- tive variables might be included in the model?

(d) What is the coefficient of determination for this model?

4-17 Accountants at the firm Walker and Walker believed that several traveling executives submit unusually high travel vouchers when they return from business trips. The accountants took a sample of 200 vouchers submitted from the past year; they then developed the following multiple regression equation relating expected travel cost (Y) to number of days on the road and distance traveled in miles:

YN = $90.00 + $48.50X1 + $0.40X2

1X221X12

YN = 13,473 + 37.65X

r2 r2

DEMAND FOR GREEN SHADES BASS DRUMS TV APPEARANCE

3 3

6 4

7 7

5 6

10 8

8 5

(a) Graph these data to see whether a linear equation might describe the relationship between the group’s television shows and bass drum sales.

(b) Using the equations presented in this chapter, compute the SST, SSE, and SSR. Find the least squares regression line for these data.

(c) What is your estimate for bass drum sales if the Green Shades performed on TV six times last month?

4-11 Using the data in Problem 4-10, test to see if there is a statistically significant relationship between sales and TV appearances at the 0.05 level of significance. Use the formulas in this chapter and Appendix D.

4-12 Using computer software, find the least squares regression line for the data in Problem 4-10. Based on the F test, is there a statistically significant rela- tionship between the demand for drums and the number of TV appearances?

4-13 Students in a management science class have just re- ceived their grades on the first test. The instructor has provided information about the first test grades in some previous classes as well as the final average for the same students. Some of these grades have been sampled and are as follows:

STUDENT 1 2 3 4 5 6 7 8 9

1st test grade 98 77 88 80 96 61 66 95 69

Final average 93 78 84 73 84 64 64 95 76

Note: means the problem may be solved with QM for Windows; means the problem may be

solved with Excel QM; and means the problem may be solved with QM for Windows and/or Excel QM.

142 CHAPTER 4 • REGRESSION MODELS

The coefficient of correlation computed was 0.68. (a) If Thomas Williams returns from a 300-mile trip

that took him out of town for five days, what is the expected amount that he should claim as ex- penses?

(b) Williams submitted a reimbursement request for $685; what should the accountant do?

(c) Comment on the validity of this model. Should any other variables be included? Which ones? Why?

4-18 Thirteen students entered the undergraduate busi- ness program at Rollins College 2 years ago. The following table indicates what their grade-point av- erages (GPAs) were after being in the program for 2 years and what each student scored on the SAT exam (maximum 2400) when he or she was in high school. Is there a meaningful relationship between grades and SAT scores? If a student scores a 1200 on the SAT, what do you think his or her GPA will be? What about a student who scores 2400?

(a) Plot these data and determine whether a linear model is reasonable.

(b) Develop a regression model. (c) What is expected ridership if 10 million tourists

visit the city? (d) If there are no tourists at all, explain the pre-

dicted ridership.

4-20 Use computer software to develop a regression model for the data in Problem 4-19. Explain what this output indicates about the usefulness of this model.

4-21 The following data give the starting salary for stu- dents who recently graduated from a local university and accepted jobs soon after graduation. The start- ing salary, grade-point average (GPA), and major (business or other) are provided.

STUDENT SAT SCORE GPA STUDENT SAT SCORE GPA

A 1263 2.90 H 1443 2.53

B 1131 2.93 I 2187 3.22

C 1755 3.00 J 1503 1.99

D 2070 3.45 K 1839 2.75

E 1824 3.66 L 2127 3.90

F 1170 2.88 M 1098 1.60

G 1245 2.15

4-19 Bus and subway ridership in Washington, D.C., dur- ing the summer months is believed to be heavily tied to the number of tourists visiting the city. During the past 12 years, the following data have been obtained:

NUMBER OF TOURISTS RIDERSHIP

YEAR (1,000,000s) (100,000s)

1 7 15

2 2 10

3 6 13

4 4 15

5 14 25

6 15 27

7 16 24

8 12 20

9 14 27

10 20 44

11 15 34

12 7 17

SALARY $29,500 $46,000 $39,800 $36,500

GPA 3.1 3.5 3.8 2.9

Major Other Business Business Other

SALARY $42,000 $31,500 $36,200

GPA 3.4 2.1 2.5

Major Business Other Business

(a) Using a computer, develop a regression model that could be used to predict starting salary based on GPA and major.

(b) Use this model to predict the starting salary for a business major with a GPA of 3.0.

(c) What does the model say about the starting salary for a business major compared to a non- business major?

(d) Do you believe this model is useful in predicting the starting salary? Justify your answer, using information provided in the computer output.

4-22 The following data give the selling price, square footage, number of bedrooms, and age of houses that have sold in a neighborhood in the past 6 months. Develop three regression models to predict the selling price based upon each of the other factors individually. Which of these is best?

SELLING SQUARE AGE PRICE($) FOOTAGE BEDROOMS (YEARS)

64,000 1,670 2 30

59,000 1,339 2 25

61,500 1,712 3 30

79,000 1,840 3 40

87,500 2,300 3 18

92,500 2,234 3 30

95,000 2,311 3 19

113,000 2,377 3 7

(Continued on next page)

DISCUSSION QUESTIONS AND PROBLEMS 143

4-23 Use the data in Problem 4-22 and develop a regres- sion model to predict selling price based on the square footage and number of bedrooms. Use this to predict the selling price of a 2,000-square-foot house with 3 bedrooms. Compare this model with the mod- els in Problem 4-22. Should the number of bed- rooms be included in the model? Why or why not?

4-24 Use the data in Problem 4-22 and develop a regres- sion model to predict selling price based on the square footage, number of bedrooms, and age. Use this to predict the selling price of a 10-year-old, 2,000-square-foot house with 3 bedrooms.

4-25 Tim Cooper plans to invest money in a mutual fund that is tied to one of the major market indices, either the S&P 500 or the Dow Jones Industrial Average. To obtain even more diversification, Tim has thought about investing in both of these. To determine whether investing in two funds would help, Tim decided to take 20 weeks of data and compare the two markets. The closing price for each index is shown in the table below:

4-26 The total expenses of a hospital are related to many factors. Two of these factors are the number of beds in the hospital and the number of admissions. Data were collected on 14 hospitals, as shown in the table below:

SELLING SQUARE AGE PRICE($) FOOTAGE BEDROOMS (YEARS)

115,000 2,736 4 10

138,000 2,500 3 1

142,500 2,500 4 3

144,000 2,479 3 3

145,000 2,400 3 1

147,500 3,124 4 0

144,000 2,500 3 2

155,500 4,062 4 10

165,000 2,854 3 3

WEEK 1 2 3 4 5 6 7

DJIA 10,226 10,473 10,452 10,442 10,471 10,213 10,187

S&P 1,107 1,141 1,135 1,139 1,142 1,108 1,110

WEEK 8 9 10 11 12 13 14

DJIA 10,240 10,596 10,584 10,619 10,628 10,593 10,488

S&P 1,121 1,157 1,145 1,144 1,146 1,143 1,131

WEEK 15 16 17 18 19 20

DJIA 10,568 10,601 10,459 10,410 10,325 10,278

S&P 1,142 1,140 1,122 1,108 1,096 1,089

Develop a regression model that would predict the DJIA based on the S&P 500 index. Based on this model, what would you expect the DJIA to be when the S&P is 1,100? What is the correlation coefficient (r) between the two markets?

NUMBER ADMISSIONS TOTAL EXPENSES HOSPITAL OF BEDs (100s) (MILLIONS)

1 215 77 57

2 336 160 127

3 520 230 157

4 135 43 24

5 35 9 14

6 210 155 93

7 140 53 45

8 90 6 6

9 410 159 99

10 50 18 12

11 65 16 11

12 42 29 15

13 110 28 21

14 305 98 63

Find the best regression model to predict the total expenses of a hospital. Discuss the accuracy of this model. Should both variables be included in the model? Why or why not?

4-27 A sample of 20 automobiles was taken, and the miles per gallon (MPG), horsepower, and total weight were recorded. Develop a linear regression model to predict MPG, using horsepower as the only independent variable. Develop another model with weight as the independent variable. Which of these two models is better? Explain.

MPG HORSEPOWER WEIGHT

44 67 1,844

44 50 1,998

40 62 1,752

37 69 1,980

37 66 1,797

34 63 2,199

35 90 2,404

32 99 2,611

30 63 3,236

28 91 2,606

26 94 2,580

26 88 2,507

(Continued on next page)

144 CHAPTER 4 • REGRESSION MODELS

4-28 Use the data in Problem 4-27 to develop a multiple linear regression model. How does this compare with each of the models in Problem 4-27?

4-29 Use the data in Problem 4-27 to find the best quad- ratic regression model. (There is more than one to consider.) How does this compare to the models in Problems 4-27 and 4-28?

4-30 A sample of nine public universities and nine private universities was taken. The total cost for the year (including room and board) and the median SAT score (maximum total is 2400) at each school were recorded. It was felt that schools with higher median SAT scores would have a better reputation and would charge more tuition as a result of that. The data is in the table below. Use regression to help an- swer the following questions based on this sample data. Do schools with higher SAT scores charge more in tuition and fees? Are private schools more expensive than public schools when SAT scores are

CATEGORY TOTAL COST ($) MEDIAN SAT

Public 21,700 1990

Public 15,600 1620

Public 16,900 1810

Public 15,400 1540

Public 23,100 1540

Public 21,400 1600

Public 16,500 1560

Public 23,500 1890

Public 20,200 1620

Private 30,400 1630

Private 41,500 1840

Private 36,100 1980

Private 42,100 1930

Private 27,100 2130

Private 34,800 2010

Private 32,100 1590

Private 31,800 1720

Private 32,100 1770

MPG HORSEPOWER WEIGHT

25 124 2,922

22 97 2,434

20 114 3,248

21 102 2,812

18 114 3,382

18 142 3,197

16 153 4,380

16 139 4,036

taken into consideration? Discuss how accurate you believe these results are using information related the regression models.

4-31 In 2008, the total payroll for the New York Yankees was $209.1 million, while the total payroll for the Tampa Bay Rays was about $43.8 million, or about one-fifth that of the Yankees. Many people have sug- gested that some teams are able to buy winning sea- sons and championships by spending a lot of money on the most talented players available. The table below lists the payrolls (in millions of dollars) for all 14 Major League Baseball teams in the American League as well as the total number of victories for each in the 2008 season:

TEAM PAYROLL

($MILLIONS) NUMBER

OF VICTORIES

New York Yankees 209.1 89

Detroit Tigers 138.7 74

Boston Red Sox 133.4 95

Chicago White Sox 121.2 89

Cleveland Indians 79.0 81

Baltimore Orioles 67.2 68

Oakland Athletics 48.0 75

Los Angeles Angels 119.2 100

Seattle Mariners 118.0 61

Toronto Blue Jays 98.6 86

Minnesota Twins 62.2 88

Kansas City Royals 58.2 75

Tampa Bay Rays 43.8 97

Texas Rangers 68.2 79

Develop a regression model to predict the total num- ber of victories based on the payroll of a team. Based on the results of the computer output, discuss how accurate this model is. Use the model to predict the number of victories for a team with a payroll of $79 million.

4-32 In 2009, the New York Yankees won 103 baseball games during the regular season. The table on the next page lists the number of victories (W), the earned- run-average (ERA), and the batting average (AVG) of each team in the American League. The ERA is one measure of the effectiveness of the pitching staff, and a lower number is better. The batting average is one measure of effectiveness of the hitters, and a higher number is better. (a) Develop a regression model that could be used to

predict the number of victories based on the ERA. (b) Develop a regression model that could be used to

predict the number of victories based on the bat- ting average.

CASE STUDY 145

(c) Which of the two models is better for predicting the number of victories?

(d) Develop a multiple regression model that in- cludes both ERA and batting average. How does this compare to the previous models?

4-33 The closing stock price for each of two stocks was recorded over a 12-month period. The closing price for the Dow Jones Industrial Average (DJIA) was also recorded over this same time period. These val- ues are shown in the following table:

(a) Develop a regression model to predict the price of stock 1 based on the Dow Jones Industrial Average.

(b) Develop a regression model to predict the price of stock 2 based on the Dow Jones Industrial Average.

(c) Which of the two stocks is most highly corre- lated to the Dow Jones Industrial Average over this time period?

Case Study

North–South Airline

In January 2008, Northern Airlines merged with Southeast Airlines to create the fourth largest U.S. carrier. The new North–South Airline inherited both an aging fleet of Boeing 727-300 aircraft and Stephen Ruth. Stephen was a tough for- mer Secretary of the Navy who stepped in as new president and chairman of the board.

Stephen’s first concern in creating a financially solid com- pany was maintenance costs. It was commonly surmised in the airline industry that maintenance costs rise with the age of the aircraft. He quickly noticed that historically there had been a significant difference in the reported B727-300 maintenance costs (from ATA Form 41s) both in the airframe and engine ar- eas between Northern Airlines and Southeast Airlines, with Southeast having the newer fleet.

On February 12, 2008, Peg Jones, vice president for opera- tions and maintenance, was called into Stephen’s office and asked to study the issue. Specifically, Stephen wanted to know whether the average fleet age was correlated to direct airframe maintenance costs, and whether there was a relationship be- tween average fleet age and direct engine maintenance costs.

Peg was to report back by February 26 with the answer, along with quantitative and graphical descriptions of the relationship.

Peg’s first step was to have her staff construct the average age of Northern and Southeast B727-300 fleets, by quarter, since the introduction of that aircraft to service by each airline in late 1993 and early 1994. The average age of each fleet was calculated by first multiplying the total number of calendar days each aircraft had been in service at the pertinent point in time by the average daily utilization of the respective fleet to total fleet hours flown. The total fleet hours flown was then divided by the number of aircraft in service at that time, giving the age of the “average” aircraft in the fleet.

The average utilization was found by taking the actual total fleet hours flown on September 30, 2007, from Northern and Southeast data, and dividing by the total days in service for all aircraft at that time. The average utilization for Southeast was 8.3 hours per day, and the average utilization for Northern was 8.7 hours per day. Because the available cost data were calcu- lated for each yearly period ending at the end of the first quar- ter, average fleet age was calculated at the same points in time.

TEAM W ERA AVG

New York Yankees 103 4.26 0.283

Los Angeles Angels 97 4.45 0.285

Boston Red Sox 95 4.35 0.270

Minnesota Twins 87 4.50 0.274

Texas Rangers 87 4.38 0.260

Detroit Tigers 86 4.29 0.260

Seattle Mariners 85 3.87 0.258

Tampa Bay Rays 84 4.33 0.263

Chicago White Sox 79 4.14 0.258

Toronto Blue Jays 75 4.47 0.266

Oakland Athletics 75 4.26 0.262

Cleveland Indians 65 5.06 0.264

Kansas City Royals 65 4.83 0.259

Baltimore Orioles 64 5.15 0.268

MONTH DJIA STOCK 1 STOCK 2

1 11,168 48.5 32.4

2 11,150 48.2 31.7

3 11,186 44.5 31.9

4 11,381 44.7 36.6

5 11,679 49.3 36.7

6 12,081 49.3 38.7

7 12,222 46.1 39.5

8 12,463 46.2 41.2

9 12,622 47.7 43.3

10 12,269 48.3 39.4

11 12,354 47.0 40.1

12 13,063 47.9 42.1

13 13,326 47.8 45.2

146 CHAPTER 4 • REGRESSION MODELS

The fleet data are shown in the following table. Airframe cost data and engine cost data are both shown paired with fleet aver- age age in that table.

Bibliography

Berenson, Mark L., David M. Levine, and Timothy C. Kriehbiel. Basic Busi- ness Statistics: Concepts and Applications, 11th ed. Upper Saddle River, NJ: Prentice Hall, 2009.

Black, Ken. Business Statistics: For Contemporary Decision Making, 6th ed. John Wiley & Sons, Inc., 2010.

Draper, Norman R., and Harry Smith. Applied Regression Analysis, 3rd ed. New York: John Wiley & Sons, Inc., 1998.

Kutner, Michael, John Neter, Chris J. Nachtsheim, and William Wasserman. Applied Linear Regression Models, 4th ed., Boston; New York: McGraw-Hill/Irwin, 2004.

Mendenhall, William, and Terry L. Sincich. A Second Course in Statistics: Re- gression Analysis, 6th ed., Upper Saddle River, NJ: Prentice Hall, 2004.

Discussion Question 1. Prepare Peg Jones’s response to Stephen Ruth.

Note: Dates and names of airlines and individuals have been changed in this case to maintain confidentiality. The data and issues described here are real.

North–South Airline Data for Boeing 727-300 Jets

NORTHERN AIRLINE DATA SOUTHEAST AIRLINE DATA

ENGINE AIRFRAME ENGINE AIRFRAME COST COST PER AVERAGE COST PER COST PER AVERAGE

YEAR PER AIRCRAFT($) AIRCRAFT($) AGE (HOURS) AIRCRAFT($) AIRCRAFT($) AGE (HOURS)

2001 51.80 43.49 6,512 13.29 18.86 5,107

2002 54.92 38.58 8,404 25.15 31.55 8,145

2003 69.70 51.48 11,077 32.18 40.43 7,360

2004 68.90 58.72 11,717 31.78 22.10 5,773

2005 63.72 45.47 13,275 25.34 19.69 7,150

2006 84.73 50.26 15,215 32.78 32.58 9,364

2007 78.74 79.60 18,390 35.56 38.07 8,259

Appendix 4.1 Formulas for Regression Calculations

When performing regression calculations by hand, there are other formulas that can make the task easier and are mathematically equivalent to the ones presented in the chapter. These, how- ever, make it more difficult to see the logic behind the formulas and to understand what the re- sults actually mean.

When using these formulas, it helps to set up a table with the columns shown in Table 4.7, which has the Triple A Construction Company data that was used earlier in the chapter. The sample size (n) is 6. The totals for all columns are shown, and the averages for X and Y are cal- culated. Once this is done, we can use the following formulas for computations in a simple lin- ear regression model (one independent variable). The simple linear regression equation is again given as

Slope of regression equation:

b1 = 180.5 - 6142172

106 - 61422 = 1.25

b1 = ©XY - nXY

©X2 - nX2

YN = b0 + b1X

APPENDIX 4.1 FORMULAS FOR REGRESSION CALCULATIONS 147

Y X Y2 X2 XY

6 3 62 � 36 32 � 9 3(6) � 18

8 4 82 � 64 42 � 16 4(8) � 32

9 6 92 � 81 62 �36 6(9) � 54

5 4 52 � 25 42 � 16 4(5) � 20

4.5 2 4.52 � 20.25 22 � 4 2(4.5) � 9

9.5 5 9.52 � 90.25 52 � 25 5(9.5) � 47.5

gY � 42 gX � 24 gY2 � 316.5 gX2 � 106 gXY � 180.5

X = 24>6 = 4Y = 42>6 = 7

TABLE 4.7 Preliminary Calculations for Triple A Construction

Intercept of regression equation:

Sum of squares of the error:

Estimate of the error variance:

Estimate of the error standard deviation:

Coefficient of determination:

This formula for the correlation coefficient automatically determines the sign of r. This could also be found by taking the square root of and giving it the same sign as the slope:

r = 61180.52 - 1242142223611062 - 2424361316.52 - 4224 = 0.833

r = n©XY - ©X©Y23n©X2 - 1©X2243n©Y2 - 1©Y224

r2

r2 = 1 - 6.875

316.5 - 61722 = 0.6944

r2 = 1 - SSE

©Y2 - nY 2

s = 21.71875 = 1.311

s = 1MSE

s2 = 6.875

6 - 2 = 1.71875

s2 = MSE = SSE

n - 2

SSE = 316.5 - 21422 - 1.251180.52 = 6.875

SSE = ©Y2 - b0©Y - b1©XY

b0 = 7 - 1.25142 = 2

b0 = Y - b1X

148 CHAPTER 4 • REGRESSION MODELS

Appendix 4.2 Regression Models Using QM for Windows

The use of QM for Windows to develop a regression model is very easy. We will use the Triple A Construction Company data to illustrate this. After starting QM for Windows, under Modules we select Forecasting. To enter the problem we select New and specify Least Squares—Simple and Multiple Regression, as illustrated in Program 4.7A. This opens the window shown in Program 4.7B. We enter the number of observations, which is 6 in this example. There is only 1 independent (X) variable. When OK is clicked, a window opens and the data is input as shown in Program 4.7C. After entering the data, click Solve, and the forecasting results are shown as in Program 4.7D. The equation as well as other information is provided on this screen. Additional output is available by clicking the Window option on the toolbar.

Recall that the MSE is an estimate of the error variance and the square root of this is the standard error of the estimate. The formula presented in the chapter and used in Excel is

where n is the sample size and k is the number of independent variables. This is an unbiased es- timate of In QM for Windows, the mean squared error is computed as

This is simply the average error and is a biased estimate of The standard error shown in Program 4.7D is not the square root of the MSE in the output, but rather is found using the de- nominator of If this standard error is squared, you get the MSE we saw earlier in the Ex- cel output.

n - 2.

�2.

MSE = SSE>n�2.

MSE = SSE>1n - k - 12 1�22,

PROGRAM 4.7A Initial Input Screen for QM for File—New—Least Squares–Simple and Multiple Regression

There is only one independent variable.

There are six pairs of observations in this sample.

PROGRAM 4.7B Second Input Screen for QM for Windows

APPENDIX 4.2 REGRESSION MODELS USING QM FOR WINDOWS 149

PROGRAM 4.7C Data Input for Triple A Construction Example

The regression equation is shown across two lines.

The standard error is the square root of SSE divided by n–2.

The MSE is the SSE divided by n.

PROGRAM 4.7D QM for Windows Output for Triple A Construction Data

PROGRAM 4.7E ANOVA Summary Output in QM for Windows

The F test was used to test a hypothesis about the overall effectiveness of the model. To see the ANOVA table, after the problem has been solved, select Window—ANOVA Summary, and the screen shown in Program 4.7E will be displayed.

150 CHAPTER 4 • REGRESSION MODELS

Appendix 4.3 Regression Analysis in Excel QM or Excel 2007

Excel QM Perhaps the easiest way to do regression analysis in Excel (either 2007 or 2010) is to use Excel QM, which is available on the companion website for this book. Once Excel QM has been installed as an add-in to Excel (see Appendix F at the end of the book for instructions on doing this), go to the Add-Ins tab and click Excel QM, as shown in Program 4.8A. When the menu appears, point the cursor at Forecasting, and the options will appear. Click on Multiple Regres- sion, as shown in Program 4.8A, for either simple or multiple regression models.

A window will open, as shown in Program 4.8B. Enter the number of past observations and the number of independent (X) variables. You can also enter a name or title for the problem. To enter the data for the Triple A Construction example in this chapter, enter 6 for the past periods (observations) and 1 for the number of independent variables. This will initialize the size of the spreadsheet, and the spreadsheet will appear as presented in Program 4.8C.

The shaded area under Y and x 1 will be empty, but the data are entered in this shaded area, and the calculations are automatically performed. In Program 4.8C, the intercept is 2 (the coeffi- cient in the Y column) and the slope is 1.25 (the coefficient in the x 1 column), resulting in the regression equation

which is the equation found earlier in this chapter.

Excel 2007 When doing regression in Excel (without the Excel QM add-in), the Data Analysis add-in is used in both Excel 2010 and Excel 2007. The steps and illustrations for Excel 2010 provided earlier in this chapter also apply to Excel 2007. However, the procedure to enable or activate this or any other Excel add-in varies, depending on which of the two versions of Excel is being used. See Appendix F at the end of this book for instructions for both Excel 2007 and Excel 2010.

Y = 2 + 1.25X

Go to the Add-In tab in Excel 2007 or Excel 2010. Click Excel QM.

Point the cursor at Forecasting. When options appear, click on Multiple Regression.

PROGRAM 4.8A Using Excel QM for Regression

APPENDIX 4.3 REGRESSION ANALYSIS IN EXCEL QM OR EXCEL 2007 151

Input a title.

Click OK.

Input the number of past observations.

Input the number of independent (X ) variables.

PROGRAM 4.8B Initializing the Spreadsheet in Excel QM

Enter the past observations of Y and X. Results appear automatically.

To forecast Y based on any value of X, simply input the value of X here.The intercepts and slope

are shown here.

The correlation coefficient is given here.

PROGRAM 4.8C Input and Results for Regression in Excel QM

This page intentionally left blank

  • CHAPTER 4 Regression Models
    • 4.1 Introduction
    • 4.2 Scatter Diagrams
    • 4.3 Simple Linear Regression
    • 4.4 Measuring the Fit of the Regression Model
      • Coefficient of Determination
      • Correlation Coefficient
    • 4.5 Using Computer Software for Regression
    • 4.6 Assumptions of the Regression Model
      • Estimating the Variance
    • 4.7 Testing the Model for Significance
      • Triple A Construction Example
      • The Analysis of Variance (ANOVA) Table
      • Triple A Construction ANOVA Example
    • 4.8 Multiple Regression Analysis
      • Evaluating the Multiple Regression Model
      • Jenny Wilson Realty Example
    • 4.9 Binary or Dummy Variables
    • 4.10 Model Building
    • 4.11 Nonlinear Regression
    • 4.12 Cautions and Pitfalls in Regression Analysis
    • Summary
    • Glossary
    • Key Equations
    • Solved Problems
    • Self-Test
    • Discussion Questions and Problems
    • Case Study: North–South Airline
    • Bibliography
    • Appendix 4.1 Formulas for Regression Calculations
    • Appendix 4.2 Regression Models Using QM for Windows
    • Appendix 4.3 Regression Analysis in Excel QM or Excel 2007