Wk5 DQ - Data Analysis and Business Intelligence

profilevoyage
Lind_18e_Chap014_PPT.pptx

Multiple Regression Analysis

Chapter 14

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-1

In this chapter, we build on what we learned in the preceding chapter by using additional independent variables (x1, x2,…, and so on) that help us better explain or predict the dependent variable y. This is the more general situation. Multiple regression analysis can be used as either a descriptive or inferential technique.

1

Learning Objectives

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

LO14-1 Use multiple regression analysis to describe and interpret a relationship between several independent variables and a dependent variable

LO14-2 Evaluate how well a multiple regression equation fits the data

LO14-3 Test hypothesis about the relationships inferred by a multiple regression model

LO14-4 Evaluate the assumptions of multiple regression

LO14-5 Use and interpret a qualitative, dummy variable in multiple regression

LO14-6 Include and interpret an interaction effect in multiple regression analysis

LO14-7 Apply stepwise regression to develop a multiple regression model

LO14-8 Apply multiple regression techniques to develop a linear model

14-2

Multiple Regression Analysis

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The general form of a multiple regression formula is

a is the intercept when all x’s are zero

b refers to the sample regression coefficients

xk refers to the value of the various independent variables

When there are two independent variables, the relationship can be graphically portrayed as a plane

14-3

There can be any number of independent variables. We cannot use a graph to illustrate more than two independent variables, since graphs are limited to 3 dimensions. This chart shows the residuals as the difference between the actual y and the fitted on the plane.

3

Multiple Regression Analysis (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-4

The least squares criterion is used to develop the regression equation

Example

Suppose the selling price of a home is directly related to the number of rooms and inversely related to its age, let x1 refer to the number of rooms, x2 to the age of the home and to the selling price of the home ($000)

= 21.2 + 18.7x1 – .25x2

= 21.2 + 18.7(7) – .25(30) = 144.6

So, a seven-room house that is 30 years old is expected to sell for $144,600

A statistical software package is needed to calculate the regression equation. Then 21.2 represents the value of a property at age 0 (without a house); 18.7 indicates that for each increase of one room, the selling price will increase $18,700; and .25 indicates that for each increase of one year, the home will decrease in value $250.

4

Multiple Regression Analysis Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Salsberry Realty sells homes along the East Coast of the United States. One question frequently asked by prospective buyers is “how much can we expect to pay to heat the home in the winter”? The research department at Salsberry thinks 3 variables relate to heating costs: the mean daily outside temperature, the number of inches of insulation, and the age in years of the furnace. They conduct a random sample of 20 homes. Determine the regression equation.

y is the dependent variable

x1 is the outside temperature

x2 is inches of insulation

x3 is the age of the furnace

= a + b1x1 +b2x2+b3x3

is used to estimate the value of y

14-5

Once the data has been collected and the variables have been defined, we are ready to use Excel to compute the statistics needed for the analysis. The Excel output is on the next slide.

5

Multiple Regression Analysis Example (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Recall:

y is the dependent variable

x1 is the outside temperature

x2 is inches of insulation

x3 is the age of the furnace

is the estimated value of y

14-6

Once we determine the regression equation, we can calculate the heating costs for January, given the mean outside temperature is 30 degrees, there are 5 inches of insulation, and the furnace is 10 years old.

= a + b1x1 +b2x2+b3x3

= 427.194 – 4.583x1 – 14.831x2 + 6.101x3

= 427.194 - 4.583(30) – 14.831(5) + 6.101(10) = 276.56

Thus, the estimated heating costs for January are $276.56

The regression intercept, a, is labeled “intercept” in the Excel output. The regression coefficient for mean outside temperature is −4.583. This means if we increase temperature by 1 degree, and hold the other two independent variables constant, we can estimate a decrease of $4.58 in monthly heating costs. The −14.831 indicates that for every one inch increase in insulation, we can expect the cost to heat the home to decline $14.83 per month. And, for each additional year older the furnace is, we expect the cost to increase $6.10 per month.

6

ANOVA Table

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

An ANOVA table summarizes the multiple regression analysis

It reports the total amount of the variation divided in two components

The regression, the variation in all the independent variables

The residual or error, the unexplained variation of y

It reports the degrees of freedom of the independent variables, the error variation, and the total variation

14-7

We can use the information in an ANOVA table to evaluate how well the equation fits the data. The SS column lists the sum of squares for each source of variation; SSR is the Regression Sum of Squares; SSE is the Residual or Error Sum of Squares. We’ll use the ANOVA table information from the previous example to evaluate the regression equation estimating January heating costs on the next slide.

7

Measures of Effectiveness

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

There are two measures of effectiveness of the regression equation

The multiple standard error of the estimate is similar to the standard deviation

It is measured in the same units as the dependent variable

It is based on squared deviations between the observed and predicted values of the dependent variable

It ranges from 0 to plus infinity

It is calculated from the following equation

14-8

ANOVA Table (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

= 427.194 – 4.583x1 – 14.831x2 + 6.101x3

= 427.194 – 4.583(35) – 14.831(3) + 6.101(6) = $258.90

Then, (y- )2 = (250 – 258.90)2 = (8.90)2 = 79.21

14-9

Multiple Standard Error of the estimate

We begin with the multiple standard error of estimate; recall that the standard error of estimate is comparable to the standard deviation. We first refer to the home in row 2. The actual heating cost, y, is $250. Using the regression equation, the estimated cost is $258.90. If we perform these calculations for all the homes, then square the differences, and sum the squares, we’ll get the SSE, the residual or error sum of squares. The SSE is used in formula 14-2 to find the multiple standard error of estimate. In formula 14-2, y is the actual observation, is the estimated value, n is the number of observations in the sample, k is the number of independent variables. SSE is the residual sum of squares from the ANOVA table. $51.05 is the typical error when we use this equation to predict the cost. A smaller multiple standard error indicates a better predictive equation.

9

Measures of Effectiveness (2 of 3)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The coefficient of multiple determination

Is symbolized by R2

Can range from 0 to 1

Cannot assume negative values

Is easy to interpret

It is found by the following formula

R2 = = = .804

80.4% of the variation is explained by the 3 independent variables.

COEFFICIENT OF MULTIPLE DETERMINATION The percent of variation in the dependent variable, y, explained by the set of independent variables, x1, x2, x3, …xk.

14-10

The coefficient of multiple determination reports the percent of variation in the dependent variable explained by the variation in the set of independent variables. The coefficient of multiple determination is based on the squared deviations from the regression equation. Use the regression and the total sum of squares from the ANOVA table to compute the coefficient of determination, .804. So 80.4% of the variation in heating cost is explained by the three independent variables. The remaining 19.6% is random error and variation from independent variables not included in the equation.

10

Measures of Effectiveness (3 of 3)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

When the number of independent variables is large, we adjust the coefficient of determination for the degrees of freedom as follows

For the cost of heating example, the adjusted coefficient of determination is

If we compare R2 (0.80) to the adjusted R2 (0.767), the difference in this case is small

14-11

The coefficient of determination tends to increase as more independent variables are added to the model. To balance the effect that the number of independent variables has on the coefficient of multiple determination, statistical software packages use an adjusted coefficient of determination.

11

Global Test

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

A global test investigates whether it is possible that all the independent variables have zero regression coefficients

The hypotheses are

H0: 1 = 2 = 3 = 0

H1: Not all is are 0

The test statistic is the F distribution

There is a family of F distributions

It cannot be negative

It is continuous

It is positively skewed

It is asymptotic

14-12

A global test investigates whether independent variables have a regression coefficient that differs significantly from zero. We’ll use Greek letters to represent the population parameters; 1, , , are the population regression coefficients and b1, b2, and b3 are the sample regression coefficients. If the hypothesis test fails to reject the null hypothesis, it implies the regression coefficients are all zero and are of no value in estimating the dependent variable (heating cost).

12

Global Test (2 of 4)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The formula to calculate the value of the test statistic is

with k (the number of independent variables) degrees of freedom in the numerator

n – (k+1) degrees of freedom in the denominator

n is sample size

We can obtain the degrees of freedom from the ANOVA table

14-13

To calculate the F statistic we need three pieces of information: the numerator degrees of freedom, the denominator degrees of freedom, and the significance level. We conduct the test of hypothesis on the next slide.

13

Global Test (3 of 4)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-14

Step 1: State the null and the alternate hypothesis

H0: 1 = 2 = 3 = 0

H1: Not all s are 0

Step 2: Select the level of significance, we’ll use .05

Step 3: Select the test statistic, F

Step 4: Formulate the decision rule, reject H0 if F > 3.24

We use Appendix B.6 to find the critical value, using .05 as the significance level, move horizontally to 3 degrees of freedom in the numerator, then down to 16 degrees of freedom in the denominator and read the critical value of 3.24.

14

Global Test (4 of 4)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-15

Step 5: Make decision; reject H0, F=21.90

The global test assures us that outside temperature, the amount of insulation, or the age of the furnace has a bearing on heating cost!

Step 6: Interpret; at least one of the independent variables has the ability to explain the variation in heating cost.

Test for Individual Variables

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The test for individual variables determines which independent variables have regression coefficients that differ significantly from zero

The variables that have zero regression coefficients are usually dropped from the analysis

The test statistic is the t distribution with n – (k +1) degrees of freedom

The formula to calculate the value of the test statistic for the individual test is

14-16

If an independent variable has a =0, it is of no value in explaining any variation in the dependent variable and we may want to eliminate it from the regression equation. We use the sample symbol is b and the population symbol is .

16

Evaluating Individual Regression Coefficients Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Salsberry Realty will use three sets of hypothesis: one for temperature, one for insulation, and one for age of the furnace.

Step 1: State the null and alternate hypothesis

For temperature

H0: = 0

H1: ≠ 0

For insulation

H0: = 0

H1: ≠ 0

For furnace age

H0: = 0

H1: 3 ≠ 0

Step 2: Select the level of significance; we use .05

Step 3: Select the test statistic; we’ll use t

Step 4: Formulate the decision rule, reject H0 if t < −2.120 or > 2.120

Step 5: Make decision; reject H0 for temperature and insulation but not furnace age

Step 6: Interpret; furnace age is not a significant predictor of heating costs

t =

t =

t =

14-17

We conduct a hypothesis test and conclude the analysis should focus on temperature and insulation as predictors of heating costs. We need to drop furnace age and rerun the regression equation.

17

Evaluating Individual Regression Coefficients Example (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Salsberry Realty will rerun the regression equation using temperature and insulation.

= 490.286 – 5.150x1 – 14.718x2

The hypotheses and details of the global test are, reject the null hypothesis if F > 3.59

H0: 1 = 2 = 0

H1: Not all of the 1’s are equal

F =

For temperature

H0: = 0

H1: ≠ 0

For insulation

H0: = 0

H1: ≠ 0

Step 2: Select the level of significance; we use .05

Step 3: Select the test statistic; we’ll use t

Step 4: Formulate the decision rule, reject H0 if t < −2.110 or > 2.110

Step 5: Make decision; reject H0 for temperature and insulation

Step 6: Interpret; temperature and insulation are a significant predictor of heating costs

t =

t =

14-18

We first conduct a global test using the F statistic and .05 as the level of significance. The degrees of freedom are 17. The null hypothesis is rejected and we conclude at least one of the regression coefficients is different from 0. Next, we test of the regression coefficients individually. The critical value of t is found in Appendix B.5 with a level of significance of .05 and degrees of freedom of n − (k + 1) = 20 − (2 + 1) = 17, it’s 2.110. In both tests, we reject the null hypothesis; outside temperature and amount of insulation are useful variables in explaining the variation in heating costs.

18

Multiple Regression Assumptions

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

There are five assumptions to use multiple regression analysis

There is a linear relationship

The variation in the residuals is the same for both large and small values of

The residuals follow the normal distribution

The independent variable should not be correlated

The residuals are independent

Next, we’ll provide a brief discussion of each of these assumptions.

14-19

The validity of the statistical global and individual tests rely on several assumptions. So if the assumptions are not true, the results might be misleading. Even though strict adherence to the assumptions is not always possible, our estimates using a multiple regression equation will be closer than any that could be made otherwise.

19

Linear Relationship Assumption

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The relationship between the dependent variable and the set of independent variables must be linear

To verify this assumption, develop a scatter diagram and plot the dependent variable on the vertical axis and the independent variable on the horizontal axis

The plots below indicate a fairly strong negative relationship between temperature and heating cost and negative relationship between insulation and costs

14-20

These graphs help us to visualize the relationships and provide some initial information about the direction (positive or negative), linearity, and strength of the relationship.

20

Linear Relationship Assumption Continued

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

To verify the linearity assumption, we can plot the residuals to evaluate the linearity of the multiple regression equation

The residuals are plotted on the vertical axis and are centered around zero against the predicted variable on the horizontal axis

14-21

We recall, the relationship between the dependent variable and the set of independent variables must be linear. The plot on the left shows the residual plots for the home heating cost example. The points are scattered, there is no obvious pattern, and there are both positive and negative values; therefore, there is no reason the doubt the linearity assumption. The points in the plot on the right are nonrandom; there is not a random pattern, which shows the relationship is probably nonlinear.

21

Variation Assumption

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Variation is the same for both large and small values of

This condition is checked by developing a scatter diagram with the residuals on the vertical axis and the fitted values on the horizontal axis

If there is not a pattern to the plots—that is, they appear random—the residuals meet the homoscedasticity requirement

We did this on the previous slide and based on the scatter diagram, it is reasonable to conclude that this assumption has been violated

HOMOSCEDASTICITY The variation around the regression equation is the same for all of the values of the independent variables.

14-22

The requirement for constant variation around the regression line is called homoscedasticity. An example that might violate this assumption: we suspect that as age increases, so does income, but there will likely be more variation in incomes of 50-year-olds than for 35-year-olds.

22

Normal Probability Assumption

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The residuals follow the normal probability distribution

This condition is checked by developing a histogram of the residuals or a normal probability plot

The mean of the distribution of the residuals is 0

If the plotted points are fairly close to the straight line drawn from lower left to upper right, the normal probability assumption is supported

14-23

Although it is difficult to show that the residuals follow a normal distribution with only 20 observations, it does appear the normality assumption is reasonable from both the histogram and the normal probability plot. The normal probability plot is often included in statistical software.

23

Variables Not Correlated Assumption

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The independent variables are not correlated assumption

A correlation matrix will show all possible correlations among independent variables

Signs of trouble are correlations > 0.70 or < −0.70

Signs of correlated independent variables

an important predictor variable is found insignificant

an obvious reversal occurs in signs in one or more of the independent variables

a variable is removed from the solution, there is a large change in the regression coefficients

The VIF is used to identify correlated independent variables

14-24

Multicollinearity exists when independent variables are correlated which makes it difficult to make inferences about the individual regression coefficients. Another problem is they may lead to erroneous results in the hypothesis tests for the individual independent variables. A VIF greater than 10 is considered unsatisfactory, indicating the independent variable should be removed from the analysis.

24

Variables Not Correlated Assumption Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Refer to table 14-1, which relates the heating cost to the independent variables: outside temperature, amount of insulation, and age of furnace. Develop a correlation matrix for all the independent variables. Does it appear there is a problem with multicollinearity? Find and interpret the VIF for each of the independent variables.

Because all of the correlations are between −.70 and .70, we do not suspect problems with multicollinearity.

VIF =

VIF =

VIF =

All VIFs < 10, no multicollinearity

14-25

A correlation matrix shows the correlation between all pairs of the variables. The highlighted area indicates the correlation among the independent variables. Then we calculated VIFs for temperature, insulation, and age of furnace.

25

Independent Observations Assumption

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Each residual is independent of other residuals

Autocorrelation occurs when successive residuals are correlated

When autocorrelation exists, the value of the standard error will be biased and will return poor results for tests of hypothesis regarding the regression coefficients

14-26

Autocorrelation frequently occurs when data are collected over a period of time. A scatter plot such as this would indicate possible autocorrelation. There is a test for autocorrelation, called the Durbin-Watson. It is discussed in chapter 18.

26

Techniques to Build a Regression Model

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Several techniques help build a regression model

A dummy or qualitative independent variable can assume one of two possible outcomes, a 1 or a 0

Use formula (14-6) to determine if the dummy variable should remain in the equation

Example

Suppose we are interested in estimating an executive’s salary on the basis of years of experience and whether he or she graduated college. Graduation will be a yes or no.

DUMMY VARIABLE A variable in which there are only two possible outcomes. For analysis, one of the outcomes is coded a 1 and the other a 0.

14-27

Qualitative variables describe a particular quality or attribute, such as gender measured as male or female.

27

Dummy Variable Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Suppose in the Salsberry Realty example that the independent variable garage is added. For homes without a garage, 0 is used; for homes with an attached garage, 1 is used. Garage will be variable x4. What is the effect of the garage variable?

Suppose we have two houses exactly alike in Buffalo, New York. One has an attached garage (1) and the other does not (0). Both have 3 inches of insulation and the temperature is 30 degrees.

14-28

Should the garage variable be included?

28

Dummy Variable Example (2 of 3)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Suppose we have two houses exactly alike in Buffalo, New York. One has an attached garage (1) and the other does not (0). Both have 3 inches of insulation and the temperature is 20 degrees.

14-29

For the house without the attached garage:

= 393.666 - 3.963x1 – 11.334x2 + 77.432x4

= 393.666 – 3.963(20) – 11.334 (3) + 77.432(0) = 280.404

For the house with the attached garage:

= 393.666 - 3.963x1 – 11.334x2 + 77.432x4

= 393.666 – 3.963(20) – 11.334 (3) + 77.432(1) = 357.836

Should the garage variable be included? We begin by developing regression equations for each house. It costs $77.432 more to heat a home with an attached garage. Is the difference significant? We conduct a hypothesis test. Use Appendix B.5 to find the critical value, there are n − (k + 1) = 20 − (3 + 1) = 16 degrees of freedom; it’s a two-tailed test, and we use .05 as the significance level, the critical value is 2.120. The computed t is 3.399, so the null hypothesis is rejected. The independent variable garage should be included in the analysis.

29

Dummy Variable Example (3 of 3)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-30

t =

Step 1: State the null and the alternate hypothesis

H0: 4 = 0

H1: 4 ≠ 0

Step 2: Select the level of significance, .05

Step 3: Select the test statistic, t

Step 4: Formulate the decision rule, reject H0 if t < -2.120 or > 2.120

Step 5: Make decision, reject H0, t= 3.399

Step 6: Interpret, the variable garage should be included in the analysis

t =

Interaction Technique

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-31

Interaction is the case in which one independent variable (such as x2) affects the relationship with another independent variable (x1) and the dependent variable (y)

In regression analysis, interaction is examined as a separate independent variable; we can multiply the data values of one independent variable by the values of another independent variable

Y =

The term x1x2 is the interaction term

Now develop a regression equation with the three variables and test the significance of the third

In this section, we include and interpret an interaction effect in a multiple regression analysis.

31

Interaction Technique Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Refer to the heating cost example and the data in table 14-1. Is there an interaction between the outside temperature and the amount of insulation? If both variables are increased, is the effect on heating cost greater than the sum of savings from warmer temperature and the savings from increased insulation separately?

= 598.070 – 7.811x1 – 30.161x2 + 0.385x1x2

14-32

We create the interaction variable by multiplying the value of temperature by the value of insulation for each observation in the data set. The results are in column D. We find the critical value using 16 degrees of freedom and .05 significance level in Appendix B.5, it’s 2.120. The computed t value is 1.324 so we do not reject the null hypothesis.

32

Interaction Technique Example Continued

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-33

It is possible to have a three-way interaction

It is possible to have an interaction where one variable is nominal scale

Step 1: State the null and the alternate hypothesis

H0: 1X2 = 0

H1: 1X2 ≠ 0

Step 2: Select the level of significance, .05

Step 3: Select the test statistic, t

Step 4: Formulate the decision rule; reject H0 if r < -2.120 or > 2.120

Step 5: Make decision, do not reject H0

Step 6: Interpret, there is not a significant interaction between temperature and insulation.

t =

Stepwise Regression

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Advantages to the stepwise method

Only independent variables with significant regression coefficients are entered into the equation

The steps involved are clear

It is efficient

The changes in the multiple standard error of estimate and the coefficient of determination are shown

STEPWISE REGRESSION A step-by-step method to determine a regression equation that begins with a single independent variable and adds or deletes independent variables one by one. Only independent variables with nonzero regression coefficients are included in the regression equation.

14-34

Stepwise regression is a step-by-step process to find the regression equation. Independent variables are added one at a time to the regression equation. This is a more direct approach to building a regression equation.

34

Stepwise Regression Technique

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-35

The stepwise procedure selects the independent variable temperature first. Temperature explains 65.85% of the variation in heating cost.

= 388.8 – 4.93x1

The next independent variable selected is garage. Now the coefficient of determination is 80.46%.

= 300.3 – 3.56x1 +93.0x2

Next, the procedure selects insulation and stops. At this point 86.98% of the variation is explained.

= 393.7 – 3.96x1 + 77.0x2 – 11.3 x3

This is the same regression equation we developed before!

Usually the regression coefficients will change from one step to the next. We stop after the insulation variable is added. The independent variable age does not add significantly to the coefficient of determination.

35

Stepwise Regression Technique Continued

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-36

Chapter 14 Practice Problems

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-37

Question 3

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-38

A consulting group was hired by the Human Resources Department at General Mills, Inc. to survey company employees regarding their degree of satisfaction with their quality of life. A special index, called the index of satisfaction, was used to measure satisfaction. Six factors were studied, namely, age at the time of first marriage (x1), annual income (x2), number of children living (x3), value of all assets (x4), status of health in the form of an index (x5), and the average number of social activities per week—such as bowling and dancing (x6). Suppose the multiple regression equation is:

ŷ = 16.24 + 0.017x1 + 0.0028x2 + 42x3 + 0.0012x4 + 0.19x5 + 26.8x6

What is the estimated index of satisfaction for a person who first married at 18, has an annual income of $26,500, has three children living, has assets of $156,000, has an index of health status of 141, and has 2.5 social activities a week on the average?

Which would add more to satisfaction, an additional income of $10,000 a year or two more social activities a week?

LO14-1

Question 5

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-39

Consider the ANOVA table that follows.

Determine the standard error of estimate. About 95% of the residuals will be between what two values?

Determine the coefficient of multiple determination. Interpret this value.

Determine the coefficient of multiple determination, adjusted for the degrees of freedom.

LO14-2

Question 7

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-40

Given the following regression output,

answer the following questions:

Write the regression equation.

If x1 is 4 and x2 is 11, what is the expected or predicted value of the dependent variable?

How large is the sample? How many independent variables are there?

Conduct a global test of hypothesis to see if any of the set of regression coefficients could be different from 0. Use the .05 significance level. What is your conclusion?

Conduct a test of hypothesis for each independent variable. Use the .05 significance level. Which variable would you consider eliminating?

Outline a strategy for deleting independent variables in this case.

LO14-3

Question 21

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-41

A video media consultant collected the following data on popular LED televisions sold through on line retailers.

LO14-1, 2, 3, 4, 5, 8

Question 21continued

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

14-42

Does there appear to be a linear relationship between the screen size and the price?

Which variable is the “dependent” variable?

Using statistical software, determine the regression equation. Interpret the value of the slope in the regression equation.

Include the manufacturer in a multiple linear regression analysis using a “dummy” variable. Does it appear that some manufacturers can command a premium price? Hint: You will need to use a set of dummy variables.

Test each of the individual coefficients to see if they are significant.

Make a plot of the residuals and comment on whether they appear to follow a normal distribution.

Plot the residuals versus the fitted values. Do they seem to have the same amount of variation?

LO14-1, 2, 3, 4, 5, 8