Module 4 - Computer Discussion

Chase@2000

Camm_4e_Ch07_PPT.pptx

Home >Computer Science homework help >Module 4 - Computer Discussion

Business Analytics

Linear Regression

Chapter 7

Introduction (Slide 1 of 2)

Managerial decisions are often based on the relationship between two or more variables:

Example: After considering the relationship between advertising expenditures and sales, a marketing manager might attempt to predict sales for a given level of advertising expenditures.

Sometimes a manager will rely on intuition to judge how two variables are related.

If data can be obtained, a statistical procedure called regression analysis can be used to develop an equation showing how the variables are related.

Introduction (Slide 2 of 2)

Dependent variable or response: Variable being predicted.

Independent variables or predictor variables: Variables being used to predict the value of the dependent variable.

Simple linear regression: A regression analysis for which any one unit change in the independent variable, x, is assumed to result in the same change in the dependent variable, y.

Multiple linear regression: A regression analysis involving two or more independent variables.

Simple Linear Regression Model

Regression Model

Estimated Regression Equation

Simple Linear Regression Model (Slide 1 of 5)

Regression Model:

The equation that describes how y is related to x and an error term.

The error term accounts for the variability in y that cannot be explained by the linear relationship between x and y.

In the Butler Trucking Company example, the population consists of all the driving assignments that can be made by the company.

For every driving assignment in the population, there is a value of x (miles traveled) and a corresponding value of y (travel time in hours).

Simple Linear Regression Model (Slide 2 of 5)

Estimated Regression Equation:

The parameter values are usually not known and must be estimated using sample data.

the regression equation and dropping the error term, we obtain the estimated regression for simple linear regression.

To estimate the mean or expected value of travel time for a driving assignment of 75 miles, Butler Trucking would substitute the value of 75 for x in equation = b0 + b1x.

Simple Linear Regression Model (Slide 3 of 5)

In the estimated simple linear regression equation:

The graph of the estimated simple linear regression equation is called the estimated regression line.

To estimate the mean or expected value of travel time for a driving assignment of 75 miles, Butler Trucking would substitute the value of 75 for x in equation = b0 + b1x.

Meghan Cook (MC) - The slash in this equation should be a vertical line. I have edited the alt text to show how this should read as well.

Simple Linear Regression Model (Slide 4 of 5)

Figure 7.1: The Estimation Process in Simple Linear Regression

Simple Linear Regression Model (Slide 5 of 5)

Figure 7.2: Possible Regression Lines in Simple Linear Regression

The regression line in Panel A shows that the mean value of y is related positively to x, with larger values of associated with larger values of x.

In Panel B, the mean value of y is related negatively to x, with smaller values of associated with larger values of x.

In Panel C, the mean value of y is not related to x; that is, is the same for every value of x.

Least Squares Method

Least Squares Estimates of the Regression Parameters

Using Excel’s Chart Tools to Compute the Estimated Regression Equation

Least Squares Method (Slide 1 of 15)

Least squares method: A procedure for using sample data to find the estimated regression equation.

Least Squares Method (Slide 2 of 15)

Table 7.1: Miles Traveled and Travel Time for 10 Butler Trucking Company Driving Assignments

Driving Assignment i	x = Miles Traveled	y = Travel Time (hours)
1	100	9.3
2	50	4.8
3	50	8.9
4	100	6.5
5	50	4.2
6	80	6.2
7	75	7.4
8	65	6.0
9	90	7.6
10	90	6.1

To illustrate the least squares method, suppose data were collected from a sample of 10 Butler Trucking Company driving assignments.

For the ith observation or driving assignment in the sample, xi is the miles traveled and yi is the travel time (in hours).

The values of xi and yi for the 10 driving assignments in the sample are summarized in Table 7.1.

Least Squares Method (Slide 3 of 15)

Figure 7.3: Scatter Chart of Miles Traveled and Travel Time for Sample of 10 Butler Trucking Company Driving Assignments

Figure 7.3 is a scatter chart of the data in Table 7.1.

Miles traveled is shown on the horizontal axis, and travel time (in hours) is shown on the vertical axis.

Longer travel times appear to coincide with more miles traveled.

The relationship between the travel time and miles traveled appears to be approximated by a straight line; indeed, a positive linear relationship is indicated between x and y.

Therefore, the simple linear regression model is chosen to represent this relationship.

Least Squares Method (Slide 4 of 15)

Least Squares Method (Slide 5 of 15)

Hence,

We are finding the regression that minimizes the sum of squared errors.

Least Squares Method (Slide 6 of 15)

Least Squares Estimates of the Regression Parameters:

For the Butler Trucking Company data in Table 7.1:

The estimated simple linear regression model:

Least Squares Method (Slide 7 of 15)

If the length of a driving assignment were 1 unit (1 mile) longer, the mean travel time for that driving assignment would be 0.0678 units (0.0678 hours, or approximately 4 minutes) longer.

If the driving distance for a driving assignment was 0 units (0 miles), the mean travel time would be 1.2739 units (1.2739 hours, or approximately 76 minutes).

We interpret b1 and b0 as we would the y-intercept and slope of any straight line.

Least Squares Method (Slide 8 of 15)

Experimental region: The range of values of the independent variables in the data used to estimate the model.

The regression model is valid only over this region.

Extrapolation: Prediction of the value of the dependent variable outside the experimental region.

It is risky.

Because we have no empirical evidence that the relationship we have found holds true for values of x outside of the range of values of x in the data used to estimate the relationship, extrapolation is risky and should be avoided if possible.

For Butler Trucking, this means that any prediction outside the travel time for a driving distance less than 50 miles or greater than 100 miles is not a reliable estimate, and so for this model the estimate of β0 is meaningless.

However, if the experimental region for a regression problem includes zero, the y-intercept will have a meaningful interpretation.

Least Squares Method (Slide 9 of 15)

Butler Trucking Company example: Use the estimated model and the known values for miles traveled for a driving assignment (x) to estimate mean travel time in hours.

For example, the first driving assignment in Table 7.1 has a value for miles

The mean travel time in hours for this driving assignment is estimated to be:

The resulting residual of the estimate is:

The simple linear regression model underestimated travel time for this driving assignment by 1.2461 hours (approximately 74 minutes).

Least Squares Method (Slide 10 of 15)

Table 7.2: Predicted Travel Time and Residuals for 10 Butler Trucking Company Driving Assignments

Note in Table 7.2 that:

The sum of predicted values is equal to the sum of the values of the dependent variable y.

The sum of the residuals ei is 0.

The sum of the squared residuals has been minimized.

Least Squares Method (Slide 11 of 15)

Figure 7.4: Scatter Chart of Miles Traveled and Travel Time for Butler Trucking Company Driving Assignments with Regression Line Superimposed

Figure 7.4 shows the simple linear regression line = 1.2739 + 0.0678xi superimposed on the scatter chart for the Butler Trucking Company data in Table 7.1.

This figure, which also highlights the residuals for driving assignment 3 (e3) and driving assignment 5 (e5), shows that the regression model underpredicts travel time for some driving assignments (such as driving assignment 3) and overpredicts travel time for others (such as driving assignment 5), but in general appears to fit the data relatively well.

Least Squares Method (Slide 12 of 15)

Figure 7.5: A Geometric Interpretation of the Least Squares Method

In Figure 7.5, a vertical line is drawn from each point in the scatter chart to the linear regression line.

Each of these lines represents the difference between the actual driving time and the driving time to be predicted using linear regression for one of the assignments in the data.

The length of each line is equal to the absolute value of the residual for one of the driving assignments.

When a residual is squared, the resulting value is equal to the square that is formed using the vertical dashed line representing the residual in Figure 7.4 as one side of a square.

Thus, when we find the linear regression model that minimizes the sum of squared errors for the Butler Trucking example, we are positioning the regression line in the manner that minimizes the sum of the areas of the ten squares in Figure 7.5.

Least Squares Method (Slide 13 of 15)

Using Excel’s Chart Tools to Compute the Estimated Regression Equation:

After constructing a scatter chart with Excel’s chart tools:

Right-click on any data point and select Add Trendline.

When the Format Trendline task pane appears:

Select Linear in the Trendline Options area.

Select Display Equation on chart in the Trendline Options area.

Least Squares Method (Slide 14 of 15)

Figure 7.6: Scatter Chart and Estimated Regression Line for Butler Trucking Company

We can use Excel’s chart tools to compute the estimated regression equation on a scatter chart of the Butler Trucking Company data in Table 7.1.

The worksheet displayed in Figure 7.6 shows the original data, scatter chart, estimated regression line, and estimated regression equation.

Note that Excel uses y instead of to denote the predicted value of the dependent variable and puts the regression equation into slope-intercept form whereas we use the intercept-slope form that is standard in statistics.

Least Squares Method (Slide 15 of 15)

Slope Equation

y-Intercept Equation

Assessing the Fit of the Simple Linear Regression Model

The Sums of Squares

The Coefficient of Determination

Using Excel’s Chart Tools to Compute the Coefficient of Determination

Assessing the Fit of the Simple Linear Regression Model (Slide 1 of 10)

The Sums of Squares:

Sum of squares due to error: The value of SSE is a measure of the error in using the estimated regression equation to predict the values of the dependent variable in the sample.

From Table 7.2,

Assessing the Fit of the Simple Linear Regression Model (Slide 2 of 10)

Without knowledge of any related variables, we would use the sample mean as a predictor of travel time for any given driving assignment.

To find , we divide the sum of the actual driving times yi from Table 7.2 (67) by the number of observations n in the data (10); this yields = 6.7.

Figure 7.7 provides insight on how well we would predict the values of yi in the Butler Trucking company example using = 6.7.

From this figure, which again highlights the residuals for driving assignments 3 and 5, we can see that tends to overpredict travel times for driving assignments that have relatively small values for miles traveled (such as driving assignment 5) and tends to underpredict travel times for driving assignments that have relatively large values for miles traveled (such as driving assignment 3).

Assessing the Fit of the Simple Linear Regression Model (Slide 3 of 10)

Table 7.3 shows the sum of squared deviations obtained by using

for each driving assignment in the sample.

Butler Trucking Example: For the ith driving assignment in the

SST is a measure of how well the observations cluster about the line.

Assessing the Fit of the Simple Linear Regression Model (Slide 4 of 10)

The corresponding sum of squares is called the total sum of squares (SST).

SST is a measure of how well the observations cluster about the line.

Assessing the Fit of the Simple Linear Regression Model (Slide 5 of 10)

Table 7.3: Calculations for the Sum of Squares Total for the Butler Trucking Simple Linear Regression

The sum at the bottom of the last column in Table 7.3 is the total sum of squares for Butler Trucking Company: SST = 23.9.

Assessing the Fit of the Simple Linear Regression Model (Slide 6 of 10)

In Figure 7.8, we show the estimated regression line = 1.2739 + 0.0678xi and the line corresponding to = 6.7.

Note that the points cluster more closely around the estimated regression line = 1.2739 + 0.0678xi than they do about the horizontal line = 6.7.

Assessing the Fit of the Simple Linear Regression Model (Slide 7 of 10)

Relation between SST, SSR, and SSE:

where

SST = total sum of squares

SSR = sum of squares due to regression

SSE = sum of squares due to error.

In the Butler Trucking Company example, SSE = 8.0288 and SST = 23.9:

SSR = SST SSE = 23.9 8.0288 = 15.8712

Assessing the Fit of the Simple Linear Regression Model (Slide 8 of 10)

The Coefficient of Determination:

The ratio SSR/SST used to evaluate the goodness of fit for the estimated regression equation; this ratio is called the coefficient of determination and is denoted by

Take values between zero and one.

Interpreted as the percentage of the total sum of squares that can be explained by using the estimated regression equation.

For the Butler Trucking Company example, the coefficient of determination,

r2 = = = 0.6641.

It can be concluded that 66.41% of the total sum of squares can be explained by using the estimated regression equation = 1.2739 + 0.0678xi to predict quarterly sales.

Assessing the Fit of the Simple Linear Regression Model (Slide 9 of 10)

Using Excel’s Chart Tools to Compute the Coefficient of Determination:

To compute the coefficient of determination using the scatter chart in Figure 7.3:

Right-click on any data point in the scatter chart and select Add Trendline…

When the Format Trendline task pane appears:

Select Display R-squared value on chart in the Trendline Options area.

Assessing the Fit of the Simple Linear Regression Model (Slide 10 of 10)

Figure 7.9 displays the scatter chart, the estimated regression equation, the graph of the estimated regression equation, and the coefficient of determination for the Butler Trucking Company data. We see that r2 = 0.6641.

The Multiple Regression Model

Regression Model

Estimated Multiple Regression Equation

Least Squares Method and Multiple Regression

Butler Trucking Company and Multiple Regression

Using Excel’s Regression Tool to Develop the Estimated Multiple Regression Equation

The Multiple Regression Model (Slide 1 of 11)

Regression Model:

The Multiple Regression Model (Slide 2 of 11)

Regression Model (cont.):

Represents the change in the mean value of the dependent variable y that corresponds to a one unit increase in the independent variable

holding the values of all other independent variables in the model constant.

The multiple regression equation that describes how the mean value of y is related to

The Multiple Regression Model (Slide 3 of 11)

Estimated Multiple Regression Equation:

The Multiple Regression Model (Slide 4 of 11)

Least Squares Method and Multiple Regression:

The least squares method is used to develop the estimated multiple regression equation:

Finding

Uses sample data to provide the values of

that minimize the sum of squared residuals.

The Multiple Regression Model (Slide 5 of 11)

Figure 7:10: The Estimation Process for Multiple Regression

The estimation process for multiple regression is shown in Figure 7.10.

The estimated values of the dependent variable y are computed by substituting values of the independent variables x1, x2, . . . , xq into the estimated multiple regression equation.

Computer software packages can be used to obtain the estimated regression equation and determine the regression coefficients b0, b1, b2, . . . , bq.

The Multiple Regression Model (Slide 6 of 11)

Butler Trucking Company and Multiple Regression:

The estimated simple linear regression equation,

The linear effect of the number of miles traveled explains 66.41%

This implies, 33.59% of the variability in sample travel times remains unexplained

The managers might want to consider adding one or more independent variables, such as number of deliveries, to the model to explain some of the remaining variability in the dependent variable.

The Multiple Regression Model (Slide 7 of 11)

Butler Trucking Company and Multiple Regression (cont.):

Estimated multiple linear regression with two independent variables:

The Multiple Regression Model (Slide 8 of 11)

Figure 7.11: Data Analysis Tools Box

Using Excel’s Regression Tool to Develop the Estimated Multiple Regression Equation:

The following steps describe how to use Excel’s Regression tool to compute the estimated regression equation using the data in the worksheet.

Copy the data values from the file ButlerWithDeliveries into an Excel worksheet from columns A through D and rows 1 through 301.

Step 1. Click the DATA tab in the Ribbon

Step 2. Click Data Analysis in the Analysis group

Step 3. Select Regression from the list of Analysis Tools in the Data Analysis tools box (shown in Figure 7.11) and click OK

The Multiple Regression Model (Slide 9 of 11)

Figure 7.12: Regression Dialog Box

Step 4. When the Regression dialog box appears (as shown in Figure 7.12):

1. Enter D1:D301 in the Input Y Range: box

2. Enter B1:C301 in the Input X Range: box

3. Select Labels

Selecting Labels tells Excel to use the names you have given to your variables in row 1 when displaying the regression model output.

4. Select Confidence Level:

5. Enter 99 in the Confidence Level: box

6. Select New Worksheet Ply:

7. Click OK

The Multiple Regression Model (Slide 10 of 11)

Figure 7.13: Excel Regression Output for the Butler Trucking Company with Miles and Deliveries as Independent Variables

In the Excel output shown in Figure 7.13, the label for the independent variable x1 is Miles (see cell A18), and the label for the independent variable x2 is Deliveries (see cell A19).

Estimated regression equation:

= 0.1273 + 0.0672x1 + 0.6900x2

Interpretation:

For a fixed number of deliveries, we estimate that the mean travel time will increase by 0.0672 hours when the distance traveled increases by 1 mile.

For a fixed distance traveled, we estimate that the mean travel time will increase by 0.69 hours when the number of deliveries increases by 1 delivery.

Multiple coefficient of determination, R2 = 0.8173.

Interpretation: 81.73% of the variability in sample values of the dependent variable, travel time, is explained by the independent variables, number of deliveries and distance traveled.

The Multiple Regression Model (Slide 11 of 11)

Figure 7.14: Graph of the Regression Equation for Multiple Regression Analysis with Two Independent Variables

With two independent variables x1 and x2, we now generate a predicted value of y for every combination of values of x1 and x2.

Thus, instead of a regression line, we now create a regression plane in three-dimensional space.

Figure 7.14 provides the graph of the estimated regression plane for the Butler Trucking Company example and shows the seventh driving assignment in the data.

As the plane slopes upward to larger values of total travel time () as either the number of miles traveled (x1) or the number of deliveries (x2) increases.

The residual for a driving assignment when x1 = 75 and x2 = 3 is the difference between the observed y value and the estimated mean value of y when

x1 = 75 and x2 = 3.

The observed value lies above the regression plane, indicating that the regression model underestimates the expected driving time for the seventh driving assignment.

Inference and Regression

Conditions Necessary for Valid Inference in the Least Squares Regression Model

Testing Individual Regression Parameters

Addressing Nonsignificant Independent Variables

Multicollinearity

Inference and Regression (Slide 1 of 17)

Statistical inference: Process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population.

In regression, inference is commonly used to estimate and draw conclusions about:

The regression parameters

The mean value and/or the predicted value of the dependent variable y for specific values of the independent variables

Consider both hypothesis testing and interval estimation.

Inference and Regression (Slide 2 of 17)

Conditions Necessary for Valid Inference in the Least Squares Regression Model:

For any given combination of values of the independent variables

distributed with a mean of 0 and a constant variance.

Inference and Regression (Slide 3 of 17)

Figure 7.15: Illustration of the Conditions for Valid Inference in Regression

Figure 7.15 illustrates the model conditions and their implications for a simple linear regression.

Note that in this graphical interpretation, the value of E(y|x) changes linearly according to the specific value of x considered, and so the mean error is zero at each value of x.

However, regardless of the x value, the error term ε and hence the dependent variable y are normally distributed, each with the same variance.

The specific value of the error ε at any particular point depends on whether the actual value of y is greater or less than E(y|x).

Inference and Regression (Slide 4 of 17)

Figure 7.16: Example of a Random Error Pattern in a Scatter Chart of Residuals and Predicted Values of the Dependent Variable

Simple scatter charts of the residuals and independent variables are an extremely effective method for assessing whether these conditions are violated.

We should review the scatter chart for patterns in the residuals indicating that one or more of the conditions have been violated.

Ideally, the residuals will be consistently scattered around zero throughout the predicted values of the independent variable.

This is shown in the example in Figure 7.16.

Inference and Regression (Slide 5 of 17)

Figure 7.17: Examples of Diagnostic Scatter Charts of Residuals from Four Regressions

The residuals in the four panels of Figure 7.17 show distinct patterns, each of which suggests a violation of at least one of the regression model conditions.

In panel a, the variation in the residuals e increases as the value of the independent variable x increases, suggesting that the residuals do not have a constant variance.

In panel b, the residuals are positive for small and large values of the independent variable x but are negative for moderate values of the independent variable.

This pattern suggests that the linear relationships in the regression model underpredicts the value of dependent variable for small and large values of the independent variable and overpredicts the value of the dependent variable for intermediate values of the independent variable.

The residuals in panel c are not symmetrically distributed around 0; many of the negative residuals are relatively close to zero, while the relatively few positive residuals tend to be far from zero. This skewness suggests that the residuals are not normally distributed.

The residuals in panel d are plotted over time t, which generally serves as an independent variable; that is, an observation is made at each of several (usually equally spaced) points in time.

Inference and Regression (Slide 6 of 17)

Figure 7.18: Excel Residual Plots for the Butler Trucking Company Multiple Regression

You can generate scatter charts of the residuals against each independent variable in the model when using Excel’s Regression tool; to do so, select the Residual Plots option in the Residuals area of the Regression dialog box.

Figure 7.18 shows residual plots produced by Excel for the Butler Trucking Company example for which the independent variables are miles (x1) and deliveries (x2).

The residuals at each value of miles appear to have a mean of zero, to have similar variances, and to be concentrated around zero.

The residuals at each value of deliveries also appear to have a mean of zero, to have similar variances, and to be concentrated around zero.

Inference and Regression (Slide 7 of 17)

Inference and Regression (Slide 8 of 17)

Inference and Regression (Slide 9 of 17)

Testing Individual Regression Parameters:

To determine whether statistically significant relationships exist between the dependent variable y and each of the independent variables

Inference and Regression (Slide 10 of 17)

Testing Individual Regression Parameters (cont.):

As the magnitude of t increases (as t deviates from zero in either direction), we are more likely to reject the hypothesis that the regression parameter

When we reject the hypothesis that the regression parameter βj is zero, we conclude that a relationship exists between the dependent variable y and the independent variable xj.

Statistical software will generally report a p value for this test statistic.

For a given value of t, this p value represents the probability of collecting a sample of the same size from the same population that yields a larger t statistic given that the value of βj is actually zero.

The hypothesis βj is equal to zero is rejected when the corresponding p value is smaller than some predetermined level of significance (usually 0.05 or 0.01).

Inference and Regression (Slide 11 of 17)

Testing Individual Regression Parameters (cont.):

Confidence interval can be used to test whether each of the regression parameters

Confidence interval: An estimate of a population parameter that provides an interval believed to contain the value of the parameter at some level of confidence.

Confidence level: Indicates how frequently interval estimates based on samples of the same size taken from the same population using identical sampling techniques will contain the true value of the parameter we are estimating.

To test that βj is zero (i.e., there is no linear relationship between xj and y) at some predetermined level of significance (say 0.05), first build a confidence interval at the (1 – 0.05)100% confidence level.

If the resulting confidence interval does not contain zero, we conclude that βj differs from zero at the predetermined level of significance.

Similarly, to test that β0 is zero (i.e., the value of the dependent variable is zero when all the independent variables x1, x2, . . . , xq are equal to zero) at some predetermined level of significance (say 0.05), first build a confidence interval at the (1 – 0.05)100% confidence level.

If the resulting confidence interval does not contain zero, we conclude that β0 differs from zero at the predetermined level of significance.

Inference and Regression (Slide 12 of 17)

Addressing Nonsignificant Independent Variables:

If practical experience dictates that the nonsignificant independent variable has a relationship with the dependent variable, the independent variable should be left in the model.

If the model sufficiently explains the dependent variable without the nonsignificant independent variable, then consider rerunning the regression without the nonsignificant independent variable.

The appropriate treatment of the inclusion or exclusion of the y-intercept

Note that it is possible that the estimates of the other regression coefficients and their p values may change considerably when we remove the nonsignificant independent variable from the model.

Since the primary purpose of the regression model is to explain or predict values of the dependent variable for values of the independent variables that lie within the experimental region on which the model is based, regression through the origin should not be forced unless there are strong a priori reasons for believing that the dependent variable is equal to zero when the values of all independent variables in the model are equal to zero.

Inference and Regression (Slide 13 of 17)

Multicollinearity:

Multicollinearity refers to the correlation among the independent variables in multiple regression analysis.

In t tests for the significance of individual parameters, the difficulty caused by multicollinearity is that it is possible to conclude that a parameter associated with one of the multicollinear independent variables is not significantly different from zero when the independent variable actually has a strong relationship with the dependent variable.

This problem is avoided when there is little correlation among the independent variables.

According to a common rule of thumb test, multicollinearity is a potential problem if the absolute value of the sample correlation coefficient exceeds 0.7 for any two of the independent variables.

The primary consequence of multicollinearity is that it increases the variances and standard errors of the regression estimates of β0, β1, β2, . . . , βq and predicted values of the dependent variable, and so inference based on these estimates is less precise than it should be.

Inference and Regression (Slide 14 of 17)

Figure 7.21: Excel Regression Output for the Butler Trucking Company with Miles and Gasoline Consumption as Independent Variables

Using Excel’s Regression tool, we obtain the results shown in Figure 7.21 for our multiple regression.

When we conduct a t test to determine whether β1 is equal to zero, we find a p value of 3.1544E-07, and so we reject this hypothesis and conclude that travel time is related to miles traveled.

On the other hand, when we conduct a t test to determine whether β2 is equal to zero, we find a p value of 0.6588, and so we do not reject this hypothesis.

Inference and Regression (Slide 15 of 17)

Figure 7.22: Scatter Chart of Miles and Gasoline Consumed for Butler Trucking Company

Using the regression output shown in Figure 7.21, we can use the Excel chart tool to create a scatter chart of these predicted values and residuals, similar to the chart in Figure 7.22.

The figure shows that miles traveled and gasoline consumed are strongly related.

Inference and Regression (Slide 16 of 17)

Multicollinearity (cont.):

Testing for an overall regression relationship:

Use an F test based on the F probability distribution.

If the F test leads us to reject the hypothesis that the values of

are all zero:

Conclude that there is an overall regression relationship.

Otherwise, conclude that there is no overall regression relationship.

Inference and Regression (Slide 17 of 17)

Multicollinearity (cont.):

Testing for an overall regression relationship (cont.):

The test statistic generated by the sample data for this test is:

SSR = Sum of squares due to regression.

SSE = Sum of squares due to error.

q = the number of independent variables in the regression model.

n = the number of observations in the sample.

Larger values of F provide stronger evidence of an overall regression relationship.

The numerator of this test statistic is a measure of the variability in the dependent variable y that is explained by the independent variables x1, x2, . . . , xq.

The denominator is a measure of the variability in the dependent variable y that is not explained by the independent variables x1, x2, . . . , xq.

Statistical software will generally report a p value for this test statistic.

For a given value of F, the p-value represents the probability of collecting a sample of the same size from the same population that yields a larger F statistic given that the values of b1, b2, . . . , bq are all actually zero.

Thus, smaller p values indicate stronger evidence against the hypothesis that the values of b1, b2, . . . , bq are all zero (i.e., stronger evidence of an overall regression relationship).

The hypothesis is rejected when the p value is smaller than some predetermined value (usually 0.05 or 0.01) that is referred to as the level of significance.

Categorical Independent Variables

Butler Trucking Company and Rush Hour

Interpreting the Parameters

More Complex Categorical Variables

Categorical Independent Variables (Slide 1 of 10)

Butler Trucking Company and Rush Hour:

Dependent variable, y: Travel time.

travel on the congested segment of highway during afternoon rush hour.

on the congested segment of highway during afternoon rush hour.

The previous independent variables we have considered (such as miles traveled and number of deliveries) have been quantitative, but the new variable, rush hour, is categorical. This variable is also called a dummy variable.

Will this dummy variable add valuable information to the current Butler Trucking regression model?

A review of the residuals produced by the current model may help us make an initial assessment.

We can create a frequency distribution and a histogram of the residuals for driving assignments that included travel on a congested segment of a highway during the afternoon rush hour period.

We then create a frequency distribution and a histogram of the residuals for driving assignments that did not include travel on a congested segment of a highway during the afternoon rush hour period.

Categorical Independent Variables (Slide 2 of 10)

Figure 7.23: Histograms of the Residuals for Driving Assignments That Included Travel on a Congested Segment of a Highway During the Afternoon Rush Hour and Residuals for Driving Assignments That Did Not

We know that the residual for the ith observation is ei = yi - , which is the difference between the observed and predicted values of the dependent variable.

The histograms in Figure 7.23 show that driving assignments that included travel on a congested segment of a highway during the afternoon rush hour period tend to have positive residuals, which means we are generally underpredicting the travel times for those driving assignments.

Conversely, driving assignments that did not include travel on a congested segment of a highway during the afternoon rush hour period tend to have negative residuals, which means we are generally overpredicting the travel times for those driving assignments.

These results suggest that the dummy variable could potentially explain a substantial proportion of the variance in travel time that is unexplained by the current model, and so we proceed by adding the dummy variable x3 to the current Butler Trucking multiple regression model.

Categorical Independent Variables (Slide 3 of 10)

Using Excel’s Regression tool to develop the estimated regression equation, we obtained the Excel output in Figure 7.24.

The estimated regression equation is = –0.3302 + 0.0672x1 + 0.6735x2 + 0.9980x3.

Categorical Independent Variables (Slide 4 of 10)

Interpreting the Parameters:

The model estimates that travel time increases by:

0.0672 hours (about 4 minutes) for every increase of 1 mile traveled, holding constant the number of deliveries and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour period.

0.6735 hours (about 40 minutes) for every delivery, holding constant the number of miles traveled and whether the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour period.

Categorical Independent Variables (Slide 5 of 10)

Interpreting the Parameters (cont.):

The model estimates that travel time increases by (cont.):

0.9980 hours (about 60 minutes) if the driving assignment route requires the driver to travel on the congested segment of a highway during the afternoon rush hour period, holding constant the number of miles traveled and the number of deliveries.

Categorical Independent Variables (Slide 6 of 10)

Interpreting the Parameters (cont.):

Categorical Independent Variables (Slide 7 of 10)

More Complex Categorical Variables:

If a categorical variable has k levels, k minus 1 dummy variables are required, with each dummy variable corresponding to one of the levels of the categorical variable and coded as 0 or 1.

Example:

Suppose a manufacturer of vending machines organized the sales territories for a particular state into three regions: A, B, and C.

The managers want to use regression analysis to help predict the number of vending machines sold per week.

Suppose the managers believe sales region is one of the important factors in predicting the number of units sold.

Dummy variables are often used to model seasonal effects in sales data.

Categorical Independent Variables (Slide 8 of 10)

More Complex Categorical Variables (cont.):

Example (cont.):

Sales region: categorical variable with three levels (A, B, and C).

Each variable can be coded 0 or 1 as:

Categorical Independent Variables (Slide 9 of 10)

More Complex Categorical Variables (cont.):

Example (cont.):

The regression equation relating the mean number of units sold to the dummy variables is written as:

Observations corresponding to Region A correspond to

so the estimated mean number of units sold in Region A is:

b0 = Mean or expected value of sales for Region A.

b1 = Difference between the mean number of units sold in Region B and the mean number of units sold in Region A.

b2 = Difference between the mean number of units sold in Region C and the mean number of units sold in Region A.

Categorical Independent Variables (Slide 10 of 10)

More Complex Categorical Variables (cont.):

Example (cont.):

Observations corresponding to Region B are coded

so the estimated mean number of units sold in Region C is:

Observations corresponding to Region C are coded

so the estimated mean number of units sold in Region B is:

Two dummy variables were required in this example because sales region is a categorical variable with three levels. But the assignment of specific variables is arbitrary.

The important point to remember is that when a categorical variable has k levels, k – 1 dummy variables are required in the multiple regression analysis.

Modeling Nonlinear Relationships

Quadratic Regression Models

Piecewise Linear Regression Models

Interaction Between Independent Variables

Modeling Nonlinear Relationships (Slide 1 of 16)

Figure 7.25: Scatter Chart for the Reynolds Example

Reynolds, Inc., is a manufacturer of industrial scales and laboratory equipment.

Managers at Reynolds want to investigate the relationship between length of employment of their salespeople and the number of electronic laboratory scales sold.

Figure 7.25, the scatter chart for these data, indicates a possible curvilinear relationship between the length of time employed and the number of units sold.

Modeling Nonlinear Relationships (Slide 2 of 16)

Figure 7.26: Excel Regression Output for the Reynolds Example

Figure 7.26 is the Excel output for a simple linear regression.

The estimated regression: Sales = 113.7453 + 2.3675 Months Employed.

The computer output shows that the relationship is significant (p value = 9.3954E-06 in cell E18 of Figure 7.26 for the t test that β1 = 0) and that a linear relationship explains a high percentage of the variability in sales (r 2 = 0.7901 in cell B5).

Modeling Nonlinear Relationships (Slide 3 of 16)

Figure 7.27: Scatter Chart of the Residuals and Predicted Values of the Dependent Variable for the Reynolds Simple Linear Regression

The pattern in the scatter chart of residuals against the predicted values of the dependent variable in Figure 7.27 suggests that a curvilinear relationship may provide a better fit to the data.

If we have a practical reason to suspect a curvilinear relationship between the number of electronic laboratory scales sold by a salesperson and the number of months the salesperson has been employed, we may wish to consider an alternative to simple linear regression.

For example, we may believe that a recently hired salesperson faces a learning curve but becomes increasingly more effective over time and that a salesperson who has been in a sales position with Reynolds for a long time eventually becomes burned out and becomes increasingly less effective.

If our regression model supports this theory, Reynolds management can use the model to identify the approximate point in employment when its salespeople begin to lose their effectiveness, and management can plan strategies to counteract salesperson burnout.

Modeling Nonlinear Relationships (Slide 4 of 16)

Quadratic Regression Models:

In the Reynolds example, to account for the curvilinear relationship between months employed and scales sold, we could include the square of the number of months the salesperson has been employed as a second independent variable.

Equation (7.18) corresponds to a quadratic regression model.

Modeling Nonlinear Relationships (Slide 5 of 16)

Figure 7.28: Relationships That Can Be Fit with a Quadratic Regression Model

Modeling Nonlinear Relationships (Slide 6 of 16)

Figure 7.29: Excel Data for the Reynolds Quadratic Regression Model

Figure 7.29 shows the Excel spreadsheet that includes the square of the number of months the employee has been with the firm.

To create the variable, which we will call MonthsSq, we create a new column and set each cell in that column equal to the square of the associated value of the variable Months. These values are shown in Column B of Figure 7.29.

Modeling Nonlinear Relationships (Slide 7 of 16)

Figure 7.30: Excel Output for the Reynolds Quadratic Regression Model

The estimated regression equation is Sales = 61.4299 + 5.8198 Months Employed – 0.0310 MonthsSq

MonthsSq is the square of the number of months the salesperson has been employed.

Because the value of b1 (5.8198) is positive, and the value of b2 (–0.0310) is negative, will initially increase as the number of months the salesperson has been employed increases.

As the value of the independent variable Months Employed increases, its squared value increases more rapidly, and eventually will decrease as the number of months the salesperson has been employed increases.

The R2 of 0.9013 indicates that this regression model explains approximately 90.1% of the variation in Scales Sold for our sample data.

Modeling Nonlinear Relationships (Slide 8 of 16)

Figure 7.31: Scatter Chart of the Residuals and Predicted Values of the Dependent Variable for the Reynolds Quadratic Regression Model

The lack of a distinct pattern in the scatter chart of residuals against the predicted values of the dependent variable (Figure 7.31) suggests that the quadratic model fits the data better than the simple linear regression in the Reynolds example (the scatter chart of residuals against the independent variable Months Employed would also lead us to this conclusion).

Modeling Nonlinear Relationships (Slide 9 of 16)

Piecewise Linear Regression Models:

For the Reynolds data, as an alternative to a quadratic regression model:

Recognize that below some value of Months Employed, the relationship between Months Employed and Sales appears to be positive and linear.

Whereas the relationship between Months Employed and Sales appears to be negative and linear for the remaining observations.

Piecewise linear regression model: This model will allow us to fit these relationships as two linear regressions that are joined at the value of Months at which the relationship between Months Employed and Sales changes.

Modeling Nonlinear Relationships (Slide 10 of 16)

Piecewise Linear Regression Models (cont.):

Knot: The value of the independent variable at which the relationship between dependent variable and independent variable changes; also called breakpoint.

For the Reynolds data, knot is the value of the independent variable Months Employed at which the relationship between Months Employed and Sales changes.

Modeling Nonlinear Relationships (Slide 11 of 16)

Figure 7.32 provides the scatter chart for the Reynolds data with an indication of the possible location of the knot, which we have denoted x (k).

From this scatter chart, it appears the knot is at approximately 90 months.

Modeling Nonlinear Relationships (Slide 12 of 16)

Piecewise Linear Regression Models (cont.):

Define a dummy variable:

Then fit the following estimated regression equation:

Once we have decided on the location of the knot, we define a dummy variable that is equal to zero for any observation for which the value of Months Employed is less than or equal to the value of the knot, and equal to one for any observation for which the value of Months Employed is greater than the value of the knot.

The interpretation of this model is similar to the interpretation of the quadratic regression model.

Modeling Nonlinear Relationships (Slide 13 of 16)

Figure 7.33: Data and Excel Output for the Reynolds Piecewise Linear Regression Model

Modeling Nonlinear Relationships (Slide 14 of 16)

Interaction Between Independent Variables:

Interaction: This occurs when the relationship between the dependent variable and one independent variable is different at various values of a second independent variable.

The estimated multiple linear regression equation is given as:

In the multiple linear regression model, y is the dependent variable, x1 and x2 are independent variables.

x1x2 depicts interaction between the two independent variables.

When interaction between two variables is present, we cannot study the relationship between one independent variable and the dependent variable y independently of the other variable.

Once we obtain the estimated regression equation, using the p value approach, we can conclude whether the interaction is significant.

The interpretation of other coefficients is the same as the ones discussed in multiple linear regression.

Note that we can combine a quadratic effect with interaction to produce a second-order polynomial model with interaction between the two independent variables.

Modeling Nonlinear Relationships (Slide 15 of 16)

Figure 7.34: Mean Unit Sales (1,000s) as a Function of Selling Price and Advertising Expenditures

Figure 7.34 shows the sample mean sales for the six price and advertising expenditure combinations.

Note that the sample mean sales corresponding to a price of $2.00 and an advertising expenditure of $50,000 is 461,000 units and that the sample mean sales corresponding to a price of $2.00 and an advertising expenditure of $100,000 is 808,000 units.

Hence, with price held constant at $2.00, the difference in mean sales between advertising expenditures of $50,000 and $100,000 is 808,000 – 461,000 = 347,000 units.

When the price of the product is $2.50, the difference in mean sales between advertising expenditures of $50,000 and $100,000 is 646,000 – 364,000 = 282,000 units.

Finally, when the price is $3.00, the difference in mean sales between advertising expenditures of $50,000 and $100,000 is 375,000 – 332,000 = 43,000 units.

Modeling Nonlinear Relationships (Slide 16 of 16)

Figure 7.35: Excel Output for the Tyler Personal Care Linear Regression Model with Interaction

The Excel output corresponding to the interaction model for the Tyler Personal Care example is provided in Figure 7.35.

The resulting estimated regression equation is Sales = –275.8333 + 175 Price + 19.68 Advertising – 6.08 Price × Advertising.

Model Fitting

Variable Selection Procedures

Overfitting

Model Fitting (Slide 1 of 10)

Variable Selection Procedures:

Special procedures are sometimes employed to select the independent variables to include in the regression model.

Iterative procedures: At each step of the procedure, a single independent variable is added or removed and the new model is evaluated. Iterative procedures include:

Backward elimination.

Forward selection.

Stepwise selection.

Best subsets procedure: Evaluates regression models involving different subsets of the independent variables.

Model Fitting (Slide 2 of 10)

Variable Selection Procedures (cont.):

Backward elimination procedure:

Begins with the regression model that includes all of the independent variables under consideration.

At each step, backward elimination considers the removal of an independent variable according to some criterion.

Stops when all independent variables in the model are significant at a specified level of significance.

Model Fitting (Slide 3 of 10)

Variable Selection Procedures (cont.):

Forward selection procedure:

Begins with none of the independent variables under consideration included in the regression model.

At each step, forward selection considers the addition of an independent variable according to some criterion.

Stops when there are no independent variables not currently in the model that meet the criterion for being added to the regression model.

Model Fitting (Slide 4 of 10)

Variable Selection Procedures (cont.):

Stepwise selection procedure:

Begins with none of the independent variables under consideration included in the regression model.

The analyst establishes both a criterion for allowing independent variables to enter the model and a criterion for allowing independent variables to remain in the model.

To initiate the procedure, the most significant independent variable is added to the empty model if its level of significance satisfies the entering threshold.

Model Fitting (Slide 5 of 10)

Variable Selection Procedures (cont.):

Stepwise selection procedure (cont.):

Each subsequent step involves two intermediate steps:

First, the remaining independent variables not in the current model are evaluated, and the most significant one is added to the model.

Then the independent variables in the current model are evaluated, and the least significant one is removed.

Stops when no independent variables not currently in the model have a level of significance for remaining in the regression model.

Model Fitting (Slide 6 of 10)

Variable Selection Procedures (cont.):

Best subsets procedure:

Simple linear regressions for each of the independent variables under consideration are generated, and then the multiple regressions with all combinations of two independent variables under consideration are generated, and so on.

Once a regression model has been generated for every possible subset of the independent variables under consideration, the entire collection of regression models can be compared and evaluated.

Model Fitting (Slide 7 of 10)

Overfitting:

Overfitting generally results from creating an overly complex model to explain idiosyncrasies in the sample data.

In regression analysis, this often results from the use of complex functional forms or independent variables that do not have meaningful relationships with the dependent variable.

If a model is overfit to the sample data, it will perform better on the sample data used to fit the model than it will on other data from the population.

Thus, an overfit model can be misleading about its predictive capability and its interpretation.

Model Fitting (Slide 8 of 10)

Overfitting (cont.):

How does one avoid overfitting a model?

Use only independent variables that you expect to have real and meaningful relationships with the dependent variable.

Use complex models, such as quadratic models and piecewise linear regression models, only when you have a reasonable expectation that such complexity provides a more accurate depiction of what you are modeling.

Do not let software dictate your model; use iterative modeling procedures, such as the stepwise and best-subsets procedures, only for guidance and not to generate your final model.

Model Fitting (Slide 9 of 10)

Overfitting (cont.):

How does one avoid overfitting a model? (cont.):

If you have access to a sufficient quantity of data, assess your model on data other than the sample data that were used to generate the model (this is referred to as cross-validation).

One possible ways to execute cross-validation is the holdout method.

105