Assignment for "THE GRADE"

profilesunsam233
2-MultipleRegression.pdf

Recall: Regression Analysis  statistical procedure for developing an equation to

show how variables are related

 dependent variable—variable being predicted

 independent variable—variable used to predict dependent variable

 equation provides a model for predicting the value of the dependent variable based on the values of the independent variable(s)

 analysis helps to identify type of relationship (e.g., linear, exponential) between variables

Recall: Simple Linear Regression  involves:

 one dependent variable—variable being predicted

 one independent variable—variable used to predict dependent variable

 relationship is approximated by a straight line

Multiple Regression  involves:

 one dependent variable—variable being predicted

 more than one independent variable—variables used to predict dependent variable

 makes it possible to consider more factors and obtain better estimates for dependent variable

 objective is to explain the variation in the dependent variable by using the independent variables as potential explanatory variables (i.e., predictors)

Multiple Regression Model Recall:

Simple Linear Regression Model:

 β0 and β1 are the parameters of the model

 β1 is the slope

 β0 is the y-intercept

 ε is the error term

 random variable accounting for the variability in y that cannot be explained by the linear relationship between x and y



y   0 

1 x 

Multiple Regression Model Multiple Regression Model:

 β0 , β1, β2, … βp are the parameters of the model

 y is a linear function of x1, x2, …xp  ε is the error term

 random variable accounting for the variability in y that cannot be explained by the linear relationship between x and y



y   0  

1 x

1  

2 x

2 

p x

p 

Multiple Regression Equation Multiple Regression Equation:

Estimated Multiple Regression Equation:

 point estimator of , the mean value of y for a given value of x

 (may be more than one y value for a given x)

 b0, b1, b2, …bp are estimates of β0, β1, β2, …βp



E y  0  1x1  2 x2  p x p



ˆ y  b 0  b

1 x

1  b

2 x

2  b

p x

p



ˆ y



E y 

Least Squares Method  procedures for using sample data to find the

estimated regression equation

 based on determining the equation of the line for which the distance from each sample point to the line is less than for any other line

 Least Squares Criterion:

—Observed value of dependent variable for the i’th observation

—Estimated value of dependent variable for the i’th observation



min  y i  ˆ y

i  2



y i



ˆ y i

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Enter multiple

columns for X range

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

 provides:

 Regression Statistics

 ANOVA SUMMARY OUTPUT Regression Statistics

Multiple R 0.870474549

R Square 0.757725941

Adjusted R Square0.742095357

Standard Error638.0652881

Observations 34

ANOVA

df SS MS F Significance F

Regression 2 39472730.77 19736365 48.47713 2.86258E-10

Residual 31 12620946.67 407127.3

Total 33 52093677.44

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Example

Store Sales Price Promotion

1 4141 59 200

2 3842 59 200

3 3056 59 200

4 3519 59 200

5 4226 59 400

6 4630 59 400

7 3507 59 400

8 3754 59 400

9 5000 59 600

10 5120 59 600

11 4011 59 600

12 5015 59 600

13 1916 79 200

14 675 79 200

15 3636 79 200

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.

Example A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.870474549

R Square 0.757725941

Adjusted R Square0.742095357

Standard Error638.0652881

Observations 34

ANOVA

df SS MS F Significance F

Regression 2 39472730.77 19736365 48.47713 2.86258E-10

Residual 31 12620946.67 407127.3

Total 33 52093677.44

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Example A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.

Example

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

From Regression analysis:

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.

Example

From Regression analysis:

Equation:

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Example

From Regression analysis:

Equation:

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Example

From Regression analysis:

Equation:

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Model Assumptions Recap:

 Estimated multiple regression model:

 use least squares method to develop values for b0, b1, b2,… bp as estimates for β0, β1, β1,… βp 

y   0  

1 x

1  

2 x

2 

p x

p 

Assumptions About ε 1. Zero Mean: The error, ε, is random variable with E(ε)=0

2. Homoskedasticity: Variance of ε is σ2 and is the same for all values of the independent variables x1, x2, x3,…xp

3. Independence: Values of ε are independent

4. Normality: The error, ε, is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given the multiple regression equation

Residual Analysis  in order to check if assumptions about ε are true, need to

analyze the residuals

 residuals:

 use residual plots to check assumptions:

 Residual plot against

 Standardized residual plot against 

y i  ̂y

i



ˆ y



ˆ y

2. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Using the measurements from these homes as found in Salsberry.xlsx, find the multiple regression equation.

2. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Using the measurements from these homes as found in Salsberry.xlsx, find the multiple regression equation.



̂y i  427  4.58x

1i 14.80x

2i 6.10x

3i

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%

Upper 95.0%

Intercept 427.193803 59.60142931 7.167509 2.24E-06 300.8444175 553.543189 300.844417 553.543189

Temp -4.5826626 0.772319353 -5.93364 2.1E-05 -6.21990652 -2.9454187 -6.2199065 -2.9454187

Insul -14.830863 4.754412281 -3.11939 0.006606 -24.9097665 -4.7519589 -24.909766 -4.7519589

Age 6.10103206 4.012120166 1.52065 0.147862 -2.40428274 14.6063469 -2.4042827 14.6063469

Equation

Multiple Regression Equation Multiple Regression Equation:

Estimated Multiple Regression Equation:

 point estimator of , the mean value of y for a given value of x

 (may be more than one y value for a given x)

 b0, b1, b2, …bp are estimates of β0, β1, β2, …βp



E y  0  1x1  2 x2  p x p



ˆ y  b 0  b

1 x

1  b

2 x

2  b

p x

p



ˆ y



E y 

Multiple Regression Equation Multiple Regression Equation:

Estimated Multiple Regression Equation:

 point estimator of , the mean value of y for a given value of x

 (may be more than one y value for a given x)

 b0, b1, b2, …bp are estimates of β0, β1, β2, …βp



E y  0  1x1  2 x2  p x p



ˆ y  b 0  b

1 x

1  b

2 x

2  b

p x

p



ˆ y



E y • b0 is y-intercept • generally not meaningful in multiple

regression • b1, b2,…bp are slopes corresponding to the

independent variables • each slope indicates how much value of

dependent variable changes with increase of 1 unit of independent variable)

• use b0, b1, b2,…bp to find estimates of y

Example A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month

in which promotional expenditures are $400?



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

Example

a) b1=–53.2173. For every increase of $.01 in the price, the mean sales are expected to decrease by 53.2173 bars

b2=3.6131. For every increase of $1 in promotional expenditures, the mean sales are expected to increase by 3.6131 bars

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month

in which promotional expenditures are $400?



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

Example

b)

Therefore, sales are expected to be 3078.57 bars.

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month

in which promotional expenditures are $400?



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

   

57.3078

4006131.379.2173.535208.5837

6131.32173.535208.5837ˆ 21



 iii

xxy

Example

b)

Therefore, sales are expected to be 3078.57 bars.

A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month

in which promotional expenditures are $400?



̂y i  5837.5208 53.2173x

1i  3.6131x

2i

   

57.3078

4006131.379.2173.535208.5837

6131.32173.535208.5837ˆ 21



 iii

xxy Use cell references to coefficients in

Regression table when calculating in MS Excel

1. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Based on measurements from these homes, the multiple regression equation for the monthly heating cost was found to be:

where x1i is the mean daily outside temperature, x2i is the number of inches of attic insulation, and x3i is the age of the furnace. a) What is the interpretation of the slopes, b1 , b2, and b3? b) What is the expected heating cost if the mean outside temperature is

30°F, there is 5 inches of insulation, and the furnace is 10 years old?



̂y i  427  4.58x

1i 14.80x

2i 6.10x

3i

1. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Based on measurements from these homes, the multiple regression equation for the monthly heating cost was found to be:

where x1i is the mean daily outside temperature, x2i is the number of inches of attic insulation, and x3i is the age of the furnace. a) What is the interpretation of the slopes, b1 , b2, and b3? b) What is the expected heating cost if the mean outside temperature is

30°F, there is 5 inches of insulation, and the furnace is 10 years old?



̂y i  427  4.58x

1i 14.80x

2i 6.10x

3i

a) b1=–4.58. For every increase of 1°F in the mean outside temperature, the heating cost is expected to decrease by $4.58 per month. b2=-14.80. For every increase of 1 inch of attic insulation, the heating cost is expected to decrease by $14.80 per month. b3=6.10. For every increase of 1 year in the age of the furnace, the heating cost is expected to increase by $6.10 per month.

1. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Based on measurements from these homes, the multiple regression equation for the monthly heating cost was found to be:

where x1i is the mean daily outside temperature, x2i is the number of inches of attic insulation, and x3i is the age of the furnace. a) What is the interpretation of the slopes, b1 , b2, and b3? b) What is the expected heating cost if the mean outside temperature is

30°F, there is 5 inches of insulation, and the furnace is 10 years old?



̂y i  427  4.58x

1i 14.80x

2i 6.10x

3i

b)

Therefore, the expected heating cost for the month is $276.60.



ˆ y i  427  4.58x

1i 14.80x

2i  6.10x

3i

 427  4.58 30 14.80 5  6.10 10 

 276.6

Multiple Coefficient of Determination

Multiple Coefficient of Determination How well does the estimated regression equation fit the data?

Multiple Coefficient of Determination How well does the estimated regression equation fit the data?

 Multiple coefficient of determination, r2, provides measure of goodness of fit

 requires calculation of:

 SSE—sum of squares due to error

 SSR—sum of squares due to regression

 SST—total sum of squares (SST = SSR + SSE)

 Proportion of variability in dependent variable that can be explained by multiple regression model



r 2 

SSR

SST

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the multiple coefficient of determination to determine the fit of the multiple regression equation.

From Regression analysis:

Multiple coefficient of determination:

Example

Regression Statistics

Multiple R 0.870474549

R Square 0.757725941

Adjusted R Square 0.742095357

Standard Error 638.0652881

Observations 34

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the multiple coefficient of determination to determine the fit of the multiple regression equation.



r 2  .7577

From Regression analysis:

Multiple coefficient of determination:

Therefore, 75.77% of the variation in sales is explained by the variation in the price and in the promotional expenditures; 24.23% is explained by other factors

Example

Regression Statistics

Multiple R 0.870474549

R Square 0.757725941

Adjusted R Square 0.742095357

Standard Error 638.0652881

Observations 34

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the multiple coefficient of determination to determine the fit of the multiple regression equation.



r 2  .7577

Multiple Coefficient of Determination How can the multiple coefficient of determination be improved (i.e., made to give a higher value)?

 it is almost always possible to raise the coefficient of determination R2 by including additional independent variables

 if adding additional variables, be careful to avoid multicollinearity of variables

Multicollinearity  Independent variables may not be independent of each

other—i.e., may be correlated or collinear

 e.g., In building multiple regression model for price of houses, 1 independent variable might be the house size in ft2. Adding another variable such as number of bedrooms may not appropriate because of the close relationship between house size and number of bedrooms.

 Most of the time, there will be some correlation between independent variables

 avoid adding new variables that add no significant information to the regression model

Multiple Coefficient of Determination How well does the estimated regression equation fit the data?

 it is almost always possible to raise the coefficient of determination R2 by including additional independent variables

 to avoid overfitting the model, an adjustment can be made to R2 statistic to penalize the inclusion of useless predictors

➢ Adjusted multiple coefficient of determination

Multiple Coefficient of Determination How well does the estimated regression equation fit the data?

 Multiple coefficient of determination:

 Adjusted multiple coefficient of determination:

 k is number of independent variables in regression equation

 Adjusts according to number of independent variables and size of sample

 Used by some statisticians to avoid overestimating impact of adding an independent variable to model



r 2 

SSR

SST



r adj

2 1  1  r

2  n 1

n  k 1



 



 

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.

From Regression analysis:

Multiple coefficient of determination:

Adjusted multiple coefficient of determination:

Example

Regression Statistics

Multiple R 0.870474549

R Square 0.757725941

Adjusted R Square 0.742095357

Standard Error 638.0652881

Observations 34

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.



r 2  .7577

From Regression analysis:

Multiple coefficient of determination:

Adjusted multiple coefficient of determination:

Example

Regression Statistics

Multiple R 0.870474549

R Square 0.757725941

Adjusted R Square 0.742095357

Standard Error 638.0652881

Observations 34

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.



r 2  .7577



r adj

2  .7421

Example

Multiple Coefficient of Determination:

75.77% of the variation in sales is explained by the variation in the price and in the promotional expenditures.

Adjusted Multiple Coefficient of Determination:

74.21% of the variation in sales is explained by the multiple regression model adjusted for the number of independent variables (k=2) and sample size (n=34).

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.



r 2  .7577



r adj

2  .7421

Example

Multiple Coefficient of Determination:

75.77% of the variation in sales is explained by the variation in the price and in the promotional expenditures.

Adjusted Multiple Coefficient of Determination:

74.21% of the variation in sales is explained by the multiple regression model adjusted for the number of independent variables (k=2) and sample size (n=34).

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.



r 2  .7577



r adj

2  .7421

3. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Find and interpret both the multiple coefficient of determination and the adjusted multiple coefficient of determination.

3. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Find and interpret both the multiple coefficient of determination and the adjusted multiple coefficient of determination.

Multiple Coefficient of Determination:

80.42% of the variation in sales is explained by the variations in mean temperature, inches of insulation, and age of furnace

Adjusted Multiple Coefficient of Determination:

76.75% of the variation in sales is explained by the multiple regression model adjusted for the number of independent variables (k=3) and sample size (n=20)



r 2  .8042



r adj

2  .7675

Regression Statistics

Multiple R 0.8967553

R Square 0.80417007

Adjusted R Square 0.76745195

Standard Error 51.0485536

Observations 20

Significance

Testing for Significance Recall:

 Simple Linear Regression Equation:  if β1 = 0, then the mean value of y does not depend on

value of x  i.e., x and y are not related

 therefore, to test for a significant regression relationship, test H0: β1 = 0

Ha: β1 ≠ 0  two possible tests:

1. t test —using t distribution to test hypothesis

2. F-test —using F distribution to test hypothesis



E y  0 1x

Testing for Significance Multiple Regression Equation:

 two possible tests:

1. F- test —test for overall significance

 used to determine whether a significant relationship exists between dependent variable and the set of ALL independent variables

Testing for Significance F Test for Significance in Multiple Regression

 test H0: β1 = β2 = …= βp = 0

Ha: one or more of parameters not equal to 0

 if H0 is rejected, then conclude that one or more of parameters ≠ 0 and that the overall relationship between y and set of independent variables is significant

 if H0 is not rejected, then there is insufficient evidence to conclude that a significant relationship is present

Hypothesis Testing Research Question

Data Collection

Statistical Analysis

Answer Question

Sample Data Collection

Statistical Analysis

Statistical Inference

Regression Analysis H0: β1= β2 = …= βp=0 Ha: one or more βi≠0

Step 1

collect sample data to

determine values of bi’s

Step 2

Using F-test, determine probability

that all βi‘s = 0

Step 3

If the probability is too small, then conclude H0 is not true and the opposite must be true.

Step 4

Testing for Significance F Test for Significance in Multiple Regression

 test H0: β1 = β2 = …= βp = 0

Ha: one or more of parameters not equal to 0

 test statistic:

 rejection rule (with Fα based on F distribution with p degrees of freedom in numerator and

n – p – 1 degrees of freedom in denominator):  p-value approach:

 Reject H0 if p-value ≤ α

 critical value approach

 Reject H0 if F ≥ Fα



F  MSR

MSE

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.

Example

H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.

Example

H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0

From Regression analysis:

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.

ANOVA

df SS MS F Significance F

Regression 2 39472730.77 19736365 48.47713 2.86258E-10

Residual 31 12620946.67 407127.3

Total 33 52093677.44

Example

H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0

From Regression analysis:

Significance of F is 2.86258 × 10-10 < α = .05

Therefore, reject null hypothesis that the slopes are all equal to 0.

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.

ANOVA

df SS MS F Significance F

Regression 2 39472730.77 19736365 48.47713 2.86258E-10

Residual 31 12620946.67 407127.3

Total 33 52093677.44

Example

H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0

From Regression analysis:

Significance of F is 2.86258 × 10-10 < α = .05

Therefore, reject null hypothesis that the slopes are all equal to 0.

We conclude that at least one slope corresponding to the independent variables (i.e., price and promotional expenditures) is not equal to 0: at least one of the independent variables is significantly related to volume of sales.

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.

ANOVA

df SS MS F Significance F

Regression 2 39472730.77 19736365 48.47713 2.86258E-10

Residual 31 12620946.67 407127.3

Total 33 52093677.44

Testing for Significance Multiple Regression Equation:

 two possible tests:

1. F- test —test for overall significance

 used to determine whether a significant relationship exists between dependent variable and the set of ALL independent variables

2. t test —test for individual significance

 used to determine whether each of individual independent variables is significant

Testing for Significance t Test for Significance in Multiple Regression

 For any βi, test H0: βi = 0

Ha: βi ≠ 0

 if H0 is rejected, then conclude that βi ≠ 0 and that a statistically significant relationship exits between 2 variables

Hypothesis Testing Research Question

Data Collection

Statistical Analysis

Answer Question

Sample Data Collection

Statistical Analysis

Statistical Inference

Chapter 14: Regression Analysis H0: βi = 0 Ha: βi ≠ 0

Step 1

collect sample data

from population to determine bi

Step 2

Using t-test, determine probability that βi = 0

Step 3

If the probability is too small, then conclude H0 is not true and the opposite must be true.

Step 4

Testing for Significance t Test for Significance in Multiple Regression

 test H0: βi = 0

Ha: βi ≠ 0

 test statistic:

 rejection rule: (with tα/2 based on t distribution with n – p – 2 degrees of freedom):  p-value approach:

 Reject H0 if p-value ≤ α

 critical value approach  Reject H0 if tSTAT ≤ -tα/2 or tSTAT ≥ tα/2



t STAT

 b

i

s bi

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.

H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.

H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0

From Regression analysis:

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0

From Regression analysis:

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Price: P-value = 9.2 × 10-9 < α = .05

Therefore, reject null hypothesis. We conclude that slope ≠ 0: price of bar is significantly related to volume of sales.

H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0

From Regression analysis:

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Lower

95.0% Upper 95.0%

Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589

Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034

Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633

Price: P-value = 9.2 × 10-9 < α = .05

Therefore, reject null hypothesis. We conclude that slope ≠ 0: price of bar is significantly related to volume of sales.

Promotion:

P-value is 9.82 × 10-6 < α = .05 Therefore, reject null hypothesis. We conclude that slope ≠ 0: amount spent on promotion is significantly related to volume of sales.

Testing for Significance Multicollinearity

 Independent variables used to predict dependent variable

 However, independent variables may not be independent of each other—i.e., may be correlated or collinear

 May create problems in analysis:

 E.g., overall significance, but no individual significance

 Try to avoid correlation among independent variables

 Can be tested for (but beyond scope here)

Multicollinearity  Independent variables may not be independent of each

other—i.e., may be correlated or collinear

 e.g., In building multiple regression model for price of houses, 1 independent variable might be the house size in ft2. Adding another variable such as number of bedrooms may not appropriate because of the close relationship between house size and number of bedrooms.

 Most of the time, there will be some correlation between independent variables

 avoid adding new variables that add no significant information to the regression model

Multicollinearity Potential Problems:

 model may be overall significant, but individual t values for coefficients may not be significant

 algebraic sign of estimated regression coefficients may be the opposite of what would be expected

 i.e., expect sign to be positive, but turns out to be negative

Multicollinearity Example:

In model for estimating crude oil production, two possible independent variables could be fuel rate and coal production

 model using fuel rate:

 model using coal production:

 model using both:

 note that coefficient for fuel rate has different sign in 2 models

 also, model overall significant, but neither coefficient tests significant, unlike when tested individually

y oil

= 44.689 + 0.7838x fuelrate

y oil

= 45.072 + 0.0157x coal

y oil

= 45.806 + 0.0227x coal

- 0.3934x fuelrate

Multicollinearity  avoid adding new variables that add no significant

information to the regression model

 avoid adding new independent variable that seems to be closely correlated to another independent variable

 be cautious in interpretation

 employ methods to detect multicollinearity

 beyond scope of this course

Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Another independent variable, whether or not customers received a coupon for a discount, is considered. What would you advise?

Example

 The use of a coupon would reduce the price of the bar, and would probably not be independent of price variable.

 Could include discount from coupon as part of promotion expenditure.

A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Another independent variable, whether or not customers received a coupon for a discount, is considered. What would you advise?

4. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Determine if there is overall significance for the multiple regression model and determine the significance of the independent variables. Interpret the results.

4. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Determine if there is overall significance for the multiple regression model and determine the significance of the independent variables. Interpret the results.

From Regression Analysis:

Significance of F is .000007 < α = .05.

Therefore, reject null hypothesis.

We conclude that at least one of the independent variables (i.e., mean daily outside temperature, the number of inches of attic insulation, the age of the furnace) is related to heating costs.

ANOVA

df SS MS F Significance

F

Regression 3 171220.4728 57073.49 21.90118 6.56178E-06

Residual 16 41695.27717 2605.955

Total 19 212915.75

From Regression Analysis:

Mean daily outside temperature (Temp): p-value is .000021 < α = .05.

Therefore, reject null hypothesis and conclude that there is a significant relationship between the mean outside temperature and heating costs.

Inches of attic insulation (Insul): p-value is .007 < α = .05.

Therefore, reject null hypothesis and conclude that there is a significant relationship between the inches of attic insulation and heating costs.

Age of furnace (Age): p-value is .14 > α = .05.

Therefore, do not reject null hypothesis. There does not appear to be a significant relationship between the age of the furnace and heating costs.

4. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Determine if there is overall significance for the multiple regression model and determine the significance of the independent variables. Interpret the results.

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%

Upper 95.0%

Intercept 427.193803 59.60142931 7.167509 2.24E-06 300.8444175 553.543189 300.844417 553.543189

Temp -4.5826626 0.772319353 -5.93364 2.1E-05 -6.21990652 -2.9454187 -6.2199065 -2.9454187

Insul -14.830863 4.754412281 -3.11939 0.006606 -24.9097665 -4.7519589 -24.909766 -4.7519589

Age 6.10103206 4.012120166 1.52065 0.147862 -2.40428274 14.6063469 -2.4042827 14.6063469