Assignment for "THE GRADE"
Recall: Regression Analysis statistical procedure for developing an equation to
show how variables are related
dependent variable—variable being predicted
independent variable—variable used to predict dependent variable
equation provides a model for predicting the value of the dependent variable based on the values of the independent variable(s)
analysis helps to identify type of relationship (e.g., linear, exponential) between variables
Recall: Simple Linear Regression involves:
one dependent variable—variable being predicted
one independent variable—variable used to predict dependent variable
relationship is approximated by a straight line
Multiple Regression involves:
one dependent variable—variable being predicted
more than one independent variable—variables used to predict dependent variable
makes it possible to consider more factors and obtain better estimates for dependent variable
objective is to explain the variation in the dependent variable by using the independent variables as potential explanatory variables (i.e., predictors)
Multiple Regression Model Recall:
Simple Linear Regression Model:
β0 and β1 are the parameters of the model
β1 is the slope
β0 is the y-intercept
ε is the error term
random variable accounting for the variability in y that cannot be explained by the linear relationship between x and y
y 0
1 x
Multiple Regression Model Multiple Regression Model:
β0 , β1, β2, … βp are the parameters of the model
y is a linear function of x1, x2, …xp ε is the error term
random variable accounting for the variability in y that cannot be explained by the linear relationship between x and y
y 0
1 x
1
2 x
2
p x
p
Multiple Regression Equation Multiple Regression Equation:
Estimated Multiple Regression Equation:
point estimator of , the mean value of y for a given value of x
(may be more than one y value for a given x)
b0, b1, b2, …bp are estimates of β0, β1, β2, …βp
E y 0 1x1 2 x2 p x p
ˆ y b 0 b
1 x
1 b
2 x
2 b
p x
p
ˆ y
E y
Least Squares Method procedures for using sample data to find the
estimated regression equation
based on determining the equation of the line for which the distance from each sample point to the line is less than for any other line
Least Squares Criterion:
—Observed value of dependent variable for the i’th observation
—Estimated value of dependent variable for the i’th observation
min y i ˆ y
i 2
y i
ˆ y i
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Enter multiple
columns for X range
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
provides:
Regression Statistics
ANOVA SUMMARY OUTPUT Regression Statistics
Multiple R 0.870474549
R Square 0.757725941
Adjusted R Square0.742095357
Standard Error638.0652881
Observations 34
ANOVA
df SS MS F Significance F
Regression 2 39472730.77 19736365 48.47713 2.86258E-10
Residual 31 12620946.67 407127.3
Total 33 52093677.44
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Example
Store Sales Price Promotion
1 4141 59 200
2 3842 59 200
3 3056 59 200
4 3519 59 200
5 4226 59 400
6 4630 59 400
7 3507 59 400
8 3754 59 400
9 5000 59 600
10 5120 59 600
11 4011 59 600
12 5015 59 600
13 1916 79 200
14 675 79 200
15 3636 79 200
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
Example A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.870474549
R Square 0.757725941
Adjusted R Square0.742095357
Standard Error638.0652881
Observations 34
ANOVA
df SS MS F Significance F
Regression 2 39472730.77 19736365 48.47713 2.86258E-10
Residual 31 12620946.67 407127.3
Total 33 52093677.44
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Example A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
Example
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
From Regression analysis:
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
Example
From Regression analysis:
Equation:
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Example
From Regression analysis:
Equation:
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Example
From Regression analysis:
Equation:
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx.
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Model Assumptions Recap:
Estimated multiple regression model:
use least squares method to develop values for b0, b1, b2,… bp as estimates for β0, β1, β1,… βp
y 0
1 x
1
2 x
2
p x
p
Assumptions About ε 1. Zero Mean: The error, ε, is random variable with E(ε)=0
2. Homoskedasticity: Variance of ε is σ2 and is the same for all values of the independent variables x1, x2, x3,…xp
3. Independence: Values of ε are independent
4. Normality: The error, ε, is a normally distributed random variable reflecting the deviation between the y value and the expected value of y given the multiple regression equation
Residual Analysis in order to check if assumptions about ε are true, need to
analyze the residuals
residuals:
use residual plots to check assumptions:
Residual plot against
Standardized residual plot against
y i ̂y
i
ˆ y
ˆ y
2. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Using the measurements from these homes as found in Salsberry.xlsx, find the multiple regression equation.
2. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Using the measurements from these homes as found in Salsberry.xlsx, find the multiple regression equation.
̂y i 427 4.58x
1i 14.80x
2i 6.10x
3i
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%
Upper 95.0%
Intercept 427.193803 59.60142931 7.167509 2.24E-06 300.8444175 553.543189 300.844417 553.543189
Temp -4.5826626 0.772319353 -5.93364 2.1E-05 -6.21990652 -2.9454187 -6.2199065 -2.9454187
Insul -14.830863 4.754412281 -3.11939 0.006606 -24.9097665 -4.7519589 -24.909766 -4.7519589
Age 6.10103206 4.012120166 1.52065 0.147862 -2.40428274 14.6063469 -2.4042827 14.6063469
Equation
Multiple Regression Equation Multiple Regression Equation:
Estimated Multiple Regression Equation:
point estimator of , the mean value of y for a given value of x
(may be more than one y value for a given x)
b0, b1, b2, …bp are estimates of β0, β1, β2, …βp
E y 0 1x1 2 x2 p x p
ˆ y b 0 b
1 x
1 b
2 x
2 b
p x
p
ˆ y
E y
Multiple Regression Equation Multiple Regression Equation:
Estimated Multiple Regression Equation:
point estimator of , the mean value of y for a given value of x
(may be more than one y value for a given x)
b0, b1, b2, …bp are estimates of β0, β1, β2, …βp
E y 0 1x1 2 x2 p x p
ˆ y b 0 b
1 x
1 b
2 x
2 b
p x
p
ˆ y
E y • b0 is y-intercept • generally not meaningful in multiple
regression • b1, b2,…bp are slopes corresponding to the
independent variables • each slope indicates how much value of
dependent variable changes with increase of 1 unit of independent variable)
• use b0, b1, b2,…bp to find estimates of y
Example A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month
in which promotional expenditures are $400?
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
Example
a) b1=–53.2173. For every increase of $.01 in the price, the mean sales are expected to decrease by 53.2173 bars
b2=3.6131. For every increase of $1 in promotional expenditures, the mean sales are expected to increase by 3.6131 bars
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month
in which promotional expenditures are $400?
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
Example
b)
Therefore, sales are expected to be 3078.57 bars.
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month
in which promotional expenditures are $400?
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
57.3078
4006131.379.2173.535208.5837
6131.32173.535208.5837ˆ 21
iii
xxy
Example
b)
Therefore, sales are expected to be 3078.57 bars.
A sample of 34 stores in a supermarket chain are selected for a test-market study of OmniPower, a new high-energy bar. All the stores selected have approximately the same monthly sales volume. Two independent variables are considered here— the price of the bar and the monthly budget for in-store promotional expenditures. Data from the 34 stores is shown the MS Excel file OmniPower.xlsx. Based on the data from the 34 stores, the multiple regression equation for the sales of OmniPower, is found to be: where x1i is the price of the bar for store i and x2i is the promotional expenditure for store i. a) What is the interpretation of the slopes, b1 and b2? b) What are the expected sales for a store charging $.79 per bar during a month
in which promotional expenditures are $400?
̂y i 5837.5208 53.2173x
1i 3.6131x
2i
57.3078
4006131.379.2173.535208.5837
6131.32173.535208.5837ˆ 21
iii
xxy Use cell references to coefficients in
Regression table when calculating in MS Excel
1. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Based on measurements from these homes, the multiple regression equation for the monthly heating cost was found to be:
where x1i is the mean daily outside temperature, x2i is the number of inches of attic insulation, and x3i is the age of the furnace. a) What is the interpretation of the slopes, b1 , b2, and b3? b) What is the expected heating cost if the mean outside temperature is
30°F, there is 5 inches of insulation, and the furnace is 10 years old?
̂y i 427 4.58x
1i 14.80x
2i 6.10x
3i
1. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Based on measurements from these homes, the multiple regression equation for the monthly heating cost was found to be:
where x1i is the mean daily outside temperature, x2i is the number of inches of attic insulation, and x3i is the age of the furnace. a) What is the interpretation of the slopes, b1 , b2, and b3? b) What is the expected heating cost if the mean outside temperature is
30°F, there is 5 inches of insulation, and the furnace is 10 years old?
̂y i 427 4.58x
1i 14.80x
2i 6.10x
3i
a) b1=–4.58. For every increase of 1°F in the mean outside temperature, the heating cost is expected to decrease by $4.58 per month. b2=-14.80. For every increase of 1 inch of attic insulation, the heating cost is expected to decrease by $14.80 per month. b3=6.10. For every increase of 1 year in the age of the furnace, the heating cost is expected to increase by $6.10 per month.
1. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Based on measurements from these homes, the multiple regression equation for the monthly heating cost was found to be:
where x1i is the mean daily outside temperature, x2i is the number of inches of attic insulation, and x3i is the age of the furnace. a) What is the interpretation of the slopes, b1 , b2, and b3? b) What is the expected heating cost if the mean outside temperature is
30°F, there is 5 inches of insulation, and the furnace is 10 years old?
̂y i 427 4.58x
1i 14.80x
2i 6.10x
3i
b)
Therefore, the expected heating cost for the month is $276.60.
ˆ y i 427 4.58x
1i 14.80x
2i 6.10x
3i
427 4.58 30 14.80 5 6.10 10
276.6
Multiple Coefficient of Determination
Multiple Coefficient of Determination How well does the estimated regression equation fit the data?
Multiple Coefficient of Determination How well does the estimated regression equation fit the data?
Multiple coefficient of determination, r2, provides measure of goodness of fit
requires calculation of:
SSE—sum of squares due to error
SSR—sum of squares due to regression
SST—total sum of squares (SST = SSR + SSE)
Proportion of variability in dependent variable that can be explained by multiple regression model
r 2
SSR
SST
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the multiple coefficient of determination to determine the fit of the multiple regression equation.
From Regression analysis:
Multiple coefficient of determination:
Example
Regression Statistics
Multiple R 0.870474549
R Square 0.757725941
Adjusted R Square 0.742095357
Standard Error 638.0652881
Observations 34
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the multiple coefficient of determination to determine the fit of the multiple regression equation.
r 2 .7577
From Regression analysis:
Multiple coefficient of determination:
Therefore, 75.77% of the variation in sales is explained by the variation in the price and in the promotional expenditures; 24.23% is explained by other factors
Example
Regression Statistics
Multiple R 0.870474549
R Square 0.757725941
Adjusted R Square 0.742095357
Standard Error 638.0652881
Observations 34
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the multiple coefficient of determination to determine the fit of the multiple regression equation.
r 2 .7577
Multiple Coefficient of Determination How can the multiple coefficient of determination be improved (i.e., made to give a higher value)?
it is almost always possible to raise the coefficient of determination R2 by including additional independent variables
if adding additional variables, be careful to avoid multicollinearity of variables
Multicollinearity Independent variables may not be independent of each
other—i.e., may be correlated or collinear
e.g., In building multiple regression model for price of houses, 1 independent variable might be the house size in ft2. Adding another variable such as number of bedrooms may not appropriate because of the close relationship between house size and number of bedrooms.
Most of the time, there will be some correlation between independent variables
avoid adding new variables that add no significant information to the regression model
Multiple Coefficient of Determination How well does the estimated regression equation fit the data?
it is almost always possible to raise the coefficient of determination R2 by including additional independent variables
to avoid overfitting the model, an adjustment can be made to R2 statistic to penalize the inclusion of useless predictors
➢ Adjusted multiple coefficient of determination
Multiple Coefficient of Determination How well does the estimated regression equation fit the data?
Multiple coefficient of determination:
Adjusted multiple coefficient of determination:
k is number of independent variables in regression equation
Adjusts according to number of independent variables and size of sample
Used by some statisticians to avoid overestimating impact of adding an independent variable to model
r 2
SSR
SST
r adj
2 1 1 r
2 n 1
n k 1
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.
From Regression analysis:
Multiple coefficient of determination:
Adjusted multiple coefficient of determination:
Example
Regression Statistics
Multiple R 0.870474549
R Square 0.757725941
Adjusted R Square 0.742095357
Standard Error 638.0652881
Observations 34
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.
r 2 .7577
From Regression analysis:
Multiple coefficient of determination:
Adjusted multiple coefficient of determination:
Example
Regression Statistics
Multiple R 0.870474549
R Square 0.757725941
Adjusted R Square 0.742095357
Standard Error 638.0652881
Observations 34
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.
r 2 .7577
r adj
2 .7421
Example
Multiple Coefficient of Determination:
75.77% of the variation in sales is explained by the variation in the price and in the promotional expenditures.
Adjusted Multiple Coefficient of Determination:
74.21% of the variation in sales is explained by the multiple regression model adjusted for the number of independent variables (k=2) and sample size (n=34).
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.
r 2 .7577
r adj
2 .7421
Example
Multiple Coefficient of Determination:
75.77% of the variation in sales is explained by the variation in the price and in the promotional expenditures.
Adjusted Multiple Coefficient of Determination:
74.21% of the variation in sales is explained by the multiple regression model adjusted for the number of independent variables (k=2) and sample size (n=34).
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Use the adjusted multiple coefficient of determination to determine the fit of the multiple regression equation.
r 2 .7577
r adj
2 .7421
3. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Find and interpret both the multiple coefficient of determination and the adjusted multiple coefficient of determination.
3. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Find and interpret both the multiple coefficient of determination and the adjusted multiple coefficient of determination.
Multiple Coefficient of Determination:
80.42% of the variation in sales is explained by the variations in mean temperature, inches of insulation, and age of furnace
Adjusted Multiple Coefficient of Determination:
76.75% of the variation in sales is explained by the multiple regression model adjusted for the number of independent variables (k=3) and sample size (n=20)
r 2 .8042
r adj
2 .7675
Regression Statistics
Multiple R 0.8967553
R Square 0.80417007
Adjusted R Square 0.76745195
Standard Error 51.0485536
Observations 20
Significance
Testing for Significance Recall:
Simple Linear Regression Equation: if β1 = 0, then the mean value of y does not depend on
value of x i.e., x and y are not related
therefore, to test for a significant regression relationship, test H0: β1 = 0
Ha: β1 ≠ 0 two possible tests:
1. t test —using t distribution to test hypothesis
2. F-test —using F distribution to test hypothesis
E y 0 1x
Testing for Significance Multiple Regression Equation:
two possible tests:
1. F- test —test for overall significance
used to determine whether a significant relationship exists between dependent variable and the set of ALL independent variables
Testing for Significance F Test for Significance in Multiple Regression
test H0: β1 = β2 = …= βp = 0
Ha: one or more of parameters not equal to 0
if H0 is rejected, then conclude that one or more of parameters ≠ 0 and that the overall relationship between y and set of independent variables is significant
if H0 is not rejected, then there is insufficient evidence to conclude that a significant relationship is present
Hypothesis Testing Research Question
Data Collection
Statistical Analysis
Answer Question
Sample Data Collection
Statistical Analysis
Statistical Inference
Regression Analysis H0: β1= β2 = …= βp=0 Ha: one or more βi≠0
Step 1
collect sample data to
determine values of bi’s
Step 2
Using F-test, determine probability
that all βi‘s = 0
Step 3
If the probability is too small, then conclude H0 is not true and the opposite must be true.
Step 4
Testing for Significance F Test for Significance in Multiple Regression
test H0: β1 = β2 = …= βp = 0
Ha: one or more of parameters not equal to 0
test statistic:
rejection rule (with Fα based on F distribution with p degrees of freedom in numerator and
n – p – 1 degrees of freedom in denominator): p-value approach:
Reject H0 if p-value ≤ α
critical value approach
Reject H0 if F ≥ Fα
F MSR
MSE
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.
Example
H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.
Example
H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0
From Regression analysis:
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.
ANOVA
df SS MS F Significance F
Regression 2 39472730.77 19736365 48.47713 2.86258E-10
Residual 31 12620946.67 407127.3
Total 33 52093677.44
Example
H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0
From Regression analysis:
Significance of F is 2.86258 × 10-10 < α = .05
Therefore, reject null hypothesis that the slopes are all equal to 0.
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.
ANOVA
df SS MS F Significance F
Regression 2 39472730.77 19736365 48.47713 2.86258E-10
Residual 31 12620946.67 407127.3
Total 33 52093677.44
Example
H0: βprice = βpromotion = 0; HA: at least one of slopes ≠ 0
From Regression analysis:
Significance of F is 2.86258 × 10-10 < α = .05
Therefore, reject null hypothesis that the slopes are all equal to 0.
We conclude that at least one slope corresponding to the independent variables (i.e., price and promotional expenditures) is not equal to 0: at least one of the independent variables is significantly related to volume of sales.
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for overall significance using a .05 level of significance and interpret the result.
ANOVA
df SS MS F Significance F
Regression 2 39472730.77 19736365 48.47713 2.86258E-10
Residual 31 12620946.67 407127.3
Total 33 52093677.44
Testing for Significance Multiple Regression Equation:
two possible tests:
1. F- test —test for overall significance
used to determine whether a significant relationship exists between dependent variable and the set of ALL independent variables
2. t test —test for individual significance
used to determine whether each of individual independent variables is significant
Testing for Significance t Test for Significance in Multiple Regression
For any βi, test H0: βi = 0
Ha: βi ≠ 0
if H0 is rejected, then conclude that βi ≠ 0 and that a statistically significant relationship exits between 2 variables
Hypothesis Testing Research Question
Data Collection
Statistical Analysis
Answer Question
Sample Data Collection
Statistical Analysis
Statistical Inference
Chapter 14: Regression Analysis H0: βi = 0 Ha: βi ≠ 0
Step 1
collect sample data
from population to determine bi
Step 2
Using t-test, determine probability that βi = 0
Step 3
If the probability is too small, then conclude H0 is not true and the opposite must be true.
Step 4
Testing for Significance t Test for Significance in Multiple Regression
test H0: βi = 0
Ha: βi ≠ 0
test statistic:
rejection rule: (with tα/2 based on t distribution with n – p – 2 degrees of freedom): p-value approach:
Reject H0 if p-value ≤ α
critical value approach Reject H0 if tSTAT ≤ -tα/2 or tSTAT ≥ tα/2
t STAT
b
i
s bi
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.
H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.
H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0
From Regression analysis:
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0
From Regression analysis:
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Price: P-value = 9.2 × 10-9 < α = .05
Therefore, reject null hypothesis. We conclude that slope ≠ 0: price of bar is significantly related to volume of sales.
H0: βprice = 0; HA: βprice ≠ 0 H0: βpromotion = 0; HA: βpromotion ≠ 0
From Regression analysis:
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Test for significance at 0.05 of each independent variables and interpret the results.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Lower
95.0% Upper 95.0%
Intercept 5837.520759 628.150225 9.293192 1.79E-10 4556.399929 7118.6416 4556.39993 7118.641589
Price -53.21733631 6.852220559 -7.76644 9.2E-09 -67.19253228 -39.24214 -67.1925323 -39.24214034
Promotion 3.613058036 0.685222056 5.272828 9.82E-06 2.215538439 5.0105776 2.21553844 5.010577633
Price: P-value = 9.2 × 10-9 < α = .05
Therefore, reject null hypothesis. We conclude that slope ≠ 0: price of bar is significantly related to volume of sales.
Promotion:
P-value is 9.82 × 10-6 < α = .05 Therefore, reject null hypothesis. We conclude that slope ≠ 0: amount spent on promotion is significantly related to volume of sales.
Testing for Significance Multicollinearity
Independent variables used to predict dependent variable
However, independent variables may not be independent of each other—i.e., may be correlated or collinear
May create problems in analysis:
E.g., overall significance, but no individual significance
Try to avoid correlation among independent variables
Can be tested for (but beyond scope here)
Multicollinearity Independent variables may not be independent of each
other—i.e., may be correlated or collinear
e.g., In building multiple regression model for price of houses, 1 independent variable might be the house size in ft2. Adding another variable such as number of bedrooms may not appropriate because of the close relationship between house size and number of bedrooms.
Most of the time, there will be some correlation between independent variables
avoid adding new variables that add no significant information to the regression model
Multicollinearity Potential Problems:
model may be overall significant, but individual t values for coefficients may not be significant
algebraic sign of estimated regression coefficients may be the opposite of what would be expected
i.e., expect sign to be positive, but turns out to be negative
Multicollinearity Example:
In model for estimating crude oil production, two possible independent variables could be fuel rate and coal production
model using fuel rate:
model using coal production:
model using both:
note that coefficient for fuel rate has different sign in 2 models
also, model overall significant, but neither coefficient tests significant, unlike when tested individually
y oil
= 44.689 + 0.7838x fuelrate
y oil
= 45.072 + 0.0157x coal
y oil
= 45.806 + 0.0227x coal
- 0.3934x fuelrate
Multicollinearity avoid adding new variables that add no significant
information to the regression model
avoid adding new independent variable that seems to be closely correlated to another independent variable
be cautious in interpretation
employ methods to detect multicollinearity
beyond scope of this course
Example A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Another independent variable, whether or not customers received a coupon for a discount, is considered. What would you advise?
Example
The use of a coupon would reduce the price of the bar, and would probably not be independent of price variable.
Could include discount from coupon as part of promotion expenditure.
A sample of 34 stores in a supermarket chain are selected for a test- market study of OmniPower, a new high-energy bar. Two independent variables are considered here—the price of the bar and the monthly budget for in-store promotional expenditures. Another independent variable, whether or not customers received a coupon for a discount, is considered. What would you advise?
4. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Determine if there is overall significance for the multiple regression model and determine the significance of the independent variables. Interpret the results.
4. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Determine if there is overall significance for the multiple regression model and determine the significance of the independent variables. Interpret the results.
From Regression Analysis:
Significance of F is .000007 < α = .05.
Therefore, reject null hypothesis.
We conclude that at least one of the independent variables (i.e., mean daily outside temperature, the number of inches of attic insulation, the age of the furnace) is related to heating costs.
ANOVA
df SS MS F Significance
F
Regression 3 171220.4728 57073.49 21.90118 6.56178E-06
Residual 16 41695.27717 2605.955
Total 19 212915.75
From Regression Analysis:
Mean daily outside temperature (Temp): p-value is .000021 < α = .05.
Therefore, reject null hypothesis and conclude that there is a significant relationship between the mean outside temperature and heating costs.
Inches of attic insulation (Insul): p-value is .007 < α = .05.
Therefore, reject null hypothesis and conclude that there is a significant relationship between the inches of attic insulation and heating costs.
Age of furnace (Age): p-value is .14 > α = .05.
Therefore, do not reject null hypothesis. There does not appear to be a significant relationship between the age of the furnace and heating costs.
4. The research department at Salsberry Realty has been asked to develop some guidelines regarding heating costs for single family homes. Three variables are thought to relate to the heating costs: the mean daily outside temperature, the number of inches of attic insulation, the age of the furnace. A random sample of 20 homes was selected. Determine if there is overall significance for the multiple regression model and determine the significance of the independent variables. Interpret the results.
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%
Upper 95.0%
Intercept 427.193803 59.60142931 7.167509 2.24E-06 300.8444175 553.543189 300.844417 553.543189
Temp -4.5826626 0.772319353 -5.93364 2.1E-05 -6.21990652 -2.9454187 -6.2199065 -2.9454187
Insul -14.830863 4.754412281 -3.11939 0.006606 -24.9097665 -4.7519589 -24.909766 -4.7519589
Age 6.10103206 4.012120166 1.52065 0.147862 -2.40428274 14.6063469 -2.4042827 14.6063469