Assignment for "THE GRADE"
Concept
Relationships Between Variables Selling Price
Style Square Feet Bedrooms
113000 TRI-LE 1350 3
113542 BI-LEV 1700 4
120000 TRI-LE 1350 4
120000 TRI-LE 1350 3
122100 RANCH 1700 4
123000 TRI-LE 1350 4
128500 RANCH 2457 3
137000 RANCH 2100 3
139900 RANCH 2811 3
142000 RANCH 2920 3
: : : :
Is there a relationship between the selling price and the square footage of the houses?
Relationships Between Variables Selling Price
Style Square Feet Bedrooms
113000 TRI-LE 1350 3
113542 BI-LEV 1700 4
120000 TRI-LE 1350 4
120000 TRI-LE 1350 3
122100 RANCH 1700 4
123000 TRI-LE 1350 4
128500 RANCH 2457 3
137000 RANCH 2100 3
139900 RANCH 2811 3
142000 RANCH 2920 3
: : : :
Is there a relationship between the selling price and the square footage of the houses?
0
50000
100000
150000
200000
250000
300000
350000
0 1000 2000 3000 4000 5000
Price by Square Feet
Scatterplot with trendline indicates positive relationship.
Relationships Between Variables Selling Price
Style Square Feet Bedrooms
113000 TRI-LE 1350 3
113542 BI-LEV 1700 4
120000 TRI-LE 1350 4
120000 TRI-LE 1350 3
122100 RANCH 1700 4
123000 TRI-LE 1350 4
128500 RANCH 2457 3
137000 RANCH 2100 3
139900 RANCH 2811 3
142000 RANCH 2920 3
: : : :
Is there a relationship between the selling price and the square footage of the houses?
0
50000
100000
150000
200000
250000
300000
350000
0 1000 2000 3000 4000 5000
Price by Square Feet
Correlation coefficient = 0.676870565
Scatterplot with trendline indicates positive relationship.
Correlation coefficient indicates positive relationship.
Regression Analysis statistical procedure for developing an equation to
show how variables are related
dependent variable—variable being predicted
independent variable—variable used to predict dependent variable
regression equation provides a model for predicting the value of the dependent variable based on the values of the independent variable(s)
analysis helps to identify type of relationship (e.g., linear, exponential) between variables
Simple Linear Regression involves:
one dependent variable—variable being predicted
one independent variable—variable used to predict dependent variable
relationship is approximated by a straight line
Simple Linear Regression Model Recall:
general form of linear equation:
can determine a value for y given any value of x
value of y dependent on value of x
m is the slope and b is the y-intercept:
slope m represents rate of change
the change in the value of y for every unit of change in x
y-intercept b is a constant
y mx b
Simple Linear Regression Model Linear Equation
0
y
x
b
y mx b
x 2
m rise
run
y 2 y
1
x 2 x
1
rise
x 1
run
y 1
y 2
Simple Linear Regression Model Simple Linear Regression Model:
β0 and β1 are the parameters of the model
β1 is the slope (corresponds to m)
β0 is the y-intercept (corresponds to b)
ε is the error term
equation will not fit each point exactly
ε is random variable accounting for the variability in y that cannot be explained by the linear relationship between x and y
y 0
1 x
Simple Linear Regression Equation Simple Linear Regression Equation:
Estimated Simple Linear Regression Equation:
ො𝑦 point estimator of , the mean value of y for a given value of x
(may be more than one y value for a given x)
xyE 10
xbby 10
ˆ
E y
Simple Linear Regression Model Estimated Regression Model
0
y
x
x i
x 1
y 1
y 2
x 1 , y
1
x 2
x 2 , y
2
x i , y
i Black dots represent observed values of y.
i y
Simple Linear Regression Model Estimated Regression Model
0
y
x
ŷ = b 0 +b
1 x
x i
x 1
y 1
y 2
x 2
i ŷ
b 0
Black dots represent observed values of y.
Blue dots represent estimated values of y
based on regression model.
1 ŷ
2 ŷ
i y
x 1 , y
1
x 2 , y
2
x i , y
i
(𝑥𝑖, ෝ𝑦𝑖)
(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1)
Simple Linear Regression Model Estimated Regression Model
0
y
x
ŷ = b 0 +b
1 x
x i
x 1
y 1
y 2
x 2
i ŷ
b 0
Black dots represent observed values of y.
Blue dots represent estimated values of y
based on regression model.
1 ŷ
2 ŷ
i y
x 1 , y
1
x 2 , y
2
x i , y
i
(𝑥𝑖, ෝ𝑦𝑖)
(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1)
e i
Simple Linear Regression Model Estimated Regression Model
0
y
x
ŷ = b 0 +b
1 x
x i
x 1
y 1
y 2
x 2
i ŷ
b 0
e represents error or
difference between
observed and
estimated values of
y. 1
ŷ
2 ŷ
i y
x 1 , y
1
x 2 , y
2
x i , y
i
(𝑥𝑖, ෝ𝑦𝑖)
(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1)
e i
Simple Linear Regression Model Estimated Regression Model
0
y
x
ŷ = b 0 +b
1 x
x i
x 1
y 1
y 2
x 1 , y
1
e 1
x 2
x 2 , y
2
e 2
x i , y
i
e i
i ŷ
b 0
1 ŷ
2 ŷ
i y
goal is to determine the
equation of the line for
which the e’s are the least
(𝑥𝑖, ෝ𝑦𝑖)
(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1)
Simple Linear Regression Equation Estimation process in simple linear regression:
1. collect sample data for x and y
2. use sample data to find b0 and b1 in
3. use b0 and b1 as estimates for β0 and β1 in
E y 0 1x
ŷ = b 0 +b
1 x
Simple Linear Regression Equation Estimation process in simple linear regression:
1. collect sample data for x and y
2. use sample data to find b0 and b1 in
3. use b0 and b1 as estimates for β0 and β1 in
E y 0 1x
ŷ = b 0 +b
1 x
how? use Least Squares Method
Least Squares Method procedure for using sample data to find the
estimated regression equation
based on determining the equation of the line for which the distance from each sample point to the line is less than for any other line
Least Squares Method Least Squares Criterion:
Slope for Estimated Regression Equation
y-Intercept for Estimated Regression Equation
b 1 x
i x yi y
x i x
2
b 0 y b
1 x
—Observed value of dependent variable for the i’th observation
—Estimated value of dependent variable for the i’th observation
y i
ŷ i
—value of independent variable for the i’th observation
—value of dependent variable for the i’th observation —mean value of independent variable —mean value of dependent variable
x i
y i
x
y
min å y i - ŷ
i( ) 2
Finding Estimated Regression Equation Options in MS Excel: Example: Sunflowers.xlsx
1. perform calculations
See Calculations worksheet
2. use LINEST function (with INDEX function)
See Calculations worksheet
3. use scatterplot with trendline and equation
See Graph worksheet
4. use Data Analysis tool for Regression
See Regression worksheet
Finding Estimated Regression Equation Options in MS Excel: Example: Sunflowers.xlsx
1. perform calculations
See Calculations worksheet
2. use LINEST function (with INDEX function)
See Calculations worksheet
3. use scatterplot with trendline and equation
See Graph worksheet
4. use Data Analysis tool for Regression
See Regression worksheet
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
Example
scatterplot:
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
S q
u a
re f
e e
t (i
n T
h o
u s a
n d
s )
o f
S to
re s
Annual Sales (Million $'s)
Example
trendline:
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
S q
u a
re f
e e
t (i
n T
h o
u s a
n d
s )
o f
S to
re s
Annual Sales (Million $'s)
Example
estimated regression equation:
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
y = 1.6699x + 0.9645
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
S q
u a
re f
e e
t (i
n T
h o
u s a
n d
s )
o f
S to
re s
Annual Sales (Million $'s)
Example
estimated regression equation:
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
y = 1.6699x + 0.9645
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
S q
u a
re f
e e
t (i
n T
h o
u s a
n d
s )
o f
S to
re s
Annual Sales (Million $'s)
slope
Example
estimated regression equation:
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
y = 1.6699x + 0.9645
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
S q
u a
re f
e e
t (i
n T
h o
u s a
n d
s )
o f
S to
re s
Annual Sales (Million $'s)
y-intercept
Finding Estimated Regression Equation Options in MS Excel: Example: Sunflowers.xlsx
1. perform calculations
See Calculations worksheet
2. use LINEST function (with INDEX function)
See Calculations worksheet
3. use scatterplot with trendline and equation
See Graph worksheet
4. use Data Analysis tool for Regression
See Regression worksheet
Preferred, as other
useful information
provided by report
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Dependent Variable
y
Independent Variable
x
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Dependent Variable
y
Independent Variable
x
SAFETY CHECK: It is critically important to
specify the correct data for the independent variable x and the
dependent variable y.
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Enter Confidence Level
Check if labels selected with data
Specify Cell for Output
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Enter Confidence Level
Check if labels selected with data
Specify Cell for Output
SAFETY CHECK: Select the column labels with the data and check the Labels box to make sure that the MS
Excel output is clearly labelled.
Excel’s Regression Tool Analysis
↳ Data Analysis
↳ Regression
Analysis provides:
ANOVA
Regression Statistics
Regression Model
Example
Data Analysis
↳ Regression
↳ Input Y Range?
↳ Input X Range?
The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
Example
Data Analysis
↳ Regression
↳ Input Y Range?
↳ Input X Range?
The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
Dependent variable (the one you are
trying to predict)
Independent variable (the
one you think is causing
variance in the dependent variable)
Example The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
y-intercept
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
y-intercept
slope
Example
Therefore, the estimated regression equation is
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
y-intercept
slope ŷ = 0.9645+1.6699x
2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.
a) Construct a scatterplot. b) Find the simple linear regression equation.
2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.
a) Construct a scatterplot. b) Find the simple linear regression equation.
a) Scatterplot: b) simple linear regression
equation: ŷ i =145 + 7.4x
i
y = 7.4x + 145
0
100
200
300
400
0 5 10 15 20 25
Sales
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%
Upper
95.0%
Intercept 145 21.78302091 6.656560658 5.66278E-05 96.464405 193.535595 96.464405 193.535595
Shelf Space 7.4 1.590806923 4.651727304 0.000905656 3.855461304 10.9445387 3.855461304 10.9445387
ŷ i =145 + 7.4x
i
Regression Equation
Interpreting Coefficients
Recall: — dependent variable
— independent variable
Coefficients: — slope
corresponds to m in y = mx + b
rate of change in y for each change in x
— y-intercept
corresponds to b in y = mx + b
value of y when x = 0 (i.e., where line crosses y-axis)
̂y b 0 b
1 x
ˆ y
x
b 1
b 0
Example
Therefore, the estimated regression equation is
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
y-intercept
slope ŷ = 0.9645+1.6699x
Interpreting Coefficients
Recall: — dependent variable
— independent variable
Coefficients: — slope
corresponds to m in y = mx + b
rate of change in y for each change in x
— y-intercept
corresponds to b in y = mx + b
value of y when x = 0 (i.e., where line crosses y-axis)
̂y b 0 b
1 x
ˆ y
x
b 1
b 0
Helpful for understanding how independent variable affects
dependent variable
Interpreting Coefficients
Recall: — dependent variable
— independent variable
Coefficients: — slope
corresponds to m in y = mx + b
rate of change in y for each change in x
— y-intercept
corresponds to b in y = mx + b
value of y when x = 0 (i.e., where line crosses y-axis)
̂y b 0 b
1 x
ˆ y
x
b 1
b 0
may or may not be helpful for making sense of business
context
Example A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:
ො𝑦 = 35.0 + 3𝑥𝑖 What is the interpretation of the y-intercept b0 and the slope b1?
Example
Interpretation of y-intercept b0:
Interpretation of slope b1:
A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:
What is the interpretation of the y-intercept, b0, and the slope, b1? ො𝑦 = 35.0 + 3𝑥𝑖
Example
Interpretation of y-intercept b0:
• y-intercept of 35 indicates that when the student does not study at all for the final exam, the predicted final score is 35
Interpretation of slope b1:
A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:
What is the interpretation of the y-intercept, b0, and the slope, b1?
ො𝑦 = 35.0 + 3𝑥𝑖
Example
Interpretation of y-intercept b0:
• y-intercept of 35 indicates that when the student does not study at all for the final exam, the predicted final score is 35
Interpretation of slope b1:
• the slope of 3 indicates that for every hour of studying, the mean change in exam score is 3%
A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:
What is the interpretation of the y-intercept, b0, and the slope, b1? ො𝑦 = 35.0 + 3𝑥𝑖
Example
Therefore, the estimated regression equation is
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
y-intercept
slope ŷ = 0.9645+1.6699x
Example
Therefore, the estimated regression equation is
The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.95088328
R Square 0.904179
Adjusted R Square 0.89619392
Standard Error 0.96637968
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.7476095 105.74761 113.233513 1.8227E-07
Residual 12 11.20667621 0.93388968
Total 13 116.9542857
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038
Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334
y-intercept
slope ŷ = 0.9645+1.6699x
For a store of 0 square feet, sales are expected to be $0.9645 million Not
meaningful
For each additional 1000 sq ft, sales
increase by $1.6699 million
Estimating Dependent Variable
Recall: — dependent variable
— independent variable
can find values of corresponding to values of x to make estimates of the dependent variable corresponding to values of the independent variable
̂y b 0 b
1 x
ˆ y
x
ˆ y
Example A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:
What is a student who studies 10 hours estimated to score? 12 hours? ො𝑦 = 35.0 + 3𝑥𝑖
Example
10 hours:
therefore, estimate that student scores 65
12 hours:
therefore, estimate that student scores 71
A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:
What is a student who studies 10 hours estimated to score? 12 hours?
ˆ y i 35.0 3x
i
35.0 3 10
65
ˆ y i 35.0 3x
i
35.0 3 12
71
ො𝑦 = 35.0 + 3𝑥𝑖
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Estimate sales for a store of 5000 square feet.
ŷ = 0.9645+1.6699x
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Estimate sales for a store of 5000 square feet.
Store Square Feet (Thousands)
Annual Sales (Million $'s)
1 1.7 3.7
2 1.6 3.9
3 2.8 6.7
4 5.6 9.5
5 1.3 3.4
6 2.2 5.6
7 1.3 3.7
8 1.1 2.7
9 3.2 5.5
10 1.5 2.9
11 5.2 10.7
12 4.6 7.6
13 5.8 11.8
14 3 4.1
Note that size of store in thousands of square feet and sales in millions
of dollars. Therefore, need to interpret values of dependent and
independent variables in these units.
ŷ = 0.9645+1.6699x
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Estimate sales for a store of 5000 square feet.
ŷ = 0.9645+1.6699x
5000 square feet x = 5:
therefore, estimate that store has revenue of $9.314 million.
ˆ y 0.9645 1.6699x
0.9645 1.6699 5
9.314
2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.
a) Construct a scatterplot. b) Find the simple linear regression equation. c) Interpret the meaning of the slope in this problem. d) Predict or estimate the mean weekly sales of pet food for stores with 8 feet
of shelf space for pet food.
2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.
a) Construct a scatterplot. b) Find the simple linear regression equation. c) Interpret the meaning of the slope in this problem. d) Predict or estimate the mean weekly sales of pet food for stores with 8 feet
of shelf space for pet food.
a) Scatterplot: b) simple linear regression
equation: ො𝑦 = 145 + 7.4𝑥𝑖 c) for each increase in shelf
space of 1 foot, weekly sales are estimated to increase by $7.40
d) 8 feet:
Therefore, predicted mean weekly sales are $204.20.
y = 7.4x + 145
0
50
100
150
200
250
300
350
0 5 10 15 20 25
Sales
ො𝑦 = 145 + 7.4𝑥𝑖 = 145 + 7.4(8)
Regression Fit
Coefficient of Determination How well does the estimated regression equation fit the data?
Coefficient of Determination How well does the estimated regression equation fit the data?
coefficient of determination, r2, provides measure of goodness of fit
requires calculation of:
SSE—sum of squares due to error
SSR—sum of squares due to regression
SST—total sum of squares (SST = SSR + SSE)
r 2
SSR
SST
Coefficient of Determination For independent variable xi:
0
y
x
y
y i
x i
y
Observed dependent variable
Estimated dependent variable
Mean dependent variable
ෝ𝑦𝑖
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖
Measures of Variation for independent variable xi:
Coefficient of Determination
0
y
x
y
y i
x i
y i y
represents the error in using ෝ𝑦𝑖 to estimate yi
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖𝑦𝑖 − ෝ𝑦𝑖
ෝ𝑦𝑖
ෝ𝑦𝑖 − ത𝑦
Overall Measures of Variation:
0
y
x
y
y i
x i
SST y i y
2
Error sum of squares
Regression sum of squares
Total sum of squares
Coefficient of Determination
𝑆𝑆𝑅 = ෝ𝑦𝑖 − ത𝑦 2
𝑆𝑆𝐸 = 𝑦𝑖 − ෝ𝑦𝑖 2
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖
ෝ𝑦𝑖
Coefficient of Determination sum of squares due to error (SSE)
0
y
x
y
y i
x i
SST y i y
2
Error sum of squares
Regression sum of squares
Total sum of squares
• Difference 𝑦𝑖 − ෝ𝑦𝑖 represents the error in using ෝ𝑦𝑖 to estimate yi.
• SSE represents the sum of the differences between all yi’s and ෝ𝑦𝑖 ‘s
• this value minimized by least squares method
𝑆𝑆𝑅 = ෝ𝑦𝑖 − ത𝑦 2
𝑆𝑆𝐸 = 𝑦𝑖 − ෝ𝑦𝑖 2
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖
ෝ𝑦𝑖
Coefficient of Determination total sum of squares (SST)
0
y
x
y
̂y i b
0 b
1 x
i
y i
x i
SSE y i ˆ y
i 2
SST y i y
2
SSR ˆ y i y
2
Error sum of squares
Regression sum of squares
Total sum of squares
• can be used to estimate yi
• Difference represents the error in using to estimate yi
• SST represents the sum of the differences between all yi’s and ‘s
ˆ y i
y
y i y
y
y
Coefficient of Determination Regression sum of squares (SSR)
0
y
x
y
y i
x i
SST y i y
2
Error sum of squares
Regression sum of squares
Total sum of squares
• Difference ෝ𝑦𝑖 − ത𝑦 measures how much the ෝ𝑦𝑖 values on the regression line deviate from
• SSR represents the sum of the differences between all ෝ𝑦𝑖’s and ‘s
y
y
𝑆𝑆𝑅 = ෝ𝑦𝑖 − ത𝑦 2
𝑆𝑆𝐸 = 𝑦𝑖 − ෝ𝑦𝑖 2
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖
ෝ𝑦𝑖
Coefficient of Determination yields the percentage of how much of the variability in dependent variable y can be explained by the linear relationship
0
y
x
y
y i
x i
SST y i y
2
Error sum of squares
Regression sum of squares
Total sum of squares
r 2
SSR
SST
𝑆𝑆𝑅 = ෝ𝑦𝑖 − ത𝑦 2
𝑆𝑆𝐸 = 𝑦𝑖 − ෝ𝑦𝑖 2
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖
ෝ𝑦𝑖
Coefficient of Determination How well does the estimated regression equation fit the data?
coefficient of determination, r2, provides measure of goodness of fit
result yields the percentage of how much of the variability in the y can be explained by the linear relationship
r 2
SSR
SST
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.
Finding Coefficient of Determination Options in MS Excel: Example: Sunflowers.xlsx
1. perform calculations
See Coefficient worksheet
2. use CORREL function
See Coefficient worksheet
3. use scatterplot and show R2
See Graph worksheet
4. use data analysis tool for Regression
See Regression worksheet
Finding Coefficient of Determination Options in MS Excel: Example: Sunflowers.xlsx
1. perform calculations
See Coefficient worksheet
2. use CORREL function
See Coefficient worksheet
3. use scatterplot and show R2
See Graph worksheet
4. use data analysis tool for Regression
See Regression worksheet
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.
y = 1.6699x + 0.9645 R² = 0.90418
0
2
4
6
8
10
12
14
0 1 2 3 4 5 6 7
S q
u a
re f
e e
t (i
n T
h o
u s a
n d
s )
o f
S to
re s
Annual Sales (Million $'s) Therefore, 90.42% of the variability in annual sales is explained by the linear relationship with the size of the store.
9.58% of sample variability is due to other factors.
Finding Coefficient of Determination Options in MS Excel: Example: Sunflowers.xlsx
1. perform calculations
See Coefficient worksheet
2. use CORREL function
See Coefficient worksheet
3. use scatterplot and show R2
See Graph worksheet
4. use data analysis tool for Regression
See Regression worksheet
Preferred, as other
useful information
provided by report
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.
From MS Excel:
ANOVA
df SS MS F Significance F
Regression 1 105.74761 105.74761 113.233513 1.8227E-07
Residual 12 11.2066762 0.93388968
Total 13 116.954286 Regression Statistics
Multiple R 0.950883275
R Square 0.904179003
Adjusted R Square 0.89619392
Standard Error 0.966379679
Observations 14
Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.
From MS Excel:
ANOVA
df SS MS F Significance F
Regression 1 105.74761 105.74761 113.233513 1.8227E-07
Residual 12 11.2066762 0.93388968
Total 13 116.954286 Regression Statistics
Multiple R 0.950883275
R Square 0.904179003
Adjusted R Square 0.89619392
Standard Error 0.966379679
Observations 14
Therefore, 90.42% of the variability in annual sales is explained by the linear relationship with the size of the store.
9.58% of sample variability is due to other factors.
3. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. Determine the coefficient of determination and interpret its meaning.
3. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. Determine the coefficient of determination and interpret its meaning.
ANOVA
df SS MS F Significance F
Regression 1 20535 20535 21.638567 0.000905656
Residual 10 9490 949
Total 11 30025
68.4% of the variability in sales of the pet food can be explained by the linear relationship with the size of shelf space.
r 2
SSR
SST
20,535
30,025 0.684
Regression Statistics
Multiple R 0.82700064
R Square 0.683930058
Adjusted R Square 0.652323064
Standard Error 30.8058436
Observations 12
Assumptions
Regression Model Recap:
assumed regression model:
use least squares method to develop values for b0 and b1 as estimates for β0 and β1
coefficient of determination r2 used as measure of goodness of fit of estimated regression equation
y 0
1 x
̂y b 0 b
1 x
r 2
SSR
SST
Regression Model Recap:
assumed regression model:
use least squares method to develop values for b0 and b1 as estimates for β0 and β1
ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥
coefficient of determination r2 used as measure of goodness of fit of estimated regression equation
y 0
1 x
r 2
SSR
SST
WARNING: It is not always
appropriate to use Regression Analysis
Pitfalls in Regression assuming the relationship between the variables is linear
lacking awareness of the assumptions of least-squares regression
not knowing how to evaluate the assumptions of least- squares regression
using a regression model without knowledge of the subject matter
extrapolating outside the relevant range
concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship
Pitfalls in Regression assuming the relationship between the variables is linear
lacking awareness of the assumptions of least-squares regression
not knowing how to evaluate the assumptions of least- squares regression
using a regression model without knowledge of the subject matter
extrapolating outside the relevant range
concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship
Assuming Linear Relationship quick and easy to check that relationship is linear or
not by drawing scatterplot
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2
0
20
40
60
80
100
120
0 10 20 30 40 50
0
2
4
6
8
10
12
14
0 2 4 6 8
Assuming Linear Relationship quick and easy to check that relationship is linear or
not by drawing scatterplot
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2
0
20
40
60
80
100
120
0 10 20 30 40 50
0
2
4
6
8
10
12
14
0 2 4 6 8
appears to be a linear
relationship
no obvious pattern
clearly not a linear
relationship
Assuming Linear Relationship quick and easy to check that relationship is linear or
not by drawing scatterplot
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 0.5 1 1.5 2
0
20
40
60
80
100
120
0 10 20 30 40 50
0
2
4
6
8
10
12
14
0 2 4 6 8
appears to be a linear
relationship
no obvious pattern
clearly not a linear
relationship
✔ Proceed with
regression analysis
Proceed with
caution
✗ Do not use linear
regression analysis
Pitfalls in Regression assuming the relationship between the variables is linear
lacking awareness of the assumptions of least-squares regression
not knowing how to evaluate the assumptions of least- squares regression
using a regression model without knowledge of the subject matter
extrapolating outside the relevant range
concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship
Residual Analysis: Validating Model Assumptions Assumptions about Linear Regression Model
1. Linearity
relationship between variables is linear; ε is random variable with E(ε) = 0
2. Independence of errors
values of ε are independent
3. Normality of error:
ε is a normally distributed variable
4. Equal variance
variance of ε, denoted by σ2 is the same for all values of x
y 0
1 x
Residual Analysis: Validating Model Assumptions Assumptions about Linear Regression Model
1. Linearity
relationship between variables is linear; ε is random variable with E(ε) = 0
2. Independence of errors
values of ε are independent
3. Normality of error:
ε is a normally distributed variable
4. Equal variance
variance of ε, denoted by σ2 is the same for all values of x
y 0
1 x
acronymn LINE to help remember
Residual Analysis: Validating Model Assumptions in order to check if assumptions about ε are true, need to
analyze the residuals
residuals: yi − ෝ𝑦𝑖 difference between each value of the dependent variable and its
estimated value as calculated using estimated regression line
e.g., for a given value of x, the data shows y = 13.2 and the linear regression equation calculates a y value of 13.5. The residual is 13.2 – 13.5 = –0.3
Residual Analysis: Validating Model Assumptions in order to check if assumptions about ε are true, need to
analyze the residuals
residuals:
difference between each value of the dependent variable and its estimated value as calculated using estimated regression line
e.g., for a given value of x, the data shows y = 13.2 and the linear regression equation calculates a y value of 13.5. The residual is 13.2 – 13.5 = –0.3
use residual plots and standardized residuals to check assumptions
if assumptions seem questionable, then:
results of hypothesis tests may not be valid
interval estimates may not be accurate
yi − ෝ𝑦𝑖
Residual Analysis: Validating Model Assumptions
Assumption Residual Plot to Check
1. linearity of relationship between variables; ε is random variable, E(ε) = 0
residual plot against x
2. independence: values of ε are independent
plot residuals in order or sequence of data collection (may not always be applicable)
3. normality of ε: ε is a normally distributed variable
standardized residual plot
4. equal variance: variance of ε, denoted by σ2 is the same for all values of x
residual plot against x
Residual Analysis using Excel Data
Analysis Tool
Check these boxes for residual
analysis
Pitfalls in Regression assuming the relationship between the variables is linear
lacking awareness of the assumptions of least-squares regression
not knowing how to evaluate the assumptions of least- squares regression
using a regression model without knowledge of the subject matter
extrapolating outside the relevant range
concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship
Extrapolating Outside the Relevant Range need to be careful about making estimates or predictions about
dependent variable based on extreme values of independent variable
other factors may come into play for extremely small or large values of dependent variable and estimated linear regression may not fit points
e.g., Sunflowers Apparel
estimated regression equation is y = 0.9645 + 1.6699x
based on range of values for independent variable
stores are from 1.1 to 5.8 thousand sq. ft.
For values of x less than 1.1 thousand sq. ft., relationship between size of store and sales may not be the same
For values of x greater than 5.8 thousand sq. ft., relationship between size of store and sales may not be the same, especially for much larger values
Pitfalls in Regression assuming the relationship between the variables is linear
lacking awareness of the assumptions of least-squares regression
not knowing how to evaluate the assumptions of least- squares regression
using a regression model without knowledge of the subject matter
extrapolating outside the relevant range
concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship
Cause-and-Effect Relationship temptation to attribute cause-and-effect relationship
between variables i.e., independent variable is causing values of dependent variable
variables may be related, but necessarily causal e.g., consider 2 related variables: mean price of houses sold in a
month and mean number of house listings in a month
could develop model with mean price as independent variable and mean number of listings as dependent variable
could also develop model with mean number of listings as independent variable and mean price as dependent variable
both scatterplots would show same pattern and regression analysis would generate same R2 value
not clear whether there is a causal relationship—could be a other factors causing relationship between variables
Pitfalls in Regression To avoid pitfalls:
1. examine scatterplot of x and y observe possible relationship
2. Check assumptions of least-squares regression
3. If no violations of assumptions, carry out statistical analysis
significance tests, confidence and prediction intervals
4. Avoid making predictions outside the relevant range of the independent variable
5. Do not attribute any cause-and-effect relationship to variables
6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with
simple linear regression? b) Explain why the regression equation should not be used to estimate
the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will
result in more sales? Why or why not?
6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with
simple linear regression? b) Explain why the regression equation should not be used to estimate
the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will
result in more sales? Why or why not?
a) The dots in the scatterplot appear to be
arranged in a linear pattern, suggesting
a linear relationship: simple linear
regression appears to be appropriate. 0
100
200
300
400
0 5 10 15 20 25
6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with
simple linear regression? b) Explain why the regression equation should not be used to estimate
the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will
result in more sales? Why or why not?
b) The amount of shelf space in the sample
data ranges from 5 to 20. Therefore, it
would be inadvisable to use the estimated
regression equation to estimate sales for
shelf space outside of this range. Because
30 is outside of this range, it would be
inadvisable to try to estimate sales for 30
units of shelf space.
Shelf Space Sales Aisle Location
5 160 0
5 220 1
5 140 0
10 190 0
10 240 0
10 260 1
15 230 0
15 270 0
15 280 1
20 260 0
20 290 0
20 310 1
6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with
simple linear regression? b) Explain why the regression equation should not be used to estimate
the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will
result in more sales? Why or why not?
c) There is evidence of a positive relationship between the independent
variable shelf space and the dependent variable sales: as shelf space
increases, sales increase, and vice versa. Although, intuitively, based on
this positive relationship, the manager might want to conclude that more
shelf space results in more sales, statistically this would be an invalid
conclusion. Although there is evidence of a positive relationship, it has
not been statistically established that the relationship is causal.