Assignment for "THE GRADE"

profilesunsam233
1-SimpleRegression.pdf

Concept

Relationships Between Variables Selling Price

Style Square Feet Bedrooms

113000 TRI-LE 1350 3

113542 BI-LEV 1700 4

120000 TRI-LE 1350 4

120000 TRI-LE 1350 3

122100 RANCH 1700 4

123000 TRI-LE 1350 4

128500 RANCH 2457 3

137000 RANCH 2100 3

139900 RANCH 2811 3

142000 RANCH 2920 3

: : : :

Is there a relationship between the selling price and the square footage of the houses?

Relationships Between Variables Selling Price

Style Square Feet Bedrooms

113000 TRI-LE 1350 3

113542 BI-LEV 1700 4

120000 TRI-LE 1350 4

120000 TRI-LE 1350 3

122100 RANCH 1700 4

123000 TRI-LE 1350 4

128500 RANCH 2457 3

137000 RANCH 2100 3

139900 RANCH 2811 3

142000 RANCH 2920 3

: : : :

Is there a relationship between the selling price and the square footage of the houses?

0

50000

100000

150000

200000

250000

300000

350000

0 1000 2000 3000 4000 5000

Price by Square Feet

Scatterplot with trendline indicates positive relationship.

Relationships Between Variables Selling Price

Style Square Feet Bedrooms

113000 TRI-LE 1350 3

113542 BI-LEV 1700 4

120000 TRI-LE 1350 4

120000 TRI-LE 1350 3

122100 RANCH 1700 4

123000 TRI-LE 1350 4

128500 RANCH 2457 3

137000 RANCH 2100 3

139900 RANCH 2811 3

142000 RANCH 2920 3

: : : :

Is there a relationship between the selling price and the square footage of the houses?

0

50000

100000

150000

200000

250000

300000

350000

0 1000 2000 3000 4000 5000

Price by Square Feet

Correlation coefficient = 0.676870565

Scatterplot with trendline indicates positive relationship.

Correlation coefficient indicates positive relationship.

Regression Analysis  statistical procedure for developing an equation to

show how variables are related

 dependent variable—variable being predicted

 independent variable—variable used to predict dependent variable

 regression equation provides a model for predicting the value of the dependent variable based on the values of the independent variable(s)

 analysis helps to identify type of relationship (e.g., linear, exponential) between variables

Simple Linear Regression  involves:

 one dependent variable—variable being predicted

 one independent variable—variable used to predict dependent variable

 relationship is approximated by a straight line

Simple Linear Regression Model Recall:

 general form of linear equation:

 can determine a value for y given any value of x

 value of y dependent on value of x

 m is the slope and b is the y-intercept:

 slope m represents rate of change

 the change in the value of y for every unit of change in x

 y-intercept b is a constant



y  mx b

Simple Linear Regression Model Linear Equation

0

y

x

b



y  mx b



x 2



m  rise

run 

y 2  y

1

x 2  x

1

rise



x 1

run



y 1



y 2

Simple Linear Regression Model Simple Linear Regression Model:

 β0 and β1 are the parameters of the model

 β1 is the slope (corresponds to m)

 β0 is the y-intercept (corresponds to b)

 ε is the error term

 equation will not fit each point exactly

 ε is random variable accounting for the variability in y that cannot be explained by the linear relationship between x and y



y   0 

1 x 

Simple Linear Regression Equation Simple Linear Regression Equation:

Estimated Simple Linear Regression Equation:

 ො𝑦 point estimator of , the mean value of y for a given value of x

 (may be more than one y value for a given x)

  xyE 10  

xbby 10

ˆ 



E y 

Simple Linear Regression Model Estimated Regression Model

0

y

x



x i



x 1



y 1



y 2



x 1 , y

1 



x 2



x 2 , y

2 



x i , y

i  Black dots represent observed values of y.

i y

Simple Linear Regression Model Estimated Regression Model

0

y

x

ŷ = b 0 +b

1 x



x i



x 1



y 1



y 2



x 2

i ŷ



b 0

Black dots represent observed values of y.

Blue dots represent estimated values of y

based on regression model.

1 ŷ

2 ŷ

i y



x 1 , y

1 



x 2 , y

2 



x i , y

i 

(𝑥𝑖, ෝ𝑦𝑖)

(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1)

Simple Linear Regression Model Estimated Regression Model

0

y

x

ŷ = b 0 +b

1 x



x i



x 1



y 1



y 2



x 2

i ŷ



b 0

Black dots represent observed values of y.

Blue dots represent estimated values of y

based on regression model.

1 ŷ

2 ŷ

i y



x 1 , y

1 



x 2 , y

2 



x i , y

i 

(𝑥𝑖, ෝ𝑦𝑖)

(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1) 

e i

Simple Linear Regression Model Estimated Regression Model

0

y

x

ŷ = b 0 +b

1 x



x i



x 1



y 1



y 2



x 2

i ŷ



b 0

e represents error or

difference between

observed and

estimated values of

y. 1

2 ŷ

i y



x 1 , y

1 



x 2 , y

2 



x i , y

i 

(𝑥𝑖, ෝ𝑦𝑖)

(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1) 

e i

Simple Linear Regression Model Estimated Regression Model

0

y

x

ŷ = b 0 +b

1 x



x i



x 1



y 1



y 2



x 1 , y

1 



e 1



x 2



x 2 , y

2 



e 2



x i , y

i 



e i

i ŷ



b 0

1 ŷ

2 ŷ

i y

goal is to determine the

equation of the line for

which the e’s are the least

(𝑥𝑖, ෝ𝑦𝑖)

(𝑥2, ෞ𝑦2)(𝑥1, ෞ𝑦1)

Simple Linear Regression Equation Estimation process in simple linear regression:

1. collect sample data for x and y

2. use sample data to find b0 and b1 in

3. use b0 and b1 as estimates for β0 and β1 in



E y  0 1x

ŷ = b 0 +b

1 x

Simple Linear Regression Equation Estimation process in simple linear regression:

1. collect sample data for x and y

2. use sample data to find b0 and b1 in

3. use b0 and b1 as estimates for β0 and β1 in



E y  0 1x

ŷ = b 0 +b

1 x

how? use Least Squares Method

Least Squares Method  procedure for using sample data to find the

estimated regression equation

 based on determining the equation of the line for which the distance from each sample point to the line is less than for any other line

Least Squares Method  Least Squares Criterion:

 Slope for Estimated Regression Equation

 y-Intercept for Estimated Regression Equation



b 1   x

i  x   yi  y  

 x i  x  

2



b 0  y b

1 x

—Observed value of dependent variable for the i’th observation

—Estimated value of dependent variable for the i’th observation



y i

ŷ i

—value of independent variable for the i’th observation

—value of dependent variable for the i’th observation —mean value of independent variable —mean value of dependent variable



x i



y i



x



y

min å y i - ŷ

i( ) 2

Finding Estimated Regression Equation Options in MS Excel: Example: Sunflowers.xlsx

1. perform calculations

See Calculations worksheet

2. use LINEST function (with INDEX function)

See Calculations worksheet

3. use scatterplot with trendline and equation

See Graph worksheet

4. use Data Analysis tool for Regression

See Regression worksheet

Finding Estimated Regression Equation Options in MS Excel: Example: Sunflowers.xlsx

1. perform calculations

See Calculations worksheet

2. use LINEST function (with INDEX function)

See Calculations worksheet

3. use scatterplot with trendline and equation

See Graph worksheet

4. use Data Analysis tool for Regression

See Regression worksheet

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

Example

scatterplot:

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

S q

u a

re f

e e

t (i

n T

h o

u s a

n d

s )

o f

S to

re s

Annual Sales (Million $'s)

Example

trendline:

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

S q

u a

re f

e e

t (i

n T

h o

u s a

n d

s )

o f

S to

re s

Annual Sales (Million $'s)

Example

estimated regression equation:

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

y = 1.6699x + 0.9645

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

S q

u a

re f

e e

t (i

n T

h o

u s a

n d

s )

o f

S to

re s

Annual Sales (Million $'s)

Example

estimated regression equation:

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

y = 1.6699x + 0.9645

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

S q

u a

re f

e e

t (i

n T

h o

u s a

n d

s )

o f

S to

re s

Annual Sales (Million $'s)

slope

Example

estimated regression equation:

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

y = 1.6699x + 0.9645

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

S q

u a

re f

e e

t (i

n T

h o

u s a

n d

s )

o f

S to

re s

Annual Sales (Million $'s)

y-intercept

Finding Estimated Regression Equation Options in MS Excel: Example: Sunflowers.xlsx

1. perform calculations

See Calculations worksheet

2. use LINEST function (with INDEX function)

See Calculations worksheet

3. use scatterplot with trendline and equation

See Graph worksheet

4. use Data Analysis tool for Regression

See Regression worksheet

Preferred, as other

useful information

provided by report

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Dependent Variable

y

Independent Variable

x

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Dependent Variable

y

Independent Variable

x

SAFETY CHECK: It is critically important to

specify the correct data for the independent variable x and the

dependent variable y.

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Enter Confidence Level

Check if labels selected with data

Specify Cell for Output

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

Enter Confidence Level

Check if labels selected with data

Specify Cell for Output

SAFETY CHECK: Select the column labels with the data and check the Labels box to make sure that the MS

Excel output is clearly labelled.

Excel’s Regression Tool  Analysis

↳ Data Analysis

↳ Regression

 Analysis provides:

ANOVA

Regression Statistics

Regression Model

Example

Data Analysis

↳ Regression

↳ Input Y Range?

↳ Input X Range?

The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

Example

Data Analysis

↳ Regression

↳ Input Y Range?

↳ Input X Range?

The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

Dependent variable (the one you are

trying to predict)

Independent variable (the

one you think is causing

variance in the dependent variable)

Example The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

y-intercept

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

y-intercept

slope

Example

Therefore, the estimated regression equation is

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

y-intercept

slope ŷ = 0.9645+1.6699x

2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.

a) Construct a scatterplot. b) Find the simple linear regression equation.

2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.

a) Construct a scatterplot. b) Find the simple linear regression equation.

a) Scatterplot: b) simple linear regression

equation: ŷ i =145 + 7.4x

i

y = 7.4x + 145

0

100

200

300

400

0 5 10 15 20 25

Sales

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0%

Upper

95.0%

Intercept 145 21.78302091 6.656560658 5.66278E-05 96.464405 193.535595 96.464405 193.535595

Shelf Space 7.4 1.590806923 4.651727304 0.000905656 3.855461304 10.9445387 3.855461304 10.9445387

ŷ i =145 + 7.4x

i

Regression Equation

Interpreting Coefficients

 Recall: — dependent variable

— independent variable

 Coefficients: — slope

 corresponds to m in y = mx + b

 rate of change in y for each change in x

— y-intercept

 corresponds to b in y = mx + b

 value of y when x = 0 (i.e., where line crosses y-axis)



̂y  b 0 b

1 x



ˆ y



x



b 1



b 0

Example

Therefore, the estimated regression equation is

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

y-intercept

slope ŷ = 0.9645+1.6699x

Interpreting Coefficients

 Recall: — dependent variable

— independent variable

 Coefficients: — slope

 corresponds to m in y = mx + b

 rate of change in y for each change in x

— y-intercept

 corresponds to b in y = mx + b

 value of y when x = 0 (i.e., where line crosses y-axis)



̂y  b 0 b

1 x



ˆ y



x



b 1



b 0

Helpful for understanding how independent variable affects

dependent variable

Interpreting Coefficients

 Recall: — dependent variable

— independent variable

 Coefficients: — slope

 corresponds to m in y = mx + b

 rate of change in y for each change in x

— y-intercept

 corresponds to b in y = mx + b

 value of y when x = 0 (i.e., where line crosses y-axis)



̂y  b 0 b

1 x



ˆ y



x



b 1



b 0

may or may not be helpful for making sense of business

context

Example A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:

ො𝑦 = 35.0 + 3𝑥𝑖 What is the interpretation of the y-intercept b0 and the slope b1?

Example

Interpretation of y-intercept b0:

Interpretation of slope b1:

A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:

What is the interpretation of the y-intercept, b0, and the slope, b1? ො𝑦 = 35.0 + 3𝑥𝑖

Example

Interpretation of y-intercept b0:

• y-intercept of 35 indicates that when the student does not study at all for the final exam, the predicted final score is 35

Interpretation of slope b1:

A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:

What is the interpretation of the y-intercept, b0, and the slope, b1?

ො𝑦 = 35.0 + 3𝑥𝑖

Example

Interpretation of y-intercept b0:

• y-intercept of 35 indicates that when the student does not study at all for the final exam, the predicted final score is 35

Interpretation of slope b1:

• the slope of 3 indicates that for every hour of studying, the mean change in exam score is 3%

A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:

What is the interpretation of the y-intercept, b0, and the slope, b1? ො𝑦 = 35.0 + 3𝑥𝑖

Example

Therefore, the estimated regression equation is

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

y-intercept

slope ŷ = 0.9645+1.6699x

Example

Therefore, the estimated regression equation is

The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.95088328

R Square 0.904179

Adjusted R Square 0.89619392

Standard Error 0.96637968

Observations 14

ANOVA

df SS MS F Significance F

Regression 1 105.7476095 105.74761 113.233513 1.8227E-07

Residual 12 11.20667621 0.93388968

Total 13 116.9542857

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 0.96447366 0.526193302 1.83292652 0.09172683 -0.1820031 2.11095038 -0.1820031 2.11095038

Square Feet (Thousands)1.66986232 0.156925375 10.6411237 1.8227E-07 1.3279513 2.01177334 1.3279513 2.01177334

y-intercept

slope ŷ = 0.9645+1.6699x

For a store of 0 square feet, sales are expected to be $0.9645 million  Not

meaningful

For each additional 1000 sq ft, sales

increase by $1.6699 million

Estimating Dependent Variable

 Recall: — dependent variable

— independent variable

 can find values of corresponding to values of x to make estimates of the dependent variable corresponding to values of the independent variable



̂y  b 0 b

1 x



ˆ y



x



ˆ y

Example A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:

What is a student who studies 10 hours estimated to score? 12 hours? ො𝑦 = 35.0 + 3𝑥𝑖

Example

 10 hours:

 therefore, estimate that student scores 65

 12 hours:

 therefore, estimate that student scores 71

A statistics professor wants to use the number of hours a student studies for a statistic final exam (x) to predict the final exam score (y). A regression model was fit based on data collected for a class during the previous semester, with the following results:

What is a student who studies 10 hours estimated to score? 12 hours?



ˆ y i  35.0  3x

i

 35.0  3 10 

 65



ˆ y i  35.0  3x

i

 35.0  3 12 

 71

ො𝑦 = 35.0 + 3𝑥𝑖

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Estimate sales for a store of 5000 square feet.

ŷ = 0.9645+1.6699x

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Estimate sales for a store of 5000 square feet.

Store Square Feet (Thousands)

Annual Sales (Million $'s)

1 1.7 3.7

2 1.6 3.9

3 2.8 6.7

4 5.6 9.5

5 1.3 3.4

6 2.2 5.6

7 1.3 3.7

8 1.1 2.7

9 3.2 5.5

10 1.5 2.9

11 5.2 10.7

12 4.6 7.6

13 5.8 11.8

14 3 4.1

Note that size of store in thousands of square feet and sales in millions

of dollars. Therefore, need to interpret values of dependent and

independent variables in these units.

ŷ = 0.9645+1.6699x

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Estimate sales for a store of 5000 square feet.

ŷ = 0.9645+1.6699x

 5000 square feet  x = 5:

 therefore, estimate that store has revenue of $9.314 million.



ˆ y  0.9645 1.6699x

 0.9645 1.6699 5 

 9.314

2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.

a) Construct a scatterplot. b) Find the simple linear regression equation. c) Interpret the meaning of the slope in this problem. d) Predict or estimate the mean weekly sales of pet food for stores with 8 feet

of shelf space for pet food.

2. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx.

a) Construct a scatterplot. b) Find the simple linear regression equation. c) Interpret the meaning of the slope in this problem. d) Predict or estimate the mean weekly sales of pet food for stores with 8 feet

of shelf space for pet food.

a) Scatterplot: b) simple linear regression

equation: ො𝑦 = 145 + 7.4𝑥𝑖 c) for each increase in shelf

space of 1 foot, weekly sales are estimated to increase by $7.40

d) 8 feet:

Therefore, predicted mean weekly sales are $204.20.

y = 7.4x + 145

0

50

100

150

200

250

300

350

0 5 10 15 20 25

Sales

ො𝑦 = 145 + 7.4𝑥𝑖 = 145 + 7.4(8)

Regression Fit

Coefficient of Determination How well does the estimated regression equation fit the data?

Coefficient of Determination How well does the estimated regression equation fit the data?

 coefficient of determination, r2, provides measure of goodness of fit

 requires calculation of:

 SSE—sum of squares due to error

 SSR—sum of squares due to regression

 SST—total sum of squares (SST = SSR + SSE)



r 2 

SSR

SST

Coefficient of Determination For independent variable xi:

0

y

x



y



y i



x i



y

Observed dependent variable

Estimated dependent variable

Mean dependent variable

ෝ𝑦𝑖

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖

Measures of Variation for independent variable xi:

Coefficient of Determination

0

y

x



y



y i



x i



y i  y

represents the error in using ෝ𝑦𝑖 to estimate yi

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖𝑦𝑖 − ෝ𝑦𝑖

ෝ𝑦𝑖

ෝ𝑦𝑖 − ത𝑦

Overall Measures of Variation:

0

y

x



y



y i



x i



SST   y i  y  

2

Error sum of squares

Regression sum of squares

Total sum of squares

Coefficient of Determination

𝑆𝑆𝑅 = ෍ ෝ𝑦𝑖 − ത𝑦 2

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ෝ𝑦𝑖 2

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖

ෝ𝑦𝑖

Coefficient of Determination sum of squares due to error (SSE)

0

y

x



y



y i



x i



SST   y i  y  

2

Error sum of squares

Regression sum of squares

Total sum of squares

• Difference 𝑦𝑖 − ෝ𝑦𝑖 represents the error in using ෝ𝑦𝑖 to estimate yi.

• SSE represents the sum of the differences between all yi’s and ෝ𝑦𝑖 ‘s

• this value minimized by least squares method

𝑆𝑆𝑅 = ෍ ෝ𝑦𝑖 − ത𝑦 2

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ෝ𝑦𝑖 2

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖

ෝ𝑦𝑖

Coefficient of Determination total sum of squares (SST)

0

y

x



y



̂y i  b

0 b

1 x

i 

y i



x i



SSE   y i  ˆ y

i  2



SST   y i  y  

2



SSR   ˆ y i  y  

2

Error sum of squares

Regression sum of squares

Total sum of squares

• can be used to estimate yi

• Difference represents the error in using to estimate yi

• SST represents the sum of the differences between all yi’s and ‘s



ˆ y i



y



y i  y



y



y

Coefficient of Determination Regression sum of squares (SSR)

0

y

x



y



y i



x i



SST   y i  y  

2

Error sum of squares

Regression sum of squares

Total sum of squares

• Difference ෝ𝑦𝑖 − ത𝑦 measures how much the ෝ𝑦𝑖 values on the regression line deviate from

• SSR represents the sum of the differences between all ෝ𝑦𝑖’s and ‘s



y



y

𝑆𝑆𝑅 = ෍ ෝ𝑦𝑖 − ത𝑦 2

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ෝ𝑦𝑖 2

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖

ෝ𝑦𝑖

Coefficient of Determination yields the percentage of how much of the variability in dependent variable y can be explained by the linear relationship

0

y

x



y



y i



x i



SST   y i  y  

2

Error sum of squares

Regression sum of squares

Total sum of squares



r 2 

SSR

SST

𝑆𝑆𝑅 = ෍ ෝ𝑦𝑖 − ത𝑦 2

𝑆𝑆𝐸 = ෍ 𝑦𝑖 − ෝ𝑦𝑖 2

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥𝑖

ෝ𝑦𝑖

Coefficient of Determination How well does the estimated regression equation fit the data?

 coefficient of determination, r2, provides measure of goodness of fit

 result yields the percentage of how much of the variability in the y can be explained by the linear relationship

r 2 

SSR

SST

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.

Finding Coefficient of Determination Options in MS Excel: Example: Sunflowers.xlsx

1. perform calculations

See Coefficient worksheet

2. use CORREL function

See Coefficient worksheet

3. use scatterplot and show R2

See Graph worksheet

4. use data analysis tool for Regression

See Regression worksheet

Finding Coefficient of Determination Options in MS Excel: Example: Sunflowers.xlsx

1. perform calculations

See Coefficient worksheet

2. use CORREL function

See Coefficient worksheet

3. use scatterplot and show R2

See Graph worksheet

4. use data analysis tool for Regression

See Regression worksheet

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.

y = 1.6699x + 0.9645 R² = 0.90418

0

2

4

6

8

10

12

14

0 1 2 3 4 5 6 7

S q

u a

re f

e e

t (i

n T

h o

u s a

n d

s )

o f

S to

re s

Annual Sales (Million $'s) Therefore, 90.42% of the variability in annual sales is explained by the linear relationship with the size of the store.

9.58% of sample variability is due to other factors.

Finding Coefficient of Determination Options in MS Excel: Example: Sunflowers.xlsx

1. perform calculations

See Coefficient worksheet

2. use CORREL function

See Coefficient worksheet

3. use scatterplot and show R2

See Graph worksheet

4. use data analysis tool for Regression

See Regression worksheet

Preferred, as other

useful information

provided by report

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.

From MS Excel:

ANOVA

df SS MS F Significance F

Regression 1 105.74761 105.74761 113.233513 1.8227E-07

Residual 12 11.2066762 0.93388968

Total 13 116.954286 Regression Statistics

Multiple R 0.950883275

R Square 0.904179003

Adjusted R Square 0.89619392

Standard Error 0.966379679

Observations 14

Example The sales for Sunf lowers Apparel, a chain of upscale clothing stores for women, have increased during the past 12 years. As the new director of planning, you believe that the size of the store significantly contributes to store sales, and you want to develop a systematic approach that will lead you to making better forecasts of annual sales based on store size. Data from 14 stores is shown the MS Excel file Sunf lowers.xlsx. Find and interpret the coefficient of determination.

From MS Excel:

ANOVA

df SS MS F Significance F

Regression 1 105.74761 105.74761 113.233513 1.8227E-07

Residual 12 11.2066762 0.93388968

Total 13 116.954286 Regression Statistics

Multiple R 0.950883275

R Square 0.904179003

Adjusted R Square 0.89619392

Standard Error 0.966379679

Observations 14

Therefore, 90.42% of the variability in annual sales is explained by the linear relationship with the size of the store.

9.58% of sample variability is due to other factors.

3. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. Determine the coefficient of determination and interpret its meaning.

3. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. Determine the coefficient of determination and interpret its meaning.

ANOVA

df SS MS F Significance F

Regression 1 20535 20535 21.638567 0.000905656

Residual 10 9490 949

Total 11 30025

68.4% of the variability in sales of the pet food can be explained by the linear relationship with the size of shelf space.



r 2 

SSR

SST 

20,535

30,025  0.684

Regression Statistics

Multiple R 0.82700064

R Square 0.683930058

Adjusted R Square 0.652323064

Standard Error 30.8058436

Observations 12

Assumptions

Regression Model Recap:

 assumed regression model:

 use least squares method to develop values for b0 and b1 as estimates for β0 and β1

 coefficient of determination r2 used as measure of goodness of fit of estimated regression equation



y   0 

1 x 



̂y  b 0 b

1 x



r 2 

SSR

SST

Regression Model Recap:

 assumed regression model:

 use least squares method to develop values for b0 and b1 as estimates for β0 and β1

ෝ𝑦𝑖 = 𝑏0 + 𝑏1𝑥

 coefficient of determination r2 used as measure of goodness of fit of estimated regression equation



y   0 

1 x 



r 2 

SSR

SST

WARNING: It is not always

appropriate to use Regression Analysis

Pitfalls in Regression  assuming the relationship between the variables is linear

 lacking awareness of the assumptions of least-squares regression

 not knowing how to evaluate the assumptions of least- squares regression

 using a regression model without knowledge of the subject matter

 extrapolating outside the relevant range

 concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship

Pitfalls in Regression  assuming the relationship between the variables is linear

 lacking awareness of the assumptions of least-squares regression

 not knowing how to evaluate the assumptions of least- squares regression

 using a regression model without knowledge of the subject matter

 extrapolating outside the relevant range

 concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship

Assuming Linear Relationship  quick and easy to check that relationship is linear or

not by drawing scatterplot

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2

0

20

40

60

80

100

120

0 10 20 30 40 50

0

2

4

6

8

10

12

14

0 2 4 6 8

Assuming Linear Relationship  quick and easy to check that relationship is linear or

not by drawing scatterplot

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2

0

20

40

60

80

100

120

0 10 20 30 40 50

0

2

4

6

8

10

12

14

0 2 4 6 8

appears to be a linear

relationship

no obvious pattern

clearly not a linear

relationship

Assuming Linear Relationship  quick and easy to check that relationship is linear or

not by drawing scatterplot

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0 0.5 1 1.5 2

0

20

40

60

80

100

120

0 10 20 30 40 50

0

2

4

6

8

10

12

14

0 2 4 6 8

appears to be a linear

relationship

no obvious pattern

clearly not a linear

relationship

✔ Proceed with

regression analysis

Proceed with

caution

✗ Do not use linear

regression analysis

Pitfalls in Regression  assuming the relationship between the variables is linear

 lacking awareness of the assumptions of least-squares regression

 not knowing how to evaluate the assumptions of least- squares regression

 using a regression model without knowledge of the subject matter

 extrapolating outside the relevant range

 concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship

Residual Analysis: Validating Model Assumptions Assumptions about Linear Regression Model

1. Linearity

 relationship between variables is linear; ε is random variable with E(ε) = 0

2. Independence of errors

 values of ε are independent

3. Normality of error:

 ε is a normally distributed variable

4. Equal variance

 variance of ε, denoted by σ2 is the same for all values of x



y   0 

1 x 

Residual Analysis: Validating Model Assumptions Assumptions about Linear Regression Model

1. Linearity

 relationship between variables is linear; ε is random variable with E(ε) = 0

2. Independence of errors

 values of ε are independent

3. Normality of error:

 ε is a normally distributed variable

4. Equal variance

 variance of ε, denoted by σ2 is the same for all values of x



y   0 

1 x 

acronymn LINE to help remember

Residual Analysis: Validating Model Assumptions  in order to check if assumptions about ε are true, need to

analyze the residuals

 residuals: yi − ෝ𝑦𝑖  difference between each value of the dependent variable and its

estimated value as calculated using estimated regression line

 e.g., for a given value of x, the data shows y = 13.2 and the linear regression equation calculates a y value of 13.5. The residual is 13.2 – 13.5 = –0.3

Residual Analysis: Validating Model Assumptions  in order to check if assumptions about ε are true, need to

analyze the residuals

 residuals:

 difference between each value of the dependent variable and its estimated value as calculated using estimated regression line

 e.g., for a given value of x, the data shows y = 13.2 and the linear regression equation calculates a y value of 13.5. The residual is 13.2 – 13.5 = –0.3

 use residual plots and standardized residuals to check assumptions

 if assumptions seem questionable, then:

 results of hypothesis tests may not be valid

 interval estimates may not be accurate

yi − ෝ𝑦𝑖

Residual Analysis: Validating Model Assumptions

Assumption Residual Plot to Check

1. linearity of relationship between variables; ε is random variable, E(ε) = 0

residual plot against x

2. independence: values of ε are independent

plot residuals in order or sequence of data collection (may not always be applicable)

3. normality of ε: ε is a normally distributed variable

standardized residual plot

4. equal variance: variance of ε, denoted by σ2 is the same for all values of x

residual plot against x

Residual Analysis  using Excel Data

Analysis Tool

Check these boxes for residual

analysis

Pitfalls in Regression  assuming the relationship between the variables is linear

 lacking awareness of the assumptions of least-squares regression

 not knowing how to evaluate the assumptions of least- squares regression

 using a regression model without knowledge of the subject matter

 extrapolating outside the relevant range

 concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship

Extrapolating Outside the Relevant Range  need to be careful about making estimates or predictions about

dependent variable based on extreme values of independent variable

 other factors may come into play for extremely small or large values of dependent variable and estimated linear regression may not fit points

e.g., Sunflowers Apparel

 estimated regression equation is y = 0.9645 + 1.6699x

 based on range of values for independent variable

 stores are from 1.1 to 5.8 thousand sq. ft.

 For values of x less than 1.1 thousand sq. ft., relationship between size of store and sales may not be the same

 For values of x greater than 5.8 thousand sq. ft., relationship between size of store and sales may not be the same, especially for much larger values

Pitfalls in Regression  assuming the relationship between the variables is linear

 lacking awareness of the assumptions of least-squares regression

 not knowing how to evaluate the assumptions of least- squares regression

 using a regression model without knowledge of the subject matter

 extrapolating outside the relevant range

 concluding that a significant relationship identified in an observational study is due to a cause-and-effect relationship

Cause-and-Effect Relationship  temptation to attribute cause-and-effect relationship

between variables  i.e., independent variable is causing values of dependent variable

 variables may be related, but necessarily causal  e.g., consider 2 related variables: mean price of houses sold in a

month and mean number of house listings in a month

 could develop model with mean price as independent variable and mean number of listings as dependent variable

 could also develop model with mean number of listings as independent variable and mean price as dependent variable

 both scatterplots would show same pattern and regression analysis would generate same R2 value

 not clear whether there is a causal relationship—could be a other factors causing relationship between variables

Pitfalls in Regression To avoid pitfalls:

1. examine scatterplot of x and y  observe possible relationship

2. Check assumptions of least-squares regression

3. If no violations of assumptions, carry out statistical analysis

 significance tests, confidence and prediction intervals

4. Avoid making predictions outside the relevant range of the independent variable

5. Do not attribute any cause-and-effect relationship to variables

6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with

simple linear regression? b) Explain why the regression equation should not be used to estimate

the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will

result in more sales? Why or why not?

6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with

simple linear regression? b) Explain why the regression equation should not be used to estimate

the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will

result in more sales? Why or why not?

a) The dots in the scatterplot appear to be

arranged in a linear pattern, suggesting

a linear relationship: simple linear

regression appears to be appropriate. 0

100

200

300

400

0 5 10 15 20 25

6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with

simple linear regression? b) Explain why the regression equation should not be used to estimate

the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will

result in more sales? Why or why not?

b) The amount of shelf space in the sample

data ranges from 5 to 20. Therefore, it

would be inadvisable to use the estimated

regression equation to estimate sales for

shelf space outside of this range. Because

30 is outside of this range, it would be

inadvisable to try to estimate sales for 30

units of shelf space.

Shelf Space Sales Aisle Location

5 160 0

5 220 1

5 140 0

10 190 0

10 240 0

10 260 1

15 230 0

15 270 0

15 280 1

20 260 0

20 290 0

20 310 1

6. The marketing manager of a large supermarket chain would like to use shelf space to predict the sales of pet food. A random sample of 12 equal-sized stores is selected, with the results stored in the file Petfood.xlsx. a) based on the scatterplot, does it seem reasonable to proceed with

simple linear regression? b) Explain why the regression equation should not be used to estimate

the sales of pet food for 30 units of shelf space? c) Can the manager conclude that more shelf space for the product will

result in more sales? Why or why not?

c) There is evidence of a positive relationship between the independent

variable shelf space and the dependent variable sales: as shelf space

increases, sales increase, and vice versa. Although, intuitively, based on

this positive relationship, the manager might want to conclude that more

shelf space results in more sales, statistically this would be an invalid

conclusion. Although there is evidence of a positive relationship, it has

not been statistically established that the relationship is causal.