Wk5 DQ - Data Analysis and Business Intelligence

profilevoyage
Lind_18e_Chap013_PPT.pptx

Correlation and Linear Regression

Chapter 13

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-1

In this chapter, we study the relationship between two interval- or ratio-level variables and develop numerical measures to express the relationship between two variables. We also develop an equation to express the relationship between variables. We examine both correlation analysis and regression analysis.

1

Learning Objectives

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

LO13-1 Explain the purpose of correlation analysis

LO13-2 Calculate a correlation coefficient to test and interpret the relationship between two variables

LO13-3 Apply regression analysis to estimate the linear relationship between two variables

LO13-4 Evaluate the significance of the slope of the regression equation

LO13-5 Evaluate a regression equation’s ability to predict using the standard estimate of the error and the coefficient of determination

LO13-6 Calculate and interpret confidence and prediction intervals

LO13-7 Use a log function to transform a nonlinear relationship

13-2

What is Correlation Analysis?

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Used to report the relationship between two variables

In addition to graphing techniques, we’ll develop numerical measures to describe the relationships

Examples

Does the amount Healthtex spends per month on training its sales force affect its monthly sales?

Does the number of hours students study for an exam influence the exam score?

CORRELATION ANALYSIS A group of techniques to measure the relationship between two variables.

13-3

In all business fields, identifying and studying relationships between variables can provide information on ways to increase profits, methods to decrease costs, or variables to predict demand.

3

Scatter Diagram

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

A scatter diagram is a graphic tool used to portray the relationship between two variables

The independent variable is scaled on the X-axis and is the variable used as the predictor

The dependent variable is scaled on the Y-axis and is the variable being estimated

Graphing the data in a scatter diagram will make the relationship between sales calls and copiers sales easier to see.

13-4

We often begin our study of the relationship between two variables with a scatter diagram. It gives us a visual representation of the relationship between the variables. For instance, a sales manager wants to know if there is a relationship between the number of sales calls made in a month and the number of copiers sold that month and begins the analysis with a random sample of 15 sales representatives. With this data, the number of sales calls is the independent variable and number of copiers sold is the dependent variable.

4

Scatter Diagram Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

North American Copier Sales sells copiers to businesses of all sizes throughout the United States and Canada. The new national sales manager is preparing for an upcoming sales meeting and would like to impress upon the sales representatives the importance of making an extra sales call each day. She takes a random sample of 15 sales representatives and gathers information on the number of sales calls made last month and the number of copiers sold. Develop a scatter diagram of the data.

Sales reps who make more calls tend to sell more copiers!

13-5

We develop a scatter diagram of the data. The first salesperson, Brian Virost, made 96 sales calls and sold 41 copiers; to plot this point move along the horizontal axis to x=96 and then go vertically to y=41 and place a dot at that intersection. Do this for the all the sales data. It is perfectly reasonable for the manager to tell the sales people that the more sales calls they make, the more copiers they can expect to sell. Note, that while there does seem to be a positive relationship between the two variables, all the points do not fall on a line.

5

Correlation Coefficient

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Characteristics of the correlation coefficient are:

The sample correlation coefficient is identified as r

It shows the direction and strength of the linear relationship between two interval- or ratio-scale variables

It ranges from −1.00 to 1.00

If it’s 0, there is no association

A value near 1.00 indicates a direct or positive correlation

A value near −1.00 indicates a negative correlation

CORRELATION COEFFICIENT A measure of the strength of the linear relationship between two variables.

13-6

Both variables must be at least the interval scale of measurement to find the correlation coefficient. A value of −1 indicates perfect negative correlation and a value of +1 indicates perfect positive correlation.

6

Correlation Coefficient (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The following graphs summarize the strength and direction of the correlation coefficient

13-7

In the set of charts at the bottom of the slide, the first one indicates no correlation between the number of children as the independent variable, and income (as the dependent variable). The middle chart shows there is a slightly negative correlation between price and quantity. The chart on the right shows a strong positive relationship between hours studied (the independent variable) and exam score (the dependent variable).

7

Correlation Coefficient, r

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

How is the correlation coefficient determined? We’ll use North American Copier Sales as an example. We begin with a scatter diagram, but this time we’ll draw a vertical line at the mean of the x-values (96 sales calls) and a horizontal line at the mean of the y-values (45 copiers).

13-8

Drawing lines through the center of the data establishes quadrants. These two variables are positively related when the number of copiers sold is above the mean and the number of sales calls is also above the mean; the points appear in quadrant 1. When the number of sales calls is less than the mean, so is the number of copiers sold, the points appear in quadrant lll.

8

Correlation Coefficient, r, Continued

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

How is the correlation coefficient determined? Now we find the deviations from the mean number of sales calls and the mean number of copiers sold; then multiply them. The sum of their product is 6,672 and will be used in formula 13-1 to find r. We also need the standard deviations. The result, r=.865 indicates a strong, positive relationship.

13-9

The correlation coefficient is designated by the letter r and found with equation 13-1. We will use Excel to find the standard deviations of the two variables, x (sales calls) and y (copier sales) to use in the formula.

9

Correlation Coefficient Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The Applewood Auto Group’s marketing department believes younger buyers purchase vehicles on which lower profits are earned and older buyers purchase vehicles on which higher profits are earned. They would like to use this information as part of an upcoming advertising campaign to try to attract older buyers. Develop a scatter diagram and then determine the correlation coefficient. Would this be a useful advertising feature?

The scatter diagram suggests that a positive relationship does exist between age and profit, but it does not appear to be a strong relationship.

Next, calculate r, which is 0.262. The relationship is positive but weak. The data does not support a business decision to create an advertising campaign to attract older buyers!

13-10

We use Excel to calculate r; r is .262 and is much closer to zero than one. We would observe the relationship between the age of the buyer and the profit of their purchase is not strong.

10

Testing the Significance of r

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-11

11

Testing the Significance of r Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-12

The population in this example is all of the salespeople employed by the firm. This is a two-tailed test. We use Appendix B.5 for degrees of freedom n-2=15-2=13 and a level of significance of .05. Use formula 13-2; the result is 6.216. We reject the null hypothesis; there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.

12

Testing the Significance of r Example Continued

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-13

Step 5: Make decision; reject H0, t=6.216

Step 6: Interpret; there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.

Testing the Significance of the Correlation Coefficient

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

In the Applewood Auto Group example, we found an r=0.262 which is positive, but rather weak. We test our conclusion by conducting a hypothesis test that the correlation is greater than 0.

13-14

This is a one-tailed (right-tailed) test. The degrees of freedom in this test is n − 2 = 180 − 2 = 178; but Appendix B.5 doesn’t have 178, so we use 180, so the critical value is 1.653. We use formula 13-2 and conclude the sample correlation is too large to have come from a population with no correlation. The outcome of a marketing campaign directed to older buyers is uncertain.

14

Regression Analysis

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

In regression analysis, we estimate one variable based on another variable

The variable being estimated is the dependent variable

The variable used to make the estimate or predict the value is the independent variable

The relationship between the variables is linear

Both the independent and the dependent variables must be interval or ratio scale

REGRESSION EQUATION An equation that expresses the linear relationship between two variables.

13-15

The least squares criterion is used to determine the regression equation.

15

Least Squares Principle

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

In regression analysis, our objective is to use the data to position a line that best represents the relationship between two variables

The first approach is to use a scatter diagram to visually position the line

But this depends on judgement; we would prefer a method that results in a single, best regression line

13-16

The lines drawn in the chart on the right represents the judgement of four people. The method that results in a single, best regression line is called the least squares principle.

16

Least Squares Regression Line

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

To illustrate, the same data are plotted in the three charts below

LEAST SQUARES PRINCIPLE A mathematical procedure that uses the data to position a line with the objective of minimizing the sum of the squares of the vertical distances between the actual y values and the predicted values of y.

13-17

The line drawn in chart 13-9 is the best fitting line and is drawn using the least squares method. It is the best fitting because the sum of the squares of the vertical deviations about it is at a minimum; the sum of the squares is 24. Chart 13-10 and 13-11 was drawn differently and their sum of the squares is 44 and 132 respectively.

17

Least Squares Regression Line (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-18

18

Least Squares Regression Line Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Recall the example of North American Copier Sales. The sales manager gathered information on the number of sales calls made and the number of copiers sold. Use the least squares method to determine a linear equation to express the relationship between the two variables.

13-19

The first step is to find the slope of the least squares regression line, b

Next, find a

Then determine the regression line

So if a salesperson makes 100 calls, he or she can expect to sell 46.0432 copiers

The b value of .2608 indicates that for each additional sales call, the sales representative can expect to increase the number of copiers sold by about .2608. So 20 additional sales calls in a month will result in about five more copiers being sold.

19

Drawing the Regression Line

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-20

The line of regression is drawn on the scatter diagram. Estimated sales for all sales representatives are calculated using the formula we determined earlier and placed in the table. The regression line will always pass through the mean of variables x and y. Plus, there is no other line through the data where the sum of the deviations is smaller.

20

Regression Equation Slope Test

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-21

21

Regression Equation Slope Test Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-22

This is a one-tailed test. If we do not reject the null hypothesis, we conclude that the slope of the regression line could be zero. We use Excel to determine the needed regression statistics. We find the critical value in Appendix B.5 with degrees of freedom of n − 2, 15 − 2 = 13 and a level of significance of .05, it is 1.771. We reject the null hypothesis and conclude the slope of the line is greater than 0.

22

Regression Equation Slope Test Example (2 of 2)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-23

Highlighted, b is .2606; the standard error is .0420

Evaluating a Regression Equation’s Ability to Predict

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Perfect prediction is practically impossible in almost all disciplines, including economics and business

The North American Copier Sales example showed a significant relationship between sales calls and copier sales, the equation is

Number of copiers sold = 19.9632 + .2608(Number of sales calls)

What if the number of sales calls is 84, and we calculate the number of copiers sold is 41.8704—we did have two employees with 84 sales calls, they sold just 30 and 24

So, is the regression equation a good predictor?

We need a measure that will tell how inaccurate the estimate might be

13-24

The measure we’ll use is the standard error of the estimate, sy,x. We find more information on the next slide.

24

The Standard Error of Estimate

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The standard error of estimate measures the variation around the regression line

It is in the same units as the dependent variable

It is based on squared deviations from the regression line

Small values indicate that the points cluster closely about the regression line

It is computed using the following formula

STANDARD ERROR OF ESTIMATE A measure of the dispersion, or scatter, of the observed values around the line of regression for a given value of x.

13-25

The standard error of estimate is the same concept as the standard deviation in chapter 3. The standard deviation measures dispersion around the mean. The standard error of estimate measures dispersion around the regression line for a given value of x.

25

The Standard Error of Estimate Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The standard error of estimate is 6.720

If the standard error of estimate is small, this indicates that the data are relatively close to the regression line and the regression equation can be used. If it is large, the data are widely scattered around the regression line and the regression equation will not provide a precise estimate of y.

13-26

The standard error of estimate can be calculated using statistical software like Excel.

26

Coefficient of Determination

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

It ranges from 0 to 1.0

It is the square of the correlation coefficient

It is found from the following formula

In the North American Copier Sales example, the correlation coefficient was .865; just square that (.865)2 = .748; this is the coefficient of determination

This means 74.8% of the variation in the number of copiers sold is explained by the variation in sales calls

COEFFICIENT OF DETERMINATION The proportion of the total variation in the dependent variable Y that is explained, or accounted for, by the variation in the independent variable X.

13-27

The coefficient of determination provides a more interpretable measure of a regression equation’s ability to predict. It’s easy to compute too; just square the correlation coefficient.

27

Relationships among r, r2, and sy,x

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Recall the standard error of estimate measures how close the actual values are to the regression line

When it is small, the two variables are closely related

The correlation coefficient measures the strength of the linear association between two variables

When points on the scatter diagram are close to the line, the correlation coefficient tends to be large

Therefore, the correlation coefficient and the standard error of estimate are inversely related

13-28

As the strength of a linear relationship between two variables increases, the correlation coefficient increases and the standard error of the estimate decreases.

28

Inference about Linear Regression

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

We can predict the number of copiers sold (y) for a selected value of number of sales calls made (x)

But first, let’s review the regression assumptions of each of the distributions in the graph below

13-29

We’ll now relate these assumptions to North American Copier Sales.

29

Constructing Confidence and Prediction Intervals

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Use a confidence interval when the regression equation is used to predict the mean value of y for a given value of x

For instance, we would use a confidence interval to estimate the mean salary of all executives in the retail industry based on their years of experience

Use a prediction interval when the regression equation is used to predict an individual y for a given value of x

For instance, we would estimate the salary of a particular retail executive who has 20 years of experience

13-30

Two different predictions can be made for a selected value of the independent variable; a confidence interval and a prediction interval. In a confidence interval, the width of the interval is affected by the level of confidence, the size of the standard error of the estimate, and the size of the sample, as well as the value of the independent variable. The prediction interval is also based on the level of confidence, the size of the standard error of the estimate, the size of the sample, and the value of the independent variable. The difference between formulas 13-11 and 13-12 is the 1 under the radical. The prediction interval will be wider than the confidence interval.

30

Confidence Interval and Prediction Interval Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

We return to the North American Copier Sales example. Determine a 95% confidence interval for all sales representatives who make 50 calls, and determine a prediction interval for Sheila Baker, a west coast sales representative who made 50 sales calls.

The 95% confidence interval for all sales representatives is 27.3942 up to 38.6122.

The 95% prediction interval for Sheila Baker is 17.442 up to 48.5644 copiers.

13-31

31

Transforming Data

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

Regression analysis and the correlation coefficient requires data to be linear

But what if data is not linear?

If data is not linear, we can rescale one or both of the variables so the new relationship is linear

Common transformations include

Computing the log to the base 10 of y, Log(y)

Taking the square root

Taking the reciprocal

Squaring one or both variables

Caution: when you are interpreting a correlation coefficient or regression equation – it could be nonlinear

13-32

For example, instead of using the actual values of the dependent variable y, we would create a new dependent variable by transforming it.

32

Transforming Data Example

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

GroceryLand Supermarkets is a regional grocery chain located in the midwestern United States. The director of marketing wishes to study the effect of price on weekly sales of their two-liter private brand diet cola. The objectives of the study are

To determine whether there is a relationship between selling price and weekly sales. Is this relationship direct or indirect? Is it strong or weak?

To determine the effect of price increases or decreases on sales. Can we effectively forecast sales based on the price?

To begin, the company decides to price the two-liter diet cola from $0.50 to $2.00. To collect the data, a random sample of 20 stores is taken and then each store is randomly assigned a selling price.

13-33

There is a strong relationship between the two variables. The coefficient of determination is 88.9%. So 88.9% of the variation in Sales is accounted for by the variation in Price. But, a careful analysis of the scatter diagram reveals that the relationship may not be linear. That means we need to transform the data.

33

Transforming Data Example (2 of 3)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-34

A strong, inverse relationship!

Transforming Data Example (3 of 3)

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

The director of marketing decides to transform the dependent variable, Sales, by taking the logarithm to the base 10 of each sales value. Note the new variable, Log-Sales, in the following analysis as it is used as the dependent variable with Price as the independent variable.

13-35

Clearly, as price increases, sales decrease. This relationship will be very helpful to GroceryLand when making pricing decisions for this product.

35

Chapter 12 Practice Problems

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-36

Question 3

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-37

Bi-lo Appliance Super-Store has outlets in several large metropolitan areas in New England. The general sales manager aired a commercial for a digital camera on selected local TV stations prior to a sale starting on Saturday and ending Sunday. She obtained the information for Saturday–Sunday digital camera sales at the various outlets and paired it with the number of times the advertisement was shown on the local TV stations. The purpose is to find whether there is any relationship between the number of times the advertisement was aired and digital camera sales. The pairings are:

What is the dependent variable?

Draw a scatter diagram.

Determine the correlation coefficient.

Interpret these statistical measures.

LO13-2

Question 11

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-38

The Airline Passenger Association studied the relationship between the number of passengers on a particular flight and the cost of the flight. It seems logical that more passengers on the flight will result in more weight and more luggage, which in turn will result in higher fuel costs. For a sample of 15 flights, the correlation between the number of passengers and total fuel cost was .667. Is it reasonable to conclude that there is positive association in the population between the two variables? Use the .01 significance level.

LO13-2

Question 17

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-39

Bloomberg Intelligence listed 50 companies to watch in 2018 (www.bloomberg.com/features/companies-to-watch-2018). Twelve of the companies are listed here with their total assets and 12-month sales.

Let sales be the dependent variable and total assets the independent variable.

Draw a scatter diagram.

Compute the correlation coefficient.

Determine the regression equation.

For a company with $100 billion in assets, predict the 12-month sales.

LO13-3

Question 23

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-40

Refer to Exercise 17. The regression equation is ŷ = 1.85 + .08x, the sample size is 12, and the standard error of the slope is 0.03. Use the .05 significance level. Can we conclude that the slope of the regression line is different from zero?

LO13-4

Question 27

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-41

Bradford Electric Illuminating Company is studying the relationship between kilowatt-hours (thousands) used and the number of rooms in a private single-family residence. A random sample of 10 homes yielded the following:

Determine the standard error of estimate and the coefficient of determination. Interpret the coefficient of determination.

LO13-5

Question 33

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-42

Bradford Electric Illuminating Company is studying the relationship between kilowatt-hours (thousands) used and the number of rooms in a private single-family residence. A random sample of 10 homes yielded the following:

Determine the .95 confidence interval, in thousands of kilowatt-hours, for the mean of all six-room homes.

Determine the .95 prediction interval, in thousands of kilowatt-hours, for a particular six-room home.

LO13-6

Question 35

Copyright ©2021 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.

13-43

Using the following data with x as the independent variable and y as the dependent variable, answer the items.

Create a scatter diagram and describe the relationship between x and y.

Compute the correlation coefficient.

Transform the x variable by squaring each value, x2.

Create a scatter diagram and describe the relationship between x2 and y.

Compute the correlation coefficient between x2 and y.

Compare the relationships between x and y, and x2 and y.

Interpret your results.

LO13-7