5080UNIT1414.docx

5080 UNIT 1414

Lesson Guide

Assess the differences between correlation and causation.

Explain the correlation of data points to a given equation.

Determine assumptions of the regression model.

Regression Models

In business and government, situations occur that lead observers to wonder “is the change in X related to the change in Y?” Indeed, they may see Y change every time X does and already conclude that they are correlated but just cannot tell how much without analysis.

It follows that after seeing something affect a change in something else in a field of one’s concern, the next question that arises is “how much will Y change when X changes?” Data that can be grouped into a sloping line can lead to a mathematical answer to that question, as all lines on a graph have a slope of y = mx + b (these terms will be defined shortly). So you do not have to wonder about data and changes. You can use regression analysis to calculate the change. There are two reasons to use regression analysis:

1. to understand the relationship between variables as shown by a collected pattern of data, and

2. to predict the value of one variable if the value of the other variable is known or set.

walk you through taking a scatter plot of data and using a line (and simple linear regression) to model the correlation, predict a value for Y unknown given a value set for X, and determine how well the model linear equation fits the data situation, in terms of error and standard deviation. The error and standard deviation could be very small, showing that the linear equation fits well and the data is clustered very close together in relative value. Or, the error and standard deviation could be wide apart, showing the predictions of correlations will not be that good, and leading you to wonder if you have the right linear equation modeling the data.

Note the pattern of the six data points (the local payroll amounts). When plotted on a graph of X/Y axes, these form a scatter plot, which can be modeled by a line with a certain position and slope. Of course, the standard mathematics line equation, y = mx + b, can model this line and any other straight line, but the difference is in the values for the variables. In terms of linear regression, the equation for a line becomes:

Y = β0 + β1X + ϵ , where

Y = dependent / response variable

X = independent variable

β0 = Y-intercept when X = 0

β1 = slope of the line

ϵ = random error

For a linear regression as a model to solve a correlation, X and β1 are not known, but they are estimated with sample data. Rewrite the linear (regression) equation based on sample data as:

Ŷ = b0 + b1X, where

Ŷ = predicted value of Y

b0 = estimate of β0 based on sample results

b1 = estimate of β1 based on sample results

You could try a line and “eyeball” it so you can report that you have a close model, but what you really must do to be accurate is to determine the position of a line with minimal error. Error, you define with some common sense:

Error = actual value – predicted value

And in terms of linear regression equations, this means:

E = Y – Ŷ

Square errors so that an error in a negative direction does not cancel out an error in a positive direction, making the predicted values look more accurate than they may be. The best regression line, then, is the one with the minimum sum of squared errors, which is why regression analysis is also termed least-squares regression.

Note how you can find b0 and b1: by taking the averages of X and Y (summing all the Xs and multiplying the sum by the number of Xs, and doing the same for the Ys) you emplace the resulting averages in equivalent formulas for b0 and b1

:

And in the equation, Ŷ = b0 + b1X, Ŷ = 2 + 1.25X, or sales = 2 + 1.25 (payroll), which enables us to estimate the predicted value of sales for whatever amount the payroll would be set. Also as noted, finding the numbers for the linear regression equation shows us the relationship between the variables. Here, you can see how sales should move, given certain payroll amounts (do not forget that payroll is in units of hundreds of millions and sales is in units of hundreds of thousands).

Measuring the Fit

As previously addressed, you can try linear regression equations and settle on one that calculations show is a good fit, but the issue of the amount of error will persist, can be argued over, and finally tends to lead analysts to find out how much error is in an equation and which ones fit with the smallest error. To address these issues and ward off objections to calculations, analysts developed sums of squares total (SST), sums of squares error (SSE), and sums of squares regression (SSR), and methods to test for significance.

The reason you square terms in these equations, as you have in past units, is because an error with a negative value may cancel out an error with a positive value when these are added, making the regression model equation appear to have a smaller error than it really has. Terms squared are always positive, so that problem is eliminated by converting formulas for error to those where error values are squared.

So:

Sum of squares total = SST = Σ (Y – average of Y values)2

Sum of squares error = SSE = Σ e2 = Σ (Y – Ŷ)2

The sum of squares regression (which shows how much Y’s variance is) is explained by the regression equation:

SSR = Σ (Ŷ average of Y values)2

These sums are related: SST = SSR + SSE. As noted in the textbook on page 118, these measuring tools (Render, et al., 2015)

can be seen as the SSR, showing the explained variability in Y and the SSE showing the unexplained variability in Y. The proportion of these two is called the coefficient of determination, r2, and calculated with SST, SSE and SSR like this:

r2 = SSR = 1 – SSE

SST SST

A value for r2 is the percentage of the variability of Y explained by the regression equation, as that developed for payroll (X) for the Triple A Construction Company example.

This discussion can now tie in the title of the unit, Correlation. r, or the square root of r2, is the coefficient of correlation and shows the strength of correlation of the regression equation. Note the four examples in Figure 4.3 of the textbook, and how in two cases, (a) and (d), the data points are aligned in an exact line, and so that line has perfect correlation, as shown here:

Because the line can slope one way or another and still be a perfect correlation, r can be any number including, and between, + 1 and – 1.

Checking Significance

As you can imagine, there are cases where the available samples taken are too small for these equations measuring fit to work. In these cases, another satisfactory method is to check for significance. For this, use the F Distribution shown in Unit III. As mentioned, analysts turn to the F Distribution because it provides solutions to the ratio between variances, which now you will use to check for significance.

In short, the F statistic is: F = MSR / MSE

If the MSE of the F statistic is very small in relation to MSR (meaning the F statistic is large), then there is just a minor error in the regression equation and the equation is useful. A large F statistic indicates that the equation solutions are unlikely to be occurring by chance. Again, note that F statistic tables have been calculated and published, reducing the amount of mathematics required to check for significance.