homework 4
CHAPTER SEVEN
Objective:
A brief introduction of the basic concepts of Linear Regression analysis, correlation coefficient and analysis of variance for (ANOVA) linear regression.
Chapter Content:
Any process is a result of the interaction of many variables. Some of them can assume any value as they interact in the process, these are known as independent variables. There are others that their performance is influenced by others, these are known as dependent variables. In this chapter we will determine if there is a linear relationship between one independent variable and one dependent variable. Also, we will be able to establish if the dependent variable is influence by the selected independent variable.
Let say that a production machine has a production level of 95% to 99% and that this level may be influenced by the temperature of the equipment, what do you think the independent variable is? If your answer is the temperature, you are correct, since the influenced variable production level will be depending of the value of the temperature.
If for any reason the temperature increases and the production level increase or decrease, we may have a linear function behavior. If the production level increases every time the temperature increases, then you can create a chart showing a similar behavior than the next figure.
Production
Level
Temperature
Let say that the blue dots represent the value of the production level every time the temperature increases. You can see in the chart that a line can be drawn to show this performance. If this is the case, then we can determine a linear function that simulate the production level performance when the temperature changes.
To do this we will use the Linear Regression Tool to determine the linear function that best reflect the variables behavior. To make it easy to calculate the results we will use the Sum of Square method. This method was developed to determine the level of variation developed by the variables in the equation. We will use them to confirm if these variations follow a linear function path.
The main variations in a linear function are the one generated by the independent variable, known as X, the one generated by the dependent variable, known as Y and the one generated by the interaction of the two variables.
Y
b
Production
Level
Y = a + bX
a
X
Temperature
The linear function can be defined as Y = a + b (X), where Y is the dependent variable result, X is the independent variable value, a is the interception of Y when X is 0, and b is the slope or delta change or the change ratio every time the X value is changed.
By knowing the components of the linear function, then the relationship of the Sum of Squares method can be explained. Let starts with the calculation of the variation generated by the X or independent variable.
Sxx = ((X^2) - ((X)^2
n
In the first part of the function each X value is squared and then we added to obtain ((X^2), while in the second part we add the X values and the sum of them is squared ((X)^2, n is the number of samples taken for the analysis. The result should always be a positive value.
We continue with the second variable, the dependent variable Y and its variation is determine as:
Syy = ((Y^2) - ((Y)^2
n
In the first part of the function each Y value is squared and then we added to obtain ((Y^2), while in the second part we add the Y values and the sum of them is squared ((Y)^2, n is the number of samples taken for the analysis. The result should always be a positive value.
At last we will determine the variation caused by the interaction of both variables.
Sxy = ((XY) - ((Y) ((X)
n
In the first part of the function each X is multiplied by it corresponding Y value and then we added to obtain ((XY), while in the second part we multiply the sum of the X values by the sum of Y values ((Y) ((X), n is the number of samples taken for the analysis. The result can be a positive or negative value, since it shows the direction of the linear function.
If we use the following data, where X shows the change in temperature and the Y shows the production level achieved for each X, then we can proceed to add the first two columns, square the next two columns and obtain the multiplication of the X & Y in the fifth column.
(X) (Y)
Delta Temp Production X^2 Y^2 XY
1 1250 1 1562500 1250
2 1590 4 2528100 3180
3 1340 9 1795600 4020
4 1510 16 2280100 6040
5 1486 25 2208196 7430
6 1440 36 2073600 8640
21 8616 91 12448096 30560
We proceed to substitute the corresponding values in the formulae.
Sxx = 91 - 21 ^2 Syy = 12448096 - 8616 ^2 Sxy = 30560 - 21*8616
6 6 6
Sxx = 17.5 Syy = 75520 Sxy = 404
Please note that Sxx and Syy are positive values. That shows that an adding mistake has not occurred. Also, notes that Sxy is positive, this means that the slope of the linear function will be positive, which means that every time X increases, Y increases.
With the three sum of squares calculated, then we can calculate the Least-Square estimator for β or the slope of the linear function (b) by dividing Sxy by Sxx:
b = Sxy / Sxx = 404 /17.5 = 23.09
Then to determine Least Square estimator for α or the interception a, we need to calculate first the average of X and Y.
X = 21/6 = 3.5, Y = 8616/6 = 1436,
a = Y – Xb = 1436- (3.5*23.09) = 1355.2
Therefore by obtaining the slope b and the interception a, the linear function can be established as:
Y = 1355.2 + 23.09X
To determine how effective is this equation or function to forecast the behavior of the variable Y when X changes, a correlation analysis has to be performed. This analysis will provide a factor that represents the interdependence of two variables. Known as the Correlation Coefficient (r), it measures the linear relationship between two variables within a range of -1 to 1, being 0 the value that infers no linear relationship.
r = Sxy / Sxx* Syy = 404 / 17.5 * 75520 = .35
For our course, we will use the following interpretation of the correlation coefficient by splitting the range in three sections.
· If r is close to 1 (between 0.5 and 1), then we can infer that there is a linear relationship with a positive slope. A positive slope means that whenever the X value increases the Y value increases.
Y
X
· If r is close to -1 (between -1 and -0.5), then we can infer that there is a linear relationship with a negative slope. A negative slope means that whenever the X value increases the Y value decreases.
Y
X
· If r is close to 0 (between -.49 a .49), then there is no linear relationship between the two variables. This means that Y value does not depend of X value.
For our example, 0.35 falls under the range of (-.49 to .49), which means that there is no linear relationship between those two variables.
There is another way to determine linear relationship between two variables. The use of Analysis of Variance (ANOVA) for Linear Regression will allow us to determine the relationship under a confidence coefficient percent.
By using the following table we can define the sum of squares require to calculate the experimental F value and compare it to a critical F value based on the confidence coefficient:
Variation Sum of Degrees of Mean Value for
Source Squares Freedom Square Exp F
Regression SSR=b*Sxy 1 MSR=SSR/1 Exp. F=
Error SSE=Syy-SSR n-2 MSE=SSE/(n-2) MSR/MSE
Example:
Variation Sum of Degrees of Mean Value for
Source Squares Freedom Square Exp F
Regresión SSR=23.09*404=9328 1 MSR=9328/1 Fexp=
Error SSE=75520-9328=66192 6-2 MSE=66192/4 9328/16545
= .564
We can make reference to the hypothesis test conclusions in the last chapter to interpret the results of the ANOVA analysis. First of all, the probability distribution being used is the F distribution. This is used to determine the relationship of variances of two populations. Let say that the confidence coefficient requested was 95%, this means that the alpha or type I error is 5%. Then to find the critical value for F distribution we will need the two values of degree of freedom.
For the variation source of regression, the degrees of freedom will always be 1, while for the variation source of the errors the degrees of freedom will be n-2, in our example 6-2 = 4. So if we go to table 6 (page 694), we will find under df1 =1 and df2 = 4 and the alpha = 0.05 the F value will be 7.71.
If the experimental value of F is higher than the critical value we can infer that there is a linear relationship between the two variables. But the experimental value is lower than the critical F value we can infer that there is no linear relationship between the two variables.
Exp F > Critical F, There is linear relationship
Exp F ≤ Critical F, There is no linear relationship
It is important to clarify that the F distribution only can assume positive values since is the ratio of two squared values. If a negative value is obtained, there is a big mistake. By locating the experimental F value in the distribution curve, we found that the obtained value of 0.564 is lower than the critical value, concluding that there is no linear relationship between the production and the temperature.
(The F distribution can only assume values from 0 to infinity.)
F Distribution Curve
5%
0 ∞
Critical F =
7.71
95%
Exp F =
.564