Six Sigma, Statistics expert needed
Regression
Simple Linear Least Squares Regression
Simple linear regression is used to estimate the coefficients of the model
i i iy a bx e= + + (14.1)
where yi is the dependent variable,
x i is the independent variable, and
ei is the residual; the error in the fit of the model.
Simple Linear Least Squares Regression
28 30 32 34 36 38 40 42 44 110
120
130
140
150
160
Independent Variable
D e p
e n
d e n
t V
a ri
a b
le
Multiple Regression
0 1 1 2 2 ...i i i k ki iy b b x b x b x e= + + + + +
where yi is the dependent variable,
x1i, x2i,..., xki, are the independent variables, and
ei is the residual; the error in the fit of the model.
Regression Assumptions
• Linearity of the relationship between dependent and independent variables
• Constant variance of the errors
• Independent predictors
• Independence of the dependent values over time
• Normality of the error distribution
Linearity of the Relationship Between Dependent and Independent Variables
• Nonlinear relationship • Problems
• Significant predictors have non-significant p-value
• Prediction and confidence limits incorrect
• What to do • Always plot each predictor individually and perform a visual check
• Transform data using the appropriate non-linear model
• Interactions • Problems
• The interaction adds to the statistical noise
• A larger sample size is needed to detect statistically significant main effects
• Significant factors may be missed if the interaction has a large effect
• What to do • Do not use the mean square as an estimate for noise in the system
• Do not use the standard error of the coefficient as an estimate for noise in the system
• A pattern can be detected on individual scatter plots
Nonlinear Relationship
Nonlinear Relationship
50403020100
600
500
400
300
200
100
0
x1
y S 175.244
R-Sq 0.6%
R-Sq(adj) 0.0%
Fitted Line Plot y = 217.1 - 0.999 x1
Nonlinear Relationship
• Solution • Add x2 as a predictor
• Y = co + c1x + c2x2 + error
Excel
1. Draw a scatter graph
2. Add a trend line
3. Select the appropriate non-linear model
Limitations
1. Only works with a single
predictor
2. Only handles 5 types of
non-linear equations
Excel Analysis Tool Pak
1. Add additional columns to model desired equation
2. Perform regression with Analysis Tool Pak 1. Tools
2. Data Analysis (May need to select using add-ins menu)
3. Regression x x^2 y
33 1089 64.4
13 169 144.5
3 9 484.2
9 81 256.5
2 4 529.8
40 1600 225.3
7 49 324.7
Excel Analysis Tool Pak SUMMARY OUTPUT
Regression Statistics
Multiple R 0.999999
R Square 0.999997
Adjusted R Square 0.999997
Standard Error 0.29061
Observations 48
ANOVA
df SS MS F Significance F
Regression 2 1421905 710952.5 8418177 4.0427E-126
Residual 45 3.80045 0.084454
Total 47 1421909
CoefficientsStandard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 625.6399 0.13127 4766.067 5.7E-130 625.3754741 625.9042553 625.3754741 625.9042553
x -50.00772 0.012358 -4046.525 0% -50.03260973 -49.9828284 -50.03260973 -49.9828284
x^2 1.000023 0.000245 4089.723 0% 0.999530235 1.000515216 0.999530235 1.000515216
-1
0
1
0 10 20 30 40 50 60
Limitations
1. You must transform data
Also gives all statistical graphs
Nonlinear Relationship Multiple predictors
210-1-2
99
90
50
10
1
Residual
P e
r c e
n t
4035302520
2
1
0
-1
Fitted Value
R e
s id
u a
l
2.01.61.20.80.40.0-0.4-0.8
16
12
8
4
0
Residual
F r e
q u
e n
c y
4035302520151051
2
1
0
-1
Observation Order
R e
s id
u a
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for T (Kelvin)
Nonlinear Relationship Multiple predictors
3.02.62.2 3.02.52.0 3.02.52.0
40
30
20
3.02.52.0
40
30
20
3.02.52.0 3.02.52.0
P (Atm)
T (
K e
lv in
)
V (Liter) n (Moles)
x1 x2 x3
Scatterplot of T (Kelvin) vs P (Atm), V (Liter), n (Moles), x1, x2, x3
Source of
non-linearity
may not be
obvious
Nonlinear Relationship Multiple predictors
Nonlinear Relationship Multiple predictors
• The regression correctly identified the predictors
• How good is the equation? • Maximum temperature =44.5 • Minimum temperature = 19.95 • The maximum residual is 1.86 • The worst error is 6.1% of the temperature range
• Use you engineering judgment • Is 6.1% error OK? • The error is worse when low and high temperatures are predicted
• What if we extrapolate? • We are doomed!
P (Atm) 4
V (Liter) 4
n (Moles) 2
Predicted T (K) 75.94
True T (K) 97.49
Nonlinear Relationship Interactions
•Y = B + E - 5BE
A B C D E F Output
-1 -1 -1 1 1 1 5.14
1 -1 -1 -1 -1 1 -6.92
-1 1 -1 -1 1 -1 -2.89
1 1 -1 1 -1 -1 5.26
-1 -1 1 1 -1 -1 -6.73
1 -1 1 -1 1 -1 5.15
-1 1 1 -1 -1 1 5.30
1 1 1 1 1 1 -2.55
1 1 1 1 1 1 -2.76
Nonlinear Relationship Interactions
The interaction is treated as experimental error
This increase in error masks the effect of the coefficients
Neither B or E appear as statistically significant
Nonlinear Relationship Interactions
•Without Interaction
Experimental error (SE Coef) is much lower
B and E are statistically significant
Nonlinear Relationship Interactions
•Y = B + E - 5BE A B C D E F Output
0.480 -0.159 -0.326 -0.397 -0.531 -0.647 -0.650
-0.162 -0.174 0.796 0.084 0.954 -0.528 1.670
-0.016 -0.319 -0.224 0.120 0.074 0.794 -0.002
0.604 -0.032 0.260 -0.227 0.302 0.855 0.704
-0.053 -0.560 0.428 -0.965 0.999 -0.161 3.640
-0.712 0.560 -0.043 -0.021 0.160 0.949 0.681
-0.431 -0.308 0.862 -0.527 -0.352 -0.088 -0.806
-0.985 -0.973 -0.866 -0.747 -0.489 0.983 -3.728
0.173 0.585 -0.424 0.353 0.173 0.173 0.353
0.778 -0.219 -0.788 -0.293 -0.119 0.227 -0.437
0.009 0.984 0.348 0.141 0.707 -0.197 -1.453
-0.101 -0.197 -0.874 -0.335 -0.592 -0.653 -1.252
-0.734 0.861 0.079 -0.487 0.727 -0.776 -1.358
-0.797 0.219 0.125 0.480 -0.852 0.769 0.323
-0.843 -0.121 -0.949 0.970 0.646 0.677 1.256
0.763 0.965 -0.209 0.198 -0.865 0.783 4.629
All data is not shown
There are 50 rows of data
Nonlinear Relationship Interactions
Both B and E are
statistically significant
Nonlinear Relationship Interactions
10-1 10-1 10-1
5.0
2.5
0.0
-2.5
-5.0
10-1
5.0
2.5
0.0
-2.5
-5.0
10-1 10-1
A
O u
tp u
t
B C
D E F
Scatterplot of Output vs A, B, C, D, E, FPattern
Indicates
Significance
Pattern
Indicates
Significance
Linearity of the Relationship Between Dependent and Independent Variables
• Nonlinear relationship • Problems
• Significant predictors have non-significant p-value
• Prediction and confidence limits incorrect
• What to do
• Plot each predictor individually and perform a visual check
• Transform data using the appropriate non-linear model
• Interactions • Problems
• The interactions may add to the statistical noise
• A larger sample size is needed to detect statistically significant main effects
• Significant factors may be missed if the interaction has a large effect
• What to do
• Add interactions manually if there are enough degrees of freedom
• Do not use the mean square as an estimate for noise in the system
• Do not use the standard error of the coefficient as an estimate for noise in the system
• A pattern can be detected on individual scatter plots
Constant Variance of the Errors
• Problems • Confidence limits not valid
• Prediction limits not valid
• What to do • Nothing is required unless prediction or confidence limits are needed
• Perform regression on subsets of the data • Example; original data 20 < x < 100
• Make 4 data sets 20 < x < 40 40 < x < 60 60 < x < 80 80 < x < 10
Constant Variance of the Errors
500400300200100
8
6
4
2
0
-2
-4
-6
-8
Fitted Value
R e
s id
u a
l
Versus Fits (response is Y)
Constant Variance of the Errors
•Confidence limits and Prediction limits are not valid
200150100500
700
600
500
400
300
200
100
0
x
Y
S 34.6383
R-Sq 94.5%
R-Sq(adj) 94.3%
Regression
95% PI
Fitted Line Plot Y = 106.0 + 2.418 x
Constant Variance of the Errors
•P-values are approximately correct
Independent Predictors • What happens if the predictors are not independent?
• What causes predictors to be dependent? • Stiffness is increased when diameter is reduced
• Coolant density is increased when cutting speed in increased
• Example x1 x2 x3 x4 x5 y
23.2 47.2 27.4 0.6 70.9 95.1
11.5 49.8 26.3 7.5 61.3 107.5
3.5 22.6 39.5 27.3 26.9 73
17.1 5.3 8.7 17.5 22.9 28.7
4.9 5.5 14.3 13.5 11.3 24.6
24.7 7.7 48.5 1.7 32.4 17.7
41.1 12.2 13 39.9 53.9 64.8
38.3 19.7 43.3 19 58.1 59.3
4.4 33.1 1 8.8 38.2 75.9
41.3 25.8 36.9 8.6 67.1 60.3
5.8 20.2 11.4 21.2 26.9 62.2
All data is not shown
There are 48 rows of data
9/23/2021 © SKF Group Slide 29
Independent Predictors
X2 and X4
are Significant
9/23/2021 © SKF Group Slide 30
Independent Predictors • Let’s remove X2 as a predictor and perform regression
again
• What predictors will be significant?
X1, X4 and X5
are Significant
9/23/2021 © SKF Group Slide 31
Independent Predictors
• Why didn’t X1 and X5 appear as significant in the initial regression?
• X5 is a function of X1 and X2
• Fix • Never perform regression without verifying the independence of predictors
• Correlation
Independent Predictors
Options Remove X5 or
Remove X1 and X2
Independence of the Dependent Values Over Time
• This is detected in the residuals versus order chart
• It is also detected in the normal probability chart
• Causes • Data is correlated to itself
• Manufacturing • Change tool, process degrades, then tool is changed again
• Mold heats up over time
• Engineering • Technician learns & improves
• Technician get fatigued and get worse
• First part gets cold tools
Independence of the Dependent Values Over Time
20100-10-20
99.9
99
90
50
10
1
0.1
Residual
P e
r c e
n t
20-2-4
20
10
0
-10
-20
Fitted Value
R e
s id
u a
l 181260-6-12
20
15
10
5
0
Residual
F r e
q u
e n
c y
15 0
14 0
13 0
12 0
11 0
10 09080706050403020101
20
10
0
-10
-20
Observation Order
R e
s id
u a
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for y
Classic Pattern for
Correlated Ys
The variation in a
Short time period
is much smaller
than the total
variation
Independence of the Dependent Values Over Time
• Problems • All statistics are unreliable because
• P-value
• Coefficients
• Confidence limits
• Prediction limits
• Fix • Use autocorrelation analysis to determine lag, or
• Remove autocorrelation and model residuals, or
• Correct the underlying problem in the manufacturing or engineering process
Independence of the Dependent Values Over Time
•Use every 11th
data point
35302520151051
1.0
0.8
0.6
0.4
0.2
0.0
-0.2
-0.4
-0.6
-0.8
-1.0
Lag
A u
to c o
rr e
la ti
o n
Autocorrelation Function for y (with 5% significance limits for the autocorrelations)
Normality of the Error Distribution
1050-5
99.9
99
90
50
10
1
0.1
Residual
P e
r c e
n t
240180120600
12
9
6
3
0
Fitted Value
R e
s id
u a
l
1086420-2
48
36
24
12
0
Residual
F r e
q u
e n
c y
15 0
14 0
13 0
12 0
11 0
10 09080706050403020101
12
9
6
3
0
Observation Order
R e
s id
u a
l
Normal Probability Plot Versus Fits
Histogram Versus Order
Residual Plots for y
Normality of the Error Distribution
• Problems • The prediction limits will not be correct
• The confidence intervals will have some error
• Fix • There is no fix
• Why it happens • This is unusual
• The regression is fit by minimizing the sum of the squared residual values
• Non normal data commonly produces normally distributed residuals
• Can be caused by an unknown predictor
9/23/2021 © SKF Group Slide 39
Summary
• When performing regression • Check for correlated predictors
• Check residual graph for patterns
• Check time series graph for patterns
• Violating assumptions does not make analysis totally invalid
• Coefficients may be OK
• P-values may be OK
Modified Power Example
x y 0.00001 -132.575 0.0002 -117.1 0.0004 -107.875 0.0006 -105.075 0.0008 -101.35 0.001 -106.65 0.002 -67.225 0.005 -27.5 0.01 -4.1