Six Sigma, Statistics expert needed

sajhal-1
Regression.pdf

Regression

Simple Linear Least Squares Regression

Simple linear regression is used to estimate the coefficients of the model

i i iy a bx e= + + (14.1)

where yi is the dependent variable,

x i is the independent variable, and

ei is the residual; the error in the fit of the model.

Simple Linear Least Squares Regression

28 30 32 34 36 38 40 42 44 110

120

130

140

150

160

Independent Variable

D e p

e n

d e n

t V

a ri

a b

le

Multiple Regression

0 1 1 2 2 ...i i i k ki iy b b x b x b x e= + + + + +

where yi is the dependent variable,

x1i, x2i,..., xki, are the independent variables, and

ei is the residual; the error in the fit of the model.

Regression Assumptions

• Linearity of the relationship between dependent and independent variables

• Constant variance of the errors

• Independent predictors

• Independence of the dependent values over time

• Normality of the error distribution

Linearity of the Relationship Between Dependent and Independent Variables

• Nonlinear relationship • Problems

• Significant predictors have non-significant p-value

• Prediction and confidence limits incorrect

• What to do • Always plot each predictor individually and perform a visual check

• Transform data using the appropriate non-linear model

• Interactions • Problems

• The interaction adds to the statistical noise

• A larger sample size is needed to detect statistically significant main effects

• Significant factors may be missed if the interaction has a large effect

• What to do • Do not use the mean square as an estimate for noise in the system

• Do not use the standard error of the coefficient as an estimate for noise in the system

• A pattern can be detected on individual scatter plots

Nonlinear Relationship

Nonlinear Relationship

50403020100

600

500

400

300

200

100

0

x1

y S 175.244

R-Sq 0.6%

R-Sq(adj) 0.0%

Fitted Line Plot y = 217.1 - 0.999 x1

Nonlinear Relationship

• Solution • Add x2 as a predictor

• Y = co + c1x + c2x2 + error

Excel

1. Draw a scatter graph

2. Add a trend line

3. Select the appropriate non-linear model

Limitations

1. Only works with a single

predictor

2. Only handles 5 types of

non-linear equations

Excel Analysis Tool Pak

1. Add additional columns to model desired equation

2. Perform regression with Analysis Tool Pak 1. Tools

2. Data Analysis (May need to select using add-ins menu)

3. Regression x x^2 y

33 1089 64.4

13 169 144.5

3 9 484.2

9 81 256.5

2 4 529.8

40 1600 225.3

7 49 324.7

Excel Analysis Tool Pak SUMMARY OUTPUT

Regression Statistics

Multiple R 0.999999

R Square 0.999997

Adjusted R Square 0.999997

Standard Error 0.29061

Observations 48

ANOVA

df SS MS F Significance F

Regression 2 1421905 710952.5 8418177 4.0427E-126

Residual 45 3.80045 0.084454

Total 47 1421909

CoefficientsStandard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 625.6399 0.13127 4766.067 5.7E-130 625.3754741 625.9042553 625.3754741 625.9042553

x -50.00772 0.012358 -4046.525 0% -50.03260973 -49.9828284 -50.03260973 -49.9828284

x^2 1.000023 0.000245 4089.723 0% 0.999530235 1.000515216 0.999530235 1.000515216

-1

0

1

0 10 20 30 40 50 60

Limitations

1. You must transform data

Also gives all statistical graphs

Nonlinear Relationship Multiple predictors

210-1-2

99

90

50

10

1

Residual

P e

r c e

n t

4035302520

2

1

0

-1

Fitted Value

R e

s id

u a

l

2.01.61.20.80.40.0-0.4-0.8

16

12

8

4

0

Residual

F r e

q u

e n

c y

4035302520151051

2

1

0

-1

Observation Order

R e

s id

u a

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for T (Kelvin)

Nonlinear Relationship Multiple predictors

3.02.62.2 3.02.52.0 3.02.52.0

40

30

20

3.02.52.0

40

30

20

3.02.52.0 3.02.52.0

P (Atm)

T (

K e

lv in

)

V (Liter) n (Moles)

x1 x2 x3

Scatterplot of T (Kelvin) vs P (Atm), V (Liter), n (Moles), x1, x2, x3

Source of

non-linearity

may not be

obvious

Nonlinear Relationship Multiple predictors

Nonlinear Relationship Multiple predictors

• The regression correctly identified the predictors

• How good is the equation? • Maximum temperature =44.5 • Minimum temperature = 19.95 • The maximum residual is 1.86 • The worst error is 6.1% of the temperature range

• Use you engineering judgment • Is 6.1% error OK? • The error is worse when low and high temperatures are predicted

• What if we extrapolate? • We are doomed!

P (Atm) 4

V (Liter) 4

n (Moles) 2

Predicted T (K) 75.94

True T (K) 97.49

Nonlinear Relationship Interactions

•Y = B + E - 5BE

A B C D E F Output

-1 -1 -1 1 1 1 5.14

1 -1 -1 -1 -1 1 -6.92

-1 1 -1 -1 1 -1 -2.89

1 1 -1 1 -1 -1 5.26

-1 -1 1 1 -1 -1 -6.73

1 -1 1 -1 1 -1 5.15

-1 1 1 -1 -1 1 5.30

1 1 1 1 1 1 -2.55

1 1 1 1 1 1 -2.76

Nonlinear Relationship Interactions

The interaction is treated as experimental error

This increase in error masks the effect of the coefficients

Neither B or E appear as statistically significant

Nonlinear Relationship Interactions

•Without Interaction

Experimental error (SE Coef) is much lower

B and E are statistically significant

Nonlinear Relationship Interactions

•Y = B + E - 5BE A B C D E F Output

0.480 -0.159 -0.326 -0.397 -0.531 -0.647 -0.650

-0.162 -0.174 0.796 0.084 0.954 -0.528 1.670

-0.016 -0.319 -0.224 0.120 0.074 0.794 -0.002

0.604 -0.032 0.260 -0.227 0.302 0.855 0.704

-0.053 -0.560 0.428 -0.965 0.999 -0.161 3.640

-0.712 0.560 -0.043 -0.021 0.160 0.949 0.681

-0.431 -0.308 0.862 -0.527 -0.352 -0.088 -0.806

-0.985 -0.973 -0.866 -0.747 -0.489 0.983 -3.728

0.173 0.585 -0.424 0.353 0.173 0.173 0.353

0.778 -0.219 -0.788 -0.293 -0.119 0.227 -0.437

0.009 0.984 0.348 0.141 0.707 -0.197 -1.453

-0.101 -0.197 -0.874 -0.335 -0.592 -0.653 -1.252

-0.734 0.861 0.079 -0.487 0.727 -0.776 -1.358

-0.797 0.219 0.125 0.480 -0.852 0.769 0.323

-0.843 -0.121 -0.949 0.970 0.646 0.677 1.256

0.763 0.965 -0.209 0.198 -0.865 0.783 4.629

All data is not shown

There are 50 rows of data

Nonlinear Relationship Interactions

Both B and E are

statistically significant

Nonlinear Relationship Interactions

10-1 10-1 10-1

5.0

2.5

0.0

-2.5

-5.0

10-1

5.0

2.5

0.0

-2.5

-5.0

10-1 10-1

A

O u

tp u

t

B C

D E F

Scatterplot of Output vs A, B, C, D, E, FPattern

Indicates

Significance

Pattern

Indicates

Significance

Linearity of the Relationship Between Dependent and Independent Variables

• Nonlinear relationship • Problems

• Significant predictors have non-significant p-value

• Prediction and confidence limits incorrect

• What to do

• Plot each predictor individually and perform a visual check

• Transform data using the appropriate non-linear model

• Interactions • Problems

• The interactions may add to the statistical noise

• A larger sample size is needed to detect statistically significant main effects

• Significant factors may be missed if the interaction has a large effect

• What to do

• Add interactions manually if there are enough degrees of freedom

• Do not use the mean square as an estimate for noise in the system

• Do not use the standard error of the coefficient as an estimate for noise in the system

• A pattern can be detected on individual scatter plots

Constant Variance of the Errors

• Problems • Confidence limits not valid

• Prediction limits not valid

• What to do • Nothing is required unless prediction or confidence limits are needed

• Perform regression on subsets of the data • Example; original data 20 < x < 100

• Make 4 data sets 20 < x < 40 40 < x < 60 60 < x < 80 80 < x < 10

Constant Variance of the Errors

500400300200100

8

6

4

2

0

-2

-4

-6

-8

Fitted Value

R e

s id

u a

l

Versus Fits (response is Y)

Constant Variance of the Errors

•Confidence limits and Prediction limits are not valid

200150100500

700

600

500

400

300

200

100

0

x

Y

S 34.6383

R-Sq 94.5%

R-Sq(adj) 94.3%

Regression

95% PI

Fitted Line Plot Y = 106.0 + 2.418 x

Constant Variance of the Errors

•P-values are approximately correct

Independent Predictors • What happens if the predictors are not independent?

• What causes predictors to be dependent? • Stiffness is increased when diameter is reduced

• Coolant density is increased when cutting speed in increased

• Example x1 x2 x3 x4 x5 y

23.2 47.2 27.4 0.6 70.9 95.1

11.5 49.8 26.3 7.5 61.3 107.5

3.5 22.6 39.5 27.3 26.9 73

17.1 5.3 8.7 17.5 22.9 28.7

4.9 5.5 14.3 13.5 11.3 24.6

24.7 7.7 48.5 1.7 32.4 17.7

41.1 12.2 13 39.9 53.9 64.8

38.3 19.7 43.3 19 58.1 59.3

4.4 33.1 1 8.8 38.2 75.9

41.3 25.8 36.9 8.6 67.1 60.3

5.8 20.2 11.4 21.2 26.9 62.2

All data is not shown

There are 48 rows of data

9/23/2021 © SKF Group Slide 29

Independent Predictors

X2 and X4

are Significant

9/23/2021 © SKF Group Slide 30

Independent Predictors • Let’s remove X2 as a predictor and perform regression

again

• What predictors will be significant?

X1, X4 and X5

are Significant

9/23/2021 © SKF Group Slide 31

Independent Predictors

• Why didn’t X1 and X5 appear as significant in the initial regression?

• X5 is a function of X1 and X2

• Fix • Never perform regression without verifying the independence of predictors

• Correlation

Independent Predictors

Options Remove X5 or

Remove X1 and X2

Independence of the Dependent Values Over Time

• This is detected in the residuals versus order chart

• It is also detected in the normal probability chart

• Causes • Data is correlated to itself

• Manufacturing • Change tool, process degrades, then tool is changed again

• Mold heats up over time

• Engineering • Technician learns & improves

• Technician get fatigued and get worse

• First part gets cold tools

Independence of the Dependent Values Over Time

20100-10-20

99.9

99

90

50

10

1

0.1

Residual

P e

r c e

n t

20-2-4

20

10

0

-10

-20

Fitted Value

R e

s id

u a

l 181260-6-12

20

15

10

5

0

Residual

F r e

q u

e n

c y

15 0

14 0

13 0

12 0

11 0

10 09080706050403020101

20

10

0

-10

-20

Observation Order

R e

s id

u a

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y

Classic Pattern for

Correlated Ys

The variation in a

Short time period

is much smaller

than the total

variation

Independence of the Dependent Values Over Time

• Problems • All statistics are unreliable because

• P-value

• Coefficients

• Confidence limits

• Prediction limits

• Fix • Use autocorrelation analysis to determine lag, or

• Remove autocorrelation and model residuals, or

• Correct the underlying problem in the manufacturing or engineering process

Independence of the Dependent Values Over Time

•Use every 11th

data point

35302520151051

1.0

0.8

0.6

0.4

0.2

0.0

-0.2

-0.4

-0.6

-0.8

-1.0

Lag

A u

to c o

rr e

la ti

o n

Autocorrelation Function for y (with 5% significance limits for the autocorrelations)

Normality of the Error Distribution

1050-5

99.9

99

90

50

10

1

0.1

Residual

P e

r c e

n t

240180120600

12

9

6

3

0

Fitted Value

R e

s id

u a

l

1086420-2

48

36

24

12

0

Residual

F r e

q u

e n

c y

15 0

14 0

13 0

12 0

11 0

10 09080706050403020101

12

9

6

3

0

Observation Order

R e

s id

u a

l

Normal Probability Plot Versus Fits

Histogram Versus Order

Residual Plots for y

Normality of the Error Distribution

• Problems • The prediction limits will not be correct

• The confidence intervals will have some error

• Fix • There is no fix

• Why it happens • This is unusual

• The regression is fit by minimizing the sum of the squared residual values

• Non normal data commonly produces normally distributed residuals

• Can be caused by an unknown predictor

9/23/2021 © SKF Group Slide 39

Summary

• When performing regression • Check for correlated predictors

• Check residual graph for patterns

• Check time series graph for patterns

• Violating assumptions does not make analysis totally invalid

• Coefficients may be OK

• P-values may be OK

Modified Power Example

x y 0.00001 -132.575 0.0002 -117.1 0.0004 -107.875 0.0006 -105.075 0.0008 -101.35 0.001 -106.65 0.002 -67.225 0.005 -27.5 0.01 -4.1