hw8 and task 2
DATA & Analysis Instructions
| EXCEL'S REGRESSION ANALYSIS TOOL | |||
| Found in the menu for Excel's Data Analysis Tool (if you installed that) | |||
| It's free and already on your computer (if you have Excel 2013 and beyond on a PC) | |||
| SAMLE DATA* | |||
| Insulin sensitivity | %C20-22 fatty acids | STEP 1 | |
| [mg/m2/min] | polyunsaturated | Open the dialog box for the Data analysis tool, click on the DATA tab and then on the icon for the Data Analysis tool all the way to the right. | |
| Y | X | ||
| 250 | 17.9 | ||
| 220 | 18.3 | ||
| 145 | 18.3 | ||
| 115 | 18.4 | ||
| 230 | 18.4 | ||
| 200 | 20.2 | ||
| 330 | 20.3 | ||
| 400 | 21.8 | ||
| 370 | 21.9 | STEP 2 | |
| 260 | 22.1 | Scroll down to Regression and hit OK | |
| 270 | 23.1 | ||
| 530 | 24.2 | ||
| 375 | 24.4 | ||
| *Data from Motulsky textbook, p. 145 | |||
| STEP 3 | |||
| Enter the columns for the Y data and for the X data. Check off as many of the outputs as you like. | |||
| Remember to pick your DEPENDENT (outcome) variable as Y and your INDEPENDENT (predictor) variable as X | |||
| Otherwise the regression analysis does not make sense! | |||
| TIP: Since the output is quite comprehensive, I recommend that you select a new worksheet ply (or tab) for the results. | |||
| I selected "new worksheet ply". | |||
| Once Excel put the data in there, I renamed the new ply "Regression Results". | |||
| You can see that ply / tab below. | |||
| Click on it to get an explanation of all the result outputs. | |||
| I selected to obtain 3 types of plots: Residual Plots, Line Fit Plots, and Normal Probability Plots. | |||
| They originally appeared in the same tab. | |||
| However, in order to have more space to add explanations, I moved them to a new tab that I named "Regression Plots". |
Regression Results
| SUMMARY OUTPUT | ||||||||||
| Regression Statistics | ||||||||||
| Multiple R | 0.7700025428 | Pearson correlation coefficient R | ||||||||
| R Square | 0.592903916 | Square of R | ||||||||
| Adjusted R Square | 0.5558951811 | ignore - only relevant for multiple regression analysis | ||||||||
| Standard Error | 75.8954867325 | of residuals | ||||||||
| Observations | 13 | |||||||||
| Hypothesis test for regression | ||||||||||
| Excel uses a hypothesis test for the analysis of two variances (ANOVA) | ||||||||||
| Null hypothesis: regression variance ≤ residual variance (MSregression ≤ MSresiduals) | ||||||||||
| Alternative hypothesis: regression variance > residual variance | ||||||||||
| Test statistic: F | F= MS regression / MS residuals | (you can check this out below) | ||||||||
| ANOVA | ||||||||||
| df | SS | MS | F | Significance F | ||||||
| Regression | 1 | 92280.9337222659 | 92280.9337222659 | 16.020648028 | 0.0020770122 | this is the p-value for the regression | ||||
| Residual | 11 | 63361.3739700418 | 5760.1249063674 | |||||||
| Total | 12 | 155642.307692308 | ||||||||
| This table is the "coefficients table" | Excel does not provide the formula for the fitted stright line. | |||||||||
| Coefficients | Standard Error | t Stat | P-value | Lower 95% | Upper 95% | Lower 95.0% | Upper 95.0% | You can either take it from a scatter plot you made (should be the same) | ||
| Intercept | -486.5419945992 | 193.7160213841 | -2.5116249607 | 0.0289028264 | -912.9080829377 | -60.1759062607 | -912.9080829377 | -60.1759062607 | or create it here fore yourself: | |
| X Variable 1 | 37.2077457475 | 9.2959401573 | 4.0025801713 | 0.0020770122 | 16.7475194122 | 57.6679720827 | 16.7475194122 | 57.6679720827 | y = slope * x + intercept | |
| The slope is named after the x variable (in most programs). | thus: | y = 37.208x - 486.5 | ||||||||
| Since I didn't check the box for a name for the x and y variables, Excel just calls it X variable 1 | ||||||||||
| (Excel puts the name of the x variable here, if the name is provided in the column) | ||||||||||
| Next, Excel creates data tables for plots that analyse the residuals and the normality of the data. | ||||||||||
| For each y coordinate, Excel calculates the "residual": | ||||||||||
| The difference between that y coordinate and its predicted y value (the corresponding y value from the straight line) | ||||||||||
| See lecture 7, slide 13 | ||||||||||
| To obtain the standard residual the residual is divided by the standard error | ||||||||||
| RESIDUAL OUTPUT | PROBABILITY OUTPUT | Excel puts all observed y values in order from smallest to largest ("ranks" them) | ||||||||
| Excel then uses the normal distribution to calculate the precentile of each value. | ||||||||||
| Observation | Predicted Y | Residuals | Standard Residuals | Percentile | Y | ( the probability that values smaller than this value occur in a normally distributed data set) | ||||
| 1 | 179.4766542802 | 70.5233457198 | 0.9705350799 | 3.8461538462 | 115 | |||||
| 2 | 194.3597525792 | 25.6402474208 | 0.3528584659 | 11.5384615385 | 145 | Note: minitab does not rank the actual y values, but the residuals! | ||||
| 3 | 194.3597525792 | -49.3597525792 | -0.6792838729 | 19.2307692308 | 200 | It is more useful to test whether the residuals behave like a normal distribution than the actual y values. | ||||
| 4 | 198.080527154 | -83.080527154 | -1.1433457281 | 26.9230769231 | 220 | |||||
| 5 | 198.080527154 | 31.919472846 | 0.4392725248 | 34.6153846154 | 230 | |||||
| 6 | 265.0544694994 | -65.0544694994 | -0.895272964 | 42.3076923077 | 250 | |||||
| 7 | 268.7752440741 | 61.2247559259 | 0.842568837 | 50 | 260 | |||||
| 8 | 324.5868626953 | 75.4131373047 | 1.0378278922 | 57.6923076923 | 270 | |||||
| 9 | 328.3076372701 | 41.6923627299 | 0.573766037 | 65.3846153846 | 330 | |||||
| 10 | 335.7491864196 | -75.7491864196 | -1.0424525658 | 73.0769230769 | 370 | |||||
| 11 | 372.956932167 | -102.956932167 | -1.4168827835 | 80.7692307692 | 375 | |||||
| 12 | 413.8854524892 | 116.1145475108 | 1.5979565418 | 88.4615384615 | 400 | |||||
| 13 | 421.3270016387 | -46.3270016387 | -0.6375474643 | 96.1538461538 | 530 |
Plots from Regression Tool
| The first plot is the Residuals vs. x values plot | |||||||
| Excel calls it ("name of x variable") Residual plot | |||||||
| This plot answers the question: | |||||||
| Are the errors in the y values (residuals) evenly distributed among all x values? | |||||||
| Or are there any sections on the line (or in the data) that show a bigger variation in y? | |||||||
| This would indicate that the errors are not normally distributed and not totally random. | |||||||
| It could be that they are influenced by the x value | |||||||
| Here you see that blue dots are above and below the line, no matter what the x value is. | |||||||
| Looking at the data in the first tab you see that x values range from ~ 18 to ~ 24. | |||||||
| The default plot here ranges from x=0 to x=30, so it wastes a lot of space with no data. | |||||||
| Tip: Double click the x axis to change its range. | |||||||
| The "Format Axis" menu will open to the right. | |||||||
| Change the minimum and maximum bound for x. | |||||||
| The next plot is Excel's horrible attempt to show the actual data points (blue) | |||||||
| together with the fitted data points (orange) from the line of best fit. | |||||||
| It would be much better to draw the actual line here | |||||||
| and omit the fitted data points. | |||||||
| You can do this easily yourself: | |||||||
| Create a scatter plot from your data and insert a trendline | |||||||
| (with formula and R value) | |||||||
| This plot tests whether the y values are normally distributed. | |||||||
| The data for this plot are in the table called "Probability Output" | |||||||
| I'm repeating the comments I made there. | |||||||
| Excel puts all observed y values in order from smallest to largest ("ranks" them) | |||||||
| Excel then uses the normal distribution to calculate the precentile of each value | |||||||
| ( the probability that values smaller than this value occur in a normally distributed data set). | |||||||
| Note: minitab does not rank the actual y values, but the residuals! | |||||||
| It is more useful to test whether the residuals behave like a normal distribution than the actual y values. | |||||||
| Questions to answer with the normal probability plot: | |||||||
| Are y values across their range showing an approximate normal distribution? | |||||||
| Are there any outliers in the values? | |||||||
| (looks like the data point with the highest y score - 530 - may be an outlier!) |
X Variable 1 Line Fit Plot
Y 17.899999999999999 18.3 18.3 18.399999999999999 18.399999999999999 20.2 20.3 21.8 21.9 22.1 23.1 24.2 24.4 250 220 145 115 230 200 330 400 370 260 270 530 375 Predicted Y 17.899999999999999 18.3 18.3 18.399999999999999 18.399999999999999 20.2 20.3 21.8 21.9 22.1 23.1 24.2 24.4 179.47665428024101 194.3597525792232 194.3597525792232 198.08052715396872 198.08052715396872 265.0544694993884 268.77524407413404 324.58686269531705 328.30763727006257 335.74918641955372 372.95693216700914 413.88545248920997 421.32700163870101X Variable 1
Y
Normal Probability Plot
3.8461538461538463 11.538461538461538 19.23076923076923 26.923076923076923 34.615384615384613 42.307692307692307 50 57.692307692307693 65.384615384615387 73.076923076923066 80.769230769230759 88.461538461538453 96.153846153846146 115 145 200 220 230 250 260 270 330 370 375 400 530Sample Percentile
Y
X Variable 1 Residual Plot
17.899999999999999 18.3 18.3 18.399999999999999 18.399999999999999 20.2 20.3 21.8 21.9 22.1 23.1 24.2 24.4 70.523345719758993 25.640247420776802 -49.359752579223198 -83.080527153968717 31.919472846031283 -65.054469499388404 61.224755925865963 75.413137304682948 41.692362729937429 -75.749186419553723 -102.95693216700914 116.11454751079003 -46.327001638701006X Variable 1
Residuals