Statistics lesson 27
MTH 245 Lesson 27 Notes Sums of Squares and the Coefficient of Determination
We learned in Lesson 7 that the sample variance of a random variable 𝑦𝑦 is given by the equation
𝑠𝑠2𝑦𝑦 = Σ (𝑦𝑦𝑖𝑖 − 𝑦𝑦�)2
𝑛𝑛 − 1
If 𝑦𝑦 is a response variable correlated with a predictor variable 𝑥𝑥, a percentage of 𝑠𝑠2𝑦𝑦 can be explained by an estimated regression model.
To understand how this works, consider a data point (𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) contained in the data set used to construct the model 𝑦𝑦� = 𝑏𝑏0 + 𝑏𝑏1𝑥𝑥.
− The total deviation of 𝑦𝑦𝑖𝑖 is the vertical distance between (𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) and the horizontal line passing through 𝑦𝑦�. In symbols: 𝑦𝑦𝑖𝑖 − 𝑦𝑦�. Note that we calculate this value for each 𝑦𝑦𝑖𝑖 when calculating 𝑠𝑠2𝑦𝑦 (see above).
− The explained deviation of 𝑦𝑦𝑖𝑖 is the vertical distance between 𝑦𝑦�𝑖𝑖, the model's predicted value at 𝑥𝑥𝑖𝑖, and the horizontal line passing through 𝑦𝑦�. In symbols: 𝑦𝑦�𝑖𝑖 − 𝑦𝑦�.
− The unexplained deviation of 𝑦𝑦𝑖𝑖 is the vertical distance between (𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖) and the regression line. Note: this is identical to the residual of (𝑥𝑥𝑖𝑖, 𝑦𝑦𝑖𝑖). In symbols: 𝑦𝑦𝑖𝑖 − 𝑦𝑦�.
As we can see in the above plot, if we add explained deviation and unexplained deviation, we get the total deviation. In symbols: (𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖) + (𝑦𝑦�𝑖𝑖 − 𝑦𝑦�) = (𝑦𝑦𝑖𝑖 − 𝑦𝑦�).
If we calculate (𝑦𝑦𝑖𝑖 − 𝑦𝑦�) for each response variable value in the data set, square them, and add them together, we get the total sum of squares (SSTO):
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�)2 𝑛𝑛
𝑖𝑖=1
This is the numerator of the formula for 𝑠𝑠2𝑦𝑦. Similarly, we can find the model (or regression) sum of squares (SSM) and the error sum of squares (SSE) as follows:
𝑆𝑆𝑆𝑆𝑆𝑆 = �(𝑦𝑦�𝑖𝑖 − 𝑦𝑦�)2 𝑛𝑛
𝑖𝑖=1
𝑆𝑆𝑆𝑆𝑆𝑆 = �(𝑦𝑦𝑖𝑖 − 𝑦𝑦�𝑖𝑖)2 𝑛𝑛
𝑖𝑖=1
Fortunately, StatCrunch calculates these sums of squares for us and displays them in the regression output's analysis of variance table:
As stated above, the coefficient of determination 𝑟𝑟2 is the proportion of variation in a dependent variable 𝑦𝑦 that is explained by the linear relationship between 𝑦𝑦 and the independent variable 𝑥𝑥. We can calculate 𝑟𝑟2 directly using the following formula:
𝑟𝑟2 = 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
Example 1: Fit an estimated regression model using the Handspan-Height data set and do the following:
a. Locate SSM, SSE and SSTO on the regression output screen.
𝑆𝑆𝑆𝑆𝑆𝑆 = 1500.0600; 𝑆𝑆𝑆𝑆𝑆𝑆 = 1242.7027; 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 2472.7627
b. Use these values to calculate 𝑟𝑟2. Does this result match the value of 𝑟𝑟2 reported in the StatCrunch output (allowing for rounding error)?
𝑟𝑟2 = 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
= 1500.0600 2472.7627
= 0.547; this matches the 𝑟𝑟2 value reported at the top of the output screen.
c. Interpret 𝑟𝑟2 in the context of this problem.
Since 𝑟𝑟2 = 0.547, the estimated model explains 54.7% of the variation in height measurements.
Example 2: Fit an estimated regression model using the IQ-Cranial data set and do the following:
a. Locate SSM, SSE and SSTO on the regression output screen.
𝑆𝑆𝑆𝑆𝑆𝑆 = 63.0031; 𝑆𝑆𝑆𝑆𝑆𝑆 = 3252.9969; 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆 = 3316.0000
b. Use these values to calculate 𝑟𝑟2. Does this result match the value of 𝑟𝑟2 reported in the StatCrunch output (allowing for rounding error)?
𝑟𝑟2 = 𝑆𝑆𝑆𝑆𝑆𝑆 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆
= 63.0031 3316.0000
= 0.019; this matches the 𝑟𝑟2 value reported at the top of the output screen.
c. Interpret 𝑟𝑟2 in the context of this problem.
Since 𝑟𝑟2 = 0.019, the estimated model explains 1.9% of the variation in height measurements.
Hypothesis Test for Correlation
The SLR model 𝑦𝑦� = 𝑏𝑏0 + 𝑏𝑏1𝑥𝑥 should only be used to estimate 𝜇𝜇𝑦𝑦 when the predictor and response variables are linearly correlated. If they are not— that is, if 𝛽𝛽1 = 0—the model 𝑦𝑦� = 𝑏𝑏0 + 𝑏𝑏1𝑥𝑥 is not appropriate, and the model 𝑦𝑦� = 𝑦𝑦� (the estimate of the horizontal line 𝑦𝑦 = 𝜇𝜇𝑦𝑦) should be used instead.
To formally establish the existence of correlation between two variables, we need to conduct a formal hypothesis test. Our hypotheses are:
𝐻𝐻0: 𝛽𝛽1 = 0 (alternatively, 𝜇𝜇𝑦𝑦 = 𝛽𝛽0, the reduced model) 𝐻𝐻𝐴𝐴: 𝛽𝛽1 ≠ 0 (alternatively, 𝜇𝜇𝑦𝑦 = 𝛽𝛽0 + 𝛽𝛽1𝑥𝑥, the full model)
For SLR, the p-value for this test is located in both the parameter estimates and analysis of variance tables:
This test follows the same general procedure described in Section 7.1: compare the p-value to the given significance level 𝛼𝛼. Unless otherwise specified, the original claim is always that the variables are correlated (that is, the original claim is represented by 𝐻𝐻𝐴𝐴).
− If the p-value ≤ 𝜶𝜶, reject 𝐻𝐻0 and conclude that there is sufficient evidence that the variables are correlated. Use the full model (𝒚𝒚� = 𝒃𝒃𝟎𝟎 + 𝒃𝒃𝟏𝟏𝒙𝒙).
− If the p-value > 𝜶𝜶, fail to reject 𝐻𝐻0 and conclude that there is insufficient evidence that the variables are correlated. Use the reduced model (𝒚𝒚� = 𝒚𝒚�).
Example 3: Using the Handspan-Height data set, fit a SLR model and use it to determine if there is sufficient evidence of linear correlation between height and handspan. (Use 𝛼𝛼 = 0.05.) Is the full model appropriate for estimating 𝜇𝜇𝑦𝑦?
Since the p-value = 0.000 < 𝛼𝛼 = 0.05, there is sufficient evidence of correlation. The full model (𝑦𝑦� = 𝑏𝑏0 + 𝑏𝑏1𝑥𝑥) is appropriate for estimating 𝜇𝜇𝑦𝑦.
Example 4: Using the IQ-Cranial data set, fit a SLR model and use it to determine if there is sufficient evidence of linear correlation between Stanford-Binet IQ Test score and brain capacity. (Use 𝛼𝛼 = 0.05.) Based on this result, identify the appropriate model for estimating 𝜇𝜇𝑦𝑦 (full or reduced).
Since the p-value = 0.562 > 𝛼𝛼 = 0.05, there is insufficient evidence of correlation. The reduced model (𝑦𝑦� = 𝑦𝑦�) is appropriate for estimating 𝜇𝜇𝑦𝑦.
SLR Model Validity For the full SLR model 𝑦𝑦� = 𝑏𝑏0 + 𝑏𝑏1𝑥𝑥 to produce valid estimates of 𝜇𝜇𝑦𝑦, there must not only be evidence of correlation between 𝑥𝑥 and 𝑦𝑦, but the model must also satisfy all four of the following LINE assumptions:
Linear Relationship: The two variables must show evidence of a linear relationship. Independence: The residuals must be independent of each other. Normal Distribution: The residuals must come from a normally distributed population. Equal Variances: The residuals must have equal variances for each value of the predictor variable values (i.e., homoscedastic).
Linear Relationship. To determine if a linear relationship exists, examine a scatter plot of the data that includes the fitted least-squares line (see examples below). If the points on the scatter plot the form a "band" around the estimated regression line, it's appropriate to assume linearity. (Note: Evidence of correlation doesn't mean the relationship is linear, only that it exists.)
Independence. This is an advanced topic we won't cover in MTH 245. Unless told otherwise, assume the independence assumption holds. Normal Distribution. To determine if the residuals associated with a particular model are normally distributed, we use the graphical method from Lesson 16. Because StatCrunch doesn't construct a boxplot for the residuals in Stat Regression Simple Linear, we will only use a histogram and a Q-Q plot. Equal Variances. To check for equal variances, we use a residual-vs-predictors plot, which is a plot of each 𝑦𝑦� against its associated 𝑥𝑥-value. If the variances of the residuals are equal (or homoscedastic), then the plot should show no identifiable pattern. If there is an obvious pattern—usually in the form of a funnel, fan or diamond—then the residuals have unequal variances (they are heteroscedastic).
Assessing LINE assumptions in StatCrunch: The scatter plot with fitted line is generated by default by running Stat Regression Simple Linear. To also build the residual histogram, Q-Q plot, and residual-vs-predictors plot, select the appropriate items from the "Graphs:" section of the options screen prior to fitting the model.
Warning! If you stare at any of these plots for long enough, you'll start to see patterns even when there are none. Residual analysis should be done thoroughly and carefully, but without over-interpreting every slight anomaly. If a plot looks "mostly OK" at first glance, chances the model meets that particular LINE assumption.
Example 5: Determine if the model you constructed in Example 3 (Height vs Handspan) is valid. Justify your answer.
L: Satisfied. The points on the scatter plot form a loose band around the fitted line. I: Satisfied by assumption. N: Satisfied. The histogram is approximately symmetric and the Q-Q plot aligns closely with the trend line. E. Satisfied. There is no evidence of a pattern in the residuals-vs-predictor plot.
Since the model satisfies all four of the LINE criteria, and we also saw in Example 3 that there is sufficient evidence of correlation between the variables, the model is VALID.
Example 6: Assess the model you constructed in Example 4 (IQ vs cranial capacity) is valid. Justify your answer.
We already know from Example 3 that the full model is inappropriate since there is insufficient evidence of correlation between IQ score and cranial circumference. However, let's examine the LINE criteria anyway:
L: Not satisfied. There is no evidence of a pattern in the scatter plot. I: Satisfied by assumption. N: Not satisfied. The histogram and the Q-Q plot both display patterns that suggest the residuals have a right-skewed, non-normal distribution. E. Satisfied. There is no evidence of a pattern in the residuals-vs-predictor plot.
Since the model satisfies only two of the LINE criteria, even if there were sufficient evidence of correlation, it would be INVALID.
Extrapolation and Limits on Validity
Even if the full SLR model 𝑦𝑦� = 𝑏𝑏0 + 𝑏𝑏1𝑥𝑥 meets all four LINE criteria, that does not mean it can be used to predict 𝜇𝜇𝑦𝑦 for any value of 𝑥𝑥. An SLR model is only valid when used with 𝑥𝑥-values that are within the scope of the model—that is, 𝑥𝑥-values that lie between the minimum and maximum 𝑥𝑥-values in the original data set used to construct the model. Using a model with any 𝑥𝑥-value greater than the maximum or less than the minimum is referred to as extrapolation, and the 𝑥𝑥-value is referred to as being outside the scope of the model.
In the plot to the right, the SLR model—represented by the blue line—can be used with any 𝑥𝑥-value between the two red vertical lines, but 𝑥𝑥-values outside those lines are out of the scope of the model and could produce invalid estimates of 𝜇𝜇𝑦𝑦.
The second plot shows an example of the potential negative effect of extrapolation. In this case, the true regression model is nonlinear. The regression line provides a reasonably close linear approximation to the true model within the scope of the data, but outside the scope, it is not at all accurate. As a result, the new 𝑥𝑥- value, which is well out of the scope of the estimated model, produces a 𝑦𝑦� value with a large error compared to the true value of 𝜇𝜇𝑦𝑦.
Example 7: Suppose an analyst wants to use the Handspan-Height model from Example 5 to estimate the mean height of all people with a handspan of 26 cm. Can the model be used to determine a valid estimate in this case?
No. The maximum Handspan value in the Handspan-Height data set is 25.5. Therefore, 26 cm is outside the scope of the model, and it cannot be used to estimate mean height.