economic math
Regression Analysis with Excel
In statistical modeling, regression analysis is used to estimate the relationships between two or more variables:
· Dependent variable (aka criterion variable) is the main factor you are trying to understand and predict.
· Independent variables (aka explanatory variables, or predictors) are the factors that might influence the dependent variable.
Regression analysis helps you understand how the dependent variable changes when one of the independent variables varies and allows to mathematically determine which of those variables really has an impact.
Technically, a regression analysis model is based on the sum of squares, which is a mathematical way to find the dispersion of data points. The goal of a model is to get the smallest possible sum of squares and draw a line that comes closest to the data.
In statistics, they differentiate between a simple and multiple linear regression. Simple linear regression models the relationship between a dependent variable and one independent variables using a linear function. If you use two or more explanatory variables to predict the independent variable, you deal with multiple linear regression. If the dependent variables are modeled as a non-linear function because the data relationships do not follow a straight line, use nonlinear regression instead.
Multiple linear regression
The regression equation that expresses the linear relationships between a single dependent variable and one or more independent variables is:
ŷ = b0 + b1x1 + b2x2 + … + bk-1xk-1 + bkxk
In this equation, ŷ is the predicted value of the dependent variable. Values of the k independent variables are denoted by x1, x2, x3, … , xk.
And finally, we have the b's - b0, b1, b2, … , bk. The b's are constants, called regression coefficients. Values are assigned to the b's based on the principle of least squares.
The regression constant (b0) is equal to the y intercept of the regression line.
The regression coefficient (b1) is the average change in the dependent variable (Y) for a 1-unit change in the independent variable (X). It is the slope of the regression line.
The least squares regression line
· The line minimizes the sum of squared differences between observed values (the y values) and predicted values (the ŷ values computed from the regression equation).
· The regression line passes through the mean of the X values (x) and through the mean of the Y values (y).
Example
Consider the table below. It shows three performance measures for 10 students.
|
Student |
Test score |
IQ |
Study hours |
|
1 |
100 |
110 |
40 |
|
2 |
95 |
110 |
40 |
|
3 |
90 |
120 |
30 |
|
4 |
85 |
110 |
40 |
|
5 |
80 |
100 |
20 |
|
6 |
75 |
110 |
40 |
|
7 |
70 |
90 |
0 |
|
8 |
65 |
110 |
40 |
|
9 |
60 |
80 |
10 |
|
10 |
55 |
80 |
10 |
Using data from the table, we are going to complete the following tasks:
· Develop a least-squares regression equation to predict test score, based on (1) IQ and (2) the number of hours that the student studied.
· Assess how well the regression equation predicts test score, the dependent variable.
· Assess the contribution of each independent variable (i.e., IQ and study hours) to the prediction.
How to Enable Excel
When you open Excel, the module for regression analysis may or may not be enabled. So, before you do anything else, you need to determine whether Excel is enabled. Here's how to do that:
· Open Excel.
· Click the Data tab.
· If you see the Data Analysis button in the upper right corner, the Analysis TookPak is enabled and you are ready to go.
If the Data Analysis button is not visible, the Analysis ToolPak is not enabled. In that case, do the following:
· Click the File tab.
· Select Options to open the Excel Options dialog box.
· Click the Add-Ins item, from the left column. This opens the View and Manage Microsoft Office Add-ins screen
· From the Manage drop-down box, choose Excel Add-Ins and click the Go button. This opens the Add-Ins dialog box.
· From the Add-Ins dialog, check the box beside Analysis ToolPak and click Go.
This enables the Analysis ToolPak. Now, when you click the Data tab, you will see a Data Analysis button in the upper right corner under the Data tab.
DATA ENTRY
Data entry with Excel is easy. There are three main steps:
· Enter data on spreadsheet.
· Identify independent and dependent variables.
· Specify desired analyses.
DATA ANALYSIS
Excel will display the Regression dialog box. This is where you identify data fields for the independent and dependent variables. In the Input Y Range, enter coordinates for the independent variable. In the Input X Range, enter coordinates for the dependent variable(s). If you include column labels in these input ranges, check the Labels box. In the example below, we have included labels, so the Labels box is checked.
By default, Excel will produce a standard set of outputs. For this sample problem, that's all we need; so click OK to generate standard regression outputs.
|
|
|
|
|
|
|
|
|
|
|
Result in a regression coefficients table:
|
|
|
|
|
|
|
|
|
SUMMARY OUTPUT |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.951455 |
|
|
|
|
|
|
R Square |
0.905267 |
|
|
|
|
|
|
Adjusted R Square |
0.8782 |
|
|
|
|
|
|
Standard Error |
3.875119 |
|
|
|
|
|
|
Observations |
10 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
2 |
1004.484 |
502.2421 |
33.44592 |
0.000262 |
|
|
Residual |
7 |
105.1158 |
15.01654 |
|
|
|
|
Total |
9 |
1109.6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
23.15614 |
15.96718 |
1.450234 |
0.190278 |
-14.6002 |
60.91251881 |
|
IQ |
0.509434 |
0.180809 |
2.817519 |
0.025865 |
0.081888 |
0.93698026 |
|
STUDY HRS |
0.467134 |
0.17196 |
2.716522 |
0.029916 |
0.060512 |
0.873754908 |
The regression intercept (b0) is 23.156, the regression coefficient for IQ (b1) is 0.509, and the regression coefficient for study hours (b2) is 0.467. So the least-squares regression equation can be re-written as:
ŷ = 23.156 + 0.505 * IQ + 0.467 * Hours
This is the only linear equation that satisfies a least-squares criterion. That means this equation fits the data from which it was created better than any other linear equation.
MULTIPLE R - (0.951455) This is the correlation coefficient . It tells you how strong the linear relationship is. For example, a value of 1 means a perfect positive relationship and a value of zero means no relationship at all. It is the square root of r squared
R SQUARE - (0.905267) The coefficient of determination is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. Or It tells you how many points fall on the regression line.
· The coefficient of determination ranges from 0 to 1.
· An R2 of 0 means that the dependent variable cannot be predicted from the independent variable.
· An R2 of 1 means the dependent variable can be predicted without error from the independent variable.
· An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on.
The coefficient of multiple determination measures the proportion of variation in the dependent variable that can be predicted from the set of independent variables in the regression equation. When the regression equation fits the data well, R2 will be large (i.e., close to 1); and vice versa.
The coefficient of multiple determination can be defined in terms of sums of squares:
SSR = Σ ( ŷ - y )2
SSTO = Σ ( y - y )2
R2 = SSR / SSTO
where SSR is the sum of squares due to regression, SSTO is the total sum of squares, ŷ is the predicted value of the dependent variable, y is the dependent variable mean, and y is the dependent variable raw score.
A quick glance at the output suggests that the regression equation fits the data pretty well. The coefficient of multiple determination is 0.905. For our sample problem, this means 90.5% of test score variation can be explained by IQ and by hours spent in study.
ADJUSTED R SQUARE (0.8782): It is the proportion of the variation Y explained by X1 ,X2, . . . Xk adjusted for the number of predictors in the model. As the number of independent variables in the regression model increases, the coefficient of multiple determination, R2 , increases. Even if the additional independent variable is not a good predictor, its inclusion in the model decreases SSE which in turn increases SSR and R2. Because of this, another measure of the effectiveness of a multiple regression model, called Radj2 ,should be considered.
STANDARD ERROR (3.875119) about the regression line (often denoted by SE) is a measure of the average amount that the regression equation over- or under-predicts. The standard error of the regression is the precision that the regression coefficient is measured. The higher the coefficient of determination, the lower the standard error; and the more accurate predictions are likely to be. Standard error represents the average distance that the observed values fall from the regression line. Conveniently, it tells you how wrong the regression model is on average using the units of the response variable. Smaller values are better because it indicates that the observations are closer to the fitted line.
For example the S 3.875, which tells us that the average distance of the data points from the fitted line is about 3.87%.
Unlike R-squared, you can use the standard error of the regression to assess the precision of the predictions
Observations (10) . Number of observations in the sample.
ANOVA Table
Another way to evaluate the regression equation would be to assess the statistical significance of the regression sum of squares. For that, we examine the ANOVA table produced by Excel:
This table tests the statistical significance of the independent variables as predictors of the dependent variable.
The last column of the table shows the results of an overall F test. The F statistic (33.4) is big, and the p value (0.00026) is small. This indicates that one or both independent variables has explanatory power beyond what would be expected by chance.
Like the coefficient of multiple correlation, the overall F test found in the ANOVA table suggests that the regression equation fits the data well.
F-test in regression compares the fits of different linear models. Unlike t-tests that can assess only one regression coefficient at a time, the F-test can assess multiple coefficients simultaneously.
If the P value for the F-test of overall significance test is less than your significance level.
The F-statistic is simply a ratio of two variances. Variances are a measure of dispersion, or how far the data are scattered from the mean. Larger values represent greater dispersion.
If the group means are clustered close to the overall mean, their variance is low. However, if the group means are spread out further from the overall mean, their variance is higher.
While R-squared provides an estimate of the strength of the relationship between your model and the response variable, it does not provide a formal hypothesis test for this relationship. The overall F-test determines whether this relationship is statistically significant. If the P value for the overall F-test is less than your significance level, you can conclude that the R-squared value is significantly different from zero.
Significance of Regression Coefficients
With multiple regression, there is more than one independent variable; so it is natural to ask whether a particular independent variable contributes significantly to the regression after effects of other variables are taken into account. The answer to this question can be found in the regression coefficients table:
The regression coefficients table shows the following information each coefficient: its value, its standard error, a t-statistic, and the significance of the t-statistic. In this example, the t-statistics for IQ and study hours are both statistically significant at the 0.05 level. This means that IQ contributes significantly to the regression after effects of study hours are taken into account. And study hours contribute significantly to the regression after effects of IQ are taken into account.
T-Test - The two sample t-test simply tests whether or not two independent populations have different mean values on some measure. For example, we might have a research hypothesis that rich people have a different quality of life than poor people. We give a questionnaire that measures quality of life to a random sample of rich people and a random sample of poor people. The null hypothesis, which is assumed to be true until proven wrong, is that there is really no difference between these two populations. We gather some sample data and observe that the two groups have different average scores. But does this represent a real difference between the two populations, or just a difference in our particular sample, which might not be representative of the true population (e.g. unlucky random sample, or convenience sampling etc.). The statistics t-test allows us to answer this question by using the t-test statistic to determine a p-value that indicates how likely we could have gotten these results by chance, if in fact the null hypothesis were true (i.e. no difference in the population). By convention, if there is less than 5% chance of getting the observed differences by chance, we reject the null hypothesis and say we found a statistically significant difference between the two groups.
The p-value for each term tests the null hypothesis that the coefficient is equal to zero (no effect). A low p-value (< 0.05) indicates that you can reject the null hypothesis. In other words, a predictor that has a low p-value is likely to be a meaningful addition to your model because changes in the predictor's value are related to changes in the response variable.
Conversely, a larger (insignificant) p-value suggests that changes in the predictor are not associated with changes in the response.
The P-value (the probability of chance) can be thought of as the probability that the results of a statistical experiment are due only to chance. The lower the p-value, the greater the likelihood of obtaining the same (or very similar) results in a repetition of the statistical experiment. Thus, a low p-value is a good indication that the results are not due to random chance alone.
Exercise 1 – the following table shows the measure of the intelligence and extroversion of 20 sales persons along with the amount of sales per week.
Using excel – find out the both the correlation and regression output tables and interpret the results.
|
Sales Person |
Intelligence |
Extroversion |
$ Sales/Week |
|
1 |
89 |
21 |
$ 2,625 |
|
2 |
93 |
24 |
$ 2,700 |
|
3 |
91 |
21 |
$ 3,100 |
|
4 |
122 |
23 |
$ 3,150 |
|
5 |
115 |
27 |
$ 3,175 |
|
6 |
100 |
18 |
$ 3,100 |
|
7 |
98 |
19 |
$ 2,700 |
|
8 |
105 |
16 |
$ 2,475 |
|
9 |
112 |
23 |
$ 3,625 |
|
10 |
109 |
28 |
$ 3,525 |
|
11 |
130 |
20 |
$ 3,225 |
|
12 |
104 |
25 |
$ 3,450 |
|
13 |
104 |
20 |
$ 2,425 |
|
14 |
111 |
26 |
$ 3,025 |
|
15 |
97 |
28 |
$ 3,625 |
|
16 |
115 |
29 |
$ 2,750 |
|
17 |
113 |
25 |
$ 3,150 |
|
18 |
88 |
23 |
$ 2,600 |
|
19 |
108 |
19 |
$ 2,525 |
|
20 |
101 |
16 |
$ 2,650 |
Exercise 2
Suppose a commercial developer is considering purchasing a group of small office buildings in an established business district. The developer can use multiple linear regression analysis to estimate the value of an office building in a given area based on the following variables.
y = Assessed value of the office building
x1 = Floor space in square feet
x2 = Number of offices
x3= Number of entrances
x4 = Age of the office building in years
The developer randomly chooses a sample of 11 office buildings from a possible 1,500 office buildings and obtains the spreadsheet data. "Half an entrance" means an entrance for deliveries only.
|
Floor Space (x1) |
Offices (x2) |
Entrances (x3) |
Age (x4) |
Assed Value (y) |
|
2310 |
2 |
2 |
20 |
142000 |
|
2333 |
2 |
2 |
12 |
144000 |
|
2356 |
3 |
1.5 |
33 |
151000 |
|
2379 |
3 |
2 |
43 |
150000 |
|
2402 |
2 |
3 |
53 |
139000 |
|
2425 |
4 |
2 |
23 |
169000 |
|
2448 |
2 |
1.5 |
99 |
126000 |
|
2471 |
2 |
2 |
34 |
142900 |
|
2494 |
3 |
3 |
23 |
163000 |
|
2517 |
4 |
4 |
55 |
169000 |
|
2540 |
2 |
3 |
22 |
149000 |
|
2500 |
3 |
2 |
25 |
158258 |
Using Excel find the correlation and regression output tables and interpret the results.