excel assignment
California State University, East Bay College of Business and Economics
BUS 602 Business Analytics for Managers
Regression Analysis
Dr. Z. Radovilsky
Lecture Materials
Dr. Z. Radovilsky
‹#›
1
1
Learning Objectives
Explain regression analysis using simple and multiple linear regressions
Identify six general steps of linear regression analysis
Apply general steps in developing simple linear regression
Utilize various methods of developing regression equation
Analyze regression output using coefficient of determination, regression coefficient(s) and intercept, and significance Fs
Develop multiple regression analysis using the steps to create the best regression equation
Explain approaches of building good regression equation
Dr. Z. Radovilsky
‹#›
2
2
Regression analysis is the study of relationships between variables
Main goal: estimate values of one variable (dependent variable) on the basis of known values of the other variable or variables (independent variables), all of which are numeric
One of the most useful tools for business data analysis and business forecasting
Dependent or response variable is the single variable being explained by the regression
Independent or explanatory variable(s) is (are) used to explain the dependent variable
Regression analysis can be used for
Cross-sectional data
Time series data (forecasting)
Regression Analysis
Dr. Z. Radovilsky
‹#›
Regression Analysis – Linear Regression
Linear regression (one of the most popular types of regression) involves the attempt to develop a straight-line or linear mathematical equation that describes the relationship between a dependent variable (Y) and one or more independent variables (X1 , X2 , ... Xn )
Simple linear regression:
Y = b0 + b1X1
Multiple linear regression:
Y = b0 + b1X1 + b2X2 +...+ bk Xk
Dr. Z. Radovilsky
‹#›
Example – Simple Regression
Market Value = b0 + b1*Square Feet
Y = 426,798 + 38.127 X
Dr. Z. Radovilsky
‹#›
Example – Multiple Regression
Average Bank Balance = b0 + b1*Median Age + b2*Median Years Education + b3*Median Income + b4*Median Home Value
Dr. Z. Radovilsky
‹#›
Six General Steps of Linear Regression Analysis
Formulate the goal of regression analysis
Identify the dependent variable and potential independent variable(s)
Plot the data
A suggested step to check for potentially linear relationship between dependent and independent variables
Develop correlation coefficients (correlation matrix)
A suggested step: examine the strength of relationships between dependent and independent variable(s)
Identify a linear regression equation
If the plot and correlation coefficients indicate linear relationship between the dependent and independent variable(s)
Analyze statistical significance and usability of the regression equation
Identify if the regression equation is “good for use”
Dr. Z. Radovilsky
‹#›
Example of Linear Regression Analysis: Steps 1-3
Formulate the goal of regression analysis, and identify the dependent variable and potential independent variable(s)
Identify relationships between market value and square footage
Market Value – dependent variable (Y)
Square Feet – independent variable (X)
Plot the data
A suggested step to check for potentially linear relationship between dependent and independent variables
Scatterplot
Line plot, specifically for time series data
Identify correlation coefficients (correlation matrix) and examine the strength of relationships between dependent and independent variable(s)
Look for statistically significant correlation coefficient(s) between the dependent variable and independent variable(s)
Dr. Z. Radovilsky
‹#›
Simple Linear Regression
Single independent variable
Linear relationship
Dr. Z. Radovilsky
‹#›
Example: Scatterplot of Home Market Value
Dr. Z. Radovilsky
‹#›
Example: Correlation Coefficients
Dr. Z. Radovilsky
‹#›
Example of Regression Analysis: Step 4
If the plot and correlation coefficients indicate linear relationship between the dependent and independent variable(s), identify a linear regression equation
Dr. Z. Radovilsky
‹#›
Simple Linear Regression Equation
Simple regression model
Y = β0 + β1X1 + ε
where:
βo = Intercept
β1 = Slope (regression coefficient)
ε = Random error (residual)
Nature of random error
Variable may be influenced by other variables not included in the regression equation
Errors of measurement (rounding)
Unexplained (random) variability
Least-squares regression estimates β0 and β1 by b0 and b1, respectively, by minimizing the sum of squares of the residuals:
Y = b0 + b1X1
Dr. Z. Radovilsky
‹#›
Errors (Residuals)
Dr. Z. Radovilsky
‹#›
Least-Squares Regression for Simple Leniear Regression
Least squares method
The most popular method of identifying b0 and b1
Identifies such coefficients b0 and b1 that minimize the sum of the squared residuals Minimize εi2 = (Yi - (b0 + b1X1))2
Formulas for b1 (slope of the regression line) and b0 (intercept of the regression line)
Typically, utilize Excel or statistical software to identify b0 and b1
Using Excel functions: INTERCEPT ( ) and SLOPE ( )
Using Excel Trendline option on the scatterplot chart
Using Regression option in Data Analysis box
Dr. Z. Radovilsky
‹#›
Market Value Example: Excel Functions
For every additional square foot, the market value increases by approximately $38.13.
For a house with 1,800 square feet, the estimated market value is
Y = 426,798 + 38.127*1,800 = $495,427
Dr. Z. Radovilsky
‹#›
Market Value Example: Using Trendline
Dr. Z. Radovilsky
‹#›
Market Value Example: Simple Linear Regression Excel Output in Data Analysis Box
Dr. Z. Radovilsky
‹#›
Coefficient of Determination
“R Square” = R2, coefficient of determination: the proportion of variation explained by the independent variable (regression model)
0 R2 1
“Multiple R” = square root of R2, is the sample correlation coefficient r between X and Y
The sign of r is the same as the slope of the fitted line
Dr. Z. Radovilsky
‹#›
Statistical Significance of Regression Coefficient (Slope of Regression Line)
Regression coefficient of independent variable
Represents a slope of the regression line. It should be proven that it is statistically significant (not equal to zero)
P-value identifies a probability that the population regression coefficient is equal to 0
In order for a regression coefficient to be statistically significant, the P-value should be no more than 0.05 (no more than 5% of chance that the regression coefficient is equal to 0)
Dr. Z. Radovilsky
‹#›
Significance F
Significance F or F-Statistic (Variance Ratio Test)
Is a test of the overall significance of the estimated population regression line, that is, whether a dependent variable is linearly related to all independent variables
The regression equation is overall statistically significant if Significance F is no more than 0.05
Significance F ≤ 0.05
A chance that all population regression coefficients and R2 are equal to 0 is no more than 0.05 or 5%
Dr. Z. Radovilsky
‹#›
Continue with General Steps of Linear Regression Analysis
Analyze statistical significance and usability of the regression equation
Coefficient of Determination – R 2
Significance of regression coefficient and intercept
Overall significance of the regression equation – Significance F
Regression residuals (errors)
Overall, the regression equation is “good for use” if:
Coefficient of determination R2 is high (relatively high)
Regression coefficient(s) and intercept are statistically significant (P-values are less than 5%)
Significance F is also less than 5%
Dr. Z. Radovilsky
‹#›
Multiple Linear Regression
Multiple linear regression model:
Y = b0 + b1 X1 + b2 X2 + ... + bk Xk + e
Predicted model:
Y = b0 + b1X1 + b2 X2 + ... + bk Xk
The b’s are called partial regression coefficients
Dr. Z. Radovilsky
‹#›
Example: Banking Data
Need to identify relationships between Average Bank Balance and other variables, i.e., Median Age, Median Years of Education, Median Income, and Median Home Value
Dr. Z. Radovilsky
‹#›
Banking Data: Scatterplots
Dr. Z. Radovilsky
‹#›
Banking Data: Correlation Matrix
Dr. Z. Radovilsky
‹#›
Regression Equation for Banking Data
Dr. Z. Radovilsky
‹#›
Analyzing Results: R Square
Regression statistics similar to single independent variable case
R Square (coefficient of determination)
The value .923 indicates that 92.3% of the variation in graduation rate can be explained by the variation in the independent variables.
Adjusted R2 accounts for sample size and number of independent variables. It is useful for comparing models with different sets of independent variables
Dr. Z. Radovilsky
‹#›
Analyzing Results: Partial Regression Coefficients
Regression coefficients for Age and Income are statistically significant (good for use), because the respective P-values ≤ 0.05
However, the regression coefficients for Education and Home Value are not statistically significant (P value is substantially greater than 0.05). There is a high chance that the population regression coefficients for both variables are equal to 0.
Home Value and Education variables may be considered for removal from the regression equation
Dr. Z. Radovilsky
‹#›
Improved Regression Equation
Dr. Z. Radovilsky
‹#›
Building Good Models
Include only significant independent variables
Use the fewest necessary to permit adequate interpretation of the dependent variable
10 variables has potentially 210 = 1024 models!
As you add more explanatory variables to a model, R2 increases (even if the variables are irrelevant)
However, the Adjusted R2 could either increase or decrease, thus providing information about the value of additional variables.
Dr. Z. Radovilsky
‹#›
Modeling Approach
Construct a model with all available independent variables
Check for significance of the independent variables by examining the p-values
Identify the independent variable having the largest p-value that exceeds the chosen level of significance
Remove the variable from the model and evaluate adjusted R2
Continue until all variables are significant
Dr. Z. Radovilsky
‹#›
House
Age, Years
Square
Feet
Market
Value, $
331,812$490,000.00
321,914$504,400.00
321,842$493,300.00
331,812$491,000.00
321,836$501,900.00
332,028$508,500.00
321,732$487,600.00
331,850$496,000.00
321,791$489,200.00
331,566$488,400.00
321,852$500,800.00
321,620$496,700.00
321,592$487,500.00
322,472$514,000.00
322,372$513,200.00
331,666$487,500.00
322,123$516,100.00
321,620$494,700.00
321,731$486,400.00
321,666$487,100.00
281,520$483,400.00
271,484$479,800.00
281,588$481,500.00
281,598$487,100.00
281,484$482,600.00
281,484$478,800.00
281,520$487,600.00
271,701$494,200.00
281,484$482,000.00
281,468$488,100.00
281,520$488,100.00
271,520$488,600.00
271,484$476,600.00
281,520$484,400.00
271,668$490,900.00
281,588$481,000.00
281,784$491,300.00
271,484$481,300.00
272,020$500,700.00
281,520$487,200.00
271,684$496,700.00
271,881$520,700.00
MedianMedian YearsMedianMedianAverage Bank
AgeEducationIncomeHome ValueBalance
35.914.8$101,033$283,104$38,517
37.713.8$96,748$263,843$40,618
36.813.8$82,245$242,732$35,206
35.313.2$80,639$245,024$33,434
35.313.2$74,879$235,951$28,162
34.813.7$85,591$255,334$36,708
39.314.4$90,615$281,265$38,766
36.613.9$86,507$249,880$34,811
35.716.1$117,935$376,139$41,032
40.515.1$92,557$282,088$41,742
37.914.2$68,294$223,500$29,950
Correlation Matrix
House Age, YearsSquare FeetMarket Value, $
House Age, Years1
Square Feet0.4971.000
Market Value, $0.3610.8451
2
2
1
1
1
X
n
X
Y
X
n
Y
X
b
i
n
i
i
i
n
i
-
å
-
å
=
=
=
X
b
Y
b
1
0
-
=
Using INTERCEPT and SLOPE Functions
Intercept (bo)=INTERCEPT(C4:C45,B4:B45)426,798
Slope (b1)=SLOPE(C4:C45, B4:B45)38.127
Regression EquationY = 426,798 + 38.127 X
Square Feet X = 1,800Y = 426,798 + 38.127*1800 =495,427
SUMMARY OUTPUT
Regression Statistics
Multiple R0.8455
R Square0.7148
Adjusted R Square0.7077
Standard Error5705.2971
Observations42
ANOVA
dfSSMSF
Significance F
Regression132640531763264053176100.2771.8558E-12
Residual40130201658532550414.64
Total414566069762
Coefficients
Standard Errort Stat
P-value
Lower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept4267986577.22166364.890391823.7E-42413505.4299440091.55413505.43440091.552
Square Feet38.1273.80742473910.013833131.9E-1230.4318235645.82200830.431823645.8220084
Regression Statistics
Multiple R0.8455
R Square0.7148
Adjusted R Square0.7077
Standard Error5705.2971
Observations42
Coefficients
Standard Errort Stat
P-value
Lower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept4267986577.22166364.890391823.7E-42413505.4299440091.55413505.43440091.552
Square Feet38.1273.80742473910.013833131.9E-1230.4318235645.82200830.431823645.8220084
ANOVA
dfSSMSF
Significance F
Regression132640531763264053176100.2771.8558E-12
Residual40130201658532550414.64
Total414566069762
MedianMedian YearsMedianMedianAverage Bank
AgeEducationIncomeHome ValueBalance
35.914.8$101,033$283,104$38,517
37.713.8$96,748$263,843$40,618
36.813.8$82,245$242,732$35,206
35.313.2$80,639$245,024$33,434
35.313.2$74,879$235,951$28,162
34.813.7$85,591$255,334$36,708
39.314.4$90,615$281,265$38,766
36.613.9$86,507$249,880$34,811
35.716.1$117,935$376,139$41,032
40.515.1$92,557$282,088$41,742
37.914.2$68,294$223,500$29,950
43.115.8$98,041$294,369$51,107
37.712.9$74,597$219,305$34,936
36.013.1$74,894$241,011$32,387
40.416.1$71,091$294,928$32,150
33.813.6$86,771$259,531$37,996
36.413.5$65,609$223,085$24,672
37.712.8$84,091$243,750$37,603
36.212.9$63,713$212,649$26,785
39.112.7$70,262$226,928$32,576
39.416.1$121,548$330,893$56,569
36.112.8$58,600$205,737$26,144
35.312.7$61,419$204,149$24,558
37.512.8$61,182$206,898$23,584
34.412.8$70,753$195,869$26,773
33.713.8$74,601$203,737$27,877
$0$10,000$20,000$30,000$40,000$50,000$60,00015.020.025.030.035.040.045.0
Average Bank Balance vs. Median Age
$0$10,000$20,000$30,000$40,000$50,000$60,00010.011.012.013.014.015.016.017.0
Average Bank Balance vs. Median Years
of Education
$0$10,000$20,000$30,000$40,000$50,000$60,000$0$20,000$40,000$60,000$80,000$100,000$120,000$140,000
Average Bank Balance vs. Median
Income
$0$10,000$20,000$30,000$40,000$50,000$60,000$100,000$150,000$200,000$250,000$300,000$350,000$400,000
Average Bank Balance vs. Median
Home Value
AgeEducationIncomeHome ValueBalance
Age1.000
Education0.1731.000
Income0.4770.5751.000
Home Value0.3860.7540.7951.000
Balance0.5650.5550.9520.7661.000
Regression Statistics
Multiple R0.960469892
R Square0.922502414
Adjusted R Square0.919306637
Standard Error2470.750497
Observations102
ANOVA
dfSSMSF
Significance F
Regression470486971671.76E+09288.66297176.17711E-53
Residual97592146977.86104608
Total1017640844145
Coefficients
Standard Errort Stat
P-value
Lower 95%Upper 95%Lower 95.0%Upper 95.0%
Intercept-13554.52574552.740433-2.977220.003673236-22590.45516-4518.6-22590.45516-4518.596169
Age333.450821573.252534534.5520721.5418E-05188.0648178478.8368188.0648178478.8368252
Education313.0083905379.27425870.8252820.411236296-439.74602471065.763-439.74602471065.762806
Income0.3884537980.02198618417.668094.42981E-320.3448173090.432090.3448173090.432090288
Home Value-0.001393910.013128309-0.106180.915662272-0.0274499660.024662-0.0274499660.024662153
Regression Statistics
Multiple R0.960
R Square0.923
Adjusted R Square0.919
Standard Error2470.750
Observations102
Coefficients
Standard Errort Stat
P-value
Lower 95%Upper 95%
Intercept-13554.5264552.740433-2.977220.003673236-22590.45516-4518.596
Age333.45173.252534534.5520721.5418E-05188.0648178478.83683
Education313.008379.27425870.8252820.411236296-439.74602471065.7628
Income0.3880.02198618417.668094.42981E-320.3448173090.4320903
Home Value-0.0010.013128309-0.106180.915662272-0.0274499660.0246622
SUMMARY OUTPUT
Regression Statistics
Multiple R0.960
R Square0.922
Adjusted R Square0.920
Standard Error2457.293
Observations102
ANOVA
dfSSMSF
Significance F
Regression270430535763.52E+09583.19951.67372E-55
Residual99597790568.56038289
Total1017640844145
Coefficients
Standard Errort Stat
P-value
Lower 95%Upper 95%
Intercept-9878.3112273.255582-4.345453.37E-05-14388.94352-5367.679
Age322.72471.587740164.5080871.8E-05180.6781674464.76938
Income0.3970.01436968527.600442.73E-480.3680970230.4251222