excel assignment

profileflower
602_04_regression_analysis.pptx

California State University, East Bay College of Business and Economics

BUS 602 Business Analytics for Managers

Regression Analysis

Dr. Z. Radovilsky

Lecture Materials

Dr. Z. Radovilsky

‹#›

1

1

Learning Objectives

Explain regression analysis using simple and multiple linear regressions

Identify six general steps of linear regression analysis

Apply general steps in developing simple linear regression

Utilize various methods of developing regression equation

Analyze regression output using coefficient of determination, regression coefficient(s) and intercept, and significance Fs

Develop multiple regression analysis using the steps to create the best regression equation

Explain approaches of building good regression equation

Dr. Z. Radovilsky

‹#›

2

2

Regression analysis is the study of relationships between variables

Main goal: estimate values of one variable (dependent variable) on the basis of known values of the other variable or variables (independent variables), all of which are numeric

One of the most useful tools for business data analysis and business forecasting

Dependent or response variable is the single variable being explained by the regression

Independent or explanatory variable(s) is (are) used to explain the dependent variable

Regression analysis can be used for

Cross-sectional data

Time series data (forecasting)

Regression Analysis

Dr. Z. Radovilsky

‹#›

Regression Analysis – Linear Regression

Linear regression (one of the most popular types of regression) involves the attempt to develop a straight-line or linear mathematical equation that describes the relationship between a dependent variable (Y) and one or more independent variables (X1 , X2 , ... Xn )

Simple linear regression:

Y = b0 + b1X1

Multiple linear regression:  

Y = b0 + b1X1 + b2X2 +...+ bk Xk  

Dr. Z. Radovilsky

‹#›

Example – Simple Regression

Market Value = b0 + b1*Square Feet

Y = 426,798 + 38.127 X

Dr. Z. Radovilsky

‹#›

Example – Multiple Regression

Average Bank Balance = b0 + b1*Median Age + b2*Median Years Education + b3*Median Income + b4*Median Home Value

Dr. Z. Radovilsky

‹#›

Six General Steps of Linear Regression Analysis

Formulate the goal of regression analysis

Identify the dependent variable and potential independent variable(s)

Plot the data

A suggested step to check for potentially linear relationship between dependent and independent variables

Develop correlation coefficients (correlation matrix)

A suggested step: examine the strength of relationships between dependent and independent variable(s)

Identify a linear regression equation

If the plot and correlation coefficients indicate linear relationship between the dependent and independent variable(s)

Analyze statistical significance and usability of the regression equation

Identify if the regression equation is “good for use”

Dr. Z. Radovilsky

‹#›

Example of Linear Regression Analysis: Steps 1-3

Formulate the goal of regression analysis, and identify the dependent variable and potential independent variable(s)

Identify relationships between market value and square footage

Market Value – dependent variable (Y)

Square Feet – independent variable (X)

Plot the data

A suggested step to check for potentially linear relationship between dependent and independent variables

Scatterplot

Line plot, specifically for time series data

Identify correlation coefficients (correlation matrix) and examine the strength of relationships between dependent and independent variable(s)

Look for statistically significant correlation coefficient(s) between the dependent variable and independent variable(s)

Dr. Z. Radovilsky

‹#›

Simple Linear Regression

Single independent variable

Linear relationship

Dr. Z. Radovilsky

‹#›

Example: Scatterplot of Home Market Value

Dr. Z. Radovilsky

‹#›

Example: Correlation Coefficients

Dr. Z. Radovilsky

‹#›

Example of Regression Analysis: Step 4

If the plot and correlation coefficients indicate linear relationship between the dependent and independent variable(s), identify a linear regression equation

Dr. Z. Radovilsky

‹#›

Simple Linear Regression Equation

Simple regression model

Y = β0 + β1X1 + ε

where:

βo = Intercept

β1 = Slope (regression coefficient)

ε = Random error (residual)

Nature of random error

Variable may be influenced by other variables not included in the regression equation

Errors of measurement (rounding)

Unexplained (random) variability

Least-squares regression estimates β0 and β1 by b0 and b1, respectively, by minimizing the sum of squares of the residuals:

Y = b0 + b1X1

 

 

 

 

Dr. Z. Radovilsky

‹#›

Errors (Residuals)

Dr. Z. Radovilsky

‹#›

Least-Squares Regression for Simple Leniear Regression

Least squares method

The most popular method of identifying b0 and b1

Identifies such coefficients b0 and b1 that minimize the sum of the squared residuals Minimize  εi2 =  (Yi - (b0 + b1X1))2

Formulas for b1 (slope of the regression line) and b0 (intercept of the regression line)

Typically, utilize Excel or statistical software to identify b0 and b1

Using Excel functions: INTERCEPT ( ) and SLOPE ( )

Using Excel Trendline option on the scatterplot chart

Using Regression option in Data Analysis box

Dr. Z. Radovilsky

‹#›

Market Value Example: Excel Functions

For every additional square foot, the market value increases by approximately $38.13.

For a house with 1,800 square feet, the estimated market value is

Y = 426,798 + 38.127*1,800 = $495,427

Dr. Z. Radovilsky

‹#›

Market Value Example: Using Trendline

Dr. Z. Radovilsky

‹#›

Market Value Example: Simple Linear Regression Excel Output in Data Analysis Box

Dr. Z. Radovilsky

‹#›

Coefficient of Determination

“R Square” = R2, coefficient of determination: the proportion of variation explained by the independent variable (regression model)

0  R2  1

“Multiple R” = square root of R2, is the sample correlation coefficient r between X and Y

The sign of r is the same as the slope of the fitted line

Dr. Z. Radovilsky

‹#›

Statistical Significance of Regression Coefficient (Slope of Regression Line)

Regression coefficient of independent variable

Represents a slope of the regression line. It should be proven that it is statistically significant (not equal to zero)

P-value identifies a probability that the population regression coefficient is equal to 0

In order for a regression coefficient to be statistically significant, the P-value should be no more than 0.05 (no more than 5% of chance that the regression coefficient is equal to 0)

Dr. Z. Radovilsky

‹#›

Significance F

Significance F or F-Statistic (Variance Ratio Test)

Is a test of the overall significance of the estimated population regression line, that is, whether a dependent variable is linearly related to all independent variables

The regression equation is overall statistically significant if Significance F is no more than 0.05

Significance F ≤ 0.05

A chance that all population regression coefficients and R2 are equal to 0 is no more than 0.05 or 5%

Dr. Z. Radovilsky

‹#›

Continue with General Steps of Linear Regression Analysis

Analyze statistical significance and usability of the regression equation

Coefficient of Determination – R 2

Significance of regression coefficient and intercept

Overall significance of the regression equation – Significance F

Regression residuals (errors)

Overall, the regression equation is “good for use” if:

Coefficient of determination R2 is high (relatively high)

Regression coefficient(s) and intercept are statistically significant (P-values are less than 5%)

Significance F is also less than 5%

Dr. Z. Radovilsky

‹#›

Multiple Linear Regression

Multiple linear regression model:

Y = b0 + b1 X1 + b2 X2 + ... + bk Xk + e

Predicted model:

Y = b0 + b1X1 + b2 X2 + ... + bk Xk

The b’s are called partial regression coefficients

Dr. Z. Radovilsky

‹#›

Example: Banking Data

Need to identify relationships between Average Bank Balance and other variables, i.e., Median Age, Median Years of Education, Median Income, and Median Home Value

Dr. Z. Radovilsky

‹#›

Banking Data: Scatterplots

Dr. Z. Radovilsky

‹#›

Banking Data: Correlation Matrix

Dr. Z. Radovilsky

‹#›

Regression Equation for Banking Data

Dr. Z. Radovilsky

‹#›

Analyzing Results: R Square

Regression statistics similar to single independent variable case

R Square (coefficient of determination)

The value .923 indicates that 92.3% of the variation in graduation rate can be explained by the variation in the independent variables.

Adjusted R2 accounts for sample size and number of independent variables. It is useful for comparing models with different sets of independent variables

Dr. Z. Radovilsky

‹#›

Analyzing Results: Partial Regression Coefficients

Regression coefficients for Age and Income are statistically significant (good for use), because the respective P-values ≤ 0.05

However, the regression coefficients for Education and Home Value are not statistically significant (P value is substantially greater than 0.05). There is a high chance that the population regression coefficients for both variables are equal to 0.

Home Value and Education variables may be considered for removal from the regression equation

Dr. Z. Radovilsky

‹#›

Improved Regression Equation

Dr. Z. Radovilsky

‹#›

Building Good Models

Include only significant independent variables

Use the fewest necessary to permit adequate interpretation of the dependent variable

10 variables has potentially 210 = 1024 models!

As you add more explanatory variables to a model, R2 increases (even if the variables are irrelevant)

However, the Adjusted R2 could either increase or decrease, thus providing information about the value of additional variables.

Dr. Z. Radovilsky

‹#›

Modeling Approach

Construct a model with all available independent variables

Check for significance of the independent variables by examining the p-values

Identify the independent variable having the largest p-value that exceeds the chosen level of significance

Remove the variable from the model and evaluate adjusted R2

Continue until all variables are significant

Dr. Z. Radovilsky

‹#›

House

Age, Years

Square

Feet

Market

Value, $

331,812$490,000.00

321,914$504,400.00

321,842$493,300.00

331,812$491,000.00

321,836$501,900.00

332,028$508,500.00

321,732$487,600.00

331,850$496,000.00

321,791$489,200.00

331,566$488,400.00

321,852$500,800.00

321,620$496,700.00

321,592$487,500.00

322,472$514,000.00

322,372$513,200.00

331,666$487,500.00

322,123$516,100.00

321,620$494,700.00

321,731$486,400.00

321,666$487,100.00

281,520$483,400.00

271,484$479,800.00

281,588$481,500.00

281,598$487,100.00

281,484$482,600.00

281,484$478,800.00

281,520$487,600.00

271,701$494,200.00

281,484$482,000.00

281,468$488,100.00

281,520$488,100.00

271,520$488,600.00

271,484$476,600.00

281,520$484,400.00

271,668$490,900.00

281,588$481,000.00

281,784$491,300.00

271,484$481,300.00

272,020$500,700.00

281,520$487,200.00

271,684$496,700.00

271,881$520,700.00

MedianMedian YearsMedianMedianAverage Bank

AgeEducationIncomeHome ValueBalance

35.914.8$101,033$283,104$38,517

37.713.8$96,748$263,843$40,618

36.813.8$82,245$242,732$35,206

35.313.2$80,639$245,024$33,434

35.313.2$74,879$235,951$28,162

34.813.7$85,591$255,334$36,708

39.314.4$90,615$281,265$38,766

36.613.9$86,507$249,880$34,811

35.716.1$117,935$376,139$41,032

40.515.1$92,557$282,088$41,742

37.914.2$68,294$223,500$29,950

Correlation Matrix

House Age, YearsSquare FeetMarket Value, $

House Age, Years1

Square Feet0.4971.000

Market Value, $0.3610.8451

2

2

1

1

1

X

n

X

Y

X

n

Y

X

b

i

n

i

i

i

n

i

-

å

-

å

=

=

=

X

b

Y

b

1

0

-

=

Using INTERCEPT and SLOPE Functions

Intercept (bo)=INTERCEPT(C4:C45,B4:B45)426,798

Slope (b1)=SLOPE(C4:C45, B4:B45)38.127

Regression EquationY = 426,798 + 38.127 X

Square Feet X = 1,800Y = 426,798 + 38.127*1800 =495,427

SUMMARY OUTPUT

Regression Statistics

Multiple R0.8455

R Square0.7148

Adjusted R Square0.7077

Standard Error5705.2971

Observations42

ANOVA

dfSSMSF

Significance F

Regression132640531763264053176100.2771.8558E-12

Residual40130201658532550414.64

Total414566069762

Coefficients

Standard Errort Stat

P-value

Lower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept4267986577.22166364.890391823.7E-42413505.4299440091.55413505.43440091.552

Square Feet38.1273.80742473910.013833131.9E-1230.4318235645.82200830.431823645.8220084

Regression Statistics

Multiple R0.8455

R Square0.7148

Adjusted R Square0.7077

Standard Error5705.2971

Observations42

Coefficients

Standard Errort Stat

P-value

Lower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept4267986577.22166364.890391823.7E-42413505.4299440091.55413505.43440091.552

Square Feet38.1273.80742473910.013833131.9E-1230.4318235645.82200830.431823645.8220084

ANOVA

dfSSMSF

Significance F

Regression132640531763264053176100.2771.8558E-12

Residual40130201658532550414.64

Total414566069762

MedianMedian YearsMedianMedianAverage Bank

AgeEducationIncomeHome ValueBalance

35.914.8$101,033$283,104$38,517

37.713.8$96,748$263,843$40,618

36.813.8$82,245$242,732$35,206

35.313.2$80,639$245,024$33,434

35.313.2$74,879$235,951$28,162

34.813.7$85,591$255,334$36,708

39.314.4$90,615$281,265$38,766

36.613.9$86,507$249,880$34,811

35.716.1$117,935$376,139$41,032

40.515.1$92,557$282,088$41,742

37.914.2$68,294$223,500$29,950

43.115.8$98,041$294,369$51,107

37.712.9$74,597$219,305$34,936

36.013.1$74,894$241,011$32,387

40.416.1$71,091$294,928$32,150

33.813.6$86,771$259,531$37,996

36.413.5$65,609$223,085$24,672

37.712.8$84,091$243,750$37,603

36.212.9$63,713$212,649$26,785

39.112.7$70,262$226,928$32,576

39.416.1$121,548$330,893$56,569

36.112.8$58,600$205,737$26,144

35.312.7$61,419$204,149$24,558

37.512.8$61,182$206,898$23,584

34.412.8$70,753$195,869$26,773

33.713.8$74,601$203,737$27,877

$0$10,000$20,000$30,000$40,000$50,000$60,00015.020.025.030.035.040.045.0

Average Bank Balance vs. Median Age

$0$10,000$20,000$30,000$40,000$50,000$60,00010.011.012.013.014.015.016.017.0

Average Bank Balance vs. Median Years

of Education

$0$10,000$20,000$30,000$40,000$50,000$60,000$0$20,000$40,000$60,000$80,000$100,000$120,000$140,000

Average Bank Balance vs. Median

Income

$0$10,000$20,000$30,000$40,000$50,000$60,000$100,000$150,000$200,000$250,000$300,000$350,000$400,000

Average Bank Balance vs. Median

Home Value

AgeEducationIncomeHome ValueBalance

Age1.000

Education0.1731.000

Income0.4770.5751.000

Home Value0.3860.7540.7951.000

Balance0.5650.5550.9520.7661.000

Regression Statistics

Multiple R0.960469892

R Square0.922502414

Adjusted R Square0.919306637

Standard Error2470.750497

Observations102

ANOVA

dfSSMSF

Significance F

Regression470486971671.76E+09288.66297176.17711E-53

Residual97592146977.86104608

Total1017640844145

Coefficients

Standard Errort Stat

P-value

Lower 95%Upper 95%Lower 95.0%Upper 95.0%

Intercept-13554.52574552.740433-2.977220.003673236-22590.45516-4518.6-22590.45516-4518.596169

Age333.450821573.252534534.5520721.5418E-05188.0648178478.8368188.0648178478.8368252

Education313.0083905379.27425870.8252820.411236296-439.74602471065.763-439.74602471065.762806

Income0.3884537980.02198618417.668094.42981E-320.3448173090.432090.3448173090.432090288

Home Value-0.001393910.013128309-0.106180.915662272-0.0274499660.024662-0.0274499660.024662153

Regression Statistics

Multiple R0.960

R Square0.923

Adjusted R Square0.919

Standard Error2470.750

Observations102

Coefficients

Standard Errort Stat

P-value

Lower 95%Upper 95%

Intercept-13554.5264552.740433-2.977220.003673236-22590.45516-4518.596

Age333.45173.252534534.5520721.5418E-05188.0648178478.83683

Education313.008379.27425870.8252820.411236296-439.74602471065.7628

Income0.3880.02198618417.668094.42981E-320.3448173090.4320903

Home Value-0.0010.013128309-0.106180.915662272-0.0274499660.0246622

SUMMARY OUTPUT

Regression Statistics

Multiple R0.960

R Square0.922

Adjusted R Square0.920

Standard Error2457.293

Observations102

ANOVA

dfSSMSF

Significance F

Regression270430535763.52E+09583.19951.67372E-55

Residual99597790568.56038289

Total1017640844145

Coefficients

Standard Errort Stat

P-value

Lower 95%Upper 95%

Intercept-9878.3112273.255582-4.345453.37E-05-14388.94352-5367.679

Age322.72471.587740164.5080871.8E-05180.6781674464.76938

Income0.3970.01436968527.600442.73E-480.3680970230.4251222