Multiple Linear Regression

profilefsyesil
MLRProject.pdf

Report for Multiple Linear Regression Project / IE 5318 / Fall 2020

Due date: Dec 16th, 2020

Title

To study the expected profit of 50 start-up firms in the USA.

Dataset reference

https://www.kaggle.com/root64shivansh/profit-in-startup-of-a-

company/notebooks?sortBy=dateRun&group=profile&pageSize=20&datasetId=457637

Team members

Shekhar Madhav Khairnar (1001822833)

Jayant Vithalrao Madan (1001814817)

Section 1: Background and data description

Describe the problem and the variables.

Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable

based on the value of two or more variables. It is sometimes known simply as multiple regression, and

it is an extension of linear regression. The variable that we want to predict is known as the dependent

variable, while the variables we use to predict the value of the dependent variable are known as

independent or explanatory variables. Our project emphasizes the profit of a startup. And earning

a profit is important because it decides whether a company can secure financing from a bank, attract

investors to fund its operations, and grow its business. Startups are full of promise and excitement, but

the flip side is, they’re also full of risk and uncertainty.

The variables

The dataset consists of the following attributes examined

• The response variable (y) Profit

• Predictor variables (x) Research and Development spend

Administration

Marketing spends

Present and discuss the matrix scatter plot of the variables. And checking response-

predictor and predictor-predictor pairwise correlations.

Fig – heat map

Fig – scatter plot

Fig – correlation matrix

Response vs Predictor

(profit [Y] vs Research and development spend [X1])

Glancing through the scatter plot we can observe that, there is a strong linear relationship between these

two variables. Also, it has a constant upward trend with a positive linear relationship. Its inclusion in

the MLR model will aid in explaining variation in profit. It has a correlation coefficient of 0.972.

(profit [Y] vs Marketing spend [X2])

Glancing through the scatter plot we can observe that, there is a somewhat linear pattern with an upward

trend between these two variables. Also, it has a positive linear relationship but less as compared to

research and development spend and we may call it a moderate. It has a correlation coefficient of 0.747.

Moreover, it may have outliers.

(profit [Y] vs Administration spend [X3])

Glancing through the scatter plot we can observe that, it is a random point cloud. And in our sense, this

variable may not able to explain the variability in profit to the extent that the other two variables will

do. It has a correlation coefficient of 0.20. Also, due to the pattern, it has the possibility of having

outliers.

Predictor vs Predictor

(Research and development spend [X1] vs Marketing spend [X2] )

Here, there is a positive upward trend, and the correlation coefficient is 0.724. These two predictors are

highly correlated.

(Research and development spend [X1] vs Administration spend [X3])

Here, looking at the figure there is a random cloud and no linear trend. The model is reasonable. And

the correlation coefficient is 0.241. These two predictors are moderately correlated

(Marketing spend [X2] vs Administration spend [X3]

Here, looking at the figure there is a random cloud and no linear trend. The model is reasonable. And

the correlation coefficient is -0.032. These two predictors are negatively correlated.

Because the correlations between predictor-predictor are all less than r (0.97), therefore we may not

have a serious multicollinearity problem.

Discuss potential complications

Since the correlations between predictor-predictor are all less than 0.90, therefore we don’t have a

serious multicollinearity problem in our model. Also observing the plots of predictor variables with

predictor variables there is no curve pattern so and that’s why there will not be any curvilinear

problem. Moreover, there may be a possibility of outliers and we can examine that further with the

help of plots.

Section 2: The multiple linear regression model

Fit a preliminary model.

The general multiple linear regression model is

Yi = β0+β1Xi1+β2Xi2+…..+ βp-1Xi, p-1+εi

Where: Yi is dependent variable, β0, β1, β2 …… βp-1 are parameters,, Xi1, Xi2…… Xi, p-1 are known constants, εi are independent N (0, σ2), i = 1,….n

MODEL I

Regression Statistics Multiple R 0.975062046 R Square 0.950745994

Adjusted R

Square 0.947533776

Standard Error 9232.334837 Observations 50

ANOVA

df SS MS F Significance F Regression 3 75683964196 25227988065 295.9780624 4.52851E-30 Residual 46 3920856301 85236006.54

Total 49 79604820497

Coefficients

Standard

Error t Stat P-value

Lower 95% Upper 95%

Lower

90.0%

Upper

90.0%

Intercept 50122.19299 6572.352622 7.626217867 1.05738E-09 36892.73332 63351.65266 39089.44482 61154.94116

R&D Spend 0.80571505 0.04514727 17.84637376 2.63497E-22 0.714838309 0.89659179 0.729928115 0.881501984

Administration

-

0.026815968 0.05102878 -0.525506752 0.601755108

-0.129531575 0.075899638

-

0.112475961 0.058844024

Marketing

Spend 0.027228065 0.016451235 1.6550773 0.104716819

-0.005886553 0.060342682

-

0.000387971 0.054844101

As per our model, estimated best fit equation is,

(profit)Yi = 50122.19299 + 0.8057 Xi1 - 0.02681 Xi2 + 0.02722 Xi3 +εi

And according to our model for Per unit increase in profit there has to be 0.80571505 units increase in r and d spend

also decrease in 0.026815968 units in administration 0.027228065 increase in marketing spend.

Model assumptions

The MLR model is reasonable

Residual have constant variance

The plot Residual vs Ŷ aids in understanding whether we have constant variance or not. By glancing at the graph, we

can observe that it has a non-constant variance, to make sure we will be performed a Modified Levene test for a non-

constant variance.

Residuals are normally distributed

The plot helps to understand whether residuals are normally distributed or not in this figure the plot has shorter

tails. And hence from this information, we conclude that the model assumption for normality is not satisfied.

Furthermore, we calculated the Pearson correlation coefficient of 0.9646 using software and used hypothesis testing

to conclude results.

In order to check the fitness of the model

form, a residual plot of each predictor variable

is necessary to check for curvature. And

according to the above residual plots there is

no curvilinear trend and hence it can be

concluded that the model form is reasonable.

F – test:

H0: Variance is constant

H1: Variance is not constant

Decision rule: p > α then Accept H0. α=0.10

From above table, p=0.99, which is greater

than 0.10.

So we accept Ho. It is strong conclusion.

We conclude that error variance is constant.

Residuals are uncorrelated

It is valid only for time series models. Where one variable is dependent on the other.

No outliers

For finding X outliers we used hat matrix where hii > 2 𝑋 4

50 = 0.16 is considered as an outlier. And in our case, we

found observation 7,20,47 & 49 with leverage values 0.17196, 0.18271, 0.2406 & 0.20802 resp. as an outlier. And we

haven’t found any Y outliers.

To check their influence, there are three measures we can use

1) DFFITS – for the observation 7,20,47 & 49 we got -2.55, -2.92,6.37 &-1.37 which are greater than DFFITS

>2√𝑝/𝑛 (0.565) and hence they are influential 2) Cook’s distance - for the observation 7,20,47 & 49 we got 8.51,1.57,6.56 & 3.87 which are greater than F(0.5,4,46) =

0.8516 and hence they are influential

Predictors are not highly correlated

Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It measures how much variances of the estimated regression coefficient are inflated as compared to when predictors are not linearly

related. VIF factor of 10 and above is said to be highly correlated. And according to our model our calculated inflation

factors are 1.73,1.23 & 1.54 which are below 10, that’s why the assumption that predictors are not highly co-related is

valid.

Present your preliminary model that satisfies the model assumptions.

Our preliminary model has not satisfied all our assumptions but the model is reasonable and error has a

constant variance. And as per the test of significance F* = 295.97 which is greater than F(0.90,3,46) = 2.2068,

we reject Ho and conclude our model is significant at the level of 10%. Moreover, 95.07% of the variability

is explained by the model. As we increase the number of variables R2 increases so to keep it in proportion

adjusted Ra is used (which is R 2 divided by degree of freedom) in our case it is found to be 94.75%.

Section 3: Explore the interaction terms

Partial regression plots. And Discuss the addition of possibly useful interaction terms.

Partial regression plots are related to, but distinct from, partial residual plots. Partial regression plots are most commonly

used to identify leverage points and influential data points that might not be leverage points. Partial residual plots are

most commonly used to identify the nature of the relationship between Y and Xi the partial regression plot, the x-axis is

not Xi. This limits its usefulness in determining the need for a transformation.

H0: Normally distributed

H1: Normality is violated

Decision rule: if ρ̂ < c (α, n) Reject H0.

α=0.10

From the critical values table c (α, n) =

0.981 and from above table ρ̂ = 0.9646,

which is less

So, we reject H0. It is a strong conclusion.

We conclude that Normality is violated,

Section 4: Model search

Obtain a set of two potentially good models (backward deletion & stepwise regression):

To further narrow our search to find a good regression model we used two methods approaches backward

deletion and stepwise regression. We also assumed that multicollinearity is not a serious problem.

Backward deletion

we began with a full set with all predictor variables and considered the value of 𝛼 as 0.10, we regressed y on the full set. And got the following results in form of ANOVA.

Coefficients Standard Error t Stat P-value

Intercept 50122.19299 6572.352622 7.626217867 1.05738E-09

R&D Spend 0.80571505 0.04514727 17.84637376 2.63497E-22

Administration -0.026815968 0.05102878 -0.525506752 0.601755108

Marketing Spend 0.027228065 0.016451235 1.6550773 0.104716819

In this table, the highest p-value is found to be 0.601 so we must remove the administration predictor variable. And after

removing the predictor variable we got the following results.

According to the graph there is some kind of linear

trend, so X3 (marketing spend) and X2 (R and D

spend) needs to be added

The points have no trend (random cloud). So we

conclude not to add (administration)X1 & X3

(marketing spend) because it will not do contribution

to our model since it doesn’t have any particular

trend.

The points have no trend (random cloud). So we

conclude not to add (administration) X1X2 (R and D

spend) because it will not do contribution to our

model since it doesn’t have any particular trend.

Coefficients Standard Error t Stat P-value

Intercept 46975.86422 2689.932925 17.46358201 3.50406E-22

R&D Spend 0.796584044 0.041347578 19.26555518 6.04043E-24

Marketing Spend 0.029907875 0.015520012 1.927052287 0.060030397

At this point all the p values of predictor variables are less than 0.10, Hence we got this as the potentially good model.

Forward selection approach

in the forward selection approach, we regressed Y(dependent variable) on each independent variable and

compared it with the p-value and got the following results.

Iteration I

Coefficients Standard

Error t Stat P-value

Intercept 76974.47 25320.18 3.040044 0.003823543

Administration 0.288749 0.203417 1.419493 0.162217395

Step 1. we regressed Y (profit) on administration and got P-value as 0.162217 which is greater than α = 0.10 which is not significant. i.e it does not explain any variability in the data.

Coefficients Standard

Error t Stat P-value

Intercept 60003.55 7684.53 7.808356 4.29474E-10

Marketing Spend 0.246459 0.031587 7.802657 4.38107E-10

Step 2. we regressed Y (profit) on marketing spend and got P-value as 4.29474E-10 which is less than α = 0.10 and hence it is significant. i.e it did explain any variability in the data.

Step 3. we

regressed Y

(profit) on

R and D

spend and got P-value as 2.78E-24 which is less than α = 0.10 and hence it is significant. i.e it did explain much variability on the data as compared to top marketing spend because P-value of R and D spend is lower than P-value of

administration.

Iteration II

Step 1. we regressed Y (profit) on administration and R and D spend. The results, R and D spend P-value as 2.28E-31

which is less than α = 0.10 but Administration’s P-value as 0.288893 which is greater than α = 0.10 which is not

significant. Hence, to conclude we dropped the model.

Coefficients Standard Error t Stat P-value

Intercept 54886.62 6016.718 9.122352 5.7E-12 R&D Spend 0.862118 0.030156 28.58887 2.28E-31 Administration -0.053 0.049405 -1.07268 0.288893

Step 2. we regressed Y (profit) on marketing spend and R and D spend. The results, R and D spend P-value as 6.04E-

24 which is less than α = 0.10 marketing spend P-value as 0.06003 which is also less than α = 0.10. hence, we selected

this as our final model.

Coefficients Standard Error t Stat P-value

Intercept 49032.9 2537.897 19.32029 2.78E-24

R&D Spend 0.854291 0.029306 29.15114 3.5E-32

Coefficients Standard Error t Stat P-value

Intercept 46975.86 2689.933 17.46358 3.5E-22

Marketing Spend 0.029908 0.01552 1.927052 0.06003

R&D Spend 0.796584 0.041348 19.26556 6.04E-24

Also, the correlation coefficient of 72.4% is not a serious concern because the respected VIF factor is 3.6 which is less

than 10 and hence considerable.

Present your potentially good models.

We used two approaches in our model selection and achieved a similar outcome. The analysis of variance for the

model is shown below.

MODEL II

Regression Statistics Multiple R 0.97491 R Square 0.95045 Adjusted R Square 0.948342 Standard Error 9160.966 Observations 50

ANOVA df SS MS F Significance F

Regression 2 7.57E+10 3.78E+10 450.7713 2.16E-31

Residual 47 3.94E+09 83923295 Total 49 7.96E+10

Coefficients Standard Error t Stat P-value

Intercept 46975.86 2689.933 17.46358 3.5E-22 Marketing Spend 0.029908 0.01552 1.927052 0.06003 R&D Spend2 0.796584 0.041348 19.26556 6.04E-24

Section 5: Model selection

For each model, verify model assumptions and check diagnostics.

As per our selection procedure, we acquired the same results and estimated the best-fit equation for the model is.

(profit)Yi = 46975 + 0.029908 Xi1 – 0.796584 Xi2 + εi

Model Assumptions

The MLR model is reasonable

According to the above residual plots, there is no curvilinear trend and hence it can be concluded that the model

form is reasonable.

Residual have constant variance

Residuals are normally distributed

No outliers

For finding X outliers we used hat matrix where hii > 2 𝑋 3

50 = 0.12 is considered as an outlier. And in our case, we

found observation 1, 7, 20 & 49 with leverage values 0.121,0.17196, 0.18271 & 0.217 &resp. as an outlier. And

observation 1 was found to be Y outlier.

To check their influence, there are three measures we can use

1) DFFITS – for the observation 1,7& 49 we got -10.5343, -1.86309 & 4.164383 which are greater than DFFITS

>2√𝑝/𝑛 (0.489) and hence they are influential. 2) Cook’s distance - for the observation 1, 7,20 & 49 we got 3.79, 2.66, 1.18 & 2.66 which are greater than F(0.5,4,46) =

0.8 and hence they are influential

Modified levene test for non-constant variance.

F – test:

H0: Variance is constant

H1: Variance is not constant

Decision rule: p > α then Accept H0. α=0.10

From above table, p=0.9566, which is greater

than 0.10.

So we accept Ho. It is strong conclusion. We

conclude that variance is constant.

H0: Normally distributed

H1: Normality is violated

Decision rule: if ρ̂ < c (α, n) Reject H0. α=0.10

From the critical values table c (α, n) = 0.964 and

from above table ρ̂ = 0.9646, which is almost equal

So, we accept H0. It is a strong conclusion. We

conclude that Normality is not violated

Predictors are not highly correlated

Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It measures how much variances of the estimated regression coefficient are inflated as compared to when predictors are not linearly

related. VIF factor of 10 and above is said to be highly correlated. And according to our model our calculated inflation

factor is 3.6 which is below 10, that’s why the assumption that predictor is not highly co-related is not violated.

Fully discuss and justify your choice of the best overall model

According to the study, two models were considered the first model in which all variables are considered. Where only

two assumptions are satisfied i.e the model is reasonable and error have constant variance. Also, the adjusted R2 was

found to be 0.940. Alternatively, in the model where it is selected by forwarding and backward selection, all the

assumptions are satisfied and fewer numbers of outliers are found as compared to the previous model. Also, the adjusted

R2 was found to be 0.948 which is quite greater as compared to the previous model. Moreover, in the previous model there

is one insignificant variable found according to the P-value and in the latter none of which is found. To conclude, as per

our understanding the model obtained from the selection procedure is the best model we can choose.

Present and interpret the meaning of your final model.

After all verified assumptions performing the test, we are considering model 2 as our final model. We can predict the

profit from the marketing spend and Research and development spend. It means that amount by which the profit of the

start-up increases, will primarily depend on marketing spend and Research and development spend.

(profit)Yi = 46975 + 0.029908 (marketing spend) Xi1 – 0.796584 (R and D spend) Xi2 + εi There would be a per unit increase in profit for 0.029908 unit increase in marketing spend and 0.796584 unit increase

in R and D spend. And variance inflation factor of 3.57 which is less than 10, so concludes that there won’t be any severe

multicollinearity problem. From the table, we can say that the p-value for years since they joined is less than 0.05, so

this predictor significant. The p-value of marketing spend is slightly greater than 0.05 so this predictor is marginally

significant, but if we change the value of α to 0.10, then marketing spend becomes significant.

F test: To check whether regression is significant or not

H0: β1=β2=0

H1: not all β1, β2, = 0

Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05

F* = 450.77, F (0.90, 3, 47) = 2.2041

As 2.2041 > 3.259, we reject Ho and conclude that not all β1, β2 is zero. In other words, regression is significant.

And the coefficient of determination was found to be 0.95.

Discuss the fit of the model and interpret inferences (explained variability, joint C.I. for the

parameters; C.I., C.B., and P.I. calculated at one xh of interest). Joint C. I for the parameters

The Bonferroni joint confidence interval can be used to estimate regression coefficient simultaneously. g

are the parameter to be estimated jointly where g≤p, the confidence limits can be find by using

bk ± B s{bk}, where B = t (1 – α /2g; n – p), g =2, p = 3, n = 50

B = t (1-(0.1/2*2);50-3) = t(0.975,47) = 2.0117

b1 = 0.7965, b2 = 0.02999

S{b1} = √(MSE*b1) = √ (83923294.69*0.79865) =8176.30

S{b2} = √(MSE*b2) = √ (83923294.69*0.0299) =1584.07

β1 = bk ± B s{bk} = 0.7965 ± 2.0117*8176.30

β1 = (-16447.466, 16449.059)

β2 = bk ± B s{bk} = 0.0299 ± 2.0117*1584.07

β2 = (-3186.643, 3186.7035)

Prediction Interval with new Xh

Xhnew = [1 90000 150000]

Hii = Xhnew *(X TX)-1 * Xhnew

T = [1 90000 150000]

= 0.04709

S{pred} = √ (MSE + s{yh}2) = √ (83923294.69 + 0.04709) =9160.965

Yhat h = 46975.86422 + 0.7965*90000 + 0.0299*150000 = 123145.864

Prediction interval: -

Yhat h ± t (1 – α /2; n – p) *S{pred} = 123145.865 ±t (0.95,47) *9160.965

Prediction interval = (107782.92,138508.7983)

Prediction interval of 3 new mean observations: -

S {pred mean} = √ (MSE/3 + s{yh}2) = 5289.086

Yhat h ± t (1 – α /2; n – p) *S {pred mean} = 123145.865 ±t (0.95,47) *5289.086

PI = (114276.06,132015.6572)

Section 6: Final discussion

Summary and Conclusions

To conclude, Our project focused on predicting the profit of a startup firm using cost invested on administration spend, research and development spend and marketing spend. We found the data of 50 entries online at Kaggle from various cities in the USA which are New York, California, and Florida. By using Python found out the relation between the predictor variable and dependent variable and ruled out the one with low coefficient of correlation, also calculated the coefficient of correlation between predictor variables and ruled out the one with the highest. Using the variables, we regressed our dependent variables and then checked for the assumption that needs to be satisfied because the best model is the one whose all assumptions are satisfied. Many assumptions were not satisfied. We proceeded further using partial regression plots and tried to extract the relations between variables. Then we proceeded with the model search where we used two methods in search of the best model. Surprisingly both the methods yielded the same model results with the same predictor variables. Then the same we checked for the assumptions and most of the assumptions were satisfied in the second model including the ones which were not in the first model and the number of outliers also reduced. In sum, the model search methods and partial plots helped a lot in searching out for the best model.

50 3686080.78 10551254.89

3686080.78 3.74988E+11 9.77065E+11

10551254.89 9.77065E+11 2.95937E+12

1

90000

150000