Multiple Linear Regression
Report for Multiple Linear Regression Project / IE 5318 / Fall 2020
Due date: Dec 16th, 2020
Title
To study the expected profit of 50 start-up firms in the USA.
Dataset reference
https://www.kaggle.com/root64shivansh/profit-in-startup-of-a-
company/notebooks?sortBy=dateRun&group=profile&pageSize=20&datasetId=457637
Team members
Shekhar Madhav Khairnar (1001822833)
Jayant Vithalrao Madan (1001814817)
Section 1: Background and data description
Describe the problem and the variables.
Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable
based on the value of two or more variables. It is sometimes known simply as multiple regression, and
it is an extension of linear regression. The variable that we want to predict is known as the dependent
variable, while the variables we use to predict the value of the dependent variable are known as
independent or explanatory variables. Our project emphasizes the profit of a startup. And earning
a profit is important because it decides whether a company can secure financing from a bank, attract
investors to fund its operations, and grow its business. Startups are full of promise and excitement, but
the flip side is, they’re also full of risk and uncertainty.
The variables
The dataset consists of the following attributes examined
• The response variable (y) Profit
• Predictor variables (x) Research and Development spend
Administration
Marketing spends
Present and discuss the matrix scatter plot of the variables. And checking response-
predictor and predictor-predictor pairwise correlations.
Fig – heat map
Fig – scatter plot
Fig – correlation matrix
Response vs Predictor
(profit [Y] vs Research and development spend [X1])
Glancing through the scatter plot we can observe that, there is a strong linear relationship between these
two variables. Also, it has a constant upward trend with a positive linear relationship. Its inclusion in
the MLR model will aid in explaining variation in profit. It has a correlation coefficient of 0.972.
(profit [Y] vs Marketing spend [X2])
Glancing through the scatter plot we can observe that, there is a somewhat linear pattern with an upward
trend between these two variables. Also, it has a positive linear relationship but less as compared to
research and development spend and we may call it a moderate. It has a correlation coefficient of 0.747.
Moreover, it may have outliers.
(profit [Y] vs Administration spend [X3])
Glancing through the scatter plot we can observe that, it is a random point cloud. And in our sense, this
variable may not able to explain the variability in profit to the extent that the other two variables will
do. It has a correlation coefficient of 0.20. Also, due to the pattern, it has the possibility of having
outliers.
Predictor vs Predictor
(Research and development spend [X1] vs Marketing spend [X2] )
Here, there is a positive upward trend, and the correlation coefficient is 0.724. These two predictors are
highly correlated.
(Research and development spend [X1] vs Administration spend [X3])
Here, looking at the figure there is a random cloud and no linear trend. The model is reasonable. And
the correlation coefficient is 0.241. These two predictors are moderately correlated
(Marketing spend [X2] vs Administration spend [X3]
Here, looking at the figure there is a random cloud and no linear trend. The model is reasonable. And
the correlation coefficient is -0.032. These two predictors are negatively correlated.
Because the correlations between predictor-predictor are all less than r (0.97), therefore we may not
have a serious multicollinearity problem.
Discuss potential complications
Since the correlations between predictor-predictor are all less than 0.90, therefore we don’t have a
serious multicollinearity problem in our model. Also observing the plots of predictor variables with
predictor variables there is no curve pattern so and that’s why there will not be any curvilinear
problem. Moreover, there may be a possibility of outliers and we can examine that further with the
help of plots.
Section 2: The multiple linear regression model
Fit a preliminary model.
The general multiple linear regression model is
Yi = β0+β1Xi1+β2Xi2+…..+ βp-1Xi, p-1+εi
Where: Yi is dependent variable, β0, β1, β2 …… βp-1 are parameters,, Xi1, Xi2…… Xi, p-1 are known constants, εi are independent N (0, σ2), i = 1,….n
MODEL I
Regression Statistics Multiple R 0.975062046 R Square 0.950745994
Adjusted R
Square 0.947533776
Standard Error 9232.334837 Observations 50
ANOVA
df SS MS F Significance F Regression 3 75683964196 25227988065 295.9780624 4.52851E-30 Residual 46 3920856301 85236006.54
Total 49 79604820497
Coefficients
Standard
Error t Stat P-value
Lower 95% Upper 95%
Lower
90.0%
Upper
90.0%
Intercept 50122.19299 6572.352622 7.626217867 1.05738E-09 36892.73332 63351.65266 39089.44482 61154.94116
R&D Spend 0.80571505 0.04514727 17.84637376 2.63497E-22 0.714838309 0.89659179 0.729928115 0.881501984
Administration
-
0.026815968 0.05102878 -0.525506752 0.601755108
-0.129531575 0.075899638
-
0.112475961 0.058844024
Marketing
Spend 0.027228065 0.016451235 1.6550773 0.104716819
-0.005886553 0.060342682
-
0.000387971 0.054844101
As per our model, estimated best fit equation is,
(profit)Yi = 50122.19299 + 0.8057 Xi1 - 0.02681 Xi2 + 0.02722 Xi3 +εi
And according to our model for Per unit increase in profit there has to be 0.80571505 units increase in r and d spend
also decrease in 0.026815968 units in administration 0.027228065 increase in marketing spend.
Model assumptions
The MLR model is reasonable
Residual have constant variance
The plot Residual vs Ŷ aids in understanding whether we have constant variance or not. By glancing at the graph, we
can observe that it has a non-constant variance, to make sure we will be performed a Modified Levene test for a non-
constant variance.
Residuals are normally distributed
The plot helps to understand whether residuals are normally distributed or not in this figure the plot has shorter
tails. And hence from this information, we conclude that the model assumption for normality is not satisfied.
Furthermore, we calculated the Pearson correlation coefficient of 0.9646 using software and used hypothesis testing
to conclude results.
In order to check the fitness of the model
form, a residual plot of each predictor variable
is necessary to check for curvature. And
according to the above residual plots there is
no curvilinear trend and hence it can be
concluded that the model form is reasonable.
F – test:
H0: Variance is constant
H1: Variance is not constant
Decision rule: p > α then Accept H0. α=0.10
From above table, p=0.99, which is greater
than 0.10.
So we accept Ho. It is strong conclusion.
We conclude that error variance is constant.
Residuals are uncorrelated
It is valid only for time series models. Where one variable is dependent on the other.
No outliers
For finding X outliers we used hat matrix where hii > 2 𝑋 4
50 = 0.16 is considered as an outlier. And in our case, we
found observation 7,20,47 & 49 with leverage values 0.17196, 0.18271, 0.2406 & 0.20802 resp. as an outlier. And we
haven’t found any Y outliers.
To check their influence, there are three measures we can use
1) DFFITS – for the observation 7,20,47 & 49 we got -2.55, -2.92,6.37 &-1.37 which are greater than DFFITS
>2√𝑝/𝑛 (0.565) and hence they are influential 2) Cook’s distance - for the observation 7,20,47 & 49 we got 8.51,1.57,6.56 & 3.87 which are greater than F(0.5,4,46) =
0.8516 and hence they are influential
Predictors are not highly correlated
Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It measures how much variances of the estimated regression coefficient are inflated as compared to when predictors are not linearly
related. VIF factor of 10 and above is said to be highly correlated. And according to our model our calculated inflation
factors are 1.73,1.23 & 1.54 which are below 10, that’s why the assumption that predictors are not highly co-related is
valid.
Present your preliminary model that satisfies the model assumptions.
Our preliminary model has not satisfied all our assumptions but the model is reasonable and error has a
constant variance. And as per the test of significance F* = 295.97 which is greater than F(0.90,3,46) = 2.2068,
we reject Ho and conclude our model is significant at the level of 10%. Moreover, 95.07% of the variability
is explained by the model. As we increase the number of variables R2 increases so to keep it in proportion
adjusted Ra is used (which is R 2 divided by degree of freedom) in our case it is found to be 94.75%.
Section 3: Explore the interaction terms
Partial regression plots. And Discuss the addition of possibly useful interaction terms.
Partial regression plots are related to, but distinct from, partial residual plots. Partial regression plots are most commonly
used to identify leverage points and influential data points that might not be leverage points. Partial residual plots are
most commonly used to identify the nature of the relationship between Y and Xi the partial regression plot, the x-axis is
not Xi. This limits its usefulness in determining the need for a transformation.
H0: Normally distributed
H1: Normality is violated
Decision rule: if ρ̂ < c (α, n) Reject H0.
α=0.10
From the critical values table c (α, n) =
0.981 and from above table ρ̂ = 0.9646,
which is less
So, we reject H0. It is a strong conclusion.
We conclude that Normality is violated,
Section 4: Model search
Obtain a set of two potentially good models (backward deletion & stepwise regression):
To further narrow our search to find a good regression model we used two methods approaches backward
deletion and stepwise regression. We also assumed that multicollinearity is not a serious problem.
Backward deletion
we began with a full set with all predictor variables and considered the value of 𝛼 as 0.10, we regressed y on the full set. And got the following results in form of ANOVA.
Coefficients Standard Error t Stat P-value
Intercept 50122.19299 6572.352622 7.626217867 1.05738E-09
R&D Spend 0.80571505 0.04514727 17.84637376 2.63497E-22
Administration -0.026815968 0.05102878 -0.525506752 0.601755108
Marketing Spend 0.027228065 0.016451235 1.6550773 0.104716819
In this table, the highest p-value is found to be 0.601 so we must remove the administration predictor variable. And after
removing the predictor variable we got the following results.
According to the graph there is some kind of linear
trend, so X3 (marketing spend) and X2 (R and D
spend) needs to be added
The points have no trend (random cloud). So we
conclude not to add (administration)X1 & X3
(marketing spend) because it will not do contribution
to our model since it doesn’t have any particular
trend.
The points have no trend (random cloud). So we
conclude not to add (administration) X1X2 (R and D
spend) because it will not do contribution to our
model since it doesn’t have any particular trend.
Coefficients Standard Error t Stat P-value
Intercept 46975.86422 2689.932925 17.46358201 3.50406E-22
R&D Spend 0.796584044 0.041347578 19.26555518 6.04043E-24
Marketing Spend 0.029907875 0.015520012 1.927052287 0.060030397
At this point all the p values of predictor variables are less than 0.10, Hence we got this as the potentially good model.
Forward selection approach
in the forward selection approach, we regressed Y(dependent variable) on each independent variable and
compared it with the p-value and got the following results.
Iteration I
Coefficients Standard
Error t Stat P-value
Intercept 76974.47 25320.18 3.040044 0.003823543
Administration 0.288749 0.203417 1.419493 0.162217395
Step 1. we regressed Y (profit) on administration and got P-value as 0.162217 which is greater than α = 0.10 which is not significant. i.e it does not explain any variability in the data.
Coefficients Standard
Error t Stat P-value
Intercept 60003.55 7684.53 7.808356 4.29474E-10
Marketing Spend 0.246459 0.031587 7.802657 4.38107E-10
Step 2. we regressed Y (profit) on marketing spend and got P-value as 4.29474E-10 which is less than α = 0.10 and hence it is significant. i.e it did explain any variability in the data.
Step 3. we
regressed Y
(profit) on
R and D
spend and got P-value as 2.78E-24 which is less than α = 0.10 and hence it is significant. i.e it did explain much variability on the data as compared to top marketing spend because P-value of R and D spend is lower than P-value of
administration.
Iteration II
Step 1. we regressed Y (profit) on administration and R and D spend. The results, R and D spend P-value as 2.28E-31
which is less than α = 0.10 but Administration’s P-value as 0.288893 which is greater than α = 0.10 which is not
significant. Hence, to conclude we dropped the model.
Coefficients Standard Error t Stat P-value
Intercept 54886.62 6016.718 9.122352 5.7E-12 R&D Spend 0.862118 0.030156 28.58887 2.28E-31 Administration -0.053 0.049405 -1.07268 0.288893
Step 2. we regressed Y (profit) on marketing spend and R and D spend. The results, R and D spend P-value as 6.04E-
24 which is less than α = 0.10 marketing spend P-value as 0.06003 which is also less than α = 0.10. hence, we selected
this as our final model.
Coefficients Standard Error t Stat P-value
Intercept 49032.9 2537.897 19.32029 2.78E-24
R&D Spend 0.854291 0.029306 29.15114 3.5E-32
Coefficients Standard Error t Stat P-value
Intercept 46975.86 2689.933 17.46358 3.5E-22
Marketing Spend 0.029908 0.01552 1.927052 0.06003
R&D Spend 0.796584 0.041348 19.26556 6.04E-24
Also, the correlation coefficient of 72.4% is not a serious concern because the respected VIF factor is 3.6 which is less
than 10 and hence considerable.
Present your potentially good models.
We used two approaches in our model selection and achieved a similar outcome. The analysis of variance for the
model is shown below.
MODEL II
Regression Statistics Multiple R 0.97491 R Square 0.95045 Adjusted R Square 0.948342 Standard Error 9160.966 Observations 50
ANOVA df SS MS F Significance F
Regression 2 7.57E+10 3.78E+10 450.7713 2.16E-31
Residual 47 3.94E+09 83923295 Total 49 7.96E+10
Coefficients Standard Error t Stat P-value
Intercept 46975.86 2689.933 17.46358 3.5E-22 Marketing Spend 0.029908 0.01552 1.927052 0.06003 R&D Spend2 0.796584 0.041348 19.26556 6.04E-24
Section 5: Model selection
For each model, verify model assumptions and check diagnostics.
As per our selection procedure, we acquired the same results and estimated the best-fit equation for the model is.
(profit)Yi = 46975 + 0.029908 Xi1 – 0.796584 Xi2 + εi
Model Assumptions
The MLR model is reasonable
According to the above residual plots, there is no curvilinear trend and hence it can be concluded that the model
form is reasonable.
Residual have constant variance
Residuals are normally distributed
No outliers
For finding X outliers we used hat matrix where hii > 2 𝑋 3
50 = 0.12 is considered as an outlier. And in our case, we
found observation 1, 7, 20 & 49 with leverage values 0.121,0.17196, 0.18271 & 0.217 &resp. as an outlier. And
observation 1 was found to be Y outlier.
To check their influence, there are three measures we can use
1) DFFITS – for the observation 1,7& 49 we got -10.5343, -1.86309 & 4.164383 which are greater than DFFITS
>2√𝑝/𝑛 (0.489) and hence they are influential. 2) Cook’s distance - for the observation 1, 7,20 & 49 we got 3.79, 2.66, 1.18 & 2.66 which are greater than F(0.5,4,46) =
0.8 and hence they are influential
Modified levene test for non-constant variance.
F – test:
H0: Variance is constant
H1: Variance is not constant
Decision rule: p > α then Accept H0. α=0.10
From above table, p=0.9566, which is greater
than 0.10.
So we accept Ho. It is strong conclusion. We
conclude that variance is constant.
H0: Normally distributed
H1: Normality is violated
Decision rule: if ρ̂ < c (α, n) Reject H0. α=0.10
From the critical values table c (α, n) = 0.964 and
from above table ρ̂ = 0.9646, which is almost equal
So, we accept H0. It is a strong conclusion. We
conclude that Normality is not violated
Predictors are not highly correlated
Variance Influence Factor: VIF is a method of detecting the multicollinearity between the predictors. It measures how much variances of the estimated regression coefficient are inflated as compared to when predictors are not linearly
related. VIF factor of 10 and above is said to be highly correlated. And according to our model our calculated inflation
factor is 3.6 which is below 10, that’s why the assumption that predictor is not highly co-related is not violated.
Fully discuss and justify your choice of the best overall model
According to the study, two models were considered the first model in which all variables are considered. Where only
two assumptions are satisfied i.e the model is reasonable and error have constant variance. Also, the adjusted R2 was
found to be 0.940. Alternatively, in the model where it is selected by forwarding and backward selection, all the
assumptions are satisfied and fewer numbers of outliers are found as compared to the previous model. Also, the adjusted
R2 was found to be 0.948 which is quite greater as compared to the previous model. Moreover, in the previous model there
is one insignificant variable found according to the P-value and in the latter none of which is found. To conclude, as per
our understanding the model obtained from the selection procedure is the best model we can choose.
Present and interpret the meaning of your final model.
After all verified assumptions performing the test, we are considering model 2 as our final model. We can predict the
profit from the marketing spend and Research and development spend. It means that amount by which the profit of the
start-up increases, will primarily depend on marketing spend and Research and development spend.
(profit)Yi = 46975 + 0.029908 (marketing spend) Xi1 – 0.796584 (R and D spend) Xi2 + εi There would be a per unit increase in profit for 0.029908 unit increase in marketing spend and 0.796584 unit increase
in R and D spend. And variance inflation factor of 3.57 which is less than 10, so concludes that there won’t be any severe
multicollinearity problem. From the table, we can say that the p-value for years since they joined is less than 0.05, so
this predictor significant. The p-value of marketing spend is slightly greater than 0.05 so this predictor is marginally
significant, but if we change the value of α to 0.10, then marketing spend becomes significant.
F test: To check whether regression is significant or not
H0: β1=β2=0
H1: not all β1, β2, = 0
Decision rule: F* > F (1-α, p-1, n-p), then reject H0. α = 0.05
F* = 450.77, F (0.90, 3, 47) = 2.2041
As 2.2041 > 3.259, we reject Ho and conclude that not all β1, β2 is zero. In other words, regression is significant.
And the coefficient of determination was found to be 0.95.
Discuss the fit of the model and interpret inferences (explained variability, joint C.I. for the
parameters; C.I., C.B., and P.I. calculated at one xh of interest). Joint C. I for the parameters
The Bonferroni joint confidence interval can be used to estimate regression coefficient simultaneously. g
are the parameter to be estimated jointly where g≤p, the confidence limits can be find by using
bk ± B s{bk}, where B = t (1 – α /2g; n – p), g =2, p = 3, n = 50
B = t (1-(0.1/2*2);50-3) = t(0.975,47) = 2.0117
b1 = 0.7965, b2 = 0.02999
S{b1} = √(MSE*b1) = √ (83923294.69*0.79865) =8176.30
S{b2} = √(MSE*b2) = √ (83923294.69*0.0299) =1584.07
β1 = bk ± B s{bk} = 0.7965 ± 2.0117*8176.30
β1 = (-16447.466, 16449.059)
β2 = bk ± B s{bk} = 0.0299 ± 2.0117*1584.07
β2 = (-3186.643, 3186.7035)
Prediction Interval with new Xh
Xhnew = [1 90000 150000]
Hii = Xhnew *(X TX)-1 * Xhnew
T = [1 90000 150000]
= 0.04709
S{pred} = √ (MSE + s{yh}2) = √ (83923294.69 + 0.04709) =9160.965
Yhat h = 46975.86422 + 0.7965*90000 + 0.0299*150000 = 123145.864
Prediction interval: -
Yhat h ± t (1 – α /2; n – p) *S{pred} = 123145.865 ±t (0.95,47) *9160.965
Prediction interval = (107782.92,138508.7983)
Prediction interval of 3 new mean observations: -
S {pred mean} = √ (MSE/3 + s{yh}2) = 5289.086
Yhat h ± t (1 – α /2; n – p) *S {pred mean} = 123145.865 ±t (0.95,47) *5289.086
PI = (114276.06,132015.6572)
Section 6: Final discussion
Summary and Conclusions
To conclude, Our project focused on predicting the profit of a startup firm using cost invested on administration spend, research and development spend and marketing spend. We found the data of 50 entries online at Kaggle from various cities in the USA which are New York, California, and Florida. By using Python found out the relation between the predictor variable and dependent variable and ruled out the one with low coefficient of correlation, also calculated the coefficient of correlation between predictor variables and ruled out the one with the highest. Using the variables, we regressed our dependent variables and then checked for the assumption that needs to be satisfied because the best model is the one whose all assumptions are satisfied. Many assumptions were not satisfied. We proceeded further using partial regression plots and tried to extract the relations between variables. Then we proceeded with the model search where we used two methods in search of the best model. Surprisingly both the methods yielded the same model results with the same predictor variables. Then the same we checked for the assumptions and most of the assumptions were satisfied in the second model including the ones which were not in the first model and the number of outliers also reduced. In sum, the model search methods and partial plots helped a lot in searching out for the best model.
50 3686080.78 10551254.89
3686080.78 3.74988E+11 9.77065E+11
10551254.89 9.77065E+11 2.95937E+12
1
90000
150000