stats
4
Xu Zhang
The relationship between moving labor hours and home size and elevator equipment
The owner of a moving company is making changes. He/she would like to use a data-driven method to estimate and predict the total number of labor hours required for completing an upcoming move, instead of directing asking the most experienced managers in the company to get a rough number. Therefore, the company collected data for the number of labor hours, the number of cubic feet moved as long as whether there is an elevator in the apartment building. There are totally 36 observations. The research question is that whether we can use the latter two variables as predictors to predict the number of labor hours by using linear regression, how good the estimate is and can we predict future labor hours by using this model.
Suppose we identify Y as the dependent variable – the number of labor hours for a moving, as the first independent variable – the number of cubic feet moved, which is a continuous variable, as the second independent variable – whether there is an elevator in the apartment building, which is a binary variable. We code it that if there is an elevator in the building, it is denoted by 1; if not, denoted by 0. As a consequence, the multiple regression model can be written as
My hypotheses for this model are as follows.
1. The linear regression itself is statistically significant.
2. is statistically positively significant. It means that the number of cubic feet moved is positively correlated with the number of labor hours for moving.
3. is statistically negatively significant. It means that the existence of elevators in the building will reduce the number of labor hours for moving.
With these hypotheses in mind, we run the multiple linear regression and the regression output tells us the R squared value is 90%, which means that the regression line explains 90% of the variation in the data. It is a fairly high percentage. From the ANOVA output, he p-value for the regression is close to 0, meaning that the multiple linear regression model is valid and statistically significant. Finally, we have the estimated coefficient for each independent variable as well as the intercept. The intercept turns out to be non-significant while the coefficients for both independent variables are statistically significant when comparing the p-values with level of significance 0.05. So all of the three hypotheses listed above has been validated by the data. The final multiple linear regression model can be written as
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.95 |
|
|
|
|
|
|
R Square |
0.90 |
|
|
|
|
|
|
Adjusted R Square |
0.90 |
|
|
|
|
|
|
Standard Error |
4.78 |
|
|
|
|
|
|
Observations |
36.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
2.00 |
7016.66 |
3508.33 |
153.39 |
0.00 |
|
|
Residual |
33.00 |
754.78 |
22.87 |
|
|
|
|
Total |
35.00 |
7771.44 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
2.45 |
2.98 |
0.82 |
0.42 |
-3.62 |
8.52 |
|
Feet |
0.05 |
0.00 |
16.01 |
0.00 |
0.04 |
0.05 |
|
Elevator(Yes) |
-4.53 |
2.10 |
-2.15 |
0.04 |
-8.81 |
-0.25 |
For checking the model validity, we output the following three figures. The first one shows the residuals versus the fitted values. It seems that the residuals are around 0, so we can roughly say that the equal variance assumption is satisfied. The second figure outputs the Q-Q plot for estimated residuals. The residual dots generally follow the straight line, which indicates that the normality assumption is satisfied. The third figure indicates if there are outliers/influential points. It seems that point 36 is most far away from the rest, but still acceptable. So we can roughly say the data has no outliers.
Since the model is valid, we now interpret the model. At 0.05 level of significance, there is a significant relationship between labor hours and cubic feet moved. Besides, there is a significant relationship between labor hours and elevator existence as well. So both independent variables make contributions to the regression model. However, the intercept term is not significant at 0.05 level of significance. As a result, we rerun the regression without intercept and expect a more accurate model. The new model without intercept is attached below. At this time, the R squared value is 98%, which is higher than the R squared value of 90% in the model with intercept. The model itself as well as all coefficients are all significant. So we say this is a more appropriate model. The model formula is
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.99 |
|
|
|
|
|
|
R Square |
0.98 |
|
|
|
|
|
|
Adjusted R Square |
0.95 |
|
|
|
|
|
|
Standard Error |
4.76 |
|
|
|
|
|
|
Observations |
36.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
2.00 |
37190.28 |
18595.14 |
820.85 |
0.00 |
|
|
Residual |
34.00 |
770.22 |
22.65 |
|
|
|
|
Total |
36.00 |
37960.50 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Feet |
0.05 |
0.00 |
27.64 |
0.00 |
0.05 |
0.05 |
|
Elevator(Yes) |
-3.23 |
1.38 |
-2.34 |
0.03 |
-6.04 |
-0.42 |
We would like to predict the mean labor hours for moving 500 cubic feet. So the formula is
hours
So on average it takes 22.40 labor hours for moving a 500 cubic feet house. We should tell the owner of the moving company that the larger the house/room is, the more time will be expected for the moving.
The 95% confidence interval estimate of the population slope for the relationship between labor hours and cubic feet moved is [0.046, 0.054].
The adjusted R squared value is 95%, which means that after adjusting for the degree of freedom introduced by the number of independent variables, the multiple linear regression explains 95% of the variation in the data.
In the next step we would like to see whether adding an interaction term will make the model perform better. The interaction term is defined by the product of the number of cubic feet moved and the indicator that there is an elevator in the building. The regression with intercept output tells us that R squared values is 92% which is only a slightly better than the original 90%.
As a business owner, I conclude that the number of cubic feet moved has a significant positive correlation with the moving labor hours, while the elevator equipment has a negative correlation with the moving labor hours. We should predict the moving labor hours by considering these two factors into our regression model in the further.
The exploration of used car prices with multiple features
We are interested in the prediction of the used car prices. To perform this research, we are focusing on a dataset of cars that are currently part of an inventory of a used car dealership. We have a total of 500 used cars. For each used car, we collect information on the year it was built, the age of the used car, the mileage the car has been driven, the horse power of the car, the miles per gallon (MPG), whether the region of origin is USA or foreign countries, whether it was owned by a single person or more than one owner, and finally the price listed for the used car. To be specific, for indicator variables region of origin, we denote it by 1 if it is in USA and denote it by 0 if it is in foreign countries. For ownership, if it has only one owner, we denote it by 1, if it has more than 1 ownership, we denote it by 0. Since the year and age are equivalent, in this study, we only consider used cars’ age into model.
Before statistical analysis, we make the following hypotheses.
1. With the available independent variables, we can have a significant multiple linear regression model;
2. The used car price is positively correlated with car’s MPG;
3. The used car price is negatively correlated with car’s age;
4. The used car price is negatively correlated with car’s mileage;
5. The used car price is positively correlated with car’s horse power;
6. The used car price is positively correlated with single owner indicator;
7. The used car price may not be correlated with the region of origin.
With these hypotheses in mind, we first look at the summary statistics, they are presented in the following table. The age variable ranges from 1 to 16, with similar mean and median around 8. The price variable ranges from 4800 to 13400, with similar mean and median around 9100. The mileage variable ranges from 5000 to 184000, with mean and median around 92000. The horse power variable ranges from 130 to 220, with mean and median around 175. The MPG variable ranges from 12 to 42, with mean and median around 26. For the region of origin variable, there are 353 made in USA and 147 made in foreign countries. For the owner variable, there are 334 single owner used cars and 166 used cars with more than one owner.
Age Price Mileage Power.HP. Fuel.MPG. Region_of_origin Single_owner
Min. : 1.000 Min. : 4800 Min. : 5000 Min. :130.0 Min. :12.00 Foreign :147 No :166
1st Qu.: 4.000 1st Qu.: 7300 1st Qu.: 45000 1st Qu.:150.0 1st Qu.:15.00 USA :353 Yes:334
Median : 8.000 Median : 9100 Median : 91000 Median :180.0 Median :27.00
Mean : 8.362 Mean : 9058 Mean : 92492 Mean :174.8 Mean :25.47
3rd Qu.:12.250 3rd Qu.:10900 3rd Qu.:137250 3rd Qu.:200.0 3rd Qu.:33.00
Max. :16.000 Max. :13400 Max. :184000 Max. :220.0 Max. :42.00
Then we look into the histogram of each variable to see how each variable is distributed. The histograms of all variable are shown below. For all variables, we can see that the distribution is quite uniformly spread, so there are no extreme outliers in the sample.
After that, we examine the relationship between the used cars’ prices with each independent variable, as shown below. The scatterplot between price and car’s age indicates that there is a negative relationship; the scatterplot between price and car’s mileage indicates that there is a negative relationship as well; It is hard to identify whether there is a correlation between horse power and cars’ price, MPG, region of origin and single owner indicators by looking at their scatterplots, we need to further check it in multiple linear regression.
Fore multiple linear regression, we first use all available independent variables in the model. To make the notation simpler, we denote the used cars’ price by Y, the age of the used car by , the mileage the car has been driven by , the MPG of the car by , the ownership of a single person indicator by , the horse power of the car by , the region of origin of foreign countries indicator by . Then the first regression has formula as follows.
The output of the multiple linear regression has the following information. It has adjusted R squared value 94%, which means that considering the degree of freedom introduced by the number of independent variables, the model explains 94% of the variation in the data. The p-value for the ANOVA is close to zero. It means that the model is statistically significant. Finally, we find that horse power, region and ownership of used cars are not significant under 0.05 level of significance while the rest is significant. So we should remove the non-significant variables one by one until all independent variables are statistically significant to get the best model.
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.97 |
|
|
|
|
|
|
R Square |
0.94 |
|
|
|
|
|
|
Adjusted R Square |
0.94 |
|
|
|
|
|
|
Standard Error |
530.41 |
|
|
|
|
|
|
Observations |
500.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
6.00 |
2039879937.17 |
339979989.53 |
1208.44 |
0.00 |
|
|
Residual |
493.00 |
138699642.83 |
281338.02 |
|
|
|
|
Total |
499.00 |
2178579580.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
12924.73 |
772.00 |
16.74 |
0.00 |
11407.91 |
14441.55 |
|
Age |
-285.13 |
46.32 |
-6.16 |
0.00 |
-376.14 |
-194.11 |
|
Mileage |
-0.01 |
0.00 |
-2.92 |
0.00 |
-0.02 |
0.00 |
|
HP |
-8.37 |
8.11 |
-1.03 |
0.30 |
-24.31 |
7.57 |
|
MPG |
42.67 |
25.73 |
1.66 |
0.10 |
-7.88 |
93.21 |
|
Region(Foreign) |
28.60 |
52.30 |
0.55 |
0.58 |
-74.16 |
131.35 |
|
SingleOwner(Yes) |
69.22 |
50.44 |
1.37 |
0.17 |
-29.89 |
168.33 |
Since region indicator is the least significant independent variable in the previous model, we remove it and rerun the regression model, the model formula is now .
For this time, when we check the below output, we find that horse power, MPG and single owner indicator are still not significant, so need to further removing independent variables to improve the regression model.
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.97 |
|
|
|
|
|
|
R Square |
0.94 |
|
|
|
|
|
|
Adjusted R Square |
0.94 |
|
|
|
|
|
|
Standard Error |
530.04 |
|
|
|
|
|
|
Observations |
500.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
5.00 |
2039795803.97 |
407959160.79 |
1452.13 |
0.00 |
|
|
Residual |
494.00 |
138783776.03 |
280938.82 |
|
|
|
|
Total |
499.00 |
2178579580.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
12965.90 |
767.78 |
16.89 |
0.00 |
11457.39 |
14474.41 |
|
Age |
-285.32 |
46.29 |
-6.16 |
0.00 |
-376.27 |
-194.37 |
|
Mileage |
-0.01 |
0.00 |
-2.91 |
0.00 |
-0.02 |
0.00 |
|
HP |
-8.72 |
8.08 |
-1.08 |
0.28 |
-24.60 |
7.16 |
|
MPG |
43.73 |
25.63 |
1.71 |
0.09 |
-6.63 |
94.10 |
|
SingleOwner(Yes) |
69.36 |
50.41 |
1.38 |
0.17 |
-29.68 |
168.40 |
Since horse power is the least significant independent variable in the previous model, we remove it and rerun the regression model, the model formula is now
For this time, when we check the regression output, we find that only single owner indicator is still not significant, so need to finally removing single owner indicator variable to obtain the optimal regression model.
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.97 |
|
|
|
|
|
|
R Square |
0.94 |
|
|
|
|
|
|
Adjusted R Square |
0.94 |
|
|
|
|
|
|
Standard Error |
530.12 |
|
|
|
|
|
|
Observations |
500.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
4.00 |
2039468693.65 |
509867173.41 |
1814.27 |
0.00 |
|
|
Residual |
495.00 |
139110886.35 |
281032.09 |
|
|
|
|
Total |
499.00 |
2178579580.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
12142.96 |
88.55 |
137.13 |
0.00 |
11968.98 |
12316.93 |
|
Age |
-290.06 |
46.09 |
-6.29 |
0.00 |
-380.61 |
-199.51 |
|
Mileage |
-0.01 |
0.00 |
-2.83 |
0.00 |
-0.02 |
0.00 |
|
MPG |
16.21 |
2.53 |
6.41 |
0.00 |
11.24 |
21.18 |
|
SingleOwner(Yes) |
70.52 |
50.40 |
1.40 |
0.16 |
-28.51 |
169.55 |
After removing single owner indicator variable, we have the model formula as
The regression has the following results. This time all independent variables are statistically significant and the linear regression model is significant as well under 0.05 level of significance. The adjusted R squared value equals to 0.94, meaning that after taking into consideration the degree of freedom introduced by the number of independent variables, the model explains 94% of the variation in the data. The final model is
Only cars’ age, mileage and MPG is related to the used cars’ price. The rest independent variables are not related to the cars’ price. It confirms about hypotheses that age and mileage are negatively correlated to used cars’ price while MPG is positively correlated to used cars’ price.
|
Regression Statistics |
|
|
|
|
|
|
|
Multiple R |
0.97 |
|
|
|
|
|
|
R Square |
0.94 |
|
|
|
|
|
|
Adjusted R Square |
0.94 |
|
|
|
|
|
|
Standard Error |
530.64 |
|
|
|
|
|
|
Observations |
500.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
ANOVA |
|
|
|
|
|
|
|
|
df |
SS |
MS |
F |
Significance F |
|
|
Regression |
3.00 |
2038918613.01 |
679639537.67 |
2413.71 |
0.00 |
|
|
Residual |
496.00 |
139660966.99 |
281574.53 |
|
|
|
|
Total |
499.00 |
2178579580.00 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Coefficients |
Standard Error |
t Stat |
P-value |
Lower 95% |
Upper 95% |
|
Intercept |
12192.26 |
81.31 |
149.94 |
0.00 |
12032.50 |
12352.02 |
|
Age |
-292.91 |
46.09 |
-6.36 |
0.00 |
-383.46 |
-202.36 |
|
Mileage |
-0.01 |
0.00 |
-2.76 |
0.01 |
-0.02 |
0.00 |
|
MPG |
16.12 |
2.53 |
6.37 |
0.00 |
11.14 |
21.09 |
The final model’s diagnosis is shown below. The first figure shows that relationship between residuals and fitted values. Since all data points are close to the horizontal line, we believe that the equal variance assumption is satisfied. The second Q-Q plot shows that all dots are very close to the straight line. It indicates that the normality assumption is satisfied. The third plot indicates that there are no outliers, leverage points or influential points, so no need to delete weird points.
In summary, we build a multiple linear regression of used cars’ price on cars’ age, mileage and MPG. With this model, we have a fairly high adjusted R squared value as high as 94%. It means that out model can explain 94% of the data variation. However, although this model has high prediction power, there are still some important variables missing which may be highly correlated with the used cars’ price. First, the cars’ brand may be highly correlated with cars’ price. It is common sense that BMW 7 series cars are much more expensive than the Toyota Corolla cars. So even these two types of cars have the same age, the same mileage and the same MPG, there is still a big gap in the price difference. The second important variable is whether the car has clean title. If a car has been in accident, it is highly possible that the value of this car deflates largely. The last one is location. The used cars’ selling prices in New York might be very different from the prices in New Mexico. So we should take into consideration the location information as well.
Scatterplot of SingleOwner(Yes=1)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0
Scatterplot of Region(Foreign=1)
0 0 1 0 1 1 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1