Data Analysis Project: Final Report
Running Head: Report 1
Data Analysis Project: A Proposal
By Julia Fernandez
MSD 201
Professor Heeyong Kim
Rider University
July 25, 2018.
Running Head: Report 2
Describe your data set (including the source). Why are you interested in it? What do you
hope to learn? Before exploring your data set, state some hypotheses (guesses) about how the
variables should be related, perhaps based on your knowledge and experience, but please
provide at least one reference to support your claim. Be sure to identify the response variable
and the predictor variables.
The data set that I selected for this study is the “Manufacturing companies of US”. The
industrial sector of any country contributes a lot to a stable economy where a lot of jobs is created
and earn foreign exchange as well through trading (Timothy E. Zimmer, 2017). Hence, for me, the
analysis of manufacturing companies is quite attractive because I want to explore all those factors
that contribute to the success of these companies. The data on these companies is obtained from
the annual report of US where manufacturing firms data is mentioned against their geographical
region.
The data of 2016 is used in this study that is obtained from the US census and a total number
of firms that are taken under this is 45 (census.gov, 2016). (Appendix A)
Variables of this study:
I took three variables in this study that help me to understand the performance of
manufacturing companies. The employees or human capital is an important pillar of any company
that is the cause of its success and productivity. Hence, the relationship between a number of
employees and their productivity in determining the total expenditure has analyzed in this study.
The aim of this study is to explore that whether the number of employees and their productivity
increases or decreases the expense of the company.
The independent variables (predictor variable) of this study are;
Running Head: Report 3
• Number of employees, and
• Production of workers.
The dependent variables (response variable) of this study is
• The total expenditure of manufacturing firm.
From this analysis, I will able to learn how the manpower and their productivity increases or
decreases the total expenditure of a company.
The hypothesis is:
H1: There is a relationship between a number of employees and the production of workers in
determining the total expenditure of manufacturing firms in the US.
Ho: There is no relationship between a number of employees and the production of workers in
determining the total expenditure of manufacturing firms in the US.
Sub-hypothesis:
H1: There is a relationship between a number of employees in determining the total expenditure
of manufacturing firms in the US.
Ho: There is no relationship between a number of employees in determining the total expenditure
of manufacturing firms in the US.
H2: There is a relationship between the production of workers in determining the total expenditure
of manufacturing firms in the US.
Ho: There is a no relationship between production of workers in determining the total expenditure
of manufacturing firms in the US.
Running Head: Report 4
Regression analysis:
In this study, multiple regression analysis is used to find to the relationship between two
independent and one dependent variable. A multiple regression analysis equations are developed
to predict the dependent variable from two independent variables
Y=a+b1X1 + b2X2
• Y= it is the value of the dependent variable. In this study, the dependent variable is
“Total expenditure of manufacturing firm.
• a= it is the value of alpha or a constant or intercept
• b1 it is the slope of beta coefficient for the independent variable X1.
• X1 it is the first independent variable that is used in the study for explaining the variance
Y. in this study the first independent variable is “Number of employees”
• b2 it is the slope of beta coefficient for the independent variable X2.
• X2 it is the second independent variable that is used in the study for explaining the
variance Y. In this study, the second independent variable is “Production of workers”
Number of employees:
Running Head: Report 5
Adjusted R – square:
The value of r – square help in determining the relationship between dependent and independent
variables. Adjusted R –square is the modified foam of r square. The values of R square fall between
the 0 to 1. If the values of R square close to 1 then it shows that there is a strong relationship and
if it is close to 0 then there is a weak relationship. The value of R – square is 1.02 that is close to
1 it means that a strong relationship exists.
Statistical Significance (sig.)
The value of F statistics is used to analyze the significance of the regression model. The higher
values of F statistics show that the model is significant and if it is less than 5 then the model is not
significant. The value of F statistics of study is greater than five that shows that this model is
significant.
Running Head: Report 6
Productivity of workers:
Adjusted R – square:
The value of r – square help in determining the relationship between dependent and independent
variables. Adjusted R – square is the modified foam of r square. The values of R square fall
between the 0 to 1. If the values of R square close to 1 then it shows that there is a strong
relationship and if it is close to 0 then there is a weak relationship. The value of R – square is 1.02
that is close to 1 it means that a strong relationship exists.
Statistical Significance (sig.)
The value of F statistics is used to analyze the significance of the regression model. The higher
values of F statistics show that the model is significant and if it is less than 5 then the model is not
significant. The value of F statistics of study is 8.093 that is greater than five that shows that this
model is significant.
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.046114549
R Square 0.002126552
Adjusted R Square 1.02055239
Standard Error 24854214.28
Observations 46
ANOVA
df SS MS F Significance F
Regression 1 579232885 579232885 8.093767672 0.760885193
Residual 44 2718020657 61773196762
Total 45 27238129864
Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%
Intercept 7384332.289 3812361.214 1.936944553 0.059189439 -298976.8838 15067641.46 -298976.8838 15067641.46
X Variable 1 -1.005973127 3.28518486 -0.306215075 0.760885193 -7.626828171 5.614881916 -7.626828171 5.614881916
Running Head: Report 7
From the analysis of regression model it is evaluated that there is a positive relationship between
a number of employees, the productivity of workers with the total expenditure of the company.
Under the results of regression analysis, the null hypothesis is accepted and alternate hypothesis is
rejected.
Make a scatterplot of your response variable (on the Y-axis) versus one of the predictor
variables (on the X-axis). Describe the pattern you see. Is this pattern consistent with what
you expected? Note any apparent outliers in the plot. Can you propose a "cause" for these
outliers? Repeat the entire procedure for the other predictor variables.
Number of employees:
Below mentioned are the scatterplot by taking response variables that is total expenditure
on Y-axis and predictor variables on the x-axis. The outcome of this graph is consistent than
anticipated. The pattern that is seen in the below-mentioned plot is that all the values of the firm
lie between 0 – 2000000 while there is the maximum range of values goes more than 10000000
and total expenditure is more than 150000000.
Outliers of this plot are:
Number of employees
max 11112764
min 1259
range 11111505
mode 22862
Running Head: Report 8
Productivity of workers
Below mentioned are the scatterplot by taking response variables that is total expenditure
on Y- axis and predictor variables on x – axis that is productivity of employees. The outcome of
this graph is consistent that anticipated. The pattern that is see in the below mentioned plot is that
all the values of firm lies between 0 – 2000000 while there is maximum range of values goes more
than 16000000 and total expenditure is more than 80000000.
Productivity of workers
max 7733162
min 813
range 11111505
mode 22862
0
200000
400000
600000
800000
1000000
1200000
0 50000000 100000000 150000000 200000000
Number of employees
Running Head: Report 9
Can you think of any other variables (not in your data set) that might be useful in predicting
Y? Try to list a few possibilities.
There are multiple variables that are used in this study for the prediction of the variable in Y. These
variables are stated below:
• The annual payroll of employees. This will help in determining that whether the payroll
increases the amount of expenditure of company and to what extent this will have a positive
impact on manufacturing firm. It also helps in determining whether the company is overly
hired or not.
• The total cost of the material. The total cost of material is an important variable as
material cost is an important predictor of total expenditure as more than 50% of the total
cost of the company depends on it.
0
20000000
40000000
60000000
80000000
100000000
120000000
140000000
160000000
180000000
0 100000 200000 300000 400000 500000 600000 700000 800000
Production of workers
Running Head: Report 10
• The total value of shipments and receipts for services. This help in determining that
how the value of shipment lowers the expenditure of the company. when the company
earns more from shipment will easily bear the high expense and generated profit.
For each variable, obtain descriptive statistics using Data Analysis Toolkit and create a histogram
using Excel.
Number of employees (Predictor variables):
Histogram:
Mean 222950.3333
Standard Error 34215.00257
Median 154684
Standard Deviation 229521.2148
Sample Variance 52679988045
Kurtosis 4.292423976
Skewness 1.841321098
Range 1118637
Minimum 1259
Maximum 1119896
Sum 10032765
Count 45
Confidence Level(95.0%) 68955.80674
No of employees
Running Head: Report 11
Production of workers (Predictor variables):
Histogram:
Mean 155262.5333
Standard Error 23151.59335
Median 108945
Standard Deviation 155305.6095
Sample Variance 24119832354
Kurtosis 2.564765983
Skewness 1.551343512
Range 705577
Minimum 813
Maximum 706390
Sum 6986814
Count 45
Confidence Level(95.0%) 46658.97053
production of workers
Running Head: Report 12
For each variable, based on the descriptive statistics output, decide if your variable has
normal distribution or not. Also, you can check a scatterplot to see the liner relationship. If
you see the problem with your data, and if all of the data values for this variable are positive,
try taking log of the variable. Then create the descriptive statistics graph for the log of the
variable, and decide whether the problem is reduced. Please note that if a variable has any
zero or negative values, then taking logs is NOT appropriate, so there is no point in trying it
in this case. Excel tutorial on how to graph log-transformed data in Excel.
0
5
10
15
20
25
30
F re
q u
e n
cy
Bin
Histogram
Running Head: Report 13
From the log of graph of variable it is analyzed that there exists a linear relationship.
Return the scatterplot (and answer the rest of question 2) using the logged variables
wherever this was found appropriate in question # 5). Here are some examples of what I
mean. If you decide to take logs of predictor variable X2 only, then you should run a
scatterplot of your response variable (let's call it Y) against log(X2). If you decided to take
0
0.5
1
1.5
2
2.5
3
0 0.2 0.4 0.6 0.8 1 1.2 1.4
log of number of employees
0
0.5
1
1.5
2
2.5
3
0 0.2 0.4 0.6 0.8 1 1.2
Log of production of workers
Running Head: Report 14
logs of X2 and X3, then you should run scatterplots of Y versus log(X2) and Y versus log(X3).
If you decided to take logs of Y only, then you should run scatterplots of log(Y) versus all of
the (non-logged) predictor variables. If you decided not to take the log of any of the variables,
you do not need to do anything. For each scatterplot you create here, compare it with the
corresponding one from question 2). Did taking logs help you to uncover a relationship
between the variables?
-10000000
10000000
30000000
50000000
70000000
90000000
110000000
130000000
150000000
170000000
190000000
0.5 200000.5 400000.5 600000.5 800000.5 1000000.5 1200000.5
Number of Employees
Running Head: Report 15
References
• census.gov. (2016, 1 1). Manufacturing Firms . Retrieved from census.gov:
https://www.census.gov/data/tables/2016/econ/asm/2016-asm.html
• Timothy E. Zimmer, P. K. (2017). Lean manufacturing: The production employment and
wages connection. Indiana Business Review, 92(1), 1-10.
Running Head: Report 16
Appendix A:
Id2 Year Geographic area
name
Number of
employees
Production
workers average
for year
Total capital
expenditures
($1,000)
1 2016 District of Columbia 1259 813 168317599
2 2016 Hawaii 11513 7149 4187468
3 2016 Alaska 12178 10176 204082
4 2016 Montana 16697 11238 1924308
5 2016 New Mexico 21747 14700 1994511
6 2016 North Dakota 22862 16912 13062872
7 2016 Delaware 25434 17901 1534982
8 2016 Vermont 27420 18641 1452456
9 2016 Rhode Island 36081 23745 500411
10 2016 Nevada 41356 27759 9730
11 2016 South Dakota 44094 32491 3927971
12 2016 Maine 49710 35983 4450840
13 2016 Idaho 55774 41205 129621
14 2016 New Hampshire 65553 39464 1342579
15 2016 Maryland 91791 56258 6681908
16 2016 Nebraska 92945 70717 7046728
17 2016 Louisiana 113914 80129 3658211
18 2016 Colorado 121069 79052 2211855
19 2016 Oklahoma 121220 89068 3904222
20 2016 Mississippi 130537 103082 8699895
21 2016 Arizona 136946 82409 438790
22 2016 Arkansas 145733 116757 1153209
23 2016 Kansas 154684 110902 2369195
24 2016 Connecticut 155062 87968 7409725
25 2016 Oregon 160128 108945 3419845
26 2016 Iowa 203835 147954 1589847
27 2016 New Jersey 210291 139870 2823733
28 2016 South Carolina 213050 159797 642828
29 2016 Massachusetts 223996 131020 1109775
30 2016 Kentucky 230763 178593 456874
31 2016 Alabama 234803 176679 569796
32 2016 Missouri 245352 182878 2738486
33 2016 Florida 270180 179959 502613
34 2016 Minnesota 297770 197099 5140968
35 2016 Tennessee 308966 228500 5152963
36 2016 Georgia 351951 265733 331143
37 2016 New York 395129 261216 10755438
38 2016 North Carolina 411050 303771 2273823
39 2016 Indiana 476417 357501 2623293
40 2016 Pennsylvania 522221 362007 7348275
41 2016 Illinois 538183 370965 346802
42 2016 Michigan 555005 398946 3549962
43 2016 Ohio 642945 460781 433012
44 2016 Texas 725255 493691 5121563
45 2016 California 1119896 706390 19745490
Running Head: Report 17
- References