Data Analysis Project: Final Report

profileZelda23
DataAnalysisProjectcopy.pdf

Running Head: Report 1

Data Analysis Project: A Proposal

By Julia Fernandez

MSD 201

Professor Heeyong Kim

Rider University

July 25, 2018.

Running Head: Report 2

Describe your data set (including the source). Why are you interested in it? What do you

hope to learn? Before exploring your data set, state some hypotheses (guesses) about how the

variables should be related, perhaps based on your knowledge and experience, but please

provide at least one reference to support your claim. Be sure to identify the response variable

and the predictor variables.

The data set that I selected for this study is the “Manufacturing companies of US”. The

industrial sector of any country contributes a lot to a stable economy where a lot of jobs is created

and earn foreign exchange as well through trading (Timothy E. Zimmer, 2017). Hence, for me, the

analysis of manufacturing companies is quite attractive because I want to explore all those factors

that contribute to the success of these companies. The data on these companies is obtained from

the annual report of US where manufacturing firms data is mentioned against their geographical

region.

The data of 2016 is used in this study that is obtained from the US census and a total number

of firms that are taken under this is 45 (census.gov, 2016). (Appendix A)

Variables of this study:

I took three variables in this study that help me to understand the performance of

manufacturing companies. The employees or human capital is an important pillar of any company

that is the cause of its success and productivity. Hence, the relationship between a number of

employees and their productivity in determining the total expenditure has analyzed in this study.

The aim of this study is to explore that whether the number of employees and their productivity

increases or decreases the expense of the company.

The independent variables (predictor variable) of this study are;

Running Head: Report 3

• Number of employees, and

• Production of workers.

The dependent variables (response variable) of this study is

• The total expenditure of manufacturing firm.

From this analysis, I will able to learn how the manpower and their productivity increases or

decreases the total expenditure of a company.

The hypothesis is:

H1: There is a relationship between a number of employees and the production of workers in

determining the total expenditure of manufacturing firms in the US.

Ho: There is no relationship between a number of employees and the production of workers in

determining the total expenditure of manufacturing firms in the US.

Sub-hypothesis:

H1: There is a relationship between a number of employees in determining the total expenditure

of manufacturing firms in the US.

Ho: There is no relationship between a number of employees in determining the total expenditure

of manufacturing firms in the US.

H2: There is a relationship between the production of workers in determining the total expenditure

of manufacturing firms in the US.

Ho: There is a no relationship between production of workers in determining the total expenditure

of manufacturing firms in the US.

Running Head: Report 4

Regression analysis:

In this study, multiple regression analysis is used to find to the relationship between two

independent and one dependent variable. A multiple regression analysis equations are developed

to predict the dependent variable from two independent variables

Y=a+b1X1 + b2X2

• Y= it is the value of the dependent variable. In this study, the dependent variable is

“Total expenditure of manufacturing firm.

• a= it is the value of alpha or a constant or intercept

• b1 it is the slope of beta coefficient for the independent variable X1.

• X1 it is the first independent variable that is used in the study for explaining the variance

Y. in this study the first independent variable is “Number of employees”

• b2 it is the slope of beta coefficient for the independent variable X2.

• X2 it is the second independent variable that is used in the study for explaining the

variance Y. In this study, the second independent variable is “Production of workers”

Number of employees:

Running Head: Report 5

Adjusted R – square:

The value of r – square help in determining the relationship between dependent and independent

variables. Adjusted R –square is the modified foam of r square. The values of R square fall between

the 0 to 1. If the values of R square close to 1 then it shows that there is a strong relationship and

if it is close to 0 then there is a weak relationship. The value of R – square is 1.02 that is close to

1 it means that a strong relationship exists.

Statistical Significance (sig.)

The value of F statistics is used to analyze the significance of the regression model. The higher

values of F statistics show that the model is significant and if it is less than 5 then the model is not

significant. The value of F statistics of study is greater than five that shows that this model is

significant.

Running Head: Report 6

Productivity of workers:

Adjusted R – square:

The value of r – square help in determining the relationship between dependent and independent

variables. Adjusted R – square is the modified foam of r square. The values of R square fall

between the 0 to 1. If the values of R square close to 1 then it shows that there is a strong

relationship and if it is close to 0 then there is a weak relationship. The value of R – square is 1.02

that is close to 1 it means that a strong relationship exists.

Statistical Significance (sig.)

The value of F statistics is used to analyze the significance of the regression model. The higher

values of F statistics show that the model is significant and if it is less than 5 then the model is not

significant. The value of F statistics of study is 8.093 that is greater than five that shows that this

model is significant.

SUMMARY OUTPUT

Regression Statistics

Multiple R 0.046114549

R Square 0.002126552

Adjusted R Square 1.02055239

Standard Error 24854214.28

Observations 46

ANOVA

df SS MS F Significance F

Regression 1 579232885 579232885 8.093767672 0.760885193

Residual 44 2718020657 61773196762

Total 45 27238129864

Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

Intercept 7384332.289 3812361.214 1.936944553 0.059189439 -298976.8838 15067641.46 -298976.8838 15067641.46

X Variable 1 -1.005973127 3.28518486 -0.306215075 0.760885193 -7.626828171 5.614881916 -7.626828171 5.614881916

Running Head: Report 7

From the analysis of regression model it is evaluated that there is a positive relationship between

a number of employees, the productivity of workers with the total expenditure of the company.

Under the results of regression analysis, the null hypothesis is accepted and alternate hypothesis is

rejected.

Make a scatterplot of your response variable (on the Y-axis) versus one of the predictor

variables (on the X-axis). Describe the pattern you see. Is this pattern consistent with what

you expected? Note any apparent outliers in the plot. Can you propose a "cause" for these

outliers? Repeat the entire procedure for the other predictor variables.

Number of employees:

Below mentioned are the scatterplot by taking response variables that is total expenditure

on Y-axis and predictor variables on the x-axis. The outcome of this graph is consistent than

anticipated. The pattern that is seen in the below-mentioned plot is that all the values of the firm

lie between 0 – 2000000 while there is the maximum range of values goes more than 10000000

and total expenditure is more than 150000000.

Outliers of this plot are:

Number of employees

max 11112764

min 1259

range 11111505

mode 22862

Running Head: Report 8

Productivity of workers

Below mentioned are the scatterplot by taking response variables that is total expenditure

on Y- axis and predictor variables on x – axis that is productivity of employees. The outcome of

this graph is consistent that anticipated. The pattern that is see in the below mentioned plot is that

all the values of firm lies between 0 – 2000000 while there is maximum range of values goes more

than 16000000 and total expenditure is more than 80000000.

Productivity of workers

max 7733162

min 813

range 11111505

mode 22862

0

200000

400000

600000

800000

1000000

1200000

0 50000000 100000000 150000000 200000000

Number of employees

Running Head: Report 9

Can you think of any other variables (not in your data set) that might be useful in predicting

Y? Try to list a few possibilities.

There are multiple variables that are used in this study for the prediction of the variable in Y. These

variables are stated below:

• The annual payroll of employees. This will help in determining that whether the payroll

increases the amount of expenditure of company and to what extent this will have a positive

impact on manufacturing firm. It also helps in determining whether the company is overly

hired or not.

• The total cost of the material. The total cost of material is an important variable as

material cost is an important predictor of total expenditure as more than 50% of the total

cost of the company depends on it.

0

20000000

40000000

60000000

80000000

100000000

120000000

140000000

160000000

180000000

0 100000 200000 300000 400000 500000 600000 700000 800000

Production of workers

Running Head: Report 10

• The total value of shipments and receipts for services. This help in determining that

how the value of shipment lowers the expenditure of the company. when the company

earns more from shipment will easily bear the high expense and generated profit.

For each variable, obtain descriptive statistics using Data Analysis Toolkit and create a histogram

using Excel.

Number of employees (Predictor variables):

Histogram:

Mean 222950.3333

Standard Error 34215.00257

Median 154684

Standard Deviation 229521.2148

Sample Variance 52679988045

Kurtosis 4.292423976

Skewness 1.841321098

Range 1118637

Minimum 1259

Maximum 1119896

Sum 10032765

Count 45

Confidence Level(95.0%) 68955.80674

No of employees

Running Head: Report 11

Production of workers (Predictor variables):

Histogram:

Mean 155262.5333

Standard Error 23151.59335

Median 108945

Standard Deviation 155305.6095

Sample Variance 24119832354

Kurtosis 2.564765983

Skewness 1.551343512

Range 705577

Minimum 813

Maximum 706390

Sum 6986814

Count 45

Confidence Level(95.0%) 46658.97053

production of workers

Running Head: Report 12

For each variable, based on the descriptive statistics output, decide if your variable has

normal distribution or not. Also, you can check a scatterplot to see the liner relationship. If

you see the problem with your data, and if all of the data values for this variable are positive,

try taking log of the variable. Then create the descriptive statistics graph for the log of the

variable, and decide whether the problem is reduced. Please note that if a variable has any

zero or negative values, then taking logs is NOT appropriate, so there is no point in trying it

in this case. Excel tutorial on how to graph log-transformed data in Excel.

0

5

10

15

20

25

30

F re

q u

e n

cy

Bin

Histogram

Running Head: Report 13

From the log of graph of variable it is analyzed that there exists a linear relationship.

Return the scatterplot (and answer the rest of question 2) using the logged variables

wherever this was found appropriate in question # 5). Here are some examples of what I

mean. If you decide to take logs of predictor variable X2 only, then you should run a

scatterplot of your response variable (let's call it Y) against log(X2). If you decided to take

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8 1 1.2 1.4

log of number of employees

0

0.5

1

1.5

2

2.5

3

0 0.2 0.4 0.6 0.8 1 1.2

Log of production of workers

Running Head: Report 14

logs of X2 and X3, then you should run scatterplots of Y versus log(X2) and Y versus log(X3).

If you decided to take logs of Y only, then you should run scatterplots of log(Y) versus all of

the (non-logged) predictor variables. If you decided not to take the log of any of the variables,

you do not need to do anything. For each scatterplot you create here, compare it with the

corresponding one from question 2). Did taking logs help you to uncover a relationship

between the variables?

-10000000

10000000

30000000

50000000

70000000

90000000

110000000

130000000

150000000

170000000

190000000

0.5 200000.5 400000.5 600000.5 800000.5 1000000.5 1200000.5

Number of Employees

Running Head: Report 15

References

• census.gov. (2016, 1 1). Manufacturing Firms . Retrieved from census.gov:

https://www.census.gov/data/tables/2016/econ/asm/2016-asm.html

• Timothy E. Zimmer, P. K. (2017). Lean manufacturing: The production employment and

wages connection. Indiana Business Review, 92(1), 1-10.

Running Head: Report 16

Appendix A:

Id2 Year Geographic area

name

Number of

employees

Production

workers average

for year

Total capital

expenditures

($1,000)

1 2016 District of Columbia 1259 813 168317599

2 2016 Hawaii 11513 7149 4187468

3 2016 Alaska 12178 10176 204082

4 2016 Montana 16697 11238 1924308

5 2016 New Mexico 21747 14700 1994511

6 2016 North Dakota 22862 16912 13062872

7 2016 Delaware 25434 17901 1534982

8 2016 Vermont 27420 18641 1452456

9 2016 Rhode Island 36081 23745 500411

10 2016 Nevada 41356 27759 9730

11 2016 South Dakota 44094 32491 3927971

12 2016 Maine 49710 35983 4450840

13 2016 Idaho 55774 41205 129621

14 2016 New Hampshire 65553 39464 1342579

15 2016 Maryland 91791 56258 6681908

16 2016 Nebraska 92945 70717 7046728

17 2016 Louisiana 113914 80129 3658211

18 2016 Colorado 121069 79052 2211855

19 2016 Oklahoma 121220 89068 3904222

20 2016 Mississippi 130537 103082 8699895

21 2016 Arizona 136946 82409 438790

22 2016 Arkansas 145733 116757 1153209

23 2016 Kansas 154684 110902 2369195

24 2016 Connecticut 155062 87968 7409725

25 2016 Oregon 160128 108945 3419845

26 2016 Iowa 203835 147954 1589847

27 2016 New Jersey 210291 139870 2823733

28 2016 South Carolina 213050 159797 642828

29 2016 Massachusetts 223996 131020 1109775

30 2016 Kentucky 230763 178593 456874

31 2016 Alabama 234803 176679 569796

32 2016 Missouri 245352 182878 2738486

33 2016 Florida 270180 179959 502613

34 2016 Minnesota 297770 197099 5140968

35 2016 Tennessee 308966 228500 5152963

36 2016 Georgia 351951 265733 331143

37 2016 New York 395129 261216 10755438

38 2016 North Carolina 411050 303771 2273823

39 2016 Indiana 476417 357501 2623293

40 2016 Pennsylvania 522221 362007 7348275

41 2016 Illinois 538183 370965 346802

42 2016 Michigan 555005 398946 3549962

43 2016 Ohio 642945 460781 433012

44 2016 Texas 725255 493691 5121563

45 2016 California 1119896 706390 19745490

Running Head: Report 17

  • References