R-C2 Assign

profilematabi
2018-08-04-reportFinal.docx

REGRESSION AND STATISTICAL EFFECTS

The descriptive statistics performed to compute the minimum, maximum, mean and standard deviation of the ordinal variables in the provided data set. The output of this procedure are presented in the Table 1. The mean (standard deviation) of the variables age, years in department, salary, teaching rating and publication prestige were calculated as 48.28 (8.52), 16.32 (7.88), 61955.79 (30858.05), 9.74 (0.67) and 3.31 (1.14) respectively. Similarly, the minimum and maximum values were found and are as presented in Table 1 below.

Table 1: Descriptive statistics for numeric variables

Variable

Minimum

Maximum

Mean

Std. Dev

Age

26.01

71.27

48.28

8.52

Years in department

1.00

42.43

16.32

7.88

Salary

25916.00

212266.00

61955.79

30858.05

Teaching rating

5.58

10.66

9.74

0.67

Publication prestige

1.60

6.00

3.31

1.14

The report further created histograms for each of the variables above to visualize their distribution results which are presented in Figure 1. The results clearly indicated that the salary (Figure 1. a) and teaching rating (Figure 1. d) are skewed to the right and left respectively. On the other hand, the variables age, years in department and publication prestige revealed a normal distribution as can be seen in Figures 1 b, 1 c and 1 e respectively.

a) Histogram of salary

b) Histogram of age

c) Histogram of years in department

d) Histogram of teaching rating

e) Histogram of publication prestige

Figure 1: Graphical representation of the distribution of interval variables age, years in department, teaching rating and publication prestige

Table 2: Frequency distributions and percentages for ordinal and dummy variables

Variable

Group

Frequency

Percent

Cum. Percent

Sex

Male

262

42.3

42.3

Female

358

57.7

100.0

Level

Assistant professor

320

51.7

51.7

Associate professor

139

22.5

74.2

Full professor

80

12.9

87.1

Chaired professor

80

12.9

100.0

Minority

Minority

155

25.0

25.0

Non-minority

464

75.0

100.0

Admin. Responsibilities

No admin. responsibilities

159

25.7

25.7

Some admin. responsibilities

460

74.3

100.0

Science

In science

171

27.6

27.6

Not in science

448

72.4

100.0

Degree tier

Degree is not from top-tier

480

77.4

77.4

Degree is from top-tier

140

22.6

100.0

The analysis of frequencies was also carried out to show the distribution the respondents among the different groups of the dummy variables namely: sex, level, minority, administrative responsibilities, science and degree tier. In particular, there more females 358 (57.7%) than males and 262 (42.3%) in the study for sex, 320 assistant professors representing 51.7%. For the other four dummy (dichotomous) variables, the groups non-minority, some administrative responsibilities, not in science and degree is not from top tier university had the higher frequencies of (464) 75%, 460 (74.3%), 448 (72.4%) and 480 (77.4%) respectively. These results are presented in Table 2.

Scatter plot of salary verses age

The study sought to investigate the relationship between the variables salary and age (Figure 2), teaching rating and years in department (Figure 3), and salary and teaching rating (Figure 4). The output in Figure 2 reveal a positive and fairly strong linear correlation between the variables salary and age. Thus, from the available data the salary of the respondents increase with increasing age, or simply stated, older respondents earn more than their younger counterparts.

Figure 2: Scatter plot of salary versus age

On the other hand, from the available data, there study did not find any relationship between teaching rating and years in department as can be seen in Figure 3. Therefore, the number of years in the department doesn’t have any influence on the way the respondents are rated across all the classes. Lastly, there scatter plot in Figure 4 reveals an inverse relationship between the salary the respondents earn and their teaching rating. This relationship is however weak.

Figure 3: Scatter plot of teaching rating and years in the department (figure on the left)

Figure 4: Scatter plot of teaching rating and years in the department (figure on the right)

Simple linear regression

To test the hypothesis that the dependent salary in this study is a function of the independent variable tch_rating (teaching training), that is the slope is significantly different from zero, a simple linear regression model was carried out. The results of this regression analysis (see Table 3) revealed a very low variance in the dependent variable that was accounted by the independent variable, which is much lower than the recommended 36%. Further, the model was found to be statistically significant at the 99% level of significance, . By using the t statistic, results confirmed that the independent variable tch_rating is a significant predictor at 99% level of significance,

The best fitting model for predicting the salary given the tch_rating is therefore given by the following equation. From these model, the salary will decrease by 7628.84 for each unit increase in the average rating across all the class.

Table 3: Simple linear regression parameters (t or F statistics)

Constant

136254.17 (7.587 ***)

Teaching rating

-7628.84 (-4.147 ***)

F (1, 618)

17.20 ***

Durbin-Watson

0.067

R-squared

0.027

No. observations

619

*** indicates significant t statistic at 99% level of significance

Residuals

A plot of the residuals versus the predicted value of the dependent variable (salary) was plotted and a normal curve superimposed for clear visual representation. The output as shown in Figure 5 indicates no problems with the assumption that the residuals are normally distributed at each level of the dependent variable and constant in variance across levels of this variable. The results of the regression procedure presented in Table 3 and their interpretations are therefore statistically valid on this basis.

Figure 5: Plot of regression standardized residuals against the residuals

Multiple linear regression

To test the hypothesis that that all the slopes are equal to zero, that is, none of the independent variables influences the dependent variable, a multiple linear regression procedure was run. Tests for multicollinearity did not real any case of high correlation between any two independent variables since all values of variance inflation factor (VIF) were less than the set threshold of 10. As a result, there is not any reason to worry about any case of the presence of multicollinearity in this multiple regression procedure.

Similarly, a plot of the residuals versus the predicted value of the dependent variable (salary) was plotted and a normal curve superimposed for clear visual representation. The output as shown in Figure 6 indicates no problems with the assumption that the residuals are normally distributed at each level of the dependent variable and constant in variance across levels of this variable. The results of the regression procedure presented in Table 4 and their interpretations are therefore statistically valid on this basis that the normality assumption has been satisfied.

Figure 6: Plot of regression standardized residuals against the residuals

The results of this regression analysis (see Table 4), the independent variables included in the model account for a large percentage of the variance in the dependent variable. The results of the ANOVA table (not shown) indicate that the is statistically significant at the 99% level of significance, . The null hypothesis that the slopes are all equal to zero is thus rejected in favour of the alternative hypothesis that at least one of the slopes is different from zero, that is, there is at least one independent variable that affects the dependent variable in the model.

By using the t statistic, results confirmed that the independent variables level, tch_rating, minority, degree_tier and pub_prestige were statistically significant predictors of the dependent variable at the 99% level of significance, The other variables were found to be statistically insignificant at the 99% level of significance.

The best fitting model for predicting the salary given the independent variables is therefore given by the following equation (The models lacks the constant term because we used the standardized coefficients since the scales for the independent variables were different, i.e. interval and ordinal scales).

Table 4: Multiple linear regression parameters (t or F statistics)

Age

-0.015 (0.437)

Years in department

0.020 (0.578)

Sex

-0.031 (-1.561)

Level

0.459 (15.761 ***)

Teaching rating

-0.151 (-7.775 ***)

Minority

0.104 (5.132 ***)

Administrative responsibilities

-0.055 (-2.367)

Science

-0.014 (-0.660)

Degree tier university

0.147 (6.850 ***)

Publication prestige

0.533 (20.179 ***)

F (10, 618)

207.08 ***

Durbin-Watson

0.583

R-squared

0.773

No. observations

619

*** indicates significant t statistic at 99% level of significance

Stepwise regression

In order to choose the best predictor variables from those included in the multiple linear regression model above, the stepwise regression procedure was carried out. This procedure does multiple regression analysis a number of times, each time removing the weakest correlated variable. In this analysis, a total of six iterations were carried out producing a model with six independent variables namely: pub_prestige, level, tch_rating, degree_tier, minority and admin_resp. To test the assumption of normality, a plot of the dependent variable and residuals was plotted. This is as presented in Figure xxxx – these results indeed confirm that this assumption has met. Similarly, the variance inflation factor (VIF) was used to check for cases of multicollineariry. The results showed did not show any case of multicollinearity in the independent variables implying no of the variables were highly correlated.

Table 5: Stepwise linear regression parameters (t statistics)

Age

0.539 (21.695 ***)

Years in department

0.468 (19.605 ***)

Sex

-0.151 (-7.809 ***)

Level

0.148 (7.091 ***)

Teaching rating

0102. (5.142 ***)

Minority

-0.052 (-2.247)

Durbin-Watson

0.572

No. observations

619

*** indicates significant t statistic at 99% level of significance

Since the predictor variables have different measurement scales (interval or ordinal), the standardized coefficients are used as regression coefficients as opposed to the unstandardized coefficients (the models therefore does not include the constant term). The value of the dependent variable (salary) can be predicted from the six independent variables included in the model by using the relation below.

All the predictors in the model above except admin_resp are statistically significant at the 99% level of significance as can be seen from Table 5. This is the case due to the stepwise regression method that has been adopted in addition to having a fairly large sample.

The residuals plot used for testing the assumption of normality for the residuals is presented in Figure 7.

Figure 7: Plot of regression standardized residuals against the residuals