Course Project
2
Contents
Research Questions & Hypotheses ..................................................................................................3
Introduction & Methodology ...........................................................................................................3
Research Design ..........................................................................................................................3
Sample .........................................................................................................................................4
Buisness Understanding...............................................................................................................5
Results..............................................................................................................................................5
Descriptive Statistics ...................................................................................................................5
Normality .....................................................................................................................................7
Correlation ...................................................................................................................................8
Multiple Linear Regression Model -1......................................................................................9
Multiple Linear Regression Model 2. ........................................................................................14
Multiple Linear Regression Model-3.....................................................................................18
Discussion......................................................................................................................................23
Limitations .....................................................................................................................................23
Recommendations..........................................................................................................................24
3
Research Questions & Hypotheses
The purpose of the research study is to identify determinants of Ph.D. completion among
candidates including GPA and other selector metrics. To fulfill the purpose of the study, the
following research questions have been identified,
1. What are the determinants of Ph.D. completion among successful candidates?
2. Is GPA a proper factor to determine Ph.D. completion among candidates?
Based on the above research questions, the following Hypotheses have been devised,
H1 (Hypothesis 1): GPA, GRE determines Ph.D. completion among candidates.
H2 (Hypothesis 2): Letter of Recommendations, Student Motivation, determines Ph.D.
completion among candidates.
H3 (Hypothesis 3): Age, Gender determines Ph.D. completion among students.
H4 (Hypothesis 4): Emotional Stability, Financial Resources, Hostility and Social Abilities Mean
Rating of Selectors Impression of Applicant determines Ph.D. completion among candidates.
Introduction & Methodology
Research Design
The methodology of this research study is based upon a quantitative research design based on a
critical realist approach. Since the research study considers a sample in real-time, therefore, a
realist approach has been opted for, explaining the determinants of Ph.D. completion among
candidates. Moreover, based on the approach, the study can be called deductive since it deduces
generalizations based on results obtained through a proper detailed analysis provided in the
4
report ahead. On the other hand, as mentioned before the research strategy that has been opted
within this study is quantitative, using a more descriptive approach so that the determinants or
predictors of Ph.D. completion among candidates can be discussed in detail. Instead of a
longitudinal research study, a cross-sectional study has been considered to avoid any
discrepancies such as differences in grading plans, etc.
For the data collection process, among the sample chosen, a survey was distributed which was
filled by the students gradually. Therefore, a structured questionnaire was used for the survey.
The survey was based on 18 variables including gender, age, GPA, GRE scores, letter of
recommendations, motivation, stability, financial funding, marital status, age, social skills,
hostility, and impression. Among all these variables Ph.D. completion is the dependent variable
whereas GPA, GRE Scores, Motivation, Stability, Financial Resources, Hostility, Impression,
Letter of Recommendation, and Social Abilities have been identified to be independent variables.
All these variables have been measured on a 9-point Hedonic scale ranging from extremely low
to extremely high.
Sample
Since the study is regarding Ph.D. completion, 100 Ph.D. candidates from higher degree
institutes have been selected for this study’s sample. Each candidate was asked regarding the
status of completion of their degrees whereas the entire sample was chosen based on convenient
sampling. Convenient sampling helped in supporting the researcher since it provided direct ease
in selecting candidates who were easily in reach and available, reducing time and cost.
5
Buisness Understanding
For Data Analysis, the multiple linear regression model has been utilized using the SPSS
software to avoid any kinds of human error within the analysis of the results. As mentioned
above, Ph.D. completion has been considered as the dependent variable GPA, GRE Scores,
Gender, Age at entry ,Motivation, Stability, Financial Resources, Hostility, Impression, Letter of
Recommendation, and Social Abilities have been identified to be independent variables.
Results
Descriptive Statistics
Around 100 Ph.D. candidates were engaged within this research study to provide an effective
representation of the small sample considered. Among the entire sample, there were around 64%
Females i.e., 64 and 36% Males i.e., 36 out of a sample of 100 as displayed in Table 1.
Table 1. Gender
Gender
Frequency Percent Valid Percent Cumulative
Percent
Female 64 64.0 64.0 64.0
Male 36 36.0 36.0 100.0Valid
Total 100 100.0 100.0
Similarly, most of the candidates were single i.e., 60% whereas the rest were married accounting
to be only 40% of the entire sample as displayed in Table 2.
6
Table 2. Marital Status
Marital Status
Frequency Percent Valid Percent Cumulative
Percent
Married 40 40.0 40.0 40.0
Single 60 60.0 60.0 100.0Valid
Total 100 100.0 100.0
As a Ph.D. is a higher degree, the age bracket starts from 20 years of age. Around 76% of the
sample was between the ages of 20 to 30, followed by 20% within 31-40 and only 4 were either
41 or above. Table 3 displays the categorical distribution of the sample by age.
Table 3. Age
Age at Entry
Frequency Percent Valid Percent Cumulative
Percent
20-30 76 76.0 76.0 76.0
31-40 20 20.0 20.0 96.0
41 and Above 4 4.0 4.0 100.0 Valid
Total 100 100.0 100.0
Table 4. Ph.D. completion status
Ph.D. completion
Frequency Percent Valid Percent Cumulative
Percent
Completed 50 50.0 50.0 50.0
Incomplete 50 50.0 50.0 100.0Valid
Total 100 100.0 100.0
7
Coming towards the main sample in consideration, the entire sample is divided into Ph.D.
candidates who have completed their degree and who have not. Therefore, an equal sample is
considered so that an equal representation can be given to each, based on which, 50% of the
sample has completed their degree and 50% of the sample has not as displayed in Table 5.
Normality
Table 5. Descriptive Statistics
Descriptive Statistics
N Minimum Maximu
m
Mean Std.
Deviation
Skewness Kurtosis
Statistic Statistic Statistic Statistic Statistic Statistic Std.
Error
Statistic Std.
Error
1 Letter of
Recommendation
100 4 9 6.94 1.324 -.101 .241 -.844 .478
2 Letter of
Recommendation
100 4 9 7.00 1.303 -.168 .241 -.768 .478
3 Letter of
Recommendation
100 4 9 7.06 1.324 -.219 .241 -.783 .478
Student Motivation 100 6 9 7.82 .957 -.334 .241 -.848 .478
Emotional Stability 100 4 9 6.38 1.668 .121 .241 -1.211 .478
Financial Resources 100 3 9 5.78 1.685 .044 .241 -.754 .478
Interpersonal Skills 100 4 9 6.58 1.365 -.123 .241 -.639 .478
Hostility 100 1 5 2.60 1.044 .217 .241 -.422 .478
Selectors Impression
of Applicant
100 5 9 7.08 1.203 -.227 .241 -.970 .478
College GPA 100 2.75 3.97 3.5130 .26591 -.567 .241 .278 .478
Major GPA 100 3.20 4.00 3.7778 .19812 -.811 .241 .222 .478
GRE Specialty 100 520 790 652.20 70.504 -.041 .241 -.889 .478
GRE Quantitative 100 550 787 688.49 63.906 -.343 .241 -.899 .478
GRE Verbal 100 470 780 631.80 71.596 -.051 .241 -.652 .478
Valid N (listwise) 100
8
Since the Skewness and Kurtosis of the data are between +/-1, the data is considered normal as
displayed above in Table 5.
Correlation
Table 6 displays that Ph.D. completion significantly correlates with the GRE score of the
quantitative portion i.e., -0.608, along with Letter of recommendation-1, Letter of
Recommendation-3, Student motivation with a correlation coefficient of -0.592, –0.683, and -
0.567 respectively. All these variables are highly correlated with Ph.D. and have a strong
negative relationship. College GPA, Letter of Recommendation-2, and Age at entry are
moderately correlated and have a moderate negative relationship with Ph.D. with correlation
coefficient –0.433, –0.494, and –0.377. From these correlation values, we can assume that these
factors determine the Ph.D. completion of a candidate, however, there is no significant relation
reported in results with Impression of the Selector.
From the correlation table, we can also observe that the independent variables are highly
correlated with each other and have a strong positive relationship between them. This high
correlation between the independent variables leads to a multicollinearity issue. The following
are the highly correlated independent variables. College GPA and Major GPA with a correlation
coefficient of 0.901. GRE quantitative correlates with Letter of Recommendation-1 and Letter of
Recommendation- 3 whose correlation coefficients are 0.699 and 0.646. GRE specialty and GRE
Verbal with correlation 0.984, GRE quantitative, and student motivation with correlation 0.591.
Letter of Recommendation-1, Letter of Recommendation- 2 with student motivation with
correlation coefficients 0.501 and 0.567respectively. Letter of Recommendation-1 and Letter of
Recommendation- 3 with correlation 0.520.
9
Table 6. Correlation
Multiple Linear Regression Model -1
From the Table 7 model summary, we observe that the R-value = 0.850, which is the correlation
coefficient of the overall model with the dependent variable Ph.D. R square = 0.722, which is
also known as the coefficient of determination, explains that the model is a good fit model as
72.2% of the variance in Ph.D. completion can be predicted from the independent variables. Both
the R and R square values are good which from which we can assume our model may be a good
fit.
10
From the ANOVA (Analysis of Variance) table, the significance value of the overall model is
P<0.001 which explains that the model is statistically significant as it is less when compared to
alpha 0.05.
Table 7. Model 1
11
As we have 100 observations, so the degrees of freedom df= number of observations - 1 = 99.
With 95% of confidence interval and df=99, p-value =0.05, so the critical t-value is ±1.984.
Based on corresponding t-values and p-values we can tell which independent variables are good
for our model. This can be done by comparing the critical t-value against the calculated t-value,
the calculated t-value should not fall in the interval of critical t-value, also checking the
significance level of the independent variables whose p-value<0.05. This helps in interpreting
which independent variables to be considered as a good fit for our model.
Upon comparison based on the above discussion, we can understand that the independent
variables College GPA, Major GPA, GRE specialty, GRE verbal, GRE quantitative, 1 Letter of
recommendation, 2 Letter of recommendation, emotional stability, Marital status, interpersonal
skills, Hostility, selectors impression of applicants have p-value more than 0.005 and t-calculated
values fall in the interval of ±1.98 which is the t-critical value. Hence, indicating that these
variables are not statistically significant and may overfit our model. So, there is a need to drop
12
these variables and rerun our regression model without considering those variables which
violates our assumption of statistical significance.
13
From the histogram and the normal P-P plot, we can infer that the residuals(errors) are not much
normally distributed as the residuals do not fall along with the linear line in the normal P-P plot,
and the distribution of residuals in the histogram also does not seem normally distributed
violating the normally distributed errors assumption.
Homoscedasticity, independent errors, Linearity, normally distributed errors Assumptions:
A regression plane shown above tells us about the linearity of the multiple regression with the
standardized predictor and residual variables on the x-axis and y-axis respectively which are
cantered around Zero (0). We observe a pattern of dots rather than being randomly scattered,
showing that the successive residuals are correlated, and the errors are not normally distributed
proving that linearity assumption, independent errors, and normally distributed errors assumption
are violated. Because of the dots not being normally distributed, it may indicate that the
variances of the residuals are not constant which violates the Homoscedasticity assumption.
14
Multiple Linear Regression Model 2.
The following linear regression model is considered by dropping the independent variables from
the above regression model-1 based on the p and t-values.
Table 8 Correlations
From the above correlation table, we can infer that the independent variables have a very weak
relation among them, indicating that no multicollinearity issue. Letter of recommendation-3 and
student motivation are highly correlated with Ph.D., they are exhibiting a strong negative
relationship.
15
Linear Regression:
From the Table 8 model summary, we observe that the correlation coefficient R-value = 0.786,
coefficient of determination R square = 0.618, when compared with linear regression model-1are
low, yet the model is a good fit model as 61.8% of the variance in Ph.D. completion can be
predicted from the of the independent variables. There is no change in significance values, hence
the overall model is still statistically significant rejecting the null hypothesis with p<0.001 from
the ANOVA table.
Table 8 Model 2.
16
The critical t-value remains the same (i.e., ±1.98) which is obtained earlier as the df=99.
Applying the same conditions of comparing critical t-value against calculated t-value and
corresponding p-values of the independent variables to check which independent variables
needed to be dropped.
Depending on that we now conclude that the financial resources independent variable needs to
drop. This is because the t-value = -1.732, which falls in the interval of critical t-value, and the
corresponding p-value = 0.86 >0.05 indicating that the variable is not statistically significant.
17
From the graphs of the histogram and normal P-P plot between the residuals, we can interpret
that the residuals are normally distributed on the histogram and are aligned along the linear line
18
on the P-P plot, which is an improvement after dropping the variables. Hence, we may assume
that the normally distributed errors assumption is true.
The scatter plot between the standardized predictor values and residuals is an improvement when
compared to the model-1 regression plane. Yet this still needs to be improved as it violates the
homoscedasticity, linearity, independent errors assumptions because the residuals are not
randomly scattered.
Multiple Linear Regression Model-3
We are now considering a third regression model by dropping the financial resources
independent as it violated the assumption.
Correlation
Table 9 Correlations
From the above correlation table, we can infer that Letter of recommendation-3 and student
motivation are highly correlated with Ph.D. and are exhibiting a strong negative relationship and
there is no violation of the multicollinearity assumption.
19
Linear Regression Model-3:
From the Table 9 model summary, we observe that the correlation coefficient R-value = 0.778,
coefficient of determination R square = 0.606, explains that the model is a good fit model as
60.6% of the variance in Ph.D. completion can be predicted from the of the independent
variables. There is no change in significance values, hence the overall model is still statistically
significant rejecting the null hypothesis with p<0.001 from the ANOVA table. The F-value is
also good when compared to the previous model-1 and model-2.
Table 9 Model-3
20
From the coefficients table,
Gender: with the coefficient of 0.203, t-value = 2.923, p=0.04< 0.05, is statistically significant
independent variable.Letter of recommendation-3: with the coefficient of –0.194, t-value = -
6.926, p=0.001<0.05, is statistically significant independent variable.Student Motivation: with
the coefficient of –0.130, t-value = -3.291, p=0.001< 0.05, is statistically significant independent
variable.Age at Entry: with the coefficient of –0.169, t-value = -2.609, p=0.011< 0.05, is
statistically significant independent variable.Since all the four independent variables are
statistically significant, we can now write our regression equation to represent which factors
influence Ph.D.
PhD = 3.827 + 2.923*Gender – 0.194*Letter of recommendation -0.130*Student Motivation –
0.169*Age at Entry.
From the regression equation, we can interpret that,Gender: Ph.D. completion is determined by
gender with an increase in the factor of 2.923.Letter of Recommendation-3: Ph.D. completion is
determined by the letter of recommendation-3 with a decrease in the factor of –0.194. Student
Motivation: Ph.D. completion is determined by student motivation with a decrease in the factor
of –0.130.Age at entry: Ph.D. completion is determined by age at entry with a decrease in the
factor of –0.169.
21
22
Compared to the Model-2 residual graphs, the model-3 residual plots are far better. From the
graphs of the histogram and normal P-P plot between the residuals, we can interpret that the
residuals are normally distributed on the histogram and are aligned along the linear line on the P-
P plot, which is an improvement after dropping the variables. Hence, we may assume that the
normally distributed errors assumption is true.
The scatter plot between the standardized predictor values and residuals is an improvement when
compared to the model-2 regression plane. Yet this still needs to be improved because a few dots
are still closer to each other. This violates the homoscedasticity, linearity, independent errors
assumptions because the residuals are not randomly scattered.
23
Discussion
Based on the results of the regression analysis, Student Motivation, Letter of Recommendation-
3, Gender, Age at Entry can easily play the role of a determinant or predictor for Ph.D.
completion among candidates. Student Motivation, Letter of Recommendation-3 are strongly
correlated with Ph.D. and exhibit a strong negative relationship with correlation coefficients –
0.683, -0.567 respectively. Student Motivation can be assumed as the strongest predictor because
it is the willpower of an individual himself that makes him capable of achieving greater heights
and continue the path towards achieving the Ph.D. degree as well. On the other hand, the letter of
recommendations might provide information regarding the caliber and capabilities of the
individual, therefore are a great predictor of whether the candidate can complete the Ph.D.
degree in time. On the other hand, Age at entry and Gender are moderately correlated with
Ph.D., while age shows a negative correlation of –0.377, gender has a positive correlation of
0.250. All the four independent variables Student Motivation, Letter of Recommendation-3,
Gender, Age at Entry are statistically significant with each other independently. Among the four-
hypothesis derived from the research questions, H2 and H3 hypotheses are satisfied from
regression analysis of Model-3.
Limitations
In terms of Limitations, only a small sample i.e., a sample was considered to identify whether
GPA, GRE scores, and the Interpersonal Abilities of an individual are significant predictors of
Ph.D. completion among candidates. Therefore, a small sample cannot be a true representative of
the entire country but can be compared to a small locality. Moreover, only 100 candidates have
been selected within the locality which is also too short to be a true representative of a large
country .
24
Recommendations
The results can further be separated into samples accordingly i.e., people who completed the
Ph.D. as well as people who did not complete it. Based on the two different samples,
determinants can be predicted based on an independent t-test which would explain statistical
variances between the two samples. Moreover, more predictors can also be examined within the
study.