statistical

profilexswya
SAMPLEPROJECTONPREDICTINGTEST2Spr2020.docx

SAMPLE PROJECT

PART 1

Research question: Do factors course attendance, class,

Test1 score, and sex predict Test2 score in Math1150?

Variables: 1) course attendance: number of classes missed

.with ½ hour or more missed =1/2 of an absence

= numeric AND also, level of missing

0 days<missed<1 day LOW, 1<m<4 MED, m>4 HIGH

= categorical

2)class = categorical (class1, class2, class3)

3)Score on Test 1 = numeric and categorical: 75%+

Pass, under 75% Not Pass

4)Sex = categorical

5)Test 2 grade = numeric and categorical: 75%+

Pass, under 75% Not Pass

Type of study: Observational. The data will be obtained

.from grade records: no random group assignment, no Treatment

Population of interest: Math1150 students

Sampling method: convenience sample (bad!) of Jane Gringauz’

.classes. The reason: this is just a class demo.

Predict the result: Test 1 and attendance should be good

.predictors of Test 2 scores.

------------------------------------------------------------ PART 2

SAMPLING METHOD

Convenience sample this time, but in the future, when 4 classes worth of records are collected, stratified sampling: pick SRS of n=7 or 8 students from each class, for a total of n=30. Each

.class would be a stratum. The class’ effect on Test 2 would not be assessed, in that case.

HOW DATA WILL BE OBTAINED

The paper records classes included in the study will be searched

.to fill out the following table (made-up data inserted: demo)

Person

No name

Class

Sex

Number days

absent

per semester

Absence

level

Score Test1 (%)

Test1 Pass status

Score Test2

Test2

Pass

status

PREDICTORS OR EXPLANATARY VARIABLES (i.e. factors)

Outcome

1

1

F

1

L

79

86

2

1

M

5

H

66

71

1

27

2

M

2.5

M

95

93

WHAT I THINK THE RESULTS WILL BE:

It would be better to use “before Test 2” attendance, rather than course attendance, but that information would be harder to obtain. Attendance is probably still a reasonably good predictor of Test2 scores, possibly, the classes are not all the same and Test1 scores should predict Test 2 scores pretty well.

PILOT STUDY: has been done with one and then with two classes. The pilots showed that the study design is reasonable. Additional variables, such as homework and quiz scores would have been useful in obtaining a better model.

PART 3

Display of data: I will make 1)barplots for Passing status of Test1 and Test2, and absence level :all for the 3 classes; [For this Project, barplots are optional: ONLY USEFUL IF START WITH 5 + CATEGORIES, say letter grades, THEN COMBINE THEM into P and F .]; 2)histograms of Test1 and also Test2 scores for each class; side by side boxplots of T1 and also T2 for all 3 classes; also, a scatterplot of T2 vs. T1 for classes combined, also, a scatterplot of T2 vs. days absent for classes combined to determine whether a linear model is reasonable; and qqnorm and qqline plots of Test2 scores will be created to check if normally distributed.

Linear models:

I will fit the following linear model and then use backward selection to remove variables and fit interactions to maximize R2 and to minimize the p-value as much as possible.

Chi squared tests of association:

I will conduct the following chi squared tests: to determine if there is significant association between sex (M, F, O) and passing status of Test2 (Pass, No) , to determine association between passing status of Test1 (Pass, No) and Test2 (Pass, No), and also to determine association between absence status (Low, Med, High) and passing status of Test2 (Pass, No).

##############################################################

USUALLY, APPENDIX IS PUT AT THE END OF A PAPER.I AM PUTTING IT AFTER PART 3 TO SHOW YOU THE PROCESS OF obtaining results TO WRITE-UP in Part4. THE APPENDIX = DO what you said in PART 3!

##############################################################

APPENDIX WITH ALL THE CODE

In research, only the code associated with the model that is optimum of the ones tested would be put in the Appendix. For the class, though, put all your code (copy from R): - saves time.

1) #Enter data into R via arrays. Then you can make a data frame. Or enter data into Excel and upload into R – see orange packet.#

C1=rep(1,22)#to enter number for all 43 students, I can made R repeat

C2=rep(2,12) #for ea class the class number, in this case, 2, twelve times

C3=rep(3,9)

class=c(C1,C2,C3);length(class) #made one array called “class”

class=factor(class) # to let R know that classes 1,2,3 are not numbers

#data for class 1

daysabs.1=c(2,2,0,6,2,1,3.5,3,0,11.5,0,8,0,6.5,2,2,9,7,4,5,2,13);length(daysabs.1)#22

abs_status.1=c(2,2,1,3,2,1,2,2,1,3,1,3,1,3,2,2,3,3,3,3,2,3);length(abs_status.1)

#Smart: 1=Low, 2=Med, 3=High.

#Then change to categorical using "factor" command

T1.1=c(91,87.5,86,92,88.5,82.5,93,82,86,63,80.5,80,90.5,87,96,84.5,92.5,70.5,83.5,83,72.5,90.5);length(T1.1)

PassT1.1=c(1,1,1,1,1,1,1,1,1,1,0,1,1,1,1,1,1,1,0,1,1,0,1);length(PassT1.1)# OOPS 23

PassT1.1=PassT1.1[-1] #remove the 1st entry into the array

length(PassT1.1)#OK, now, no extra 1,n=22, but are 0s in the right place

order(PassT1.1)

#[1] 10 18 21 1 2 3 4 5 6 7 8 9 11 12 13 14 15 16 17 19 20 22

#Data is first ordered: 0,0,0,1,1.... then where each number is given: 0 #at 10,18 and 21, then 1s OK!

T2.1=c(65,88.5,85.5,97,81.5,74,93,94.5,76.5,31,94.5,33,71,57,72,94,83.5,67,61.5,70.5,52,76)

PassT2.1=c(0,1,1,1,1,0,1,1,1,0,1,0,0,0,0,1,1,0,0,0,0,1);length(PassT2.1)

sex.1=c("f","f","f","f","m","m","m","f","m","m","f","f","m","f","f","m","f","m","m","m","f","m") ; length(sex.1)

#Not smart: much better, 0,1, instead: binary, treated as categorical

#data for class 2

daysabs.2=c(2,5,3,6,1,0,3,0,4,2,3,5);length(daysabs.2) #12 students

abs_status.2=c(2,3,2,3,1,1,2,1,3,2,2,3);length(abs_status.2)

T1.2=c(82,84,56,86.5,97,81,90,88,91,83,90.5,86);length(T1.2)

PassT1.2=c(1,1,0,1,1,1,1,1,1,1,1,1);length(PassT1.2)

T2.2=c(58.5,59,46,72.5,94.5,93,87,86,87.5,95,79,83);length(T2.2)

PassT2.2=c(0,0,0,0,1,1,1,1,1,1,1,1);length(PassT2.2)

sex.2=c("f","f","m","m","m","f","m","f","m","m","f","m");length(sex.2)

#data for class 3

daysabs.3=c(1,3.5,1,1.0,1,2,10.5,3);length(daysabs.3)#8 OOPS!

daysabs.3=c(1,3.5,1,1,0,1,2,10.5,3);length(daysabs.3)#9 OK

abs_status.3=c(1,2,1,1,1,1,2,3,2);length(abs_status.3)

sex.3=c("f","f","m","m","m","m","f","f","m");length(sex.3)

T1.3=c(90,76,95,86.5,78,79.5,83,33.5,75.5);length(T1.3)

PassT1.3=c(1,1,1,1,1,1,1,0,1);length(PassT1.3)

T2.3=c(76.5,84.5,82,93.5,72.5,76,95.5,27,71);length(T2.3)

PassT2.3=c(1,1,1,1,0,1,1,0,0);length(PassT2.3)

#Now to prepare arrays for the data frame

days_abs=c(daysabs.1,daysabs.2,daysabs.3)

abs_status=c(abs_status.1,abs_status.2,abs_status.3)

T1=c(T1.1,T1.2,T1.3); length(T1) #43 OK: 22+12+9

PassT1=c(PassT1.1,PassT1.2,PassT1.3); length(PassT1)

T2=c(T2.1,T2.2,T2.3);length(T2)

PassT2=c(PassT2.1,PassT2.2,PassT2.3); length(PassT2)

sex=c(sex.1,sex.2,sex.3)

data=data.frame(class,days_abs,abs_status,sex,T1,T2,PassT1,PassT2)

data$abs_status=factor(data$abs_status)#tell R ‘absence status=categorical

sum(data$abs_status) #Error in Summary.factor(c(2L,… GOOD!

dim(data) #Get 43 8: 43 rows = 43 students, yes, 8 columns=8 variables

head(data) #to check that the first few lines of the data frame look OK

# class days_abs abs_status sex T1 T2 PassT1 PassT2

# 1 1 2 2 f 91.0 65.0 1 0

# 2 1 2 2 f 87.5 88.5 1 1

# 3 1 0 1 f 86.0 85.5 1 1 …

5) Display data visually: What we said in Part3.

Bar plots are actually not necessary for the data that we got: never “at least 5 bars”, so “extras”, but if we promised...

d1=subset(data,data$class==1) #to separate data from each class

d2=subset(data,data$class==2) #for easier counting

d3=subset(data,data$class==3) #also used rbind: found it harder

#Count then calculate relative frequencies, below

rf1as=c(5/22,8/22,9/22);rf2as=c(3/12,5/12,9/22);rf3as=c(5/9,3/9,1/9)

T1passc1=c(19/22,3/22); T1passc2=c(11/12,1/12);T1passc3=c(8/9,1/9)

T2pass.c1=c(11/22,11/22); T2pass.c2=c(8/12,4/12);T1passc3=c(6/9,3/9)

d1$abs_status=names(rf1as); d2$abs_status=names(rf2as);d3$abs_status=names(rf3as)

d1$PassT1=names(T1passc1; d2$PassT1=names(T1passc2); d3$PassT1=names(T1passc3)

d1$PassT2=names(T2pass.c1); d2$PassT2=names(T2pass.c2;d3$PassT2=names(T2pass.c3)

rf1as=c(5/22,8/22,9/22);rf2as=c(3/12,5/12,9/22);rf3as=c(5/9,3/9,1/9)

T1passc1=c(19/22,3/22); T1passc2=c(11/12,1/12);T1passc3=c(8/9,1/9)

T2pass.c1=c(11/22,11/22); T2pass.c2=c(8/12,4/12);T2pass.c3=c(6/9,3/9)

par(mfrow=c(3,3)) # Graph1: optional. Much work, little information

barplot(rf1as,ylim=c(0,0.5),xlab="LOW MED HIGH",ylab="Rel. freq.",main="Absence status, class1")

barplot(rf2as,ylim=c(0,0.5),xlab="LOW MED HIGH",ylab="Rel. freq.",main="Absence status, class2")

barplot(rf3as,ylim=c(0,0.5),xlab="LOW MED HIGH",ylab="Rel. freq.",main="Absence status, class3")

barplot(T1passc1, ylim=c(0,1.00),xlab="Pass No",ylab="Rel. freq",main="Class1 Test1 Passing status")

barplot(T1passc2, ylim=c(0,1.00),xlab="Pass No",ylab="Rel. freq",main="Class2 Test1 Passing status")

barplot(T1passc3, ylim=c(0,1.00),xlab="Pass No",ylab="Rel. freq",main="Class3 Test1 Passing status")

barplot(T2pass.c1, ylim=c(0,1.00),xlab="Pass No",ylab="Rel. freq",main="Class1 Test2 Passing status")

barplot(T2pass.c2, ylim=c(0,1.00),xlab="Pass No",ylab="Rel. freq",main="Class2 Test2 Passing status")

barplot(T2pass.c3, ylim=c(0,1.00),xlab="Pass No",ylab="Rel. freq",main="Class3 Test2 Passing status")

par(mfrow=c(2,3)) #Graph2

h1.1=hist(d1$T1,breaks=5,plot=FALSE);h1.1$counts=h1.1$counts/sum(h1.1$counts);plot(h1.1,xlim=c(20,100),ylim=c(0,0.60), xlab="scores out of 100",ylab="Rel. freq.",main="Test1,class1")

h1.2=hist(d2$T1,breaks=5,plot=FALSE);h1.2$counts=h1.2$counts/sum(h1.2$counts);plot(h1.2, xlim=c(20,100),ylim=c(0,0.60), xlab="scores out of 100", ylab="Rel. freq.",main="Test1,class2")

h1.3=hist(d3$T1,breaks=5,plot=FALSE);h1.3$counts=h1.3$counts/sum(h1.3$counts);plot(h1.3, xlim=c(20,100),ylim=c(0,0.60), xlab="scores out of 100", ylab="Rel. freq.",main="Test1,class3")

h2.1=hist(d1$T2,breaks=5,plot=FALSE);h2.1$counts=h2.1$counts/sum(h2.1$counts);plot(h2.1,xlim=c(20,100),ylim=c(0,0.60), xlab="scores out of 100",ylab="Rel. freq.",main="Test2,class1")

h2.2=hist(d2$T2,breaks=5,plot=FALSE);h2.2$counts=h2.2$counts/sum(h2.2$counts);plot(h2.2,xlim=c(20,100),ylim=c(0,0.60), xlab="scores out of 100", ylab="Rel. freq.",main="Test2,class2")

h2.3=hist(d3$T2,breaks=5,plot=FALSE);h2.3$counts=h2.3$counts/sum(h2.3$counts);plot(h2.3,xlim=c(20,100),ylim=c(0,0.60), xlab="scores out of 100",ylab="Rel. freq.",main="Test2,class3")

par(mfrow=c(2,3)) #Graph3

boxplot(data$T1~data$class,ylim=c(0,100), ylab="scores out of 100",xlab="class1,class2,class3", main="Test1")

boxplot(data$T2~data$class,ylim=c(0,100),xlab="class1,class2,class3", ylab="scores out of 100",main="Test2")

plot(data$T1,data$T2, main="T2 vs. T1 linear ?")

plot(data$days_abs,data$T2, main="T2 vs. days absent linear?") #These are the only numeric explanatory variables

qqnorm(data$T2, main="Test2 Normal?");qqline(data$T2)

M1=lm(data$T2~data$class+days_abs+data$T1+data$sex)

summary(M1)

Estimate Std. Error t value Pr(>|t|) <- 2-sided P-VALUES

(Intercept) 0.4656 18.7794 0.025 0.9804

data$class2 2.9117 4.8095 0.605 0.5486

data$class3 6.1269 5.6272 1.089 0.2833

days_abs -1.5265 0.7007 -2.178 0.0358 *

data$T1 0.9282 0.2037 4.556 5.5e-05 ***

data$sexm 1.4652 4.0351 0.363 0.7186

Multiple R-squared: 0.5266, Adjusted R-squared: 0.4627

F-statistic: 8.233 on 5 and 37 DF, p-value: 2.646e-05

The original model is significant, p-val approx 2.646e-05<<<0.05, accounting for more than 46% of the variability in Test2 scores. Classes 2 and 3 did not significantly deviate from class1 (p-val>>0.05). Sex was not a significant predictor (p-val>0.70). Drop sex for the next model.

M2=lm(data$T2~data$class+data$days_abs+data$T1); summary(M2)

Residuals: Min 1Q Median 3Q Max

-30.267 -11.104 0.105 10.310 21.792

Coefficients: Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.8200 18.5386 0.044 0.9650

data$class2 3.0478 4.7398 0.643 0.5241

data$class3 6.2535 5.5519 1.126 0.2671

days_abs -1.5154 0.6920 -2.190 0.0348 *

data$T1 0.9321 0.2011 4.635 4.12e-05 ***

Multiple R-squared: 0.525, Adjusted R-squared: 0.475

F-statistic: 10.5 on 4 and 38 DF, p-value: 7.912e-06

This is a better model: higher R2, even lower p-val. The residuals show that this model fits pretty well: the median is close to 0, |Q1| is close to |Q3|, |min| is close to |max|. It is worth trying a model with interaction between days absent and Test1 score.

M3=lm(data$T2~data$class*data$days_abs+data$T1); summary(M3)

(Intercept) -13.2855 22.1603 -0.600 0.5526

data$class2 7.3619 8.0257 0.917 0.3651

data$class3 2.3337 7.1383 0.327 0.7456

data$days_abs -1.5205 0.7681 -1.980 0.0554 .

data$T1 1.0990 0.2510 4.378 9.88e-05 ***

data$class2:data$days_abs -1.5204 2.1606 -0.704 0.4861

data$class3:data$days_abs 2.0018 2.0186 0.992 0.3280

Multiple R-squared: 0.5458, Adjusted R-squared: 0.47

F-statistic: 7.209 on 6 and 36 DF, p-value: 4.187e-05

Interactions are not significant and R squared is a little worse. Test a few more models:

M4=lm(data$T2~data$days_abs+data$T1) #Remove factor “class”

# The best model per R squared: Adjusted R-squared: 0.4836, but we can #decide to include a factor of interest even if it is not significant in #the given sample.

M5=lm(data$T2~data$days_abs*data$T1) # test the interaction

# The interaction of Test1 scores and days absent is not significant

#(p-value>0.50). The R squared is 0.475.

#DECISION: KEEP MODEL2. WANT TO RETAIN “CLASS” AS A FACTOR

# CONFIDENCE INTERVALS OF COEFFICIENTS FOR MODEL2 below

T1=rbind(data$abs_status,data$PassT2) # TO BE ABLE TO COUNT

T1 confint(M2)

2.5 % 97.5 %

(Intercept) -36.7093848 38.3493085

data$class2 -6.5473317 12.6430028

data$class3 -4.9857367 17.4926829

days_abs -2.9163170 -0.1144435

data$T1 0.5250475 1.3392074

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]

[1,] 2 2 1 3 2 1 2 2 1 3 1 3 1 3 2 2 3 3

[2,] 0 1 1 1 1 0 1 1 1 0 1 0 0 0 3 3 2 3…

r1=c(10,3);r2=c(10,6);r3=c(5,9); chi1=rbind(r1,r2,r3)

T1ch=chisq.test(chi1)

T1ch

# Pearson's Chi-squared test data: chi1

# X-squared = 4.9025, df = 2, p-value = 0.08619

#borderline significant. Try 2x2: absence 1 and more = 2 levels: worse

# NO PERSON WHO FAILED TEST1 PASSED TEST2: cannot have cell count=0

# CANNOT PERFORM a chi squared test on Passing T1 and Passing T2

# CAN FIND C.Is.

25/37-1.96*sqrt((25/37)*(12/37)/37) #0.5248365

25/37+1.96*sqrt((25/37)*(12/37)/37) #0.8265148

12/37-1.96*sqrt((25/37)*(12/37)/37) #0.1734852

12/37+1.96*sqrt((25/37)*(12/37)/37) #0.4751635

par(mfrow=c(1,2))

plot(M2$residuals,main="Linear Model residuals");abline(0,0)

qqnorm(M2$residuals, main="Linear Model residuals"); qqline(M2$residuals)

PART 4

Research question:

The study is a class demonstration. The question addressed was whether Test1 score, specific Math1150 class, attendance measured by the number of classes missed per semester and sex of the student predicted Test2 scores for students who did not drop out before Test2. In addition, association between categorical variables absence status and Test2 Passing status as well as between Test1 Passing status and Test2 Passing status was assessed.

Method:

Records for the 3 classes provided data for this study. All classes, taught by J. Gringauz, used different versions of the same tests. The classes were selected on convenience basis. Linear model to predict Test 2 scores and to estimate the effects of factors class, Test1, days absent and sex of each student was fit using backward selection, and the optimum of the models fit was analyzed. Association between attendance and Test 2 Passing was determined using the Chi2 test. CIs for proportion of students passing Test 2 given that they Passed Test1 was calculated in lieu of a Chi squared test, impossible due to a cell with count zero in the table.

Data display:

Graph 1: Absence status, Test1 Passing/Not, Test2 Passing/Not

The top row shows that the pattern of attendance was variable among classes: in class 1, a large proportion of students missed 4+ classes , in class 2, attendance varied, and in class 3, a small proportion of students missed 4+ classes. The bar graphs of passing and not passing status of the tests are not illuminating except to indicate that the classes vary in success rates.

Graph 2:

The scores for Test1 display a bell-shaped distribution, but the scores for Test2 appear left skewed: problematic for fitting models that assume that the response variable is normally distributed. However, the linear models are moderately robust against deviations from assumptions. (I assessed ln, square root and square transformations of Test 2 scores. None of them yielded a close to the normal distribution per the qqnorm and qqline plots. For ease of interpretation, I decided to use untransformed Test2 scores in data analysis.)

The graphs display problems associated with small group sizes: not all of them have classes of width 5, as indicated in R code, or at least 5 bars as is considered desirable for histograms.

Graph 3:

The plots indicate the following: 1) much variability in success within Test1 and Test 2 scores, as well as variability between classes and between tests; 2) linear models are reasonably appropriate using Test1 scores and/or days absent as predictors of Test2 scores; 3) Test2 scores from the three classes combined deviate from normality.

Results:

The linear model fit to predict Test 2 scores was:

The model is highly significant overall, p-value<<0.01. Therefore, it is unlikely that all factor effects (coefficients) are zero in the population. The model accounts for 47.5% of the variability in Test2 scores. This may mean that important factors were not included in the model, for example, the homework score, or number of days absent before Test2.

To illustrate, the model estimates that a person in class3 with 3 days absence and with 76% on Test1 will score 0.820+3.048(0)+6.254(1) –1.515(3)+0.932(76)=73.361%, on average.

Interpretation of coefficients of the linear model: The estimate of Test2 score for a student in class1 without any other information about him or her is 0.820%, 95%C.I. (-36.7, 38.3), which is not significant as it includes 0%. This makes sense due to the difficulty of making a score prediction without any information about the factors that influence it. The students in classes 2 and 3 scored 3.048 points and 6.254 points more on average than students in class1, respectively. Neither difference is significant, 95%C.I. (-6.5,12.6) for class2 and 95%C.I. (-5.0, 17.5) for class 3. Each additional day of class missed per semester is estimated to lower Test2 scores by 1.515 points on average, 95%C.I. (-2.9, -0.1), indicating the average reduction of between 2.9 and 0.1 points, with 95% confidence. Each additional point on Test1 increases Test2 scores by 0.931 points on average, 95%C.I. (0.5,1.3), indicating the average increase of between 0.5 and 1.3 points, with 95% confidence.

The association of attendance status and Test2 Passing status had borderline significance (2 df chi squared, p-value<0.09). With 91% confidence, attendance status (Low = 0-1, Med=2-3, High=4+ days missing) is associated with Passing status of Test2 (Passing=75%+, on the basis of “enough for a ‘B’ in the course”). The significance of association between Passing status of Test1 and Test2 could not be determined due to no individual not passing Test1 but passing Test2. 95% C.I. for proportion of individuals who passed Test2 given that they passed Test1 is (0.52, 0.83) indicating that the majority of those who pass Test1 also would pass Test2.

Discussion:

The fit of the linear model was moderate, as checked by plots in Graph4. The residuals displayed a random pattern, but deviated from a gaussian distribution.

The 5 point summary of the residuals (-30.267, -11.104, 0.105, 10.310, 21.792) is approximately symmetric, confirming the reasonableness of the linear model.

Some of the problematic aspects of the linear model included lack of normalcy in the outcome, Test2 scores, and the apparent lack of important predictors, such as the homework score, previous classes taken, and days missed before Test2, as well as nonlinear components of the association between the predictors and the outcome, as indicated by graphs and by R2=0.475. Another problematic aspect of this study is the lack of clear population of interest. If the population of interest is all Math1150 taught by Jane Gringauz, then non-random sample of 3 classes examined would make the inference questionable. If the three classes constitute the population of interest, the inferential methods used in data analysis are questionable, or unnecessary.

Due to the classes possibly being different populations, or different strata, C.Is. of their effect on Test2 have questionable interpretation. However, C.Is. contain zero, which may indicate that the difference between the classes could be zero if all other factors were equal. Including class as a predictor variable is, therefore, also questionable.

Recommendations for future studies include adding more classes to the data set and additional predictor variables to the linear model. Squaring the scores of Test2 may also improve model fit.