Titanic Question

profilenguyenphohue00
Assignment7.pdf

Assignment 7: Titanic Survival Prediction

This assignment is for 80 points, four times the normal assignment weight. The goal of the project is to

predict what type of persons were more likely to survive? The features available are Name, Age, Gender,

Fare Class, etc. Data dictionary is provided in the appendix. Data is partitioned into (1) ProjectTrain.csv,

and (2) ProjectTest.csv. Use Train data to develop the model and report performance results on the Test

dataset.

1) Develop Logistic Repression, LDA, QDA and KNN based survival prediction models using Pclass,

Sex, Age, SibSp, Parch, and Embarked as predictor variables. Note that some of these variables may

need to be case of categorical (factors in R). Also, Age has lot of missing values. The missing values

may need to be imputed (e.g., mean) for using this variable. Try few values of k in KNN to

determine suitable value for K. Compare and interpret True Positive (TP) and False Positive (FP) of

the different models using test data. 40 Points

2) “Cabin” has sparse data content. One approach to handle the missing data is to have a special value

“Not Available” for all the missing values. For the Logistic Regression model, evaluate performance

improvement with and without including the cabin feature using test data. 10 Points

3) Like linear regression, Logistic regression (LR) has the advantage of interpretability. Research the

concepts of “Unadjusted Odds Ratio” and “Adjusted Odds Ratio”. Determine the adjusted odds ratio

for Sex, Pclass, and Embarked using LR. Interpret the results. 10 Points

4) The default threshold to classify an entity to a class is 0.5. For the LR models, vary the threshold to

0.8, 0.5, and 0.2. Which threshold value do you think is appropriate for survival prediction? Why?

Justify your answer with respect to misclassification rate on test data 10 Points

5) Develop ROC plot for the LDA model. 5 Points

6) What features do you think are important to make the prediction? Why? Evaluate the KNN model

performance by including just the important features 5 Points

In the report, include text of the R code.

Submit through link: eCampus -> Assignment 7

Deadline: March 18, 11:55 PM

Data Dictionary

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

Age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way...

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The dataset defines family relations in this way...

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children travelled only with a nanny, therefore parch=0 for them.