Machine Learning Model Training 1.

profilejack lee
PhaseIIModelTraining.docx

Phase II Model Training Question Sheet

Click on the HTML file attached to read the scenario. After reading through the case, please review the two questions in this assignment.

Keep the HTML file open so that it is easier for you to look for the information questioned in the quiz/exercise. Check all the correct answers with an explanation in 2 or 3 sentences.

Q 1 Part 1.

The team split the data into two partitions: the training set and the test set. It is considered best practice to have a third partition– the validation set. What added utility is there in having a validation set? Check all that apply.

The validation set can be used for tuning hyperparameters

The validation set can be used for early stopping

The test set should be used for final evaluation only

The validation set can be used for updating the model directly

Part 2.

The team split the data randomly, without accounting for the patient to whom each exam belongs. Why would this be a problem? Recall: “The COVID dataset consists of 30,000 exams across 21,000 patients (some patients may be associated with multiple exams)”

Patient overlap between the training and test sets may lead to problems with model convergence due to exposure to the test set

Patient overlap between the training and test sets may lead to problems with model bias because of the underrepresentation of certain patient demographics in the training set

Patient overlap between the training and test sets may lead to the leakage of PHI or other sensitive data

Patient overlap between the training and test sets may lead to inflated model performance due to unrealistic evaluation conditions

Part 3.

The team downsized the images to 224 by 224 pixels. Why might this lead to worse model performance?

The discriminative features in the image may be too small to identify without a higher resolution

Many publicly available models use 224 by 224 pixel images

Memory constraints may limit the model’s ability to process high-resolution images

224 by 224 pixel chest x-rays are easier to classify than 3000 by 3000 pixel chest x-rays

Part 4.

Why are Convolutional Neural Networks (CNN) particularly well suited for image classification tasks? Check all that apply.

CNN architectures take advantage of feature locality through the use of filters

CNN architectures leverage multiple decision trees in order to make their predictions more robust

CNN architectures are parameter-efficient because they use the same set of weights on each region of the image

CNN architectures can condition on previous timesteps, which it takes as input in addition to the images themselves

Part 5.

What learning phenomena is the team observing?

Convergence

Overfitting

Underfitting

Generalization

Part 6

(i)

A colleague approaches you and suggests that it would be better if you created a model that relied only on observable features and exam metadata (patient age, gender, ethnicity, etc.). What trade-offs must be considered when using lab values as features?

Answer in 3- 5 sentences

(ii)

Before using the new public COVID dataset, you want to verify that there is no PHI in the data. What are some privacy issues that could come into play with imaging data?

Answer in 3- 5 sentences

Q 2 Part 1

The D-DIMER values are highly concentrated <1k, but there are many samples that are several orders of magnitude apart from the rest of the samples. What is the most likely explanation for this? (Hint: look at the data samples, particularly the exam metadata.)

There is a large disparity in D-DIMER lab values across patient gender

There is a large disparity in D-DIMER lab values across patient age

The data collected from one the clinics may use different units

The data was collected from two cohorts from two different time periods

Part 2.

Which of the following strategies can be used in order to accommodate for the missing values in the EHR dataset? Check all that apply.

A logistic regression model can be trained after the missing values are synthetically generated, using a process known as imputation

A tree-based model, such as random forest, can be trained directly on the data with missing values

A tree-based model, such as random forest, can be trained after the missing values are synthetically generated, using a process known as imputation

A logistic regression model can be trained directly on the data with missing values

Part 3.

Which of the following is FALSE regarding logistic regression models?

Logistic regression uses the sigmoid activation function

Logistic regression can take unstructured inputs, such as images or text

Logistic regression produces values between 0 and 1, regardless of the scale of the features

Logistic regression is commonly used for classification problems

Part 4.

Which of the following is FALSE regarding random forest models?

Random forest models are a type of decision tree algorithm

Random forest models are highly interpretable

Random forest models learn multiple decision trees that each learn on a subset of the available features

Random forest models require feature normalization (i.e. scaling the features such that they are between 0 and 1) in order to work effectively

image1.wmf

image2.wmf