Python Assignment 4
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 1/8
INSTRUCTIONS:
* Add your code as indicated in each cell. * Besides adding your code, do not alter this file. * Do not delete or change test cases. Once you are done with a question, you can run the test cases to see if you programmed the question correctly. * If you get a question wrong, do not give up. Keep trying until you pass the test cases. * Rename the file as firstname_lastname_assignmentid.ipynb (e.g., marina_johnson_assignment1.ipynb) * Only submit .ipynb files (no .py files)
#
Question 1
�. Read the employee_attrition dataset and save it as df. Recall that the target variable in this dataset is named 'Attrition.'
�. Check if the dataset is imbalanced by counting the number of Noes and Yeses in the target variable Attrition.
Hints:
Imbalanced data refers to a situation where the number of observations is not the same for all the classes in a dataset.
For example, the number of churned employees is 4000, while the number of unchurned employees is 40000. This
means this dataset is imbalanced.
You need to access the target variable Attrition and count how many Yes and No there is in this variable. If the number
of Yes's is equal to the number of No's, then the dataset is balanced. Otherwise, it is not balanced.
Check Module 5g: Encoding Categorical Variables to earn more about data imbalance problems. Particularly, check
2.5: Balancing datasets in Module 5.
Do not alter the below cell. It is a test case for Question 1
{'question 1': 'pass'}
In [138… # Do not delete this cell import numpy as np score = dict() np.random.seed(333)
In [139… import pandas as pd df = # your code to read the dataset goes in here number_of_yes = # your code to find the number # of yeses in the Attrition variable goes in here number_of_no = # your code to find the number # of noes in the Attrition variable goes in here
In [140… try: if (number_of_yes == 237 and number_of_no == 1233): score['question 1'] = 'pass' else: score['question 1'] = 'fail' except: score['question 1'] = 'fail' score
Out[140…
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 2/8
#
Question 2 �. Identify the names of the numerical input variables and save it as a LIST
�. Identify the names of the categorical input variables+ and save it as a LIST
Hints:
Remember Attrition is the target (output) variable, so exclude Attrition from both LISTS containing the numerical and
categorical input variables.
Check Modules 5b: Dropping Variables and Module 3e: Helpful Functions (check after minute 4)
Do not alter the below cell. It is a test case for Question 2
{'question 1': 'pass', 'question 2': 'pass'}
#
Question 3 �. Identify the numerical variables with zero variance (i.e., zero standard deviation) and save them in a LIST
�. Drop these numerical variables with zero variance (i.e., zero standard deviation) from the dataset df. The dataset df should not
have these variables going forward.
Hints:
For each numerical variable, compute the standard deviation. If the standard deviation is zero, delete (i.e., drop) that
variable from the dataset df.
Check Modules 5b: Dropping Variables
Do not alter the below cell. It is a test case for Question 3
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass'}
In [141… numerical_variables = # Your code to identify numerical variables goes in here categorical_varables = # Your code to identify categorical variables goes in here
In [142… try: if ((sorted(numerical_variables) == ['Age','DailyRate','DistanceFromHome','Education', 'EmployeeCount','EmployeeNumber','EnvironmentSatisfaction', 'HourlyRate','JobInvolvement','JobLevel','JobSatisfaction', 'MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike' 'PerformanceRating','RelationshipSatisfaction','StandardHours', 'StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear', 'WorkLifeBalance','YearsAtCompany','YearsInCurrentRole', 'YearsSinceLastPromotion','YearsWithCurrManager']) and (sorted(categorical_varables) == ['BusinessTravel','Department','EducationField','Gender', 'JobRole','MaritalStatus','Over18','OverTime'])): score['question 2'] = 'pass' else: score['question 2'] = 'fail' except: score['question 2'] = 'fail' score
Out[142…
In [143… zero_variance_numerical_variables = # your code to find the # numerical variables with zero variance goes in here df = # your code to drop the zero variance numerical variables goes in here
In [144… try: if (zero_variance_numerical_variables == ['EmployeeCount', 'StandardHours']): score['question 3'] = 'pass' else: score['question 3'] = 'fail' except: score['question 3'] = 'fail' score
Out[144…
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 3/8
#
Question 4
�. Identify the categorical variables with zero variance (i.e., low cardinality) and save them in a LIST
�. Drop these categorical variables with zero variance (i.e., low cardinality) from the dataset df. The dataset df should not have
these variables going forward.
Hints:
For each categorical variable, find the number of levels. If the number of levels is 1, delete (i.e., drop) that variable from
the dataset df. For example, if a variable named occupation has only "Engineers" across all the rows (i.e., one level), the
variable does not contain any information. In other words, zero variation.
Check Modules 5b: Dropping Variables
Do not alter the below cell. It is a test case for Question 4
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass'}
#
Question 5 �. Find the categorical variables with very high variance (i.e., very high cardinality) and save them in a LIST. Use 200 as the
threshold. In other words, the categorical variables over 200 levels should be considered as variables with high cardinality (i.e.,
with high variance).
�. Drop the categorical variables with very high variance (i.e., very high cardinality) from the dataset df. The dataset df should not
have these variables going forward.
Hints:
For each categorical variable, find the number of levels. If the number of levels is greater than 200, delete (i.e., drop)
that variable from the dataset df. For example,
Check Modules 5b: Dropping Variables
Do not alter the below cell. It is a test case for Question 5
In [145… zero_variance_categorical_variables = [] # your code to find the # categorical variables with zero variance goes in here df = # your code to drop the zero variance # categorical variables goes in here
In [146… try: if (zero_variance_categorical_variables == ['Over18']): score['question 4'] = 'pass' else: score['question 4'] = 'fail' except: score['question 4'] = 'fail' score
Out[146…
In [147… high_cardinality_categorical_variables = [] # your code to find the # categorical variables with high variance (i.e., cardinality) goes in here df = # your code to drop the high cardinality # categorical variables goes in here
In [148… try: if (high_cardinality_categorical_variables == []): score['question 5'] = 'pass' else: score['question 5'] = 'fail'
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 4/8
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass'}
#
Question 6
�. Scale (i.e., standardize) the numerical variables in the dataset using the standardization method and drop the original numerical
variables and only keep the standardized ones.
�. The new standardized numerical variables should have the same variable names. For example, the age variable after being
standardized should be named the same (i.e., age)
Hints:
Feature standardization makes the values of each feature in the data have zero-mean (when subtracting the mean in
the numerator) and unit-variance. This method is widely used for normalization in many machine learning algorithms.
Check M5d: Standardization
Do not alter the below cell. It is a test case for Question 6
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass'}
#
Question 7 �. Encode the categorical input variables. Do not encode the target variable Attrition. You will do that in the following question.
Hints:
You will create dummies for categorical variables.
Example: Let's say you have a variable named occupation. This variable has three levels: Engineer, Teacher, Manager.
We will use binary encoding and create dummies for each of these levels to be able to encode the occupation variable.
Technically, we are converting the categorical variable into new numerical variables.
We will have two new variables for this occupation variable, such as occupation_teacher, occupation_manager. We do
not need occupation_teacher because we can infer if the person is a teacher by checking occupation_manager and
occupation_engineer variables.
For example: If occupation_enginner and occupation_manager are zero, then this person is a teacher.
If occupation_engineer is 1, this person is an engineer.
Check Module 5g: Encoding Categorical Variables
except: score['question 5'] = 'fail' score
Out[148…
In [149… # your code to standardize numerical variables goes in here df =
In [150… try: if ((df['Age'].max() == 2.526885578888087) and (df['DailyRate'].max() == 1.7267301192801021)): score['question 6'] = 'pass' else: score['question 6'] = 'fail' except: score['question 6'] = 'fail' score
Out[150…
In [151… # your code to encode categorical input variables goes in here df =
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 5/8
Do not alter the below cell. It is a test case for Question 7
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass'}
#
Question 8 �. Encode the categorical output variable: Attrition. Yes should be coded as 1, and No should be coded as 0. The new encoded
target variable should be named as Attrition. Do not forget to drop the categorical Attirion Variable. Basically, you will convert
the categorical Attrition variable into numerical attrition variable such that Yes will be mapped to 1, and No will be mapped to
zero.
Hints:
Check Module 3 and Module 5 videos.
Do not alter the below cell. It is a test case for Question 8
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass'}
#
Question 9
�. Balance the dataset
�. Your code should return the input and output variables seperately. The input variables will be saved as a dataframe named X.
The output variable will be saved as a dataframe named y.
Hints:
Imbalanced data refers to a situation where the number of observations is not the same for all the classes in a dataset.
For example, the number of churned employees is 4000, while the number of unchurned employees is 40000. This
means this dataset is imbalanced.
You need to access the target variable Attrition and increase the number of ones (i.e., Yeses) so that both the number
of zeros (i.e., Noes) and the number of ones (i.e., Yeses) will be equal.
In [152… try: if ((df['JobRole_Laboratory Technician'].mean() == 0.1761904761904762) and (df['EducationField_Marketing'].mean() == 0.10816326530612246)): score['question 7'] = 'pass' else: score['question 7'] = 'fail' except: score['question 7'] = 'fail' score
Out[152…
In [153… # your code to encode categorical output variables Attrition goes in here df =
In [154… try: if (df['Attrition'].mean() == 0.16122448979591836): score['question 8'] = 'pass' else: score['question 8'] = 'fail' except: score['question 8'] = 'fail' score
Out[154…
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 6/8
Check M5g: Encoding Categorical Variables. balancing dataset is discussed in this video.
Do not alter the below cell. It is a test case for Question 9
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass'}
#
Question 10
Split the dataset into training and testing Basically using X and y dataframes, you will create X_train, X_test, y_train, and
y_test.
You need to keep 70% of the dataset for training and 30% for testing.
Hints:
You can use the train_test_split function in sklearn library
Check Module M6c: Classification
Do not alter the below cell. It is a test case for Question 6
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass', 'question 10': 'pass'}
#
Question 11
�. Train a knn model where k is 3 using the training dataset.
In [156… # Your code to balance the dataset goes in here X = # dataframe containing the input variables after balancing y = # dataframe containing the output variable Attrition after balancing
In [157… try: if ((y.Attrition.value_counts()[0] == 1233) and (y.Attrition.value_counts()[1] == 1233)): score['question 9'] = 'pass' else: score['question 9'] = 'fail' except: score['question 9'] = 'fail' score
Out[157…
In [158… # your code to create train and test sets goes in here X_train, X_test, y_train, y_test = # your code to create train and test sets goes in here
In [159… try: if ((X_train.shape[0]<1750) and (X_train.shape[0]>1700)): score['question 10'] = 'pass' else: score['question 10'] = 'fail' except: score['question 10'] = 'fail' score
Out[159…
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 7/8
�. Make predictions using the test dataset
�. Compute accuracy and save as accuracy
Hints:
You need to use the KNeighborsClassifier function. Instantiate a knn object and pass the number of neighbors to the
function. Train the model using the X_train and y_train. Then make predictions using X_test. Then compute the
accuracy using the predicted values and y_test.
Check Module 6d: Model Performance and _Module 5c: Classification
Do not alter the below cell. It is a test case for Question 11
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass', 'question 10': 'pass', 'question 11': 'pass'}
#
Question 12
�. Train a Random Forests model where the number of estimators is 100 using the training dataset.
�. Make predictions using the test dataset
�. Compute accuracy and save as accuracy
Hints:
You need to use the RandomForestClassifier function. Instantiate a RandomForestClassifier object and pass the
number of estimators to the function. Train the model using the X_train and y_train. Then make predictions using
X_test. Then compute the accuracy using the predicted values and y_test.
Check Module 6d: Model Performance and _Module 5c: Classification
Do not alter the below cell. It is a test case for Question 6
{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass',
In [160… # Your code to train knn, make predictions, and compute accuracy goes in here accuracy = # compute accuracy here
In [161… try: if ((accuracy<.80) and (accuracy>.74)): score['question 11'] = 'pass' else: score['question 11'] = 'fail' except: score['question 11'] = 'fail' score
Out[161…
In [162… # Your code to train random forest, make predictions, and compute accuracy goes in here accuracy = # compute accuracy here
In [163… try: if ((accuracy<.95) and (accuracy>.90)): score['question 12'] = 'pass' else: score['question 12'] = 'fail' except: score['question 12'] = 'fail' score
Out[163…
12/2/2020 Assignment_3
localhost:8888/nbconvert/html/Downloads/Assignment_3.ipynb?download=false 8/8
'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass', 'question 10': 'pass', 'question 11': 'pass', 'question 12': 'pass'}
#
Your Grade
Your overall score is: 100
In [164… print('Your overall score is: ', round(list(score.values()).count('pass')*8.3333))