Python Assignment

profilertan
Assignment_3.pdf

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 1/11

INSTRUCTIONS:

* Add your code as indicated in each cell.

* Besides adding your code, do not alter this file.

* Do not delete or change test cases. Once you are done with a question, you can run the test cases to see if you programmed the question correctly.

* If you get a question wrong, do not give up. Keep trying until you pass the test cases.

* Rename the file as firstname_lastname_assignmentid.ipynb (e.g., marina_johnson_assignment1.ipynb)

* Only submit .ipynb files (no .py files)

#

Question 1 1. Read the employee_attrition dataset and save it as df. Recall that the target variable in this

dataset is named 'Attrition.'

1. Check if the dataset is imbalanced by counting the number of Noes and Yeses in the target variable Attrition.

Hints:

Imbalanced data refers to a situation where the number of observations is not the same for all the classes in a dataset. For example, the number of churned employees is 4000, while the number of unchurned employees is 40000. This means this dataset is imbalanced.

You need to access the target variable Attrition and count how many Yes and No there is in this variable. If the number of Yes's is equal to the number of No's, then the dataset is balanced. Otherwise, it is not balanced.

In [138… # Do not delete this cell import numpy as np score = dict() np.random.seed(333)

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 2/11

Check Module 5g: Encoding Categorical Variables to earn more about data imbalance problems. Particularly, check 2.5: Balancing datasets in Module 5.

Do not alter the below cell. It is a test case for Question 1

{'question 1': 'pass'}

#

Question 2 1. Identify the names of the numerical input variables and save it as a LIST

1. Identify the names of the categorical input variables+ and save it as a LIST

Hints: Remember Attrition is the target (output) variable, so exclude Attrition from both LISTS containing the numerical and categorical input variables. Check Modules 5b: Dropping Variables and Module 3e: Helpful Functions (check after minute 4)

Do not alter the below cell. It is a test case for Question 2

In [139… import pandas as pd df = # your code to read the dataset goes in here number_of_yes = # your code to find the number # of yeses in the Attrition variable goes in here number_of_no = # your code to find the number # of noes in the Attrition variable goes in here

In [140… try: if (number_of_yes == 237 and number_of_no == 1233): score['question 1'] = 'pass' else: score['question 1'] = 'fail' except: score['question 1'] = 'fail' score

Out[140]:

In [141… numerical_variables = # Your code to identify numerical variables goes in here categorical_varables = # Your code to identify categorical variables goes in here

In [142… try:

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 3/11

{'question 1': 'pass', 'question 2': 'pass'}

#

Question 3 1. Identify the numerical variables with zero variance (i.e., zero standard deviation) and save

them in a LIST

1. Drop these numerical variables with zero variance (i.e., zero standard deviation) from the dataset df. The dataset df should not have these variables going forward.

Hints: For each numerical variable, compute the standard deviation. If the standard deviation is zero, delete (i.e., drop) that variable from the dataset df. Check Modules 5b: Dropping Variables

Do not alter the below cell. It is a test case for Question 3

if ((sorted(numerical_variables) == ['Age','DailyRate','DistanceFromHome','Education', 'EmployeeCount','EmployeeNumber','EnvironmentSatisfaction', 'HourlyRate','JobInvolvement','JobLevel','JobSatisfaction', 'MonthlyIncome','MonthlyRate','NumCompaniesWorked','PercentSalaryHike', 'PerformanceRating','RelationshipSatisfaction','StandardHours', 'StockOptionLevel','TotalWorkingYears','TrainingTimesLastYear', 'WorkLifeBalance','YearsAtCompany','YearsInCurrentRole', 'YearsSinceLastPromotion','YearsWithCurrManager']) and (sorted(categorical_varables) == ['BusinessTravel','Department','EducationField','Gender', 'JobRole','MaritalStatus','Over18','OverTime'])): score['question 2'] = 'pass' else: score['question 2'] = 'fail' except: score['question 2'] = 'fail' score

Out[142]:

In [143… zero_variance_numerical_variables = # your code to find the # numerical variables with zero variance goes in here df = # your code to drop the zero variance numerical variables goes in here

In [144… try: if (zero_variance_numerical_variables == ['EmployeeCount', 'StandardHours']): score['question 3'] = 'pass' else: score['question 3'] = 'fail' except: score['question 3'] = 'fail' score

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 4/11

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass'}

#

Question 4 1. Identify the categorical variables with zero variance (i.e., low cardinality) and save them in a

LIST

1. Drop these categorical variables with zero variance (i.e., low cardinality) from the dataset df. The dataset df should not have these variables going forward.

Hints:

For each categorical variable, find the number of levels. If the number of levels is 1, delete (i.e., drop) that variable from the dataset df. For example, if a variable named occupation has only "Engineers" across all the rows (i.e., one level), the variable does not contain any information. In other words, zero variation.

Check Modules 5b: Dropping Variables

Do not alter the below cell. It is a test case for Question 4

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass'}

#

Question 5

Out[144]:

In [145… zero_variance_categorical_variables = [] # your code to find the # categorical variables with zero variance goes in here df = # your code to drop the zero variance # categorical variables goes in here

In [146… try: if (zero_variance_categorical_variables == ['Over18']): score['question 4'] = 'pass' else: score['question 4'] = 'fail' except: score['question 4'] = 'fail' score

Out[146]:

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 5/11

1. Find the categorical variables with very high variance (i.e., very high cardinality) and save them in a LIST. Use 200 as the threshold. In other words, the categorical variables over 200 levels should be considered as variables with high cardinality (i.e., with high variance).

1. Drop the categorical variables with very high variance (i.e., very high cardinality) from the dataset df. The dataset df should not have these variables going forward.

Hints: For each categorical variable, find the number of levels. If the number of levels is greater than 200, delete (i.e., drop) that variable from the dataset df. For example, Check Modules 5b: Dropping Variables

Do not alter the below cell. It is a test case for Question 5

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass'}

#

Question 6

1. Scale (i.e., standardize) the numerical variables in the dataset using the standardization method and drop the original numerical variables and only keep the standardized ones.

2. The new standardized numerical variables should have the same variable names. For example, the age variable after being standardized should be named the same (i.e., age)

Hints: Feature standardization makes the values of each feature in the data have zero- mean (when subtracting the mean in the numerator) and unit-variance. This

In [147… high_cardinality_categorical_variables = [] # your code to find the # categorical variables with high variance (i.e., cardinality) goes in here df = # your code to drop the high cardinality # categorical variables goes in here

In [148… try: if (high_cardinality_categorical_variables == []): score['question 5'] = 'pass' else: score['question 5'] = 'fail' except: score['question 5'] = 'fail' score

Out[148]:

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 6/11

method is widely used for normalization in many machine learning algorithms. Check M5d: Standardization

Do not alter the below cell. It is a test case for Question 6

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass'}

#

Question 7 1. Encode the categorical input variables. Do not encode the target variable Attrition. You will

do that in the following question.

Hints: You will create dummies for categorical variables. Example: Let's say you have a variable named occupation. This variable has three levels: Engineer, Teacher, Manager. We will use binary encoding and create dummies for each of these levels to be able to encode the occupation variable. Technically, we are converting the categorical variable into new numerical variables. We will have two new variables for this occupation variable, such as occupation_teacher, occupation_manager. We do not need occupation_teacher because we can infer if the person is a teacher by checking occupation_manager and occupation_engineer variables. For example: If occupation_enginner and occupation_manager are zero, then this person is a teacher. If occupation_engineer is 1, this person is an engineer. Check Module 5g: Encoding Categorical Variables

In [149… # your code to standardize numerical variables goes in here df =

In [150… try: if ((df['Age'].max() == 2.526885578888087) and (df['DailyRate'].max() == 1.7267301192801021)): score['question 6'] = 'pass' else: score['question 6'] = 'fail' except: score['question 6'] = 'fail' score

Out[150]:

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 7/11

Do not alter the below cell. It is a test case for Question 7

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass'}

#

Question 8 1. Encode the categorical output variable: Attrition. Yes should be coded as 1, and No should

be coded as 0. The new encoded target variable should be named as Attrition. Do not forget to drop the categorical Attirion Variable. Basically, you will convert the categorical Attrition variable into numerical attrition variable such that Yes will be mapped to 1, and No will be mapped to zero.

Hints: Check Module 3 and Module 5 videos.

Do not alter the below cell. It is a test case for Question 8

In [151… # your code to encode categorical input variables goes in here df =

In [152… try: if ((df['JobRole_Laboratory Technician'].mean() == 0.1761904761904762) and (df['EducationField_Marketing'].mean() == 0.10816326530612246)): score['question 7'] = 'pass' else: score['question 7'] = 'fail' except: score['question 7'] = 'fail' score

Out[152]:

In [153… # your code to encode categorical output variables Attrition goes in here df =

In [154… try: if (df['Attrition'].mean() == 0.16122448979591836): score['question 8'] = 'pass' else: score['question 8'] = 'fail'

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 8/11

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass'}

#

Question 9

1. Balance the dataset

1. Your code should return the input and output variables seperately. The input variables will be saved as a dataframe named X. The output variable will be saved as a dataframe named y.

Hints: Imbalanced data refers to a situation where the number of observations is not the same for all the classes in a dataset. For example, the number of churned employees is 4000, while the number of unchurned employees is 40000. This means this dataset is imbalanced. You need to access the target variable Attrition and increase the number of ones (i.e., Yeses) so that both the number of zeros (i.e., Noes) and the number of ones (i.e., Yeses) will be equal. Check M5g: Encoding Categorical Variables. balancing dataset is discussed in this video.

Do not alter the below cell. It is a test case for Question 9

except: score['question 8'] = 'fail' score

Out[154]:

In [156… # Your code to balance the dataset goes in here X = # dataframe containing the input variables after balancing y = # dataframe containing the output variable Attrition after balancing

In [157… try: if ((y.Attrition.value_counts()[0] == 1233) and (y.Attrition.value_counts()[1] == 1233)): score['question 9'] = 'pass' else: score['question 9'] = 'fail' except: score['question 9'] = 'fail' score

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 9/11

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass'}

#

Question 10

Split the dataset into training and testing Basically using X and y dataframes, you will create X_train, X_test, y_train, and y_test.

You need to keep 70% of the dataset for training and 30% for testing.

Hints:

You can use the train_test_split function in sklearn library Check Module M6c: Classification

Do not alter the below cell. It is a test case for Question 6

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass', 'question 10': 'pass'}

#

Out[157]:

In [158… # your code to create train and test sets goes in here X_train, X_test, y_train, y_test = # your code to create train and test sets goes in here

In [159… try: if ((X_train.shape[0]<1750) and (X_train.shape[0]>1700)): score['question 10'] = 'pass' else: score['question 10'] = 'fail' except: score['question 10'] = 'fail' score

Out[159]:

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 10/11

Question 11

1. Train a knn model where k is 3 using the training dataset.

1. Make predictions using the test dataset

1. Compute accuracy and save as accuracy

Hints: You need to use the KNeighborsClassifier function. Instantiate a knn object and pass the number of neighbors to the function. Train the model using the X_train and y_train. Then make predictions using X_test. Then compute the accuracy using the predicted values and y_test. Check Module 6d: Model Performance and _Module 5c: Classification

Do not alter the below cell. It is a test case for Question 11

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass', 'question 10': 'pass', 'question 11': 'pass'}

#

Question 12

1. Train a Random Forests model where the number of estimators is 100 using the training dataset.

In [160… # Your code to train knn, make predictions, and compute accuracy goes in here accuracy = # compute accuracy here

In [161… try: if (accuracy > 0.70): score['question 11'] = 'pass' else: score['question 11'] = 'fail' except: score['question 11'] = 'fail' score

Out[161]:

12/4/22, 7:44 PM Assignment_3

localhost:8888/nbconvert/html/Assignment_3/Assignment_3.ipynb?download=false 11/11

1. Make predictions using the test dataset

1. Compute accuracy and save as accuracy

Hints: You need to use the RandomForestClassifier function. Instantiate a RandomForestClassifier object and pass the number of estimators to the function. Train the model using the X_train and y_train. Then make predictions using X_test. Then compute the accuracy using the predicted values and y_test. Check Module 6d: Model Performance and _Module 5c: Classification

Do not alter the below cell. It is a test case for Question 6

{'question 1': 'pass', 'question 2': 'pass', 'question 3': 'pass', 'question 4': 'pass', 'question 5': 'pass', 'question 6': 'pass', 'question 7': 'pass', 'question 8': 'pass', 'question 9': 'pass', 'question 10': 'pass', 'question 11': 'pass', 'question 12': 'pass'}

#

Your Grade

Your overall score is: 100

In [162… # Your code to train random forest, make predictions, and compute accuracy goes in here accuracy = # compute accuracy here

In [163… try: if (accuracy > 0.80): score['question 12'] = 'pass' else: score['question 12'] = 'fail' except: score['question 12'] = 'fail' score

Out[163]:

In [164… print('Your overall score is: ', round(list(score.values()).count('pass')*8.3333))