Business analytics
# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# To perform statistical analysis
import scipy.stats as stats
import sklearn
# Library to split data
from sklearn.model_selection import train_test_split
#Library to preprocess data
from sklearn.preprocessing import LabelEncoder, StandardScaler
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# To build logistic regression model
from sklearn.linear_model import LogisticRegression
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
precision_recall_curve,
roc_curve,
make_scorer,
)
In [2]:
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.tree import plot_tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
make_scorer,
)
In [3]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import warnings
warnings.filterwarnings('ignore')
In [5]:
df = pd.read_csv("HR-Employee-Attrition.csv")
In [6]:
df.head() #Retrieve the first few rows of the dataframeOut[6]:
| Age | Attrition | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EmployeeCount | EmployeeNumber | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | Over18 | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 41 | Yes | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | 1 | 1 | 2 | Female | 94 | 3 | 2 | Sales Executive | 4 | Single | 5993 | 19479 | 8 | Y | Yes | 11 | 3 | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 49 | No | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | 1 | 2 | 3 | Male | 61 | 2 | 2 | Research Scientist | 2 | Married | 5130 | 24907 | 1 | Y | No | 23 | 4 | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 37 | Yes | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | 1 | 4 | 4 | Male | 92 | 2 | 1 | Laboratory Technician | 3 | Single | 2090 | 2396 | 6 | Y | Yes | 15 | 3 | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 33 | No | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | 1 | 5 | 4 | Female | 56 | 3 | 1 | Research Scientist | 3 | Married | 2909 | 23159 | 1 | Y | Yes | 11 | 3 | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 27 | No | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | 1 | 7 | 1 | Male | 40 | 3 | 1 | Laboratory Technician | 2 | Married | 3468 | 16632 | 9 | Y | No | 12 | 3 | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
df["Attrition"].value_counts(1)#count the unique valuesOut[7]:
No 0.838776 Yes 0.161224 Name: Attrition, dtype: float64In [8]:
df.describe().round(2).T #generating a summary statisticsOut[8]:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Age | 1470.0 | 36.92 | 9.14 | 18.0 | 30.00 | 36.0 | 43.00 | 60.0 |
| DailyRate | 1470.0 | 802.49 | 403.51 | 102.0 | 465.00 | 802.0 | 1157.00 | 1499.0 |
| DistanceFromHome | 1470.0 | 9.19 | 8.11 | 1.0 | 2.00 | 7.0 | 14.00 | 29.0 |
| Education | 1470.0 | 2.91 | 1.02 | 1.0 | 2.00 | 3.0 | 4.00 | 5.0 |
| EmployeeCount | 1470.0 | 1.00 | 0.00 | 1.0 | 1.00 | 1.0 | 1.00 | 1.0 |
| EmployeeNumber | 1470.0 | 1024.87 | 602.02 | 1.0 | 491.25 | 1020.5 | 1555.75 | 2068.0 |
| EnvironmentSatisfaction | 1470.0 | 2.72 | 1.09 | 1.0 | 2.00 | 3.0 | 4.00 | 4.0 |
| HourlyRate | 1470.0 | 65.89 | 20.33 | 30.0 | 48.00 | 66.0 | 83.75 | 100.0 |
| JobInvolvement | 1470.0 | 2.73 | 0.71 | 1.0 | 2.00 | 3.0 | 3.00 | 4.0 |
| JobLevel | 1470.0 | 2.06 | 1.11 | 1.0 | 1.00 | 2.0 | 3.00 | 5.0 |
| JobSatisfaction | 1470.0 | 2.73 | 1.10 | 1.0 | 2.00 | 3.0 | 4.00 | 4.0 |
| MonthlyIncome | 1470.0 | 6502.93 | 4707.96 | 1009.0 | 2911.00 | 4919.0 | 8379.00 | 19999.0 |
| MonthlyRate | 1470.0 | 14313.10 | 7117.79 | 2094.0 | 8047.00 | 14235.5 | 20461.50 | 26999.0 |
| NumCompaniesWorked | 1470.0 | 2.69 | 2.50 | 0.0 | 1.00 | 2.0 | 4.00 | 9.0 |
| PercentSalaryHike | 1470.0 | 15.21 | 3.66 | 11.0 | 12.00 | 14.0 | 18.00 | 25.0 |
| PerformanceRating | 1470.0 | 3.15 | 0.36 | 3.0 | 3.00 | 3.0 | 3.00 | 4.0 |
| RelationshipSatisfaction | 1470.0 | 2.71 | 1.08 | 1.0 | 2.00 | 3.0 | 4.00 | 4.0 |
| StandardHours | 1470.0 | 80.00 | 0.00 | 80.0 | 80.00 | 80.0 | 80.00 | 80.0 |
| StockOptionLevel | 1470.0 | 0.79 | 0.85 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
| TotalWorkingYears | 1470.0 | 11.28 | 7.78 | 0.0 | 6.00 | 10.0 | 15.00 | 40.0 |
| TrainingTimesLastYear | 1470.0 | 2.80 | 1.29 | 0.0 | 2.00 | 3.0 | 3.00 | 6.0 |
| WorkLifeBalance | 1470.0 | 2.76 | 0.71 | 1.0 | 2.00 | 3.0 | 3.00 | 4.0 |
| YearsAtCompany | 1470.0 | 7.01 | 6.13 | 0.0 | 3.00 | 5.0 | 9.00 | 40.0 |
| YearsInCurrentRole | 1470.0 | 4.23 | 3.62 | 0.0 | 2.00 | 3.0 | 7.00 | 18.0 |
| YearsSinceLastPromotion | 1470.0 | 2.19 | 3.22 | 0.0 | 0.00 | 1.0 | 3.00 | 15.0 |
| YearsWithCurrManager | 1470.0 | 4.12 | 3.57 | 0.0 | 2.00 | 3.0 | 7.00 | 17.0 |
df.info() #displaying a concise summary
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1470 entries, 0 to 1469 Data columns (total 35 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Age 1470 non-null int64 1 Attrition 1470 non-null object 2 BusinessTravel 1470 non-null object 3 DailyRate 1470 non-null int64 4 Department 1470 non-null object 5 DistanceFromHome 1470 non-null int64 6 Education 1470 non-null int64 7 EducationField 1470 non-null object 8 EmployeeCount 1470 non-null int64 9 EmployeeNumber 1470 non-null int64 10 EnvironmentSatisfaction 1470 non-null int64 11 Gender 1470 non-null object 12 HourlyRate 1470 non-null int64 13 JobInvolvement 1470 non-null int64 14 JobLevel 1470 non-null int64 15 JobRole 1470 non-null object 16 JobSatisfaction 1470 non-null int64 17 MaritalStatus 1470 non-null object 18 MonthlyIncome 1470 non-null int64 19 MonthlyRate 1470 non-null int64 20 NumCompaniesWorked 1470 non-null int64 21 Over18 1470 non-null object 22 OverTime 1470 non-null object 23 PercentSalaryHike 1470 non-null int64 24 PerformanceRating 1470 non-null int64 25 RelationshipSatisfaction 1470 non-null int64 26 StandardHours 1470 non-null int64 27 StockOptionLevel 1470 non-null int64 28 TotalWorkingYears 1470 non-null int64 29 TrainingTimesLastYear 1470 non-null int64 30 WorkLifeBalance 1470 non-null int64 31 YearsAtCompany 1470 non-null int64 32 YearsInCurrentRole 1470 non-null int64 33 YearsSinceLastPromotion 1470 non-null int64 34 YearsWithCurrManager 1470 non-null int64 dtypes: int64(26), object(9) memory usage: 402.1+ KBIn [10]:
df.drop(["EmployeeNumber", "EmployeeCount", "StandardHours", "Over18"], axis=1, inplace=True) #removing specified columns from the dataframe.In [11]:
df.isnull().sum()#counting the number of missing nullvaluesOut[11]:
Age 0 Attrition 0 BusinessTravel 0 DailyRate 0 Department 0 DistanceFromHome 0 Education 0 EducationField 0 EnvironmentSatisfaction 0 Gender 0 HourlyRate 0 JobInvolvement 0 JobLevel 0 JobRole 0 JobSatisfaction 0 MaritalStatus 0 MonthlyIncome 0 MonthlyRate 0 NumCompaniesWorked 0 OverTime 0 PercentSalaryHike 0 PerformanceRating 0 RelationshipSatisfaction 0 StockOptionLevel 0 TotalWorkingYears 0 TrainingTimesLastYear 0 WorkLifeBalance 0 YearsAtCompany 0 YearsInCurrentRole 0 YearsSinceLastPromotion 0 YearsWithCurrManager 0 dtype: int64In [12]:
# Identify categorical columns categorical_cols = df.select_dtypes(include=['object']).columnsIn [13]:
# Apply Label Encoding
for col in categorical_cols:
df[col] = LabelEncoder().fit_transform(df[col])
In [14]:
# Separating features and target variable
X = df.drop('Attrition', axis=1)
Y = df['Attrition']
In [15]:
X.nunique()#counting the number of unique valuesOut[15]:
Age 43 BusinessTravel 3 DailyRate 886 Department 3 DistanceFromHome 29 Education 5 EducationField 6 EnvironmentSatisfaction 4 Gender 2 HourlyRate 71 JobInvolvement 4 JobLevel 5 JobRole 9 JobSatisfaction 4 MaritalStatus 3 MonthlyIncome 1349 MonthlyRate 1427 NumCompaniesWorked 10 OverTime 2 PercentSalaryHike 15 PerformanceRating 2 RelationshipSatisfaction 4 StockOptionLevel 4 TotalWorkingYears 40 TrainingTimesLastYear 7 WorkLifeBalance 4 YearsAtCompany 37 YearsInCurrentRole 19 YearsSinceLastPromotion 16 YearsWithCurrManager 18 dtype: int64
EDA ¶
In [16]:for col in df.columns:
print(col)
plt.figure(figsize=(12,4))
plt.subplot(1,2,1)
df[col].hist(bins=10, grid=False)
plt.ylabel('count')
plt.subplot(1,2,2)
sns.boxplot(x=df[col])
plt.show()
Age
Attrition
BusinessTravel
DailyRate
Department
DistanceFromHome
Education
EducationField
EnvironmentSatisfaction
Gender
HourlyRate
JobInvolvement
JobLevel
JobRole
JobSatisfaction
MaritalStatus
MonthlyIncome
MonthlyRate
NumCompaniesWorked
OverTime
PercentSalaryHike
PerformanceRating
RelationshipSatisfaction
StockOptionLevel
TotalWorkingYears
TrainingTimesLastYear
WorkLifeBalance
YearsAtCompany
YearsInCurrentRole
YearsSinceLastPromotion
YearsWithCurrManager
plt.figure(figsize = (20, 20)) cmap = sns.diverging_palette(230, 20, as_cmap = True) sns.heatmap(df.corr(), annot = True, fmt = '.2f', cmap = cmap ) plt.show()
There are strong correlations between the following variables: age and total work years (0.68), monthly income (0.95), job level and total work years (0.78), monthly income and total work years (0.77), percent salary hike and performance rating (0.77), years at the company and years in the current role (0.76), and years at the company and years with the current manager (0.77)
In [18]:# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 6))
else:
plt.figure(figsize=(n + 2, 6))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Set2",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
In [19]:
labeled_barplot(df, "Attrition", perc=True)
It appears that 83.9% of cases without attrition are labeled as '0,' while 16.1% are labeled as '1,' totaling 100%. ¶
In [20]:labeled_barplot(df, "Gender", perc=True)
There are more male employees, accounting for 60%, than female employees, who make up 40%. ¶
In [21]:labeled_barplot(df, "BusinessTravel", perc=True)
71% of employees rarely travel, 10.2% don't travel, and 18.8% travel frequently ¶
In [22]:labeled_barplot(df, "MaritalStatus", perc=True)
22.2% of employees are divorced, 45.8% are married, and 32% are single. ¶
In [23]:plt.figure(figsize=(5, 3)) sns.boxplot(x="Attrition", y="Age", data=df).set(title="Attrition Vs Age")Out[23]:
[Text(0.5, 1.0, 'Attrition Vs Age')]
The employees who have left the company tend to be younger, as indicated by the lower median age and the more compact interquartile range of the orange box. The company might be retaining older employees better, or the younger employees might be leaving for various reasons such as career advancement, higher education, or other job opportunities.
In [24]:plt.figure(figsize=(5, 3)) sns.boxplot(x="Attrition", y="YearsAtCompany", data=df).set(title="Attrition Vs YearsAtCompany")Out[24]:
[Text(0.5, 1.0, 'Attrition Vs YearsAtCompany')]
The Employees who leave the company tend to have fewer years of service. There is a more significant variation in the number of years current employees have been at the company compared to those who have left
In [25]:plt.figure(figsize=(5, 3)) sns.boxplot(x="Attrition", y="StockOptionLevel", data=df).set(title="Attrition Vs StockOptionLevel")Out[25]:
[Text(0.5, 1.0, 'Attrition Vs StockOptionLevel')]
The employees with no attrition (0) tend to have higher stock option levels compared to those with attrition (1)
In [26]:plt.figure(figsize=(5, 3)) sns.boxplot(x="Attrition", y="DistanceFromHome", data=df).set(title="Attrition Vs DistanceFromHome")Out[26]:
[Text(0.5, 1.0, 'Attrition Vs DistanceFromHome')]
The employees who have experienced attrition tend to live further away from their workplace than those who have not
In [27]:for column in X.columns:
# Get unique values in each column
unique_values = X[column].unique()
# Print the column name and its unique values
print(f"Unique values in '{column}': {unique_values}")
Unique values in 'Age': [41 49 37 33 27 32 59 30 38 36 35 29 31 34 28 22 53 24 21 42 44 46 39 43 50 26 48 55 45 56 23 51 40 54 58 20 25 19 57 52 47 18 60] Unique values in 'BusinessTravel': [2 1 0] Unique values in 'DailyRate': [1102 279 1373 1392 591 1005 1324 1358 216 1299 809 153 670 1346 103 1389 334 1123 1219 371 673 1218 419 391 699 1282 1125 691 477 705 924 1459 125 895 813 1273 869 890 852 1141 464 1240 1357 994 721 1360 1065 408 1211 1229 626 1434 1488 1097 1443 515 853 1142 655 1115 427 653 989 1435 1223 836 1195 1339 664 318 1225 1328 1082 548 132 746 776 193 397 945 1214 111 573 1153 1400 541 432 288 669 530 632 1334 638 1093 1217 1353 120 682 489 807 827 871 665 1040 1420 240 1280 534 1456 658 142 1127 1031 1189 1354 1467 922 394 1312 750 441 684 249 841 147 528 594 470 957 542 802 1355 1150 1329 959 1033 1316 364 438 689 201 1427 857 933 1181 1395 662 1436 194 967 1496 1169 1145 630 303 1256 440 1450 1452 465 702 1157 602 1480 1268 713 134 526 1380 140 629 1356 328 1084 931 692 1069 313 894 556 1344 290 138 926 1261 472 1002 878 905 1180 121 1136 635 1151 644 1045 829 1242 1469 896 992 1052 1147 1396 663 119 979 319 1413 944 1323 532 818 854 1034 771 1401 1431 976 1411 1300 252 1327 832 1017 1199 504 505 916 1247 685 269 1416 833 307 1311 128 488 529 1210 1463 675 1385 1403 452 666 1158 228 996 728 1315 322 1479 797 1070 442 496 1372 920 688 1449 1117 636 506 444 950 889 555 230 1232 566 1302 812 1476 218 1132 1105 906 849 390 106 1249 192 553 117 185 1091 723 1220 588 1377 1018 1275 798 672 1162 508 1482 559 210 928 1001 549 1124 738 570 1130 1192 343 144 1296 1309 483 810 544 1062 1319 641 1332 756 845 593 1171 350 921 1144 143 1046 575 156 1283 755 304 1178 329 1362 1371 202 253 164 1107 759 1305 982 821 1381 480 1473 891 1063 645 1490 317 422 1485 1368 1448 296 1398 1349 986 1099 1116 1499 983 1009 1303 1274 1277 587 413 1276 988 1474 163 267 619 302 443 828 561 426 232 1306 1094 509 775 195 258 471 799 956 535 1495 446 1245 703 823 1246 622 1287 448 254 1365 538 525 558 782 362 1236 1112 204 1343 604 1216 646 160 238 1397 306 991 482 1176 913 1076 727 885 243 806 817 1410 1207 1442 693 929 562 608 580 970 1179 294 314 316 654 168 381 217 501 650 141 804 975 1090 346 430 268 167 621 527 883 954 310 719 725 715 657 1146 182 376 571 384 791 1111 1243 1092 1325 805 213 118 676 1252 286 1258 932 1041 859 720 946 1184 436 589 760 887 1318 625 180 586 1012 661 930 342 1230 1271 1278 607 130 300 583 1418 1269 379 395 1265 1222 341 868 1231 102 881 1383 1075 374 1086 781 177 500 1425 1454 617 1085 995 1122 618 546 462 1198 1272 154 1137 1188 188 1333 867 263 938 129 616 498 1404 1053 289 1376 231 152 882 903 1379 335 722 461 974 1126 840 1134 248 955 939 1391 1206 287 1441 109 1066 277 466 1055 265 135 247 1035 266 145 1038 1234 1109 1089 788 124 660 1186 1464 796 415 769 1003 1366 330 1492 1204 309 1330 469 697 1262 1050 770 406 203 1308 984 439 793 1451 1182 174 490 718 433 773 603 874 367 199 481 647 1384 902 819 862 1457 977 942 1402 1421 1361 917 200 150 179 696 116 363 107 1465 458 1212 1103 966 1010 326 1098 969 1167 694 1320 536 373 599 251 131 237 1429 648 735 531 429 968 879 640 412 848 360 1138 325 1322 299 1030 634 524 256 1060 935 495 282 206 943 523 507 601 855 1291 1405 1369 999 1202 285 404 736 1498 1200 1439 499 205 683 1462 949 652 332 1475 337 971 1174 667 560 172 383 1255 359 401 377 592 1445 1221 866 981 447 1326 748 990 405 115 790 830 1193 1423 467 271 410 1083 516 224 136 1029 333 1440 674 1342 898 824 492 598 740 888 1288 104 1108 479 1351 474 437 884 1370 264 1059 563 457 1313 241 1015 336 1387 170 208 671 711 737 1470 365 763 567 486 772 301 311 584 880 392 148 708 1259 786 370 678 146 581 918 1238 585 741 552 369 717 543 964 792 611 176 897 600 1054 428 181 211 1079 590 305 953 478 1375 244 511 1294 196 734 1239 1253 1128 1336 234 766 261 1194 431 572 1422 1297 574 355 207 706 280 726 414 352 1224 459 1254 1131 835 1172 1266 783 219 1213 1096 1251 1394 605 1064 1337 937 157 754 1168 155 1444 189 911 1321 1154 557 642 801 161 1382 1037 105 582 704 345 1120 1378 468 613 1023 628] Unique values in 'Department': [2 1 0] Unique values in 'DistanceFromHome': [ 1 8 2 3 24 23 27 16 15 26 19 21 5 11 9 7 6 10 4 25 12 18 29 22 14 20 28 17 13] Unique values in 'Education': [2 1 4 3 5] Unique values in 'EducationField': [1 4 3 2 5 0] Unique values in 'EnvironmentSatisfaction': [2 3 4 1] Unique values in 'Gender': [0 1] Unique values in 'HourlyRate': [ 94 61 92 56 40 79 81 67 44 84 49 31 93 50 51 80 96 78 45 82 53 83 58 72 48 42 41 86 97 75 33 37 73 98 36 47 71 30 43 99 59 95 57 76 87 66 55 32 52 70 62 64 63 60 100 46 39 77 35 91 54 34 90 65 88 85 89 68 69 74 38] Unique values in 'JobInvolvement': [3 2 4 1] Unique values in 'JobLevel': [2 1 3 4 5] Unique values in 'JobRole': [7 6 2 4 0 3 8 5 1] Unique values in 'JobSatisfaction': [4 2 3 1] Unique values in 'MaritalStatus': [2 1 0] Unique values in 'MonthlyIncome': [5993 5130 2090 ... 9991 5390 4404] Unique values in 'MonthlyRate': [19479 24907 2396 ... 5174 13243 10228] Unique values in 'NumCompaniesWorked': [8 1 6 9 0 4 5 2 7 3] Unique values in 'OverTime': [1 0] Unique values in 'PercentSalaryHike': [11 23 15 12 13 20 22 21 17 14 16 18 19 24 25] Unique values in 'PerformanceRating': [3 4] Unique values in 'RelationshipSatisfaction': [1 4 2 3] Unique values in 'StockOptionLevel': [0 1 3 2] Unique values in 'TotalWorkingYears': [ 8 10 7 6 12 1 17 5 3 31 13 0 26 24 22 9 19 2 23 14 15 4 29 28 21 25 20 11 16 37 38 30 40 18 36 34 32 33 35 27] Unique values in 'TrainingTimesLastYear': [0 3 2 5 1 4 6] Unique values in 'WorkLifeBalance': [1 3 2 4] Unique values in 'YearsAtCompany': [ 6 10 0 8 2 7 1 9 5 4 25 3 12 14 22 15 27 21 17 11 13 37 16 20 40 24 33 19 36 18 29 31 32 34 26 30 23] Unique values in 'YearsInCurrentRole': [ 4 7 0 2 5 9 8 3 6 13 1 15 14 16 11 10 12 18 17] Unique values in 'YearsSinceLastPromotion': [ 0 1 3 2 7 4 8 6 5 15 9 13 12 10 11 14] Unique values in 'YearsWithCurrManager': [ 5 7 0 2 6 8 3 11 17 1 4 12 9 10 15 13 16 14]In [28]:
X = pd.get_dummies(X, drop_first=True) X.sample(10)#displaying a random sample of 10 rows from the modified dataframeOut[28]:
| Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | EnvironmentSatisfaction | Gender | HourlyRate | JobInvolvement | JobLevel | JobRole | JobSatisfaction | MaritalStatus | MonthlyIncome | MonthlyRate | NumCompaniesWorked | OverTime | PercentSalaryHike | PerformanceRating | RelationshipSatisfaction | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 244 | 45 | 2 | 252 | 1 | 1 | 3 | 4 | 3 | 1 | 70 | 4 | 5 | 3 | 4 | 1 | 19202 | 15970 | 0 | 0 | 11 | 3 | 3 | 1 | 25 | 2 | 3 | 24 | 0 | 1 | 7 |
| 1180 | 36 | 2 | 311 | 1 | 7 | 3 | 1 | 1 | 1 | 77 | 3 | 1 | 2 | 2 | 2 | 2013 | 10950 | 2 | 0 | 11 | 3 | 3 | 0 | 15 | 4 | 3 | 4 | 3 | 1 | 3 |
| 1173 | 36 | 2 | 711 | 1 | 5 | 4 | 1 | 2 | 0 | 42 | 3 | 3 | 0 | 1 | 1 | 8008 | 22792 | 4 | 0 | 12 | 3 | 3 | 2 | 9 | 6 | 3 | 3 | 2 | 0 | 2 |
| 494 | 34 | 2 | 204 | 2 | 14 | 3 | 5 | 3 | 0 | 31 | 3 | 1 | 8 | 3 | 0 | 2579 | 2912 | 1 | 1 | 18 | 3 | 4 | 2 | 8 | 3 | 3 | 8 | 2 | 0 | 6 |
| 700 | 58 | 2 | 289 | 1 | 2 | 3 | 5 | 4 | 1 | 51 | 3 | 1 | 6 | 3 | 2 | 2479 | 26227 | 4 | 0 | 24 | 4 | 1 | 0 | 7 | 4 | 3 | 1 | 0 | 0 | 0 |
| 152 | 53 | 2 | 1436 | 2 | 6 | 2 | 2 | 2 | 1 | 34 | 3 | 2 | 8 | 3 | 1 | 2306 | 16047 | 2 | 1 | 20 | 4 | 4 | 1 | 13 | 3 | 1 | 7 | 7 | 4 | 5 |
| 1156 | 40 | 2 | 884 | 1 | 15 | 3 | 1 | 1 | 0 | 80 | 2 | 3 | 4 | 3 | 1 | 10435 | 25800 | 1 | 0 | 13 | 3 | 4 | 2 | 18 | 2 | 3 | 18 | 15 | 14 | 12 |
| 658 | 44 | 2 | 661 | 1 | 9 | 2 | 1 | 2 | 1 | 61 | 3 | 1 | 6 | 1 | 1 | 2559 | 7508 | 1 | 1 | 13 | 3 | 4 | 0 | 8 | 0 | 3 | 8 | 7 | 7 | 1 |
| 721 | 50 | 2 | 939 | 1 | 24 | 3 | 1 | 4 | 1 | 95 | 3 | 4 | 4 | 3 | 1 | 13973 | 4161 | 3 | 1 | 18 | 3 | 4 | 1 | 22 | 2 | 3 | 12 | 11 | 1 | 5 |
| 41 | 27 | 2 | 1240 | 1 | 2 | 4 | 1 | 4 | 0 | 33 | 3 | 1 | 2 | 1 | 0 | 2341 | 19715 | 1 | 0 | 13 | 3 | 4 | 1 | 1 | 6 | 3 | 1 | 0 | 0 | 0 |
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) #splitting the into training and testing setsIn [30]:
print(Y.value_counts(1)) print(y_train.value_counts(1)) print(y_test.value_counts(1))#printing relative frequencies of the unique values
0 0.838776 1 0.161224 Name: Attrition, dtype: float64 0 0.844509 1 0.155491 Name: Attrition, dtype: float64 0 0.825397 1 0.174603 Name: Attrition, dtype: float64In [31]:
def model_performance_classification_LR(model, predictors, target, thresholds=[0.5]):
# Create an empty DataFrame to store performance metrics for each threshold
df_perf = pd.DataFrame(columns=["Threshold", "Accuracy", "Recall", "Precision", "F1"])
# Predict probabilities for the positive class
y_probs = model.predict_proba(predictors)[:, 1]
for threshold in thresholds:
# Convert probabilities to binary predictions based on the threshold
y_pred = (y_probs >= threshold).astype(int)
# Compute metrics
acc = accuracy_score(target, y_pred)
recall = recall_score(target, y_pred)
precision = precision_score(target, y_pred)
f1 = f1_score(target, y_pred)
# Append the metrics for this threshold to the DataFrame
df_perf = df_perf.append(
{
"Threshold": threshold,
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
ignore_index=True
)
return df_perf
In [32]:
def make_confusion_matrix_LR(model, predictors, target, threshold=0.5):
# Predict probabilities for the positive class
y_probs = model.predict_proba(predictors)[:, 1]
# Convert probabilities to class labels based on the threshold
y_pred = (y_probs >= threshold).astype(int)
# Generate the confusion matrix
cm = confusion_matrix(target, y_pred)
# Create labels for the confusion matrix
group_names = ["True Neg", "False Pos", "False Neg", "True Pos"]
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [
f"{v1}\n{v2}\n{v3}"
for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
]
labels = np.asarray(labels).reshape(2, 2)
# Plot the confusion matrix
fig, ax = plt.subplots(figsize=(4, 4)) # set figure size to 4x4 inches
sns.heatmap(cm, annot=labels, fmt="", cmap="Blues", annot_kws={"size": 10})
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
In [33]:
# fitting the Logistic regression model log_reg = LogisticRegression() log_reg.fit(X_train, y_train)Out[33]:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. LogisticRegression
LogisticRegression()In [34]:
# Calculating different metrics for train and test sets
log_reg_train_perf = model_performance_classification_LR(
log_reg, X_train, y_train, thresholds=[0.5]
)
print("Training performance:\n", log_reg_train_perf)
log_reg_test_perf = model_performance_classification_LR(log_reg, X_test, y_test, thresholds=[0.5])
print("Testing performance:\n", log_reg_test_perf)
Training performance:
Threshold Accuracy Recall Precision F1
0 0.5 0.844509 0.0 0.0 0.0
Testing performance:
Threshold Accuracy Recall Precision F1
0 0.5 0.823129 0.0 0.0 0.0
The model has a high level of accuracy; however, the recall, precision, and F1-score are all 0 for both sets.
In [35]:# creating confusion matrix for test set (more relevant) make_confusion_matrix_LR(log_reg, X_test, y_test, threshold=0.5)
The model predicted most negative cases correctly but failed to identify any true positive cases, as indicated by the 0% true positives. (TN): there are 363 true negatives, making up 82.31% of the predictions. (FP): there is only one false positive, constituting 0.23% of the predictions. (FN): There are 77 false negatives, 17.46% of the predictions. (TP): there are 0 true positives or 0.00% of the predictions.
In [36]:# Get predicted probabilities for the positive class
y_scores = log_reg.predict_proba(X_test)[:, 1]
# Calculate precision and recall for various thresholds
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)
# Adding last threshold of 1 to match the size of precision and recall arrays
thresholds = np.append(thresholds, 1)
# Plot precision and recall for various thresholds
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precision, label='Precision', marker='.', linestyle='--', color='blue')
plt.plot(thresholds, recall, label='Recall', marker='.', linestyle='--', color='red')
plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.title('Precision and Recall Curves')
plt.legend()
plt.grid(True)
plt.show()
It appears that Precision and recall may be exactly balanced at a threshold around 0.24
In [37]:# Find the threshold where precision-recall curve are closest to each other
closest_zero = np.argmin(np.abs(precision - recall))
optimal_threshold = thresholds[closest_zero]
print("Optimal threshold:", optimal_threshold)
Optimal threshold: 0.24787148455194857In [38]:
log_reg_test_perf = model_performance_classification_LR(log_reg, X_test, y_test, thresholds=[0.62])
print("Testing performance:\n", log_reg_test_perf)
Testing performance:
Threshold Accuracy Recall Precision F1
0 0.62 0.825397 0.0 0.0 0.0
The two sets of results are nearly identical except for a minor difference in accuracy, which is slightly higher at the 0.62 threshold. The most significant issue here is that both cases' recall, precision, and F1 scores are 0. This suggests that the model is not predicting the positive class at all.
In [39]:log_reg_test_perf = model_performance_classification_LR(log_reg, X_test, y_test, thresholds=[0.24])
print("Testing performance:\n", log_reg_test_perf)
Testing performance:
Threshold Accuracy Recall Precision F1
0 0.24 0.768707 0.376623 0.349398 0.3625
With a lower threshold of 0.24, the model accurately predicts approximately 76.87% of the outcomes. It correctly identifies about 37.66% of the actual positive cases. When the model predicts a case as positive, it is correct roughly 34.94% of the time. This indicates a relatively high number of false positives. The F1 score is 0.3625. This low score suggests that the model's precision and recall are not well-balanced. While the model demonstrates relatively good accuracy, its ability to correctly identify positive cases (recall) and be correct when predicting a positive case (precision) is relatively low.
In [40]:make_confusion_matrix_LR(log_reg, X_test, y_test, threshold=0.24)
The model predicted more negative cases correctly than positive cases. The false positive rate is higher than the true positive rate, confirming the earlier analysis that the model's precision is low. The false negative count is also high relative to the true positives, consistent with the previously noted low recall. The decision threshold of 0.24 is lower than the standard 0.5, typically an attempt to increase the number of true positives. However, in this case, even with the lowered threshold.TN: The model correctly predicted the negative class 310 times, accounting for 70.29% of all predictions.FP The model incorrectly predicted positive cases 54 times. FN: the model incorrectly predicted the negative class 48 times. TP: the model correctly predicted the positive class 29 times.
In [41]:print(y_train.value_counts(1)) print(y_test.value_counts(1))#Separating the predictor and target variables
0 0.844509 1 0.155491 Name: Attrition, dtype: float64 0 0.825397 1 0.174603 Name: Attrition, dtype: float64In [42]:
dTree = DecisionTreeClassifier(criterion="gini", random_state=1) dTree.fit(X_train, y_train)Out[42]:
DecisionTreeClassifier(random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. DecisionTreeClassifier
DecisionTreeClassifier(random_state=1)In [43]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1,
},
index=[0],
)
return df_perf
In [44]:
# Calculating different metrics
dTree_model_train_perf = model_performance_classification(
dTree, X_train, y_train
)
print("Training performance:\n", dTree_model_train_perf)
dTree_model_test_perf = model_performance_classification(dTree, X_test, y_test)
print("Testing performance:\n", dTree_model_test_perf)
Training performance:
Accuracy Recall Precision F1
0 1.0 1.0 1.0 1.0
Testing performance:
Accuracy Recall Precision F1
0 0.773243 0.324675 0.342466 0.333333
The accuracy of the testing data drops significantly to approximately 77.32%. The recall on the testing data drops to 32.47%. Precision drops to 34.25%, and the F1 score on the testing data drops to 0.33. These results indicate that the decision tree model is overfitting to the training data, the training scores are perfect across all metrics, and the testing scores substantially drop in performance.
In [45]:# function to create Confusion matrix
def make_confusion_matrix(model, predictors, target, figsize=(5, 5)):
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
group_names = ["True Neg", "False Pos", "False Neg", "True Pos"]
group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
labels = [
f"{v1}\n{v2}\n{v3}"
for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
]
labels = np.asarray(labels).reshape(2, 2)
plt.figure(figsize=figsize)
sns.heatmap(cm, annot=labels, fmt="", cmap="Blues")
In [46]:
make_confusion_matrix(dTree, X_train, y_train, figsize=(4, 3))
The model achieved 100% accuracy on the training data with no false positives or false negatives. TN 84.45%. The model correctly predicted the negative class 869 times.FP:0.00% The model did not make any false positive predictions.FN 0.00%: The model did not miss any positive cases; TP 15.55% correctly predicted the positive class 160 times.
In [47]:feature_names = list(X.columns) #extracting the column names from dataframe X print(feature_names) #prints the list of feature names
['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']In [48]:
from sklearn.tree import plot_tree
plt.figure(figsize=(20, 30))
plot_tree(
dTree,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names = ["0","1"]
)
plt.show()
# Text report showing the rules of a decision tree - print(tree.export_text(dTree, feature_names=feature_names, show_weights=True))
|--- OverTime <= 0.50 | |--- TotalWorkingYears <= 2.50 | | |--- JobInvolvement <= 1.50 | | | |--- weights: [0.00, 4.00] class: 1 | | |--- JobInvolvement > 1.50 | | | |--- TrainingTimesLastYear <= 1.00 | | | | |--- weights: [0.00, 4.00] class: 1 | | | |--- TrainingTimesLastYear > 1.00 | | | | |--- JobSatisfaction <= 1.50 | | | | | |--- PercentSalaryHike <= 15.00 | | | | | | |--- TrainingTimesLastYear <= 4.50 | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- TrainingTimesLastYear > 4.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- PercentSalaryHike > 15.00 | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | |--- JobSatisfaction > 1.50 | | | | | |--- WorkLifeBalance <= 2.50 | | | | | | |--- Age <= 24.50 | | | | | | | |--- Education <= 2.50 | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- Education > 2.50 | | | | | | | | |--- RelationshipSatisfaction <= 3.50 | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- RelationshipSatisfaction > 3.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- Age > 24.50 | | | | | | | |--- Department <= 1.50 | | | | | | | | |--- weights: [8.00, 0.00] class: 0 | | | | | | | |--- Department > 1.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | |--- WorkLifeBalance > 2.50 | | | | | | |--- HourlyRate <= 56.50 | | | | | | | |--- YearsWithCurrManager <= 0.50 | | | | | | | | |--- TrainingTimesLastYear <= 2.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- TrainingTimesLastYear > 2.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- YearsWithCurrManager > 0.50 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | |--- HourlyRate > 56.50 | | | | | | | |--- weights: [25.00, 0.00] class: 0 | |--- TotalWorkingYears > 2.50 | | |--- EnvironmentSatisfaction <= 1.50 | | | |--- JobInvolvement <= 1.50 | | | | |--- MonthlyRate <= 6287.00 | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | |--- MonthlyRate > 6287.00 | | | | | |--- weights: [0.00, 5.00] class: 1 | | | |--- JobInvolvement > 1.50 | | | | |--- HourlyRate <= 99.50 | | | | | |--- DailyRate <= 1468.50 | | | | | | |--- PercentSalaryHike <= 13.50 | | | | | | | |--- YearsInCurrentRole <= 1.50 | | | | | | | | |--- DailyRate <= 641.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- DailyRate > 641.50 | | | | | | | | | |--- StockOptionLevel <= 2.50 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | | |--- StockOptionLevel > 2.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- YearsInCurrentRole > 1.50 | | | | | | | | |--- Age <= 25.50 | | | | | | | | | |--- Education <= 1.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- Education > 1.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- Age > 25.50 | | | | | | | | | |--- JobSatisfaction <= 3.50 | | | | | | | | | | |--- MonthlyIncome <= 2394.00 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | | |--- MonthlyIncome > 2394.00 | | | | | | | | | | | |--- weights: [31.00, 0.00] class: 0 | | | | | | | | | |--- JobSatisfaction > 3.50 | | | | | | | | | | |--- MonthlyRate <= 10396.50 | | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | | |--- MonthlyRate > 10396.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | |--- PercentSalaryHike > 13.50 | | | | | | | |--- DailyRate <= 157.50 | | | | | | | | |--- RelationshipSatisfaction <= 3.50 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | |--- RelationshipSatisfaction > 3.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- DailyRate > 157.50 | | | | | | | | |--- Age <= 55.50 | | | | | | | | | |--- DistanceFromHome <= 28.50 | | | | | | | | | | |--- weights: [67.00, 0.00] class: 0 | | | | | | | | | |--- DistanceFromHome > 28.50 | | | | | | | | | | |--- TrainingTimesLastYear <= 3.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | | |--- TrainingTimesLastYear > 3.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- Age > 55.50 | | | | | | | | | |--- Department <= 1.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- Department > 1.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- DailyRate > 1468.50 | | | | | | |--- HourlyRate <= 76.50 | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- HourlyRate > 76.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- HourlyRate > 99.50 | | | | | |--- MonthlyIncome <= 4814.50 | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- MonthlyIncome > 4814.50 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | |--- EnvironmentSatisfaction > 1.50 | | | |--- YearsAtCompany <= 30.00 | | | | |--- WorkLifeBalance <= 2.50 | | | | | |--- HourlyRate <= 38.50 | | | | | | |--- JobLevel <= 2.50 | | | | | | | |--- MonthlyIncome <= 2711.50 | | | | | | | | |--- MonthlyIncome <= 2219.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- MonthlyIncome > 2219.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | |--- MonthlyIncome > 2711.50 | | | | | | | | |--- Age <= 54.50 | | | | | | | | | |--- weights: [11.00, 0.00] class: 0 | | | | | | | | |--- Age > 54.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- JobLevel > 2.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | |--- HourlyRate > 38.50 | | | | | | |--- MonthlyIncome <= 2064.00 | | | | | | | |--- Age <= 28.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- Age > 28.00 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- MonthlyIncome > 2064.00 | | | | | | | |--- YearsWithCurrManager <= 9.50 | | | | | | | | |--- EducationField <= 0.50 | | | | | | | | | |--- YearsAtCompany <= 6.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | |--- YearsAtCompany > 6.00 | | | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | | |--- EducationField > 0.50 | | | | | | | | | |--- Education <= 3.50 | | | | | | | | | | |--- EducationField <= 4.50 | | | | | | | | | | | |--- weights: [82.00, 0.00] class: 0 | | | | | | | | | | |--- EducationField > 4.50 | | | | | | | | | | | |--- truncated branch of depth 2 | | | | | | | | | |--- Education > 3.50 | | | | | | | | | | |--- PercentSalaryHike <= 14.50 | | | | | | | | | | | |--- weights: [17.00, 0.00] class: 0 | | | | | | | | | | |--- PercentSalaryHike > 14.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | |--- YearsWithCurrManager > 9.50 | | | | | | | | |--- JobRole <= 6.00 | | | | | | | | | |--- weights: [5.00, 0.00] class: 0 | | | | | | | | |--- JobRole > 6.00 | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | |--- WorkLifeBalance > 2.50 | | | | | |--- YearsSinceLastPromotion <= 14.50 | | | | | | |--- DailyRate <= 110.00 | | | | | | | |--- MonthlyRate <= 22965.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | | | |--- MonthlyRate > 22965.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- DailyRate > 110.00 | | | | | | | |--- JobRole <= 7.50 | | | | | | | | |--- DailyRate <= 1444.00 | | | | | | | | | |--- Age <= 45.50 | | | | | | | | | | |--- Education <= 2.50 | | | | | | | | | | | |--- truncated branch of depth 5 | | | | | | | | | | |--- Education > 2.50 | | | | | | | | | | | |--- weights: [205.00, 0.00] class: 0 | | | | | | | | | |--- Age > 45.50 | | | | | | | | | | |--- TotalWorkingYears <= 13.50 | | | | | | | | | | | |--- truncated branch of depth 4 | | | | | | | | | | |--- TotalWorkingYears > 13.50 | | | | | | | | | | | |--- weights: [39.00, 0.00] class: 0 | | | | | | | | |--- DailyRate > 1444.00 | | | | | | | | | |--- JobInvolvement <= 3.50 | | | | | | | | | | |--- weights: [13.00, 0.00] class: 0 | | | | | | | | | |--- JobInvolvement > 3.50 | | | | | | | | | | |--- Age <= 35.50 | | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | | |--- Age > 35.50 | | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- JobRole > 7.50 | | | | | | | | |--- TotalWorkingYears <= 9.50 | | | | | | | | | |--- weights: [12.00, 0.00] class: 0 | | | | | | | | |--- TotalWorkingYears > 9.50 | | | | | | | | | |--- YearsInCurrentRole <= 3.50 | | | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | | | |--- YearsInCurrentRole > 3.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- YearsSinceLastPromotion > 14.50 | | | | | | |--- TrainingTimesLastYear <= 1.50 | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | |--- TrainingTimesLastYear > 1.50 | | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | |--- YearsAtCompany > 30.00 | | | | |--- weights: [0.00, 1.00] class: 1 |--- OverTime > 0.50 | |--- JobLevel <= 1.50 | | |--- YearsInCurrentRole <= 0.50 | | | |--- StockOptionLevel <= 0.50 | | | | |--- weights: [0.00, 15.00] class: 1 | | | |--- StockOptionLevel > 0.50 | | | | |--- Education <= 2.50 | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | |--- Education > 2.50 | | | | | |--- Age <= 51.00 | | | | | | |--- weights: [0.00, 7.00] class: 1 | | | | | |--- Age > 51.00 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | |--- YearsInCurrentRole > 0.50 | | | |--- NumCompaniesWorked <= 0.50 | | | | |--- YearsInCurrentRole <= 1.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- YearsInCurrentRole > 1.50 | | | | | |--- weights: [12.00, 0.00] class: 0 | | | |--- NumCompaniesWorked > 0.50 | | | | |--- JobInvolvement <= 1.50 | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | |--- JobInvolvement > 1.50 | | | | | |--- DailyRate <= 936.00 | | | | | | |--- MonthlyIncome <= 2694.50 | | | | | | | |--- TrainingTimesLastYear <= 1.00 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- TrainingTimesLastYear > 1.00 | | | | | | | | |--- DailyRate <= 879.00 | | | | | | | | | |--- weights: [0.00, 13.00] class: 1 | | | | | | | | |--- DailyRate > 879.00 | | | | | | | | | |--- PercentSalaryHike <= 12.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | | |--- PercentSalaryHike > 12.50 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- MonthlyIncome > 2694.50 | | | | | | | |--- StockOptionLevel <= 1.50 | | | | | | | | |--- WorkLifeBalance <= 1.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | |--- WorkLifeBalance > 1.50 | | | | | | | | | |--- weights: [9.00, 0.00] class: 0 | | | | | | | |--- StockOptionLevel > 1.50 | | | | | | | | |--- Gender <= 0.50 | | | | | | | | | |--- weights: [0.00, 3.00] class: 1 | | | | | | | | |--- Gender > 0.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- DailyRate > 936.00 | | | | | | |--- TrainingTimesLastYear <= 3.50 | | | | | | | |--- RelationshipSatisfaction <= 2.50 | | | | | | | | |--- TotalWorkingYears <= 4.50 | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | |--- TotalWorkingYears > 4.50 | | | | | | | | | |--- Education <= 3.50 | | | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | | | | | |--- Education > 3.50 | | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | |--- RelationshipSatisfaction > 2.50 | | | | | | | | |--- weights: [16.00, 0.00] class: 0 | | | | | | |--- TrainingTimesLastYear > 3.50 | | | | | | | |--- MonthlyIncome <= 2396.00 | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | |--- MonthlyIncome > 2396.00 | | | | | | | | |--- weights: [0.00, 5.00] class: 1 | |--- JobLevel > 1.50 | | |--- JobRole <= 6.50 | | | |--- Age <= 24.00 | | | | |--- weights: [0.00, 1.00] class: 1 | | | |--- Age > 24.00 | | | | |--- TotalWorkingYears <= 37.50 | | | | | |--- MonthlyIncome <= 19853.00 | | | | | | |--- NumCompaniesWorked <= 8.50 | | | | | | | |--- DistanceFromHome <= 28.50 | | | | | | | | |--- DailyRate <= 1421.50 | | | | | | | | | |--- TotalWorkingYears <= 8.50 | | | | | | | | | | |--- YearsAtCompany <= 1.50 | | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | | | | |--- YearsAtCompany > 1.50 | | | | | | | | | | | |--- weights: [7.00, 0.00] class: 0 | | | | | | | | | |--- TotalWorkingYears > 8.50 | | | | | | | | | | |--- weights: [93.00, 0.00] class: 0 | | | | | | | | |--- DailyRate > 1421.50 | | | | | | | | | |--- YearsWithCurrManager <= 9.00 | | | | | | | | | | |--- weights: [3.00, 0.00] class: 0 | | | | | | | | | |--- YearsWithCurrManager > 9.00 | | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- DistanceFromHome > 28.50 | | | | | | | | |--- MonthlyIncome <= 10614.50 | | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | | | |--- MonthlyIncome > 10614.50 | | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- NumCompaniesWorked > 8.50 | | | | | | | |--- EnvironmentSatisfaction <= 2.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- EnvironmentSatisfaction > 2.50 | | | | | | | | |--- weights: [2.00, 0.00] class: 0 | | | | | |--- MonthlyIncome > 19853.00 | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- TotalWorkingYears > 37.50 | | | | | |--- weights: [0.00, 1.00] class: 1 | | |--- JobRole > 6.50 | | | |--- DistanceFromHome <= 11.00 | | | | |--- StockOptionLevel <= 0.50 | | | | | |--- WorkLifeBalance <= 2.50 | | | | | | |--- weights: [0.00, 5.00] class: 1 | | | | | |--- WorkLifeBalance > 2.50 | | | | | | |--- TrainingTimesLastYear <= 0.50 | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | |--- TrainingTimesLastYear > 0.50 | | | | | | | |--- MonthlyIncome <= 7653.50 | | | | | | | | |--- weights: [10.00, 0.00] class: 0 | | | | | | | |--- MonthlyIncome > 7653.50 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | |--- StockOptionLevel > 0.50 | | | | | |--- weights: [26.00, 0.00] class: 0 | | | |--- DistanceFromHome > 11.00 | | | | |--- StockOptionLevel <= 0.50 | | | | | |--- HourlyRate <= 34.00 | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | |--- HourlyRate > 34.00 | | | | | | |--- Education <= 2.50 | | | | | | | |--- Gender <= 0.50 | | | | | | | | |--- weights: [0.00, 2.00] class: 1 | | | | | | | |--- Gender > 0.50 | | | | | | | | |--- weights: [1.00, 0.00] class: 0 | | | | | | |--- Education > 2.50 | | | | | | | |--- weights: [0.00, 9.00] class: 1 | | | | |--- StockOptionLevel > 0.50 | | | | | |--- YearsSinceLastPromotion <= 1.50 | | | | | | |--- weights: [4.00, 0.00] class: 0 | | | | | |--- YearsSinceLastPromotion > 1.50 | | | | | | |--- DistanceFromHome <= 22.50 | | | | | | | |--- weights: [0.00, 4.00] class: 1 | | | | | | |--- DistanceFromHome > 22.50 | | | | | | | |--- HourlyRate <= 46.00 | | | | | | | | |--- weights: [0.00, 1.00] class: 1 | | | | | | | |--- HourlyRate > 46.00 | | | | | | | | |--- weights: [2.00, 0.00] class: 0In [50]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dTree.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp MonthlyIncome 0.085485 JobLevel 0.067762 OverTime 0.063004 TotalWorkingYears 0.061662 TrainingTimesLastYear 0.061090 YearsInCurrentRole 0.056927 Age 0.049305 DistanceFromHome 0.047290 JobInvolvement 0.046883 StockOptionLevel 0.042094 Education 0.040870 HourlyRate 0.040683 DailyRate 0.038781 JobRole 0.035187 WorkLifeBalance 0.035151 MonthlyRate 0.031497 YearsWithCurrManager 0.024574 YearsSinceLastPromotion 0.024189 RelationshipSatisfaction 0.023025 JobSatisfaction 0.022211 PercentSalaryHike 0.022082 NumCompaniesWorked 0.019826 YearsAtCompany 0.017997 EnvironmentSatisfaction 0.016954 Gender 0.010484 Department 0.010279 EducationField 0.004708 PerformanceRating 0.000000 BusinessTravel 0.000000 MaritalStatus 0.000000In [51]:
importances = dTree.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The top three feature importances are as follows: monthly income is the most significant feature, indicating the highest impact on the model's predictions. JobLevel is the second most important feature, followed by OverTime, which also appears to be a strong predictor.
In [52]:dTree_short = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=1) dTree_short.fit(X_train, y_train)Out[52]:
DecisionTreeClassifier(max_depth=3, random_state=1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. DecisionTreeClassifier
DecisionTreeClassifier(max_depth=3, random_state=1)In [53]:
make_confusion_matrix(dTree_short, X_test, y_test, figsize=(4, 3))
The decision tree model performs well in predicting the negative cases with high TN rates but not as well in correctly identifying the positive cases with low TP rates. The TN, 81.41%, correctly predicted the negative cases 359 times, and the FP, 1.13%. The model incorrectly predicted the positive cases 5 times, FN: 15.42%. The model incorrectly predicted the negative cases 68 times, TP: 2.04%. The model correctly predicted the positive cases 9 times.
In [54]:# Calculating different metrics
dTree_short_model_train_perf = model_performance_classification(
dTree_short, X_train, y_train
)
print("Training performance:\n", dTree_short_model_train_perf)
dTree_short_model_test_perf = model_performance_classification(dTree_short, X_test, y_test)
print("Testing performance:\n", dTree_short_model_test_perf)
Training performance:
Accuracy Recall Precision F1
0 0.864917 0.1625 0.83871 0.272251
Testing performance:
Accuracy Recall Precision F1
0 0.834467 0.116883 0.642857 0.197802
The model's performance has dropped from the training to the testing set
In [55]:plt.figure(figsize=(15, 10))
tree.plot_tree(
dTree_short,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=["0","1"],
)
plt.show()
# Text report showing the rules of a decision tree - print(tree.export_text(dTree_short, feature_names=feature_names, show_weights=True))
|--- OverTime <= 0.50 | |--- TotalWorkingYears <= 2.50 | | |--- JobInvolvement <= 1.50 | | | |--- weights: [0.00, 4.00] class: 1 | | |--- JobInvolvement > 1.50 | | | |--- weights: [42.00, 15.00] class: 0 | |--- TotalWorkingYears > 2.50 | | |--- EnvironmentSatisfaction <= 1.50 | | | |--- weights: [116.00, 23.00] class: 0 | | |--- EnvironmentSatisfaction > 1.50 | | | |--- weights: [509.00, 31.00] class: 0 |--- OverTime > 0.50 | |--- JobLevel <= 1.50 | | |--- YearsInCurrentRole <= 0.50 | | | |--- weights: [5.00, 22.00] class: 1 | | |--- YearsInCurrentRole > 0.50 | | | |--- weights: [47.00, 34.00] class: 0 | |--- JobLevel > 1.50 | | |--- JobRole <= 6.50 | | | |--- weights: [106.00, 8.00] class: 0 | | |--- JobRole > 6.50 | | | |--- weights: [44.00, 23.00] class: 0In [57]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dTree_short.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp OverTime 0.290465 JobLevel 0.278283 YearsInCurrentRole 0.107835 JobRole 0.107393 TotalWorkingYears 0.102742 JobInvolvement 0.069240 EnvironmentSatisfaction 0.044043 WorkLifeBalance 0.000000 TrainingTimesLastYear 0.000000 NumCompaniesWorked 0.000000 StockOptionLevel 0.000000 YearsAtCompany 0.000000 YearsSinceLastPromotion 0.000000 RelationshipSatisfaction 0.000000 PerformanceRating 0.000000 PercentSalaryHike 0.000000 Age 0.000000 MonthlyIncome 0.000000 MonthlyRate 0.000000 BusinessTravel 0.000000 MaritalStatus 0.000000 JobSatisfaction 0.000000 HourlyRate 0.000000 Gender 0.000000 EducationField 0.000000 Education 0.000000 DistanceFromHome 0.000000 Department 0.000000 DailyRate 0.000000 YearsWithCurrManager 0.000000In [58]:
importances = dTree_short.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(10, 10))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The top three feature importances are as follows: OverTime stands out as the most significant predictor in the model, with the longest bar, indicating that it has the most significant influence on the model's output. Job level and years in the current role are the most important features. This chart suggests that work-related features, especially those related to time, play a crucial role in the model's decision process.
In [59]:# Choose the type of classifier.
dTree_tuned = DecisionTreeClassifier(
class_weight={0: 0.5, 1: 0.5}, random_state=1
) # using reverse class weights with equal weight to both classes
# Grid of parameters to choose from
parameters = {
"max_depth": np.arange(1, 7, 1),
"min_samples_leaf": [5, 10, 15, 20, 25],
"max_leaf_nodes": [3, 5, 10, 15],
"min_impurity_decrease": [0.001, 0.01, 0.1],
}
# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)
# Run the grid search
grid_obj = GridSearchCV(dTree_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_
# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)
Out[59]:
DecisionTreeClassifier(class_weight={0: 0.5, 1: 0.5}, max_depth=5,
max_leaf_nodes=10, min_impurity_decrease=0.001,
min_samples_leaf=15, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier
DecisionTreeClassifier(class_weight={0: 0.5, 1: 0.5}, max_depth=5,
max_leaf_nodes=10, min_impurity_decrease=0.001,
min_samples_leaf=15, random_state=1)
In [60]:
make_confusion_matrix(dtree_tuned, X_test, y_test, figsize=(4,3))
The model has a high rate of true negatives, indicating its proficiency in correctly identifying negative attrition cases. TN: 79.14% The model accurately predicted the negative cases 349 times, FP: 3.40% The model incorrectly predicted the positive cases 15 times, FN: 12.24% The model incorrectly predicted the negative cases 54 times. TP:5.22%The model correctly predicted the positive cases 23 times
In [61]:# Calculating different metrics
dtree_tuned_model_train_perf = model_performance_classification(
dtree_tuned, X_train, y_train
)
print("Training performance:\n", dtree_tuned_model_train_perf)
dtree_tuned_model_test_perf = model_performance_classification(dtree_tuned, X_test, y_test)
print("Testing performance:\n", dtree_tuned_model_test_perf)
Training performance:
Accuracy Recall Precision F1
0 0.876579 0.375 0.689655 0.48583
Testing performance:
Accuracy Recall Precision F1
0 0.843537 0.298701 0.605263 0.4
The model demonstrates reasonable accuracy; however, there's a slight decrease in accuracy on the testing set. Additionally, the recall, precision, and F1 scores have declined in the testing set compared to the training set
In [62]:plt.figure(figsize=(15, 10))
tree.plot_tree(
dtree_tuned,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=True,
class_names=["0","1"],
)
plt.show()
# Text report showing the rules of a decision tree - print(tree.export_text(dtree_tuned, feature_names=feature_names, show_weights=True))
|--- OverTime <= 0.50 | |--- TotalWorkingYears <= 2.50 | | |--- HourlyRate <= 58.50 | | | |--- weights: [4.50, 5.50] class: 1 | | |--- HourlyRate > 58.50 | | | |--- weights: [16.50, 4.00] class: 0 | |--- TotalWorkingYears > 2.50 | | |--- weights: [312.50, 27.00] class: 0 |--- OverTime > 0.50 | |--- JobLevel <= 1.50 | | |--- YearsInCurrentRole <= 0.50 | | | |--- weights: [2.50, 11.00] class: 1 | | |--- YearsInCurrentRole > 0.50 | | | |--- EnvironmentSatisfaction <= 1.50 | | | | |--- weights: [2.50, 5.50] class: 1 | | | |--- EnvironmentSatisfaction > 1.50 | | | | |--- weights: [21.00, 11.50] class: 0 | |--- JobLevel > 1.50 | | |--- JobRole <= 6.50 | | | |--- weights: [53.00, 4.00] class: 0 | | |--- JobRole > 6.50 | | | |--- DistanceFromHome <= 11.00 | | | | |--- StockOptionLevel <= 0.50 | | | | | |--- weights: [5.00, 3.50] class: 0 | | | | |--- StockOptionLevel > 0.50 | | | | | |--- weights: [13.00, 0.00] class: 0 | | | |--- DistanceFromHome > 11.00 | | | | |--- weights: [4.00, 8.00] class: 1In [64]:
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
dtree_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
# Here we will see that importance of features has increased
Imp OverTime 0.244882 JobLevel 0.234612 DistanceFromHome 0.112491 YearsInCurrentRole 0.090912 JobRole 0.090539 TotalWorkingYears 0.086618 StockOptionLevel 0.050132 HourlyRate 0.048698 EnvironmentSatisfaction 0.041116 Age 0.000000 PerformanceRating 0.000000 RelationshipSatisfaction 0.000000 TrainingTimesLastYear 0.000000 WorkLifeBalance 0.000000 YearsAtCompany 0.000000 YearsSinceLastPromotion 0.000000 PercentSalaryHike 0.000000 MonthlyIncome 0.000000 NumCompaniesWorked 0.000000 MonthlyRate 0.000000 BusinessTravel 0.000000 MaritalStatus 0.000000 JobSatisfaction 0.000000 JobInvolvement 0.000000 Gender 0.000000 EducationField 0.000000 Education 0.000000 Department 0.000000 DailyRate 0.000000 YearsWithCurrManager 0.000000In [65]:
importances = dtree_tuned.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
The top three feature importances are as follows: OverTime is the most important feature according to the model.JobLevel is the second most significant feature; this suggests that the job level or rank within the organization plays an important role, and distance from home is also a key feature; this suggests how far an employee lives from the workplace is influential. The model emphasizes on features related to employment conditions, such as overtime work, job level, and commuting, as shown by the importance of distance
In [66]:# training performance comparison
models_train_comp_df = pd.concat(
[
model_performance_classification(dTree, X_train, y_train).T,
model_performance_classification(dTree_short, X_train, y_train).T,
model_performance_classification(dtree_tuned, X_train, y_train).T,
],
axis=1,
)
models_train_comp_df.columns = [
"Decision Tree full",
"Decision Tree short",
"Decision tree tuned",
]
print("training performance comparison:")
models_train_comp_df
training performance comparison:Out[66]:
| Decision Tree full | Decision Tree short | Decision tree tuned | |
|---|---|---|---|
| Accuracy | 1.0 | 0.864917 | 0.876579 |
| Recall | 1.0 | 0.162500 | 0.375000 |
| Precision | 1.0 | 0.838710 | 0.689655 |
| F1 | 1.0 | 0.272251 | 0.485830 |
# test performance comparison
models_test_comp_df = pd.concat(
[
model_performance_classification(dTree, X_test, y_test).T,
model_performance_classification(dTree_short, X_test, y_test).T,
model_performance_classification(dtree_tuned, X_test, y_test).T,
],
axis=1,
)
models_test_comp_df.columns = [
"Decision Tree full",
"Decision Tree short",
"Decision Tree tuned",
]
print("test performance comparison:")
models_test_comp_df
test performance comparison:Out[67]:
| Decision Tree full | Decision Tree short | Decision Tree tuned | |
|---|---|---|---|
| Accuracy | 0.773243 | 0.834467 | 0.843537 |
| Recall | 0.324675 | 0.116883 | 0.298701 |
| Precision | 0.342466 | 0.642857 | 0.605263 |
| F1 | 0.333333 | 0.197802 | 0.400000 |
The tuned decision tree achieves the highest accuracy, demonstrating its strong performance. The short decision tree has the highest precision score but the lowest recall and F1 score. The full decision tree has the highest recall and F1 score but the lowest accuracy score. The tuned decision tree has a balance between the full and short models; it offers high accuracy while also delivering the best F1 score. This suggests it achieves a more favorable equilibrium between precision and recall than the other models
In [68]:linreg = LinearRegression() linreg.fit(X_train, y_train)Out[68]:
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. LinearRegression
LinearRegression()In [69]:
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
In [70]:
# checking model performance on train and test set
linreg_train_perf = model_performance_regression(
linreg, X_train, y_train
)
print("Training Performance\n", linreg_train_perf)
linreg_test_perf = model_performance_regression(
linreg, X_test, y_test
)
print("Testing Performance\n", linreg_test_perf)
Training Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.322452 0.23493 0.208192 0.18439 inf
Testing Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.338406 0.25468 0.205375 0.147232 inf
Both training and testing scores are low; RMSE, MAE, and the R-squared scores are higher in the training performance. However, the R-squared scores remain consistently low for training and testing data. These results collectively indicate that the model is not performing effectively, as it fails to explain a significant portion of the variance in the dependent variable
In [71]:from sklearn.ensemble import BaggingRegressor # Initialize the Bagging Regressor linreg_bagging_model = BaggingRegressor(base_estimator=LinearRegression(), n_estimators=50, random_state=42) # Fit the model linreg_bagging_model.fit(X_train, y_train)Out[71]:
BaggingRegressor(base_estimator=LinearRegression(), n_estimators=50,
random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingRegressor
BaggingRegressor(base_estimator=LinearRegression(), n_estimators=50,
random_state=42)
base_estimator: LinearRegression
LinearRegression()LinearRegression
LinearRegression()In [72]:
# checking model performance on train and test set
linreg_bagging_model_train_perf = model_performance_regression(
linreg_bagging_model, X_train, y_train
)
print("Training Performance\n", linreg_bagging_model_train_perf)
linreg_bagging_model_test_perf = model_performance_regression(
linreg_bagging_model, X_test, y_test
)
print("Testing Performance\n", linreg_bagging_model_test_perf)
Training Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.322652 0.236503 0.207209 0.183378 inf
Testing Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.33888 0.256413 0.20315 0.144844 inf
The model's predictive performance is limited on these metrics. The bagging model for linear regression shows low R-squared values for both the training and testing datasets, implying that the model does not explain a large portion of the variance in the dependent variable
In [73]:from sklearn.tree import DecisionTreeRegressor # Initialize the Decision Tree Regressor dt_regressor = DecisionTreeRegressor(random_state=0) # Fit the model on the training data dt_regressor.fit(X_train, y_train)Out[73]:
DecisionTreeRegressor(random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. DecisionTreeRegressor
DecisionTreeRegressor(random_state=0)In [74]:
# checking model performance on train and test set
dt_regressor_train_perf = model_performance_regression(
dt_regressor, X_train, y_train
)
print("Training Performance\n", dt_regressor_train_perf)
dt_regressor_test_perf = model_performance_regression(
dt_regressor, X_test, y_test
)
print("Testing Performance\n", dt_regressor_test_perf)
Training Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.0 0.0 1.0 1.0 0.0
Testing Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.48795 0.238095 -0.652098 -0.772983 inf
The model performs very poorly on the test set
In [75]:# Initialize the Bagging Regressor dt_regressor_bagging_model = BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5), n_estimators=50, random_state=42) # Fit the model dt_regressor_bagging_model.fit(X_train, y_train)Out[75]:
BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5),
n_estimators=50, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BaggingRegressor
BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5),
n_estimators=50, random_state=42)
base_estimator: DecisionTreeRegressor
DecisionTreeRegressor(max_depth=5)DecisionTreeRegressor
DecisionTreeRegressor(max_depth=5)In [76]:
# checking model performance on train and test set
dt_regressor_bagging_train_perf = model_performance_regression(
dt_regressor_bagging_model, X_train, y_train
)
print("Training Performance\n", dt_regressor_bagging_train_perf)
dt_regressor_bagging_test_perf = model_performance_regression(
dt_regressor_bagging_model, X_test, y_test
)
print("Testing Performance\n", dt_regressor_bagging_test_perf)
Training Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.255089 0.166533 0.504465 0.489569 inf
Testing Performance
RMSE MAE R-squared Adj. R-squared MAPE
0 0.339019 0.225483 0.202495 0.144141 inf
The bagging regressor model shows a reasonable fit to the training data but poorly to the test data, with a significant drop in both R-squared and Adjusted R-squared values
In [77]:# testing performance comparison
models_test_comp_df = pd.concat(
[
linreg_test_perf.T,
linreg_bagging_model_test_perf.T,
dt_regressor_test_perf.T,
dt_regressor_bagging_test_perf.T,
],
axis=1,
)
models_test_comp_df.columns = [
"Linear Regression",
"Linear regression with bagging",
"Decision Tree Regressor",
"DT Regressor with bagging",
]
print("Testing performance comparison:")
models_test_comp_df
Testing performance comparison:Out[77]:
| Linear Regression | Linear regression with bagging | Decision Tree Regressor | DT Regressor with bagging | |
|---|---|---|---|---|
| RMSE | 0.338406 | 0.338880 | 0.487950 | 0.339019 |
| MAE | 0.254680 | 0.256413 | 0.238095 | 0.225483 |
| R-squared | 0.205375 | 0.203150 | -0.652098 | 0.202495 |
| Adj. R-squared | 0.147232 | 0.144844 | -0.772983 | 0.144141 |
| MAPE | inf | inf | inf | inf |
Both The Linear Regression models show very similar performance across all metrics. The bagging did not improve the model's prediction error or variance explanation. The Decision Tree Regressor model has the highest RMSE but has a negative R-squared value. The negative R-squared indicates that it is not suitable in its current form and is likely overfitting to the training data. The Decision Tree Regressor with the Bagging model improved its performance by the lower RMSE and higher, though still low R-squared values compared to the non-bagged version
The Decision Tree Regressor with Bagging is the best for the regression tasks due to its lower RMSE and MAE scores. The Decision Tree tuned is the most balanced for the classification tasks. It balances the metrics well, has lower error rates in regression, and maintains a good balance between all the metrics in classification without extreme trade-offs.The other models are not as good choices because the Linear models with and without bagging have higher prediction errors. The Decision Tree full has lower precision and F1 score. The Decision Tree short has low recall because it misses many true positive classifications. The logistic regression threshold of 0.62 results in very low recall, precision, and F1 score, indicating it is too conservative in predicting positive classes
In [ ]: