Business analytics

TermProject1-HR-Employee-Attrition.html

Home >Business & Finance homework help >Business analytics

In [1]:

# this will help in making the Python code more structured automatically (good coding practice)
#%load_ext nb_black

# Library to suppress warnings or deprecation notes
import warnings

warnings.filterwarnings("ignore")

# Libraries to help with reading and manipulating data

import pandas as pd
import numpy as np

# To perform statistical analysis
import scipy.stats as stats
import sklearn

# Library to split data
from sklearn.model_selection import train_test_split

#Library to preprocess data 
from sklearn.preprocessing import LabelEncoder, StandardScaler

# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To build logistic regression model
from sklearn.linear_model import LogisticRegression

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    roc_auc_score,
    precision_recall_curve,
    roc_curve,
    make_scorer,
)

In [2]:

# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
from sklearn import metrics
import matplotlib.pyplot as plt

%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split

# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from sklearn.tree import plot_tree

# To tune different models
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    make_scorer,
)

In [3]:

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

import warnings
warnings.filterwarnings('ignore')

In [5]:

df = pd.read_csv("HR-Employee-Attrition.csv")

In [6]:

df.head() #Retrieve the first few rows of the dataframe

Out[6]:

	Age	Attrition	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EmployeeCount	EmployeeNumber	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	Over18	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StandardHours	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
0	41	Yes	Travel_Rarely	1102	Sales	1	2	Life Sciences	1	1	2	Female	94	3	2	Sales Executive	4	Single	5993	19479	8	Y	Yes	11	3	1	80	0	8	0	1	6	4	0	5
1	49	No	Travel_Frequently	279	Research & Development	8	1	Life Sciences	1	2	3	Male	61	2	2	Research Scientist	2	Married	5130	24907	1	Y	No	23	4	4	80	1	10	3	3	10	7	1	7
2	37	Yes	Travel_Rarely	1373	Research & Development	2	2	Other	1	4	4	Male	92	2	1	Laboratory Technician	3	Single	2090	2396	6	Y	Yes	15	3	2	80	0	7	3	3	0	0	0	0
3	33	No	Travel_Frequently	1392	Research & Development	3	4	Life Sciences	1	5	4	Female	56	3	1	Research Scientist	3	Married	2909	23159	1	Y	Yes	11	3	3	80	0	8	3	3	8	7	3	0
4	27	No	Travel_Rarely	591	Research & Development	2	1	Medical	1	7	1	Male	40	3	1	Laboratory Technician	2	Married	3468	16632	9	Y	No	12	3	4	80	1	6	3	3	2	2	2	2

In [7]:

df["Attrition"].value_counts(1)#count the unique values

Out[7]:

No     0.838776
Yes    0.161224
Name: Attrition, dtype: float64

In [8]:

df.describe().round(2).T #generating a summary statistics

Out[8]:

	count	mean	std	min	25%	50%	75%	max
Age	1470.0	36.92	9.14	18.0	30.00	36.0	43.00	60.0
DailyRate	1470.0	802.49	403.51	102.0	465.00	802.0	1157.00	1499.0
DistanceFromHome	1470.0	9.19	8.11	1.0	2.00	7.0	14.00	29.0
Education	1470.0	2.91	1.02	1.0	2.00	3.0	4.00	5.0
EmployeeCount	1470.0	1.00	0.00	1.0	1.00	1.0	1.00	1.0
EmployeeNumber	1470.0	1024.87	602.02	1.0	491.25	1020.5	1555.75	2068.0
EnvironmentSatisfaction	1470.0	2.72	1.09	1.0	2.00	3.0	4.00	4.0
HourlyRate	1470.0	65.89	20.33	30.0	48.00	66.0	83.75	100.0
JobInvolvement	1470.0	2.73	0.71	1.0	2.00	3.0	3.00	4.0
JobLevel	1470.0	2.06	1.11	1.0	1.00	2.0	3.00	5.0
JobSatisfaction	1470.0	2.73	1.10	1.0	2.00	3.0	4.00	4.0
MonthlyIncome	1470.0	6502.93	4707.96	1009.0	2911.00	4919.0	8379.00	19999.0
MonthlyRate	1470.0	14313.10	7117.79	2094.0	8047.00	14235.5	20461.50	26999.0
NumCompaniesWorked	1470.0	2.69	2.50	0.0	1.00	2.0	4.00	9.0
PercentSalaryHike	1470.0	15.21	3.66	11.0	12.00	14.0	18.00	25.0
PerformanceRating	1470.0	3.15	0.36	3.0	3.00	3.0	3.00	4.0
RelationshipSatisfaction	1470.0	2.71	1.08	1.0	2.00	3.0	4.00	4.0
StandardHours	1470.0	80.00	0.00	80.0	80.00	80.0	80.00	80.0
StockOptionLevel	1470.0	0.79	0.85	0.0	0.00	1.0	1.00	3.0
TotalWorkingYears	1470.0	11.28	7.78	0.0	6.00	10.0	15.00	40.0
TrainingTimesLastYear	1470.0	2.80	1.29	0.0	2.00	3.0	3.00	6.0
WorkLifeBalance	1470.0	2.76	0.71	1.0	2.00	3.0	3.00	4.0
YearsAtCompany	1470.0	7.01	6.13	0.0	3.00	5.0	9.00	40.0
YearsInCurrentRole	1470.0	4.23	3.62	0.0	2.00	3.0	7.00	18.0
YearsSinceLastPromotion	1470.0	2.19	3.22	0.0	0.00	1.0	3.00	15.0
YearsWithCurrManager	1470.0	4.12	3.57	0.0	2.00	3.0	7.00	17.0

In [9]:

df.info() #displaying a concise summary

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1470 entries, 0 to 1469
Data columns (total 35 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Age                       1470 non-null   int64 
 1   Attrition                 1470 non-null   object
 2   BusinessTravel            1470 non-null   object
 3   DailyRate                 1470 non-null   int64 
 4   Department                1470 non-null   object
 5   DistanceFromHome          1470 non-null   int64 
 6   Education                 1470 non-null   int64 
 7   EducationField            1470 non-null   object
 8   EmployeeCount             1470 non-null   int64 
 9   EmployeeNumber            1470 non-null   int64 
 10  EnvironmentSatisfaction   1470 non-null   int64 
 11  Gender                    1470 non-null   object
 12  HourlyRate                1470 non-null   int64 
 13  JobInvolvement            1470 non-null   int64 
 14  JobLevel                  1470 non-null   int64 
 15  JobRole                   1470 non-null   object
 16  JobSatisfaction           1470 non-null   int64 
 17  MaritalStatus             1470 non-null   object
 18  MonthlyIncome             1470 non-null   int64 
 19  MonthlyRate               1470 non-null   int64 
 20  NumCompaniesWorked        1470 non-null   int64 
 21  Over18                    1470 non-null   object
 22  OverTime                  1470 non-null   object
 23  PercentSalaryHike         1470 non-null   int64 
 24  PerformanceRating         1470 non-null   int64 
 25  RelationshipSatisfaction  1470 non-null   int64 
 26  StandardHours             1470 non-null   int64 
 27  StockOptionLevel          1470 non-null   int64 
 28  TotalWorkingYears         1470 non-null   int64 
 29  TrainingTimesLastYear     1470 non-null   int64 
 30  WorkLifeBalance           1470 non-null   int64 
 31  YearsAtCompany            1470 non-null   int64 
 32  YearsInCurrentRole        1470 non-null   int64 
 33  YearsSinceLastPromotion   1470 non-null   int64 
 34  YearsWithCurrManager      1470 non-null   int64 
dtypes: int64(26), object(9)
memory usage: 402.1+ KB

In [10]:

df.drop(["EmployeeNumber", "EmployeeCount", "StandardHours", "Over18"], axis=1, inplace=True) #removing specified columns from the dataframe.

In [11]:

df.isnull().sum()#counting the number of missing nullvalues

Out[11]:

Age                         0
Attrition                   0
BusinessTravel              0
DailyRate                   0
Department                  0
DistanceFromHome            0
Education                   0
EducationField              0
EnvironmentSatisfaction     0
Gender                      0
HourlyRate                  0
JobInvolvement              0
JobLevel                    0
JobRole                     0
JobSatisfaction             0
MaritalStatus               0
MonthlyIncome               0
MonthlyRate                 0
NumCompaniesWorked          0
OverTime                    0
PercentSalaryHike           0
PerformanceRating           0
RelationshipSatisfaction    0
StockOptionLevel            0
TotalWorkingYears           0
TrainingTimesLastYear       0
WorkLifeBalance             0
YearsAtCompany              0
YearsInCurrentRole          0
YearsSinceLastPromotion     0
YearsWithCurrManager        0
dtype: int64

In [12]:

# Identify categorical columns
categorical_cols = df.select_dtypes(include=['object']).columns

In [13]:

# Apply Label Encoding
for col in categorical_cols:
    df[col] = LabelEncoder().fit_transform(df[col])

In [14]:

# Separating features and target variable
X = df.drop('Attrition', axis=1)
Y = df['Attrition']

In [15]:

X.nunique()#counting the number of unique values

Out[15]:

Age                           43
BusinessTravel                 3
DailyRate                    886
Department                     3
DistanceFromHome              29
Education                      5
EducationField                 6
EnvironmentSatisfaction        4
Gender                         2
HourlyRate                    71
JobInvolvement                 4
JobLevel                       5
JobRole                        9
JobSatisfaction                4
MaritalStatus                  3
MonthlyIncome               1349
MonthlyRate                 1427
NumCompaniesWorked            10
OverTime                       2
PercentSalaryHike             15
PerformanceRating              2
RelationshipSatisfaction       4
StockOptionLevel               4
TotalWorkingYears             40
TrainingTimesLastYear          7
WorkLifeBalance                4
YearsAtCompany                37
YearsInCurrentRole            19
YearsSinceLastPromotion       16
YearsWithCurrManager          18
dtype: int64

EDA ¶

In [16]:

for col in df.columns:
    
    print(col)
    plt.figure(figsize=(12,4))
    plt.subplot(1,2,1)
    df[col].hist(bins=10, grid=False)
    plt.ylabel('count')
    plt.subplot(1,2,2)
    sns.boxplot(x=df[col])
    plt.show()

Age

Attrition

BusinessTravel

DailyRate

Department

DistanceFromHome

Education

EducationField

EnvironmentSatisfaction

Gender

HourlyRate

JobInvolvement

JobLevel

JobRole

JobSatisfaction

MaritalStatus

MonthlyIncome

MonthlyRate

NumCompaniesWorked

OverTime

PercentSalaryHike

PerformanceRating

RelationshipSatisfaction

StockOptionLevel

TotalWorkingYears

TrainingTimesLastYear

WorkLifeBalance

YearsAtCompany

YearsInCurrentRole

YearsSinceLastPromotion

YearsWithCurrManager

In [17]:

plt.figure(figsize = (20, 20))
cmap = sns.diverging_palette(230, 20, as_cmap = True)
sns.heatmap(df.corr(), annot = True, fmt = '.2f', cmap = cmap )
plt.show()

There are strong correlations between the following variables: age and total work years (0.68), monthly income (0.95), job level and total work years (0.78), monthly income and total work years (0.77), percent salary hike and performance rating (0.77), years at the company and years in the current role (0.76), and years at the company and years with the current manager (0.77)

In [18]:

# function to create labeled barplots


def labeled_barplot(data, feature, perc=False, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 2, 6))
    else:
        plt.figure(figsize=(n + 2, 6))

    plt.xticks(rotation=90, fontsize=15)
    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Set2",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage

    plt.show()  # show the plot

In [19]:

labeled_barplot(df, "Attrition", perc=True)

It appears that 83.9% of cases without attrition are labeled as '0,' while 16.1% are labeled as '1,' totaling 100%. ¶

In [20]:

labeled_barplot(df, "Gender", perc=True)

There are more male employees, accounting for 60%, than female employees, who make up 40%. ¶

In [21]:

labeled_barplot(df, "BusinessTravel", perc=True)

71% of employees rarely travel, 10.2% don't travel, and 18.8% travel frequently ¶

In [22]:

labeled_barplot(df, "MaritalStatus", perc=True)

22.2% of employees are divorced, 45.8% are married, and 32% are single. ¶

In [23]:

plt.figure(figsize=(5, 3))
sns.boxplot(x="Attrition", y="Age", data=df).set(title="Attrition Vs Age")

Out[23]:

[Text(0.5, 1.0, 'Attrition Vs Age')]

The employees who have left the company tend to be younger, as indicated by the lower median age and the more compact interquartile range of the orange box. The company might be retaining older employees better, or the younger employees might be leaving for various reasons such as career advancement, higher education, or other job opportunities.

In [24]:

plt.figure(figsize=(5, 3))
sns.boxplot(x="Attrition", y="YearsAtCompany", data=df).set(title="Attrition Vs YearsAtCompany")

Out[24]:

[Text(0.5, 1.0, 'Attrition Vs YearsAtCompany')]

The Employees who leave the company tend to have fewer years of service. There is a more significant variation in the number of years current employees have been at the company compared to those who have left

In [25]:

plt.figure(figsize=(5, 3))
sns.boxplot(x="Attrition", y="StockOptionLevel", data=df).set(title="Attrition Vs StockOptionLevel")

Out[25]:

[Text(0.5, 1.0, 'Attrition Vs StockOptionLevel')]

The employees with no attrition (0) tend to have higher stock option levels compared to those with attrition (1)

In [26]:

plt.figure(figsize=(5, 3))
sns.boxplot(x="Attrition", y="DistanceFromHome", data=df).set(title="Attrition Vs DistanceFromHome")

Out[26]:

[Text(0.5, 1.0, 'Attrition Vs DistanceFromHome')]

The employees who have experienced attrition tend to live further away from their workplace than those who have not

In [27]:

for column in X.columns:
    # Get unique values in each column
    unique_values = X[column].unique()
    
    # Print the column name and its unique values
    print(f"Unique values in '{column}': {unique_values}")

Unique values in 'Age': [41 49 37 33 27 32 59 30 38 36 35 29 31 34 28 22 53 24 21 42 44 46 39 43
 50 26 48 55 45 56 23 51 40 54 58 20 25 19 57 52 47 18 60]
Unique values in 'BusinessTravel': [2 1 0]
Unique values in 'DailyRate': [1102  279 1373 1392  591 1005 1324 1358  216 1299  809  153  670 1346
  103 1389  334 1123 1219  371  673 1218  419  391  699 1282 1125  691
  477  705  924 1459  125  895  813 1273  869  890  852 1141  464 1240
 1357  994  721 1360 1065  408 1211 1229  626 1434 1488 1097 1443  515
  853 1142  655 1115  427  653  989 1435 1223  836 1195 1339  664  318
 1225 1328 1082  548  132  746  776  193  397  945 1214  111  573 1153
 1400  541  432  288  669  530  632 1334  638 1093 1217 1353  120  682
  489  807  827  871  665 1040 1420  240 1280  534 1456  658  142 1127
 1031 1189 1354 1467  922  394 1312  750  441  684  249  841  147  528
  594  470  957  542  802 1355 1150 1329  959 1033 1316  364  438  689
  201 1427  857  933 1181 1395  662 1436  194  967 1496 1169 1145  630
  303 1256  440 1450 1452  465  702 1157  602 1480 1268  713  134  526
 1380  140  629 1356  328 1084  931  692 1069  313  894  556 1344  290
  138  926 1261  472 1002  878  905 1180  121 1136  635 1151  644 1045
  829 1242 1469  896  992 1052 1147 1396  663  119  979  319 1413  944
 1323  532  818  854 1034  771 1401 1431  976 1411 1300  252 1327  832
 1017 1199  504  505  916 1247  685  269 1416  833  307 1311  128  488
  529 1210 1463  675 1385 1403  452  666 1158  228  996  728 1315  322
 1479  797 1070  442  496 1372  920  688 1449 1117  636  506  444  950
  889  555  230 1232  566 1302  812 1476  218 1132 1105  906  849  390
  106 1249  192  553  117  185 1091  723 1220  588 1377 1018 1275  798
  672 1162  508 1482  559  210  928 1001  549 1124  738  570 1130 1192
  343  144 1296 1309  483  810  544 1062 1319  641 1332  756  845  593
 1171  350  921 1144  143 1046  575  156 1283  755  304 1178  329 1362
 1371  202  253  164 1107  759 1305  982  821 1381  480 1473  891 1063
  645 1490  317  422 1485 1368 1448  296 1398 1349  986 1099 1116 1499
  983 1009 1303 1274 1277  587  413 1276  988 1474  163  267  619  302
  443  828  561  426  232 1306 1094  509  775  195  258  471  799  956
  535 1495  446 1245  703  823 1246  622 1287  448  254 1365  538  525
  558  782  362 1236 1112  204 1343  604 1216  646  160  238 1397  306
  991  482 1176  913 1076  727  885  243  806  817 1410 1207 1442  693
  929  562  608  580  970 1179  294  314  316  654  168  381  217  501
  650  141  804  975 1090  346  430  268  167  621  527  883  954  310
  719  725  715  657 1146  182  376  571  384  791 1111 1243 1092 1325
  805  213  118  676 1252  286 1258  932 1041  859  720  946 1184  436
  589  760  887 1318  625  180  586 1012  661  930  342 1230 1271 1278
  607  130  300  583 1418 1269  379  395 1265 1222  341  868 1231  102
  881 1383 1075  374 1086  781  177  500 1425 1454  617 1085  995 1122
  618  546  462 1198 1272  154 1137 1188  188 1333  867  263  938  129
  616  498 1404 1053  289 1376  231  152  882  903 1379  335  722  461
  974 1126  840 1134  248  955  939 1391 1206  287 1441  109 1066  277
  466 1055  265  135  247 1035  266  145 1038 1234 1109 1089  788  124
  660 1186 1464  796  415  769 1003 1366  330 1492 1204  309 1330  469
  697 1262 1050  770  406  203 1308  984  439  793 1451 1182  174  490
  718  433  773  603  874  367  199  481  647 1384  902  819  862 1457
  977  942 1402 1421 1361  917  200  150  179  696  116  363  107 1465
  458 1212 1103  966 1010  326 1098  969 1167  694 1320  536  373  599
  251  131  237 1429  648  735  531  429  968  879  640  412  848  360
 1138  325 1322  299 1030  634  524  256 1060  935  495  282  206  943
  523  507  601  855 1291 1405 1369  999 1202  285  404  736 1498 1200
 1439  499  205  683 1462  949  652  332 1475  337  971 1174  667  560
  172  383 1255  359  401  377  592 1445 1221  866  981  447 1326  748
  990  405  115  790  830 1193 1423  467  271  410 1083  516  224  136
 1029  333 1440  674 1342  898  824  492  598  740  888 1288  104 1108
  479 1351  474  437  884 1370  264 1059  563  457 1313  241 1015  336
 1387  170  208  671  711  737 1470  365  763  567  486  772  301  311
  584  880  392  148  708 1259  786  370  678  146  581  918 1238  585
  741  552  369  717  543  964  792  611  176  897  600 1054  428  181
  211 1079  590  305  953  478 1375  244  511 1294  196  734 1239 1253
 1128 1336  234  766  261 1194  431  572 1422 1297  574  355  207  706
  280  726  414  352 1224  459 1254 1131  835 1172 1266  783  219 1213
 1096 1251 1394  605 1064 1337  937  157  754 1168  155 1444  189  911
 1321 1154  557  642  801  161 1382 1037  105  582  704  345 1120 1378
  468  613 1023  628]
Unique values in 'Department': [2 1 0]
Unique values in 'DistanceFromHome': [ 1  8  2  3 24 23 27 16 15 26 19 21  5 11  9  7  6 10  4 25 12 18 29 22
 14 20 28 17 13]
Unique values in 'Education': [2 1 4 3 5]
Unique values in 'EducationField': [1 4 3 2 5 0]
Unique values in 'EnvironmentSatisfaction': [2 3 4 1]
Unique values in 'Gender': [0 1]
Unique values in 'HourlyRate': [ 94  61  92  56  40  79  81  67  44  84  49  31  93  50  51  80  96  78
  45  82  53  83  58  72  48  42  41  86  97  75  33  37  73  98  36  47
  71  30  43  99  59  95  57  76  87  66  55  32  52  70  62  64  63  60
 100  46  39  77  35  91  54  34  90  65  88  85  89  68  69  74  38]
Unique values in 'JobInvolvement': [3 2 4 1]
Unique values in 'JobLevel': [2 1 3 4 5]
Unique values in 'JobRole': [7 6 2 4 0 3 8 5 1]
Unique values in 'JobSatisfaction': [4 2 3 1]
Unique values in 'MaritalStatus': [2 1 0]
Unique values in 'MonthlyIncome': [5993 5130 2090 ... 9991 5390 4404]
Unique values in 'MonthlyRate': [19479 24907  2396 ...  5174 13243 10228]
Unique values in 'NumCompaniesWorked': [8 1 6 9 0 4 5 2 7 3]
Unique values in 'OverTime': [1 0]
Unique values in 'PercentSalaryHike': [11 23 15 12 13 20 22 21 17 14 16 18 19 24 25]
Unique values in 'PerformanceRating': [3 4]
Unique values in 'RelationshipSatisfaction': [1 4 2 3]
Unique values in 'StockOptionLevel': [0 1 3 2]
Unique values in 'TotalWorkingYears': [ 8 10  7  6 12  1 17  5  3 31 13  0 26 24 22  9 19  2 23 14 15  4 29 28
 21 25 20 11 16 37 38 30 40 18 36 34 32 33 35 27]
Unique values in 'TrainingTimesLastYear': [0 3 2 5 1 4 6]
Unique values in 'WorkLifeBalance': [1 3 2 4]
Unique values in 'YearsAtCompany': [ 6 10  0  8  2  7  1  9  5  4 25  3 12 14 22 15 27 21 17 11 13 37 16 20
 40 24 33 19 36 18 29 31 32 34 26 30 23]
Unique values in 'YearsInCurrentRole': [ 4  7  0  2  5  9  8  3  6 13  1 15 14 16 11 10 12 18 17]
Unique values in 'YearsSinceLastPromotion': [ 0  1  3  2  7  4  8  6  5 15  9 13 12 10 11 14]
Unique values in 'YearsWithCurrManager': [ 5  7  0  2  6  8  3 11 17  1  4 12  9 10 15 13 16 14]

In [28]:

X = pd.get_dummies(X, drop_first=True)
X.sample(10)#displaying a random sample of 10 rows from the modified dataframe

Out[28]:

	Age	BusinessTravel	DailyRate	Department	DistanceFromHome	Education	EducationField	EnvironmentSatisfaction	Gender	HourlyRate	JobInvolvement	JobLevel	JobRole	JobSatisfaction	MaritalStatus	MonthlyIncome	MonthlyRate	NumCompaniesWorked	OverTime	PercentSalaryHike	PerformanceRating	RelationshipSatisfaction	StockOptionLevel	TotalWorkingYears	TrainingTimesLastYear	WorkLifeBalance	YearsAtCompany	YearsInCurrentRole	YearsSinceLastPromotion	YearsWithCurrManager
244	45	2	252	1	1	3	4	3	1	70	4	5	3	4	1	19202	15970	0	0	11	3	3	1	25	2	3	24	0	1	7
1180	36	2	311	1	7	3	1	1	1	77	3	1	2	2	2	2013	10950	2	0	11	3	3	0	15	4	3	4	3	1	3
1173	36	2	711	1	5	4	1	2	0	42	3	3	0	1	1	8008	22792	4	0	12	3	3	2	9	6	3	3	2	0	2
494	34	2	204	2	14	3	5	3	0	31	3	1	8	3	0	2579	2912	1	1	18	3	4	2	8	3	3	8	2	0	6
700	58	2	289	1	2	3	5	4	1	51	3	1	6	3	2	2479	26227	4	0	24	4	1	0	7	4	3	1	0	0	0
152	53	2	1436	2	6	2	2	2	1	34	3	2	8	3	1	2306	16047	2	1	20	4	4	1	13	3	1	7	7	4	5
1156	40	2	884	1	15	3	1	1	0	80	2	3	4	3	1	10435	25800	1	0	13	3	4	2	18	2	3	18	15	14	12
658	44	2	661	1	9	2	1	2	1	61	3	1	6	1	1	2559	7508	1	1	13	3	4	0	8	0	3	8	7	7	1
721	50	2	939	1	24	3	1	4	1	95	3	4	4	3	1	13973	4161	3	1	18	3	4	1	22	2	3	12	11	1	5
41	27	2	1240	1	2	4	1	4	0	33	3	1	2	1	0	2341	19715	1	0	13	3	4	1	1	6	3	1	0	0	0

In [29]:

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1) #splitting the into training and testing sets

In [30]:

print(Y.value_counts(1))
print(y_train.value_counts(1))
print(y_test.value_counts(1))#printing relative frequencies of the unique values

0    0.838776
1    0.161224
Name: Attrition, dtype: float64
0    0.844509
1    0.155491
Name: Attrition, dtype: float64
0    0.825397
1    0.174603
Name: Attrition, dtype: float64

In [31]:

def model_performance_classification_LR(model, predictors, target, thresholds=[0.5]):
    # Create an empty DataFrame to store performance metrics for each threshold
    df_perf = pd.DataFrame(columns=["Threshold", "Accuracy", "Recall", "Precision", "F1"])

    # Predict probabilities for the positive class
    y_probs = model.predict_proba(predictors)[:, 1]

    for threshold in thresholds:
        # Convert probabilities to binary predictions based on the threshold
        y_pred = (y_probs >= threshold).astype(int)

        # Compute metrics
        acc = accuracy_score(target, y_pred)
        recall = recall_score(target, y_pred)
        precision = precision_score(target, y_pred)
        f1 = f1_score(target, y_pred)

        # Append the metrics for this threshold to the DataFrame
        df_perf = df_perf.append(
            {
                "Threshold": threshold,
                "Accuracy": acc,
                "Recall": recall,
                "Precision": precision,
                "F1": f1
            }, 
            ignore_index=True
        )

    return df_perf

In [32]:

def make_confusion_matrix_LR(model, predictors, target, threshold=0.5):
    # Predict probabilities for the positive class
    y_probs = model.predict_proba(predictors)[:, 1]

    # Convert probabilities to class labels based on the threshold
    y_pred = (y_probs >= threshold).astype(int)

    # Generate the confusion matrix
    cm = confusion_matrix(target, y_pred)

    # Create labels for the confusion matrix
    group_names = ["True Neg", "False Pos", "False Neg", "True Pos"]
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [
        f"{v1}\n{v2}\n{v3}"
        for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
    ]
    labels = np.asarray(labels).reshape(2, 2)

    # Plot the confusion matrix
    fig, ax = plt.subplots(figsize=(4, 4))  # set figure size to 4x4 inches
    sns.heatmap(cm, annot=labels, fmt="", cmap="Blues", annot_kws={"size": 10})

    plt.ylabel('Actual')
    plt.xlabel('Predicted')
    plt.show()

In [33]:

# fitting the Logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

Out[33]:

LogisticRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org. LogisticRegression

LogisticRegression()

In [34]:

# Calculating different metrics for train and test sets
log_reg_train_perf = model_performance_classification_LR(
    log_reg, X_train, y_train, thresholds=[0.5]
)
print("Training performance:\n", log_reg_train_perf)
log_reg_test_perf = model_performance_classification_LR(log_reg, X_test, y_test, thresholds=[0.5])
print("Testing performance:\n", log_reg_test_perf)

Training performance:
    Threshold  Accuracy  Recall  Precision   F1
0        0.5  0.844509     0.0        0.0  0.0
Testing performance:
    Threshold  Accuracy  Recall  Precision   F1
0        0.5  0.823129     0.0        0.0  0.0

The model has a high level of accuracy; however, the recall, precision, and F1-score are all 0 for both sets.

In [35]:

# creating confusion matrix for test set (more relevant)
make_confusion_matrix_LR(log_reg, X_test, y_test, threshold=0.5)

The model predicted most negative cases correctly but failed to identify any true positive cases, as indicated by the 0% true positives. (TN): there are 363 true negatives, making up 82.31% of the predictions. (FP): there is only one false positive, constituting 0.23% of the predictions. (FN): There are 77 false negatives, 17.46% of the predictions. (TP): there are 0 true positives or 0.00% of the predictions.

In [36]:

# Get predicted probabilities for the positive class
y_scores = log_reg.predict_proba(X_test)[:, 1]

# Calculate precision and recall for various thresholds
precision, recall, thresholds = precision_recall_curve(y_test, y_scores)

# Adding last threshold of 1 to match the size of precision and recall arrays
thresholds = np.append(thresholds, 1)

# Plot precision and recall for various thresholds
plt.figure(figsize=(8, 6))
plt.plot(thresholds, precision, label='Precision', marker='.', linestyle='--', color='blue')
plt.plot(thresholds, recall, label='Recall', marker='.', linestyle='--', color='red')

plt.xlabel('Threshold')
plt.ylabel('Precision/Recall')
plt.title('Precision and Recall Curves')
plt.legend()
plt.grid(True)
plt.show()

It appears that Precision and recall may be exactly balanced at a threshold around 0.24

In [37]:

# Find the threshold where precision-recall curve are closest to each other

closest_zero = np.argmin(np.abs(precision - recall))
optimal_threshold = thresholds[closest_zero]

print("Optimal threshold:", optimal_threshold)

Optimal threshold: 0.24787148455194857

In [38]:

log_reg_test_perf = model_performance_classification_LR(log_reg, X_test, y_test, thresholds=[0.62])
print("Testing performance:\n", log_reg_test_perf)

Testing performance:
    Threshold  Accuracy  Recall  Precision   F1
0       0.62  0.825397     0.0        0.0  0.0

The two sets of results are nearly identical except for a minor difference in accuracy, which is slightly higher at the 0.62 threshold. The most significant issue here is that both cases' recall, precision, and F1 scores are 0. This suggests that the model is not predicting the positive class at all.

In [39]:

log_reg_test_perf = model_performance_classification_LR(log_reg, X_test, y_test, thresholds=[0.24])
print("Testing performance:\n", log_reg_test_perf)

Testing performance:
    Threshold  Accuracy    Recall  Precision      F1
0       0.24  0.768707  0.376623   0.349398  0.3625

With a lower threshold of 0.24, the model accurately predicts approximately 76.87% of the outcomes. It correctly identifies about 37.66% of the actual positive cases. When the model predicts a case as positive, it is correct roughly 34.94% of the time. This indicates a relatively high number of false positives. The F1 score is 0.3625. This low score suggests that the model's precision and recall are not well-balanced. While the model demonstrates relatively good accuracy, its ability to correctly identify positive cases (recall) and be correct when predicting a positive case (precision) is relatively low.

In [40]:

make_confusion_matrix_LR(log_reg, X_test, y_test, threshold=0.24)

The model predicted more negative cases correctly than positive cases. The false positive rate is higher than the true positive rate, confirming the earlier analysis that the model's precision is low. The false negative count is also high relative to the true positives, consistent with the previously noted low recall. The decision threshold of 0.24 is lower than the standard 0.5, typically an attempt to increase the number of true positives. However, in this case, even with the lowered threshold.TN: The model correctly predicted the negative class 310 times, accounting for 70.29% of all predictions.FP The model incorrectly predicted positive cases 54 times. FN: the model incorrectly predicted the negative class 48 times. TP: the model correctly predicted the positive class 29 times.

In [41]:

print(y_train.value_counts(1))
print(y_test.value_counts(1))#Separating the predictor and target variables

0    0.844509
1    0.155491
Name: Attrition, dtype: float64
0    0.825397
1    0.174603
Name: Attrition, dtype: float64

In [42]:

dTree = DecisionTreeClassifier(criterion="gini", random_state=1)
dTree.fit(X_train, y_train)

Out[42]:

DecisionTreeClassifier(random_state=1)

DecisionTreeClassifier(random_state=1)

In [43]:

# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification(model, predictors, target):

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "Accuracy": acc,
            "Recall": recall,
            "Precision": precision,
            "F1": f1,
        },
        index=[0],
    )

    return df_perf

In [44]:

# Calculating different metrics
dTree_model_train_perf = model_performance_classification(
    dTree, X_train, y_train
)
print("Training performance:\n", dTree_model_train_perf)
dTree_model_test_perf = model_performance_classification(dTree, X_test, y_test)
print("Testing performance:\n", dTree_model_test_perf)

Training performance:
    Accuracy  Recall  Precision   F1
0       1.0     1.0        1.0  1.0
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.773243  0.324675   0.342466  0.333333

The accuracy of the testing data drops significantly to approximately 77.32%. The recall on the testing data drops to 32.47%. Precision drops to 34.25%, and the F1 score on the testing data drops to 0.33. These results indicate that the decision tree model is overfitting to the training data, the training scores are perfect across all metrics, and the testing scores substantially drop in performance.

In [45]:

# function to create Confusion matrix
def make_confusion_matrix(model, predictors, target, figsize=(5, 5)):
    y_pred = model.predict(predictors)
    cm = confusion_matrix(target, y_pred)
    group_names = ["True Neg", "False Pos", "False Neg", "True Pos"]
    group_counts = ["{0:0.0f}".format(value) for value in cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in cm.flatten() / np.sum(cm)]
    labels = [
        f"{v1}\n{v2}\n{v3}"
        for v1, v2, v3 in zip(group_names, group_counts, group_percentages)
    ]
    labels = np.asarray(labels).reshape(2, 2)
    plt.figure(figsize=figsize)
    sns.heatmap(cm, annot=labels, fmt="", cmap="Blues")

In [46]:

make_confusion_matrix(dTree, X_train, y_train, figsize=(4, 3))

The model achieved 100% accuracy on the training data with no false positives or false negatives. TN 84.45%. The model correctly predicted the negative class 869 times.FP:0.00% The model did not make any false positive predictions.FN 0.00%: The model did not miss any positive cases; TP 15.55% correctly predicted the positive class 160 times.

In [47]:

feature_names = list(X.columns) #extracting the column names from dataframe X
print(feature_names) #prints the list of feature names

['Age', 'BusinessTravel', 'DailyRate', 'Department', 'DistanceFromHome', 'Education', 'EducationField', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate', 'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction', 'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked', 'OverTime', 'PercentSalaryHike', 'PerformanceRating', 'RelationshipSatisfaction', 'StockOptionLevel', 'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance', 'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion', 'YearsWithCurrManager']

In [48]:

from sklearn.tree import plot_tree
plt.figure(figsize=(20, 30))
plot_tree(
    dTree,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names = ["0","1"]
)
plt.show()

In [49]:

# Text report showing the rules of a decision tree -

print(tree.export_text(dTree, feature_names=feature_names, show_weights=True))

|--- OverTime <= 0.50
|   |--- TotalWorkingYears <= 2.50
|   |   |--- JobInvolvement <= 1.50
|   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |--- JobInvolvement >  1.50
|   |   |   |--- TrainingTimesLastYear <= 1.00
|   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |--- TrainingTimesLastYear >  1.00
|   |   |   |   |--- JobSatisfaction <= 1.50
|   |   |   |   |   |--- PercentSalaryHike <= 15.00
|   |   |   |   |   |   |--- TrainingTimesLastYear <= 4.50
|   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- TrainingTimesLastYear >  4.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- PercentSalaryHike >  15.00
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |--- JobSatisfaction >  1.50
|   |   |   |   |   |--- WorkLifeBalance <= 2.50
|   |   |   |   |   |   |--- Age <= 24.50
|   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |--- RelationshipSatisfaction <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- RelationshipSatisfaction >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  24.50
|   |   |   |   |   |   |   |--- Department <= 1.50
|   |   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Department >  1.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- WorkLifeBalance >  2.50
|   |   |   |   |   |   |--- HourlyRate <= 56.50
|   |   |   |   |   |   |   |--- YearsWithCurrManager <= 0.50
|   |   |   |   |   |   |   |   |--- TrainingTimesLastYear <= 2.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- TrainingTimesLastYear >  2.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- YearsWithCurrManager >  0.50
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |--- HourlyRate >  56.50
|   |   |   |   |   |   |   |--- weights: [25.00, 0.00] class: 0
|   |--- TotalWorkingYears >  2.50
|   |   |--- EnvironmentSatisfaction <= 1.50
|   |   |   |--- JobInvolvement <= 1.50
|   |   |   |   |--- MonthlyRate <= 6287.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |--- MonthlyRate >  6287.00
|   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |--- JobInvolvement >  1.50
|   |   |   |   |--- HourlyRate <= 99.50
|   |   |   |   |   |--- DailyRate <= 1468.50
|   |   |   |   |   |   |--- PercentSalaryHike <= 13.50
|   |   |   |   |   |   |   |--- YearsInCurrentRole <= 1.50
|   |   |   |   |   |   |   |   |--- DailyRate <= 641.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |--- DailyRate >  641.50
|   |   |   |   |   |   |   |   |   |--- StockOptionLevel <= 2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- StockOptionLevel >  2.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- YearsInCurrentRole >  1.50
|   |   |   |   |   |   |   |   |--- Age <= 25.50
|   |   |   |   |   |   |   |   |   |--- Education <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Education >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  25.50
|   |   |   |   |   |   |   |   |   |--- JobSatisfaction <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- MonthlyIncome <= 2394.00
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |   |--- MonthlyIncome >  2394.00
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [31.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- JobSatisfaction >  3.50
|   |   |   |   |   |   |   |   |   |   |--- MonthlyRate <= 10396.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- MonthlyRate >  10396.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |--- PercentSalaryHike >  13.50
|   |   |   |   |   |   |   |--- DailyRate <= 157.50
|   |   |   |   |   |   |   |   |--- RelationshipSatisfaction <= 3.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |--- RelationshipSatisfaction >  3.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- DailyRate >  157.50
|   |   |   |   |   |   |   |   |--- Age <= 55.50
|   |   |   |   |   |   |   |   |   |--- DistanceFromHome <= 28.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [67.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- DistanceFromHome >  28.50
|   |   |   |   |   |   |   |   |   |   |--- TrainingTimesLastYear <= 3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- TrainingTimesLastYear >  3.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Age >  55.50
|   |   |   |   |   |   |   |   |   |--- Department <= 1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Department >  1.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- DailyRate >  1468.50
|   |   |   |   |   |   |--- HourlyRate <= 76.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- HourlyRate >  76.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- HourlyRate >  99.50
|   |   |   |   |   |--- MonthlyIncome <= 4814.50
|   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- MonthlyIncome >  4814.50
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |--- EnvironmentSatisfaction >  1.50
|   |   |   |--- YearsAtCompany <= 30.00
|   |   |   |   |--- WorkLifeBalance <= 2.50
|   |   |   |   |   |--- HourlyRate <= 38.50
|   |   |   |   |   |   |--- JobLevel <= 2.50
|   |   |   |   |   |   |   |--- MonthlyIncome <= 2711.50
|   |   |   |   |   |   |   |   |--- MonthlyIncome <= 2219.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- MonthlyIncome >  2219.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |--- MonthlyIncome >  2711.50
|   |   |   |   |   |   |   |   |--- Age <= 54.50
|   |   |   |   |   |   |   |   |   |--- weights: [11.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  54.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- JobLevel >  2.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- HourlyRate >  38.50
|   |   |   |   |   |   |--- MonthlyIncome <= 2064.00
|   |   |   |   |   |   |   |--- Age <= 28.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Age >  28.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- MonthlyIncome >  2064.00
|   |   |   |   |   |   |   |--- YearsWithCurrManager <= 9.50
|   |   |   |   |   |   |   |   |--- EducationField <= 0.50
|   |   |   |   |   |   |   |   |   |--- YearsAtCompany <= 6.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |--- YearsAtCompany >  6.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- EducationField >  0.50
|   |   |   |   |   |   |   |   |   |--- Education <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- EducationField <= 4.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [82.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- EducationField >  4.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |   |   |   |--- Education >  3.50
|   |   |   |   |   |   |   |   |   |   |--- PercentSalaryHike <= 14.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [17.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |   |--- PercentSalaryHike >  14.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 3
|   |   |   |   |   |   |   |--- YearsWithCurrManager >  9.50
|   |   |   |   |   |   |   |   |--- JobRole <= 6.00
|   |   |   |   |   |   |   |   |   |--- weights: [5.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- JobRole >  6.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |--- WorkLifeBalance >  2.50
|   |   |   |   |   |--- YearsSinceLastPromotion <= 14.50
|   |   |   |   |   |   |--- DailyRate <= 110.00
|   |   |   |   |   |   |   |--- MonthlyRate <= 22965.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- MonthlyRate >  22965.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- DailyRate >  110.00
|   |   |   |   |   |   |   |--- JobRole <= 7.50
|   |   |   |   |   |   |   |   |--- DailyRate <= 1444.00
|   |   |   |   |   |   |   |   |   |--- Age <= 45.50
|   |   |   |   |   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 5
|   |   |   |   |   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [205.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- Age >  45.50
|   |   |   |   |   |   |   |   |   |   |--- TotalWorkingYears <= 13.50
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 4
|   |   |   |   |   |   |   |   |   |   |--- TotalWorkingYears >  13.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [39.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- DailyRate >  1444.00
|   |   |   |   |   |   |   |   |   |--- JobInvolvement <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- JobInvolvement >  3.50
|   |   |   |   |   |   |   |   |   |   |--- Age <= 35.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- Age >  35.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- JobRole >  7.50
|   |   |   |   |   |   |   |   |--- TotalWorkingYears <= 9.50
|   |   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- TotalWorkingYears >  9.50
|   |   |   |   |   |   |   |   |   |--- YearsInCurrentRole <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |   |   |--- YearsInCurrentRole >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- YearsSinceLastPromotion >  14.50
|   |   |   |   |   |   |--- TrainingTimesLastYear <= 1.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- TrainingTimesLastYear >  1.50
|   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |--- YearsAtCompany >  30.00
|   |   |   |   |--- weights: [0.00, 1.00] class: 1
|--- OverTime >  0.50
|   |--- JobLevel <= 1.50
|   |   |--- YearsInCurrentRole <= 0.50
|   |   |   |--- StockOptionLevel <= 0.50
|   |   |   |   |--- weights: [0.00, 15.00] class: 1
|   |   |   |--- StockOptionLevel >  0.50
|   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |--- Education >  2.50
|   |   |   |   |   |--- Age <= 51.00
|   |   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |   |   |   |--- Age >  51.00
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |--- YearsInCurrentRole >  0.50
|   |   |   |--- NumCompaniesWorked <= 0.50
|   |   |   |   |--- YearsInCurrentRole <= 1.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- YearsInCurrentRole >  1.50
|   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |--- NumCompaniesWorked >  0.50
|   |   |   |   |--- JobInvolvement <= 1.50
|   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |--- JobInvolvement >  1.50
|   |   |   |   |   |--- DailyRate <= 936.00
|   |   |   |   |   |   |--- MonthlyIncome <= 2694.50
|   |   |   |   |   |   |   |--- TrainingTimesLastYear <= 1.00
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- TrainingTimesLastYear >  1.00
|   |   |   |   |   |   |   |   |--- DailyRate <= 879.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 13.00] class: 1
|   |   |   |   |   |   |   |   |--- DailyRate >  879.00
|   |   |   |   |   |   |   |   |   |--- PercentSalaryHike <= 12.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- PercentSalaryHike >  12.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- MonthlyIncome >  2694.50
|   |   |   |   |   |   |   |--- StockOptionLevel <= 1.50
|   |   |   |   |   |   |   |   |--- WorkLifeBalance <= 1.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- WorkLifeBalance >  1.50
|   |   |   |   |   |   |   |   |   |--- weights: [9.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- StockOptionLevel >  1.50
|   |   |   |   |   |   |   |   |--- Gender <= 0.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |   |   |   |   |   |   |--- Gender >  0.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- DailyRate >  936.00
|   |   |   |   |   |   |--- TrainingTimesLastYear <= 3.50
|   |   |   |   |   |   |   |--- RelationshipSatisfaction <= 2.50
|   |   |   |   |   |   |   |   |--- TotalWorkingYears <= 4.50
|   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- TotalWorkingYears >  4.50
|   |   |   |   |   |   |   |   |   |--- Education <= 3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |   |   |--- Education >  3.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- RelationshipSatisfaction >  2.50
|   |   |   |   |   |   |   |   |--- weights: [16.00, 0.00] class: 0
|   |   |   |   |   |   |--- TrainingTimesLastYear >  3.50
|   |   |   |   |   |   |   |--- MonthlyIncome <= 2396.00
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- MonthlyIncome >  2396.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |--- JobLevel >  1.50
|   |   |--- JobRole <= 6.50
|   |   |   |--- Age <= 24.00
|   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |--- Age >  24.00
|   |   |   |   |--- TotalWorkingYears <= 37.50
|   |   |   |   |   |--- MonthlyIncome <= 19853.00
|   |   |   |   |   |   |--- NumCompaniesWorked <= 8.50
|   |   |   |   |   |   |   |--- DistanceFromHome <= 28.50
|   |   |   |   |   |   |   |   |--- DailyRate <= 1421.50
|   |   |   |   |   |   |   |   |   |--- TotalWorkingYears <= 8.50
|   |   |   |   |   |   |   |   |   |   |--- YearsAtCompany <= 1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- YearsAtCompany >  1.50
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [7.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- TotalWorkingYears >  8.50
|   |   |   |   |   |   |   |   |   |   |--- weights: [93.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- DailyRate >  1421.50
|   |   |   |   |   |   |   |   |   |--- YearsWithCurrManager <= 9.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- YearsWithCurrManager >  9.00
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- DistanceFromHome >  28.50
|   |   |   |   |   |   |   |   |--- MonthlyIncome <= 10614.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- MonthlyIncome >  10614.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- NumCompaniesWorked >  8.50
|   |   |   |   |   |   |   |--- EnvironmentSatisfaction <= 2.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- EnvironmentSatisfaction >  2.50
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- MonthlyIncome >  19853.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- TotalWorkingYears >  37.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |--- JobRole >  6.50
|   |   |   |--- DistanceFromHome <= 11.00
|   |   |   |   |--- StockOptionLevel <= 0.50
|   |   |   |   |   |--- WorkLifeBalance <= 2.50
|   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |--- WorkLifeBalance >  2.50
|   |   |   |   |   |   |--- TrainingTimesLastYear <= 0.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- TrainingTimesLastYear >  0.50
|   |   |   |   |   |   |   |--- MonthlyIncome <= 7653.50
|   |   |   |   |   |   |   |   |--- weights: [10.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- MonthlyIncome >  7653.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- StockOptionLevel >  0.50
|   |   |   |   |   |--- weights: [26.00, 0.00] class: 0
|   |   |   |--- DistanceFromHome >  11.00
|   |   |   |   |--- StockOptionLevel <= 0.50
|   |   |   |   |   |--- HourlyRate <= 34.00
|   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- HourlyRate >  34.00
|   |   |   |   |   |   |--- Education <= 2.50
|   |   |   |   |   |   |   |--- Gender <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- Gender >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Education >  2.50
|   |   |   |   |   |   |   |--- weights: [0.00, 9.00] class: 1
|   |   |   |   |--- StockOptionLevel >  0.50
|   |   |   |   |   |--- YearsSinceLastPromotion <= 1.50
|   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |--- YearsSinceLastPromotion >  1.50
|   |   |   |   |   |   |--- DistanceFromHome <= 22.50
|   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |--- DistanceFromHome >  22.50
|   |   |   |   |   |   |   |--- HourlyRate <= 46.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- HourlyRate >  46.00
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0

In [50]:

# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        dTree.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

                               Imp
MonthlyIncome             0.085485
JobLevel                  0.067762
OverTime                  0.063004
TotalWorkingYears         0.061662
TrainingTimesLastYear     0.061090
YearsInCurrentRole        0.056927
Age                       0.049305
DistanceFromHome          0.047290
JobInvolvement            0.046883
StockOptionLevel          0.042094
Education                 0.040870
HourlyRate                0.040683
DailyRate                 0.038781
JobRole                   0.035187
WorkLifeBalance           0.035151
MonthlyRate               0.031497
YearsWithCurrManager      0.024574
YearsSinceLastPromotion   0.024189
RelationshipSatisfaction  0.023025
JobSatisfaction           0.022211
PercentSalaryHike         0.022082
NumCompaniesWorked        0.019826
YearsAtCompany            0.017997
EnvironmentSatisfaction   0.016954
Gender                    0.010484
Department                0.010279
EducationField            0.004708
PerformanceRating         0.000000
BusinessTravel            0.000000
MaritalStatus             0.000000

In [51]:

importances = dTree.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

The top three feature importances are as follows: monthly income is the most significant feature, indicating the highest impact on the model's predictions. JobLevel is the second most important feature, followed by OverTime, which also appears to be a strong predictor.

In [52]:

dTree_short = DecisionTreeClassifier(criterion="gini", max_depth=3, random_state=1)
dTree_short.fit(X_train, y_train)

Out[52]:

DecisionTreeClassifier(max_depth=3, random_state=1)

DecisionTreeClassifier(max_depth=3, random_state=1)

In [53]:

make_confusion_matrix(dTree_short, X_test, y_test, figsize=(4, 3))

The decision tree model performs well in predicting the negative cases with high TN rates but not as well in correctly identifying the positive cases with low TP rates. The TN, 81.41%, correctly predicted the negative cases 359 times, and the FP, 1.13%. The model incorrectly predicted the positive cases 5 times, FN: 15.42%. The model incorrectly predicted the negative cases 68 times, TP: 2.04%. The model correctly predicted the positive cases 9 times.

In [54]:

# Calculating different metrics
dTree_short_model_train_perf = model_performance_classification(
    dTree_short, X_train, y_train
)
print("Training performance:\n", dTree_short_model_train_perf)
dTree_short_model_test_perf = model_performance_classification(dTree_short, X_test, y_test)
print("Testing performance:\n", dTree_short_model_test_perf)

Training performance:
    Accuracy  Recall  Precision        F1
0  0.864917  0.1625    0.83871  0.272251
Testing performance:
    Accuracy    Recall  Precision        F1
0  0.834467  0.116883   0.642857  0.197802

The model's performance has dropped from the training to the testing set

In [55]:

plt.figure(figsize=(15, 10))

tree.plot_tree(
    dTree_short,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=["0","1"],
)
plt.show()

In [56]:

# Text report showing the rules of a decision tree -

print(tree.export_text(dTree_short, feature_names=feature_names, show_weights=True))

|--- OverTime <= 0.50
|   |--- TotalWorkingYears <= 2.50
|   |   |--- JobInvolvement <= 1.50
|   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |--- JobInvolvement >  1.50
|   |   |   |--- weights: [42.00, 15.00] class: 0
|   |--- TotalWorkingYears >  2.50
|   |   |--- EnvironmentSatisfaction <= 1.50
|   |   |   |--- weights: [116.00, 23.00] class: 0
|   |   |--- EnvironmentSatisfaction >  1.50
|   |   |   |--- weights: [509.00, 31.00] class: 0
|--- OverTime >  0.50
|   |--- JobLevel <= 1.50
|   |   |--- YearsInCurrentRole <= 0.50
|   |   |   |--- weights: [5.00, 22.00] class: 1
|   |   |--- YearsInCurrentRole >  0.50
|   |   |   |--- weights: [47.00, 34.00] class: 0
|   |--- JobLevel >  1.50
|   |   |--- JobRole <= 6.50
|   |   |   |--- weights: [106.00, 8.00] class: 0
|   |   |--- JobRole >  6.50
|   |   |   |--- weights: [44.00, 23.00] class: 0

In [57]:

# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        dTree_short.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

                               Imp
OverTime                  0.290465
JobLevel                  0.278283
YearsInCurrentRole        0.107835
JobRole                   0.107393
TotalWorkingYears         0.102742
JobInvolvement            0.069240
EnvironmentSatisfaction   0.044043
WorkLifeBalance           0.000000
TrainingTimesLastYear     0.000000
NumCompaniesWorked        0.000000
StockOptionLevel          0.000000
YearsAtCompany            0.000000
YearsSinceLastPromotion   0.000000
RelationshipSatisfaction  0.000000
PerformanceRating         0.000000
PercentSalaryHike         0.000000
Age                       0.000000
MonthlyIncome             0.000000
MonthlyRate               0.000000
BusinessTravel            0.000000
MaritalStatus             0.000000
JobSatisfaction           0.000000
HourlyRate                0.000000
Gender                    0.000000
EducationField            0.000000
Education                 0.000000
DistanceFromHome          0.000000
Department                0.000000
DailyRate                 0.000000
YearsWithCurrManager      0.000000

In [58]:

importances = dTree_short.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(10, 10))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

The top three feature importances are as follows: OverTime stands out as the most significant predictor in the model, with the longest bar, indicating that it has the most significant influence on the model's output. Job level and years in the current role are the most important features. This chart suggests that work-related features, especially those related to time, play a crucial role in the model's decision process.

In [59]:

# Choose the type of classifier.
dTree_tuned = DecisionTreeClassifier(
    class_weight={0: 0.5, 1: 0.5}, random_state=1
)  # using reverse class weights with equal weight to both classes

# Grid of parameters to choose from

parameters = {
    "max_depth": np.arange(1, 7, 1),
    "min_samples_leaf": [5, 10, 15, 20, 25],
    "max_leaf_nodes": [3, 5, 10, 15],
    "min_impurity_decrease": [0.001, 0.01, 0.1],
}

# Type of scoring used to compare parameter combinations
acc_scorer = metrics.make_scorer(metrics.f1_score)

# Run the grid search
grid_obj = GridSearchCV(dTree_tuned, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)

# Set the clf to the best combination of parameters
dtree_tuned = grid_obj.best_estimator_

# Fit the best algorithm to the data.
dtree_tuned.fit(X_train, y_train)

Out[59]:

DecisionTreeClassifier(class_weight={0: 0.5, 1: 0.5}, max_depth=5,
                       max_leaf_nodes=10, min_impurity_decrease=0.001,
                       min_samples_leaf=15, random_state=1)

DecisionTreeClassifier(class_weight={0: 0.5, 1: 0.5}, max_depth=5,
                       max_leaf_nodes=10, min_impurity_decrease=0.001,
                       min_samples_leaf=15, random_state=1)

In [60]:

make_confusion_matrix(dtree_tuned, X_test, y_test, figsize=(4,3))

The model has a high rate of true negatives, indicating its proficiency in correctly identifying negative attrition cases. TN: 79.14% The model accurately predicted the negative cases 349 times, FP: 3.40% The model incorrectly predicted the positive cases 15 times, FN: 12.24% The model incorrectly predicted the negative cases 54 times. TP:5.22%The model correctly predicted the positive cases 23 times

In [61]:

# Calculating different metrics
dtree_tuned_model_train_perf = model_performance_classification(
    dtree_tuned, X_train, y_train
)
print("Training performance:\n", dtree_tuned_model_train_perf)
dtree_tuned_model_test_perf = model_performance_classification(dtree_tuned, X_test, y_test)
print("Testing performance:\n", dtree_tuned_model_test_perf)

Training performance:
    Accuracy  Recall  Precision       F1
0  0.876579   0.375   0.689655  0.48583
Testing performance:
    Accuracy    Recall  Precision   F1
0  0.843537  0.298701   0.605263  0.4

The model demonstrates reasonable accuracy; however, there's a slight decrease in accuracy on the testing set. Additionally, the recall, precision, and F1 scores have declined in the testing set compared to the training set

In [62]:

plt.figure(figsize=(15, 10))

tree.plot_tree(
    dtree_tuned,
    feature_names=feature_names,
    filled=True,
    fontsize=9,
    node_ids=True,
    class_names=["0","1"],
)
plt.show()

In [63]:

# Text report showing the rules of a decision tree -

print(tree.export_text(dtree_tuned, feature_names=feature_names, show_weights=True))

|--- OverTime <= 0.50
|   |--- TotalWorkingYears <= 2.50
|   |   |--- HourlyRate <= 58.50
|   |   |   |--- weights: [4.50, 5.50] class: 1
|   |   |--- HourlyRate >  58.50
|   |   |   |--- weights: [16.50, 4.00] class: 0
|   |--- TotalWorkingYears >  2.50
|   |   |--- weights: [312.50, 27.00] class: 0
|--- OverTime >  0.50
|   |--- JobLevel <= 1.50
|   |   |--- YearsInCurrentRole <= 0.50
|   |   |   |--- weights: [2.50, 11.00] class: 1
|   |   |--- YearsInCurrentRole >  0.50
|   |   |   |--- EnvironmentSatisfaction <= 1.50
|   |   |   |   |--- weights: [2.50, 5.50] class: 1
|   |   |   |--- EnvironmentSatisfaction >  1.50
|   |   |   |   |--- weights: [21.00, 11.50] class: 0
|   |--- JobLevel >  1.50
|   |   |--- JobRole <= 6.50
|   |   |   |--- weights: [53.00, 4.00] class: 0
|   |   |--- JobRole >  6.50
|   |   |   |--- DistanceFromHome <= 11.00
|   |   |   |   |--- StockOptionLevel <= 0.50
|   |   |   |   |   |--- weights: [5.00, 3.50] class: 0
|   |   |   |   |--- StockOptionLevel >  0.50
|   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |--- DistanceFromHome >  11.00
|   |   |   |   |--- weights: [4.00, 8.00] class: 1

In [64]:

# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )

print(
    pd.DataFrame(
        dtree_tuned.feature_importances_, columns=["Imp"], index=X_train.columns
    ).sort_values(by="Imp", ascending=False)
)

# Here we will see that importance of features has increased

                               Imp
OverTime                  0.244882
JobLevel                  0.234612
DistanceFromHome          0.112491
YearsInCurrentRole        0.090912
JobRole                   0.090539
TotalWorkingYears         0.086618
StockOptionLevel          0.050132
HourlyRate                0.048698
EnvironmentSatisfaction   0.041116
Age                       0.000000
PerformanceRating         0.000000
RelationshipSatisfaction  0.000000
TrainingTimesLastYear     0.000000
WorkLifeBalance           0.000000
YearsAtCompany            0.000000
YearsSinceLastPromotion   0.000000
PercentSalaryHike         0.000000
MonthlyIncome             0.000000
NumCompaniesWorked        0.000000
MonthlyRate               0.000000
BusinessTravel            0.000000
MaritalStatus             0.000000
JobSatisfaction           0.000000
JobInvolvement            0.000000
Gender                    0.000000
EducationField            0.000000
Education                 0.000000
Department                0.000000
DailyRate                 0.000000
YearsWithCurrManager      0.000000

In [65]:

importances = dtree_tuned.feature_importances_
indices = np.argsort(importances)

plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()

The top three feature importances are as follows: OverTime is the most important feature according to the model.JobLevel is the second most significant feature; this suggests that the job level or rank within the organization plays an important role, and distance from home is also a key feature; this suggests how far an employee lives from the workplace is influential. The model emphasizes on features related to employment conditions, such as overtime work, job level, and commuting, as shown by the importance of distance

In [66]:

# training performance comparison

models_train_comp_df = pd.concat(
    [
        model_performance_classification(dTree, X_train, y_train).T,
        model_performance_classification(dTree_short, X_train, y_train).T,
        model_performance_classification(dtree_tuned, X_train, y_train).T,
    ],
    axis=1,
)
models_train_comp_df.columns = [
    "Decision Tree full",
    "Decision Tree short",
    "Decision tree tuned",
]
print("training performance comparison:")
models_train_comp_df

training performance comparison:

Out[66]:

	Decision Tree full	Decision Tree short	Decision tree tuned
Accuracy	1.0	0.864917	0.876579
Recall	1.0	0.162500	0.375000
Precision	1.0	0.838710	0.689655
F1	1.0	0.272251	0.485830

In [67]:

# test performance comparison

models_test_comp_df = pd.concat(
    [
        model_performance_classification(dTree, X_test, y_test).T,
        model_performance_classification(dTree_short, X_test, y_test).T,
        model_performance_classification(dtree_tuned, X_test, y_test).T,
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Decision Tree full",
    "Decision Tree short",
    "Decision Tree tuned",
]
print("test performance comparison:")
models_test_comp_df

test performance comparison:

Out[67]:

	Decision Tree full	Decision Tree short	Decision Tree tuned
Accuracy	0.773243	0.834467	0.843537
Recall	0.324675	0.116883	0.298701
Precision	0.342466	0.642857	0.605263
F1	0.333333	0.197802	0.400000

The tuned decision tree achieves the highest accuracy, demonstrating its strong performance. The short decision tree has the highest precision score but the lowest recall and F1 score. The full decision tree has the highest recall and F1 score but the lowest accuracy score. The tuned decision tree has a balance between the full and short models; it offers high accuracy while also delivering the best F1 score. This suggests it achieves a more favorable equilibrium between precision and recall than the other models

In [68]:

linreg = LinearRegression()
linreg.fit(X_train, y_train)

Out[68]:

LinearRegression()

LinearRegression()

In [69]:

# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))

# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100

# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    
    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf

In [70]:

# checking model performance on train and test set

linreg_train_perf = model_performance_regression(
    linreg, X_train, y_train
)
print("Training Performance\n", linreg_train_perf)

linreg_test_perf = model_performance_regression(
    linreg, X_test, y_test
)
print("Testing Performance\n", linreg_test_perf)

Training Performance
        RMSE      MAE  R-squared  Adj. R-squared  MAPE
0  0.322452  0.23493   0.208192         0.18439   inf
Testing Performance
        RMSE      MAE  R-squared  Adj. R-squared  MAPE
0  0.338406  0.25468   0.205375        0.147232   inf

Both training and testing scores are low; RMSE, MAE, and the R-squared scores are higher in the training performance. However, the R-squared scores remain consistently low for training and testing data. These results collectively indicate that the model is not performing effectively, as it fails to explain a significant portion of the variance in the dependent variable

In [71]:

from sklearn.ensemble import BaggingRegressor

# Initialize the Bagging Regressor 
linreg_bagging_model = BaggingRegressor(base_estimator=LinearRegression(), n_estimators=50, random_state=42)

# Fit the model
linreg_bagging_model.fit(X_train, y_train)

Out[71]:

BaggingRegressor(base_estimator=LinearRegression(), n_estimators=50,
                 random_state=42)

BaggingRegressor(base_estimator=LinearRegression(), n_estimators=50,
                 random_state=42)

base_estimator: LinearRegression

LinearRegression()

LinearRegression

LinearRegression()

In [72]:

# checking model performance on train and test set

linreg_bagging_model_train_perf = model_performance_regression(
    linreg_bagging_model, X_train, y_train
)
print("Training Performance\n", linreg_bagging_model_train_perf)

linreg_bagging_model_test_perf = model_performance_regression(
    linreg_bagging_model, X_test, y_test
)

print("Testing Performance\n", linreg_bagging_model_test_perf)

Training Performance
        RMSE       MAE  R-squared  Adj. R-squared  MAPE
0  0.322652  0.236503   0.207209        0.183378   inf
Testing Performance
       RMSE       MAE  R-squared  Adj. R-squared  MAPE
0  0.33888  0.256413    0.20315        0.144844   inf

The model's predictive performance is limited on these metrics. The bagging model for linear regression shows low R-squared values for both the training and testing datasets, implying that the model does not explain a large portion of the variance in the dependent variable

In [73]:

from sklearn.tree import DecisionTreeRegressor

# Initialize the Decision Tree Regressor
dt_regressor = DecisionTreeRegressor(random_state=0)

# Fit the model on the training data
dt_regressor.fit(X_train, y_train)

Out[73]:

DecisionTreeRegressor(random_state=0)

DecisionTreeRegressor(random_state=0)

In [74]:

# checking model performance on train and test set

dt_regressor_train_perf = model_performance_regression(
    dt_regressor, X_train, y_train
)
print("Training Performance\n", dt_regressor_train_perf)

dt_regressor_test_perf = model_performance_regression(
    dt_regressor, X_test, y_test
)
print("Testing Performance\n", dt_regressor_test_perf)

Training Performance
    RMSE  MAE  R-squared  Adj. R-squared  MAPE
0   0.0  0.0        1.0             1.0   0.0
Testing Performance
       RMSE       MAE  R-squared  Adj. R-squared  MAPE
0  0.48795  0.238095  -0.652098       -0.772983   inf

The model performs very poorly on the test set

In [75]:

# Initialize the Bagging Regressor 
dt_regressor_bagging_model = BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5), n_estimators=50, random_state=42)

# Fit the model
dt_regressor_bagging_model.fit(X_train, y_train)

Out[75]:

BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5),
                 n_estimators=50, random_state=42)

BaggingRegressor(base_estimator=DecisionTreeRegressor(max_depth=5),
                 n_estimators=50, random_state=42)

base_estimator: DecisionTreeRegressor

DecisionTreeRegressor(max_depth=5)

DecisionTreeRegressor

DecisionTreeRegressor(max_depth=5)

In [76]:

# checking model performance on train and test set

dt_regressor_bagging_train_perf = model_performance_regression(
    dt_regressor_bagging_model, X_train, y_train
)
print("Training Performance\n", dt_regressor_bagging_train_perf)

dt_regressor_bagging_test_perf = model_performance_regression(
    dt_regressor_bagging_model, X_test, y_test
)
print("Testing Performance\n", dt_regressor_bagging_test_perf)

Training Performance
        RMSE       MAE  R-squared  Adj. R-squared  MAPE
0  0.255089  0.166533   0.504465        0.489569   inf
Testing Performance
        RMSE       MAE  R-squared  Adj. R-squared  MAPE
0  0.339019  0.225483   0.202495        0.144141   inf

The bagging regressor model shows a reasonable fit to the training data but poorly to the test data, with a significant drop in both R-squared and Adjusted R-squared values

In [77]:

# testing performance comparison

models_test_comp_df = pd.concat(
    [
        linreg_test_perf.T,
        linreg_bagging_model_test_perf.T,
        dt_regressor_test_perf.T,
        dt_regressor_bagging_test_perf.T,
        
    ],
    axis=1,
)
models_test_comp_df.columns = [
    "Linear Regression",
    "Linear regression with bagging",
    "Decision Tree Regressor",
    "DT Regressor with bagging",
    
]
print("Testing performance comparison:")
models_test_comp_df

Testing performance comparison:

Out[77]:

	Linear Regression	Linear regression with bagging	Decision Tree Regressor	DT Regressor with bagging
RMSE	0.338406	0.338880	0.487950	0.339019
MAE	0.254680	0.256413	0.238095	0.225483
R-squared	0.205375	0.203150	-0.652098	0.202495
Adj. R-squared	0.147232	0.144844	-0.772983	0.144141
MAPE	inf	inf	inf	inf

Both The Linear Regression models show very similar performance across all metrics. The bagging did not improve the model's prediction error or variance explanation. The Decision Tree Regressor model has the highest RMSE but has a negative R-squared value. The negative R-squared indicates that it is not suitable in its current form and is likely overfitting to the training data. The Decision Tree Regressor with the Bagging model improved its performance by the lower RMSE and higher, though still low R-squared values compared to the non-bagged version

The Decision Tree Regressor with Bagging is the best for the regression tasks due to its lower RMSE and MAE scores. The Decision Tree tuned is the most balanced for the classification tasks. It balances the metrics well, has lower error rates in regression, and maintains a good balance between all the metrics in classification without extreme trade-offs.The other models are not as good choices because the Linear models with and without bagging have higher prediction errors. The Decision Tree full has lower precision and F1 score. The Decision Tree short has low recall because it misses many true positive classifications. The logistic regression threshold of 0.62 results in very low recall, precision, and F1 score, indicating it is too conservative in predicting positive classes

In [ ]: