Feature Selection

HGFGkhhdsf
Task7_AutoFeatureSelector2.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Task 7: AutoFeatureSelector Tool\n", "## This task is to test your understanding of various Feature Selection methods outlined in the lecture and the ability to apply this knowledge in a real-world dataset to select best features and also to build an automated feature selection tool as your toolkit\n", "\n", "### Use your knowledge of different feature selector methods to build an Automatic Feature Selection tool\n", "- Pearson Correlation\n", "- Chi-Square\n", "- RFE\n", "- Embedded\n", "- Tree (Random Forest)\n", "- Tree (Light GBM)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Dataset: FIFA 19 Player Skills\n", "#### Attributes: FIFA 2019 players attributes like Age, Nationality, Overall, Potential, Club, Value, Wage, Preferred Foot, International Reputation, Weak Foot, Skill Moves, Work Rate, Position, Jersey Number, Joined, Loaned From, Contract Valid Until, Height, Weight, LS, ST, RS, LW, LF, CF, RF, RW, LAM, CAM, RAM, LM, LCM, CM, RCM, RM, LWB, LDM, CDM, RDM, RWB, LB, LCB, CB, RCB, RB, Crossing, Finishing, Heading, Accuracy, ShortPassing, Volleys, Dribbling, Curve, FKAccuracy, LongPassing, BallControl, Acceleration, SprintSpeed, Agility, Reactions, Balance, ShotPower, Jumping, Stamina, Strength, LongShots, Aggression, Interceptions, Positioning, Vision, Penalties, Composure, Marking, StandingTackle, SlidingTackle, GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes, and Release Clause." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd \n", "import seaborn as sns\n", "import matplotlib.pyplot as plt\n", "import scipy.stats as ss\n", "from collections import Counter\n", "import math\n", "from scipy import stats" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "player_df = pd.read_csv(\"data/fifa19.csv\")" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "numcols = ['Overall', 'Crossing','Finishing', 'ShortPassing', 'Dribbling','LongPassing', 'BallControl', 'Acceleration','SprintSpeed', 'Agility', 'Stamina','Volleys','FKAccuracy','Reactions','Balance','ShotPower','Strength','LongShots','Aggression','Interceptions']\n", "catcols = ['Preferred Foot','Position','Body Type','Nationality','Weak Foot']" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "player_df = player_df[numcols+catcols]" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "traindf = pd.concat([player_df[numcols], pd.get_dummies(player_df[catcols])],axis=1)\n", "features = traindf.columns\n", "\n", "traindf = traindf.dropna()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "traindf = pd.DataFrame(traindf,columns=features)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "y = traindf['Overall']>=87\n", "X = traindf.copy()\n", "del X['Overall']" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Crossing</th>\n", " <th>Finishing</th>\n", " <th>ShortPassing</th>\n", " <th>Dribbling</th>\n", " <th>LongPassing</th>\n", " <th>BallControl</th>\n", " <th>Acceleration</th>\n", " <th>SprintSpeed</th>\n", " <th>Agility</th>\n", " <th>Stamina</th>\n", " <th>...</th>\n", " <th>Nationality_Uganda</th>\n", " <th>Nationality_Ukraine</th>\n", " <th>Nationality_United Arab Emirates</th>\n", " <th>Nationality_United States</th>\n", " <th>Nationality_Uruguay</th>\n", " <th>Nationality_Uzbekistan</th>\n", " <th>Nationality_Venezuela</th>\n", " <th>Nationality_Wales</th>\n", " <th>Nationality_Zambia</th>\n", " <th>Nationality_Zimbabwe</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>84.0</td>\n", " <td>95.0</td>\n", " <td>90.0</td>\n", " <td>97.0</td>\n", " <td>87.0</td>\n", " <td>96.0</td>\n", " <td>91.0</td>\n", " <td>86.0</td>\n", " <td>91.0</td>\n", " <td>72.0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>84.0</td>\n", " <td>94.0</td>\n", " <td>81.0</td>\n", " <td>88.0</td>\n", " <td>77.0</td>\n", " <td>94.0</td>\n", " <td>89.0</td>\n", " <td>91.0</td>\n", " <td>87.0</td>\n", " <td>88.0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>79.0</td>\n", " <td>87.0</td>\n", " <td>84.0</td>\n", " <td>96.0</td>\n", " <td>78.0</td>\n", " <td>95.0</td>\n", " <td>94.0</td>\n", " <td>90.0</td>\n", " <td>96.0</td>\n", " <td>81.0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>17.0</td>\n", " <td>13.0</td>\n", " <td>50.0</td>\n", " <td>18.0</td>\n", " <td>51.0</td>\n", " <td>42.0</td>\n", " <td>57.0</td>\n", " <td>58.0</td>\n", " <td>60.0</td>\n", " <td>43.0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>93.0</td>\n", " <td>82.0</td>\n", " <td>92.0</td>\n", " <td>86.0</td>\n", " <td>91.0</td>\n", " <td>91.0</td>\n", " <td>78.0</td>\n", " <td>76.0</td>\n", " <td>79.0</td>\n", " <td>90.0</td>\n", " <td>...</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5 rows × 223 columns</p>\n", "</div>" ], "text/plain": [ " Crossing Finishing ShortPassing Dribbling LongPassing BallControl \\\n", "0 84.0 95.0 90.0 97.0 87.0 96.0 \n", "1 84.0 94.0 81.0 88.0 77.0 94.0 \n", "2 79.0 87.0 84.0 96.0 78.0 95.0 \n", "3 17.0 13.0 50.0 18.0 51.0 42.0 \n", "4 93.0 82.0 92.0 86.0 91.0 91.0 \n", "\n", " Acceleration SprintSpeed Agility Stamina ... Nationality_Uganda \\\n", "0 91.0 86.0 91.0 72.0 ... 0 \n", "1 89.0 91.0 87.0 88.0 ... 0 \n", "2 94.0 90.0 96.0 81.0 ... 0 \n", "3 57.0 58.0 60.0 43.0 ... 0 \n", "4 78.0 76.0 79.0 90.0 ... 0 \n", "\n", " Nationality_Ukraine Nationality_United Arab Emirates \\\n", "0 0 0 \n", "1 0 0 \n", "2 0 0 \n", "3 0 0 \n", "4 0 0 \n", "\n", " Nationality_United States Nationality_Uruguay Nationality_Uzbekistan \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", " Nationality_Venezuela Nationality_Wales Nationality_Zambia \\\n", "0 0 0 0 \n", "1 0 0 0 \n", "2 0 0 0 \n", "3 0 0 0 \n", "4 0 0 0 \n", "\n", " Nationality_Zimbabwe \n", "0 0 \n", "1 0 \n", "2 0 \n", "3 0 \n", "4 0 \n", "\n", "[5 rows x 223 columns]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.head()" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "223" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "len(X.columns)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Set some fixed set of features" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "feature_name = list(X.columns)\n", "# no of maximum features we need to select\n", "num_feats=30" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter Feature Selection - Pearson Correlation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Pearson Correlation function" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "def cor_selector(X, y,num_feats):\n", " # Your code goes here (Multiple lines)\n", " \n", " # Your code ends here\n", " return cor_support, cor_feature" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30 selected features\n" ] } ], "source": [ "cor_support, cor_feature = cor_selector(X, y,num_feats)\n", "print(str(len(cor_feature)), 'selected features')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List the selected features from Pearson Correlation" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Nationality_Costa Rica',\n", " 'Position_LAM',\n", " 'Nationality_Uruguay',\n", " 'Acceleration',\n", " 'SprintSpeed',\n", " 'Strength',\n", " 'Nationality_Gabon',\n", " 'Nationality_Slovenia',\n", " 'Stamina',\n", " 'Weak Foot',\n", " 'Agility',\n", " 'Crossing',\n", " 'Nationality_Belgium',\n", " 'Dribbling',\n", " 'ShotPower',\n", " 'LongShots',\n", " 'Finishing',\n", " 'BallControl',\n", " 'FKAccuracy',\n", " 'LongPassing',\n", " 'Volleys',\n", " 'ShortPassing',\n", " 'Position_RF',\n", " 'Position_LF',\n", " 'Body Type_PLAYER_BODY_TYPE_25',\n", " 'Body Type_Courtois',\n", " 'Body Type_Neymar',\n", " 'Body Type_Messi',\n", " 'Body Type_C. Ronaldo',\n", " 'Reactions']" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "cor_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter Feature Selection - Chi-Sqaure" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectKBest\n", "from sklearn.feature_selection import chi2\n", "from sklearn.preprocessing import MinMaxScaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Chi-Squared Selector function" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "def chi_squared_selector(X, y, num_feats):\n", " # Your code goes here (Multiple lines)\n", " \n", " # Your code ends here\n", " return chi_support, chi_feature" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30 selected features\n" ] } ], "source": [ "chi_support, chi_feature = chi_squared_selector(X, y,num_feats)\n", "print(str(len(chi_feature)), 'selected features')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List the selected features from Chi-Square " ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Finishing',\n", " 'ShortPassing',\n", " 'LongPassing',\n", " 'BallControl',\n", " 'Volleys',\n", " 'FKAccuracy',\n", " 'Reactions',\n", " 'LongShots',\n", " 'Position_CM',\n", " 'Position_LAM',\n", " 'Position_LF',\n", " 'Position_LW',\n", " 'Position_RB',\n", " 'Position_RF',\n", " 'Body Type_C. Ronaldo',\n", " 'Body Type_Courtois',\n", " 'Body Type_Messi',\n", " 'Body Type_Neymar',\n", " 'Body Type_PLAYER_BODY_TYPE_25',\n", " 'Nationality_Belgium',\n", " 'Nationality_Costa Rica',\n", " 'Nationality_Croatia',\n", " 'Nationality_Egypt',\n", " 'Nationality_England',\n", " 'Nationality_France',\n", " 'Nationality_Gabon',\n", " 'Nationality_Slovakia',\n", " 'Nationality_Slovenia',\n", " 'Nationality_Spain',\n", " 'Nationality_Uruguay']" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chi_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Wrapper Feature Selection - Recursive Feature Elimination" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import RFE\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import MinMaxScaler" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### RFE Selector function" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "def rfe_selector(X, y, num_feats):\n", " # Your code goes here (Multiple lines)\n", " \n", " # Your code ends here\n", " return rfe_support, rfe_feature" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting estimator with 223 features.\n", "Fitting estimator with 213 features.\n", "Fitting estimator with 203 features.\n", "Fitting estimator with 193 features.\n", "Fitting estimator with 183 features.\n", "Fitting estimator with 173 features.\n", "Fitting estimator with 163 features.\n", "Fitting estimator with 153 features.\n", "Fitting estimator with 143 features.\n", "Fitting estimator with 133 features.\n", "Fitting estimator with 123 features.\n", "Fitting estimator with 113 features.\n", "Fitting estimator with 103 features.\n", "Fitting estimator with 93 features.\n", "Fitting estimator with 83 features.\n", "Fitting estimator with 73 features.\n", "Fitting estimator with 63 features.\n", "Fitting estimator with 53 features.\n", "Fitting estimator with 43 features.\n", "Fitting estimator with 33 features.\n", "30 selected features\n" ] } ], "source": [ "rfe_support, rfe_feature = rfe_selector(X, y,num_feats)\n", "print(str(len(rfe_feature)), 'selected features')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### List the selected features from RFE" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Finishing',\n", " 'ShortPassing',\n", " 'LongPassing',\n", " 'BallControl',\n", " 'SprintSpeed',\n", " 'Agility',\n", " 'Volleys',\n", " 'FKAccuracy',\n", " 'Reactions',\n", " 'Strength',\n", " 'Weak Foot',\n", " 'Position_CAM',\n", " 'Position_CM',\n", " 'Position_GK',\n", " 'Position_LCB',\n", " 'Position_LM',\n", " 'Position_RB',\n", " 'Position_RCB',\n", " 'Position_RF',\n", " 'Position_RM',\n", " 'Position_RW',\n", " 'Body Type_Courtois',\n", " 'Body Type_PLAYER_BODY_TYPE_25',\n", " 'Nationality_Belgium',\n", " 'Nationality_Costa Rica',\n", " 'Nationality_Croatia',\n", " 'Nationality_Gabon',\n", " 'Nationality_Netherlands',\n", " 'Nationality_Slovenia',\n", " 'Nationality_Uruguay']" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "rfe_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Embedded Selection - Lasso: SelectFromModel" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.linear_model import LogisticRegression\n", "from sklearn.preprocessing import MinMaxScaler" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [], "source": [ "def embedded_log_reg_selector(X, y, num_feats):\n", " # Your code goes here (Multiple lines)\n", " \n", " # Your code ends here\n", " return embedded_lr_support, embedded_lr_feature" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30 selected features\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/subashgandyer/opt/anaconda3/envs/testing/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " n_iter_i = _check_optimize_result(\n" ] } ], "source": [ "embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)\n", "print(str(len(embedded_lr_feature)), 'selected features')" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Reactions',\n", " 'Balance',\n", " 'Strength',\n", " 'Weak Foot',\n", " 'Preferred Foot_Left',\n", " 'Preferred Foot_Right',\n", " 'Position_CAM',\n", " 'Position_CM',\n", " 'Position_GK',\n", " 'Position_LCB',\n", " 'Position_LF',\n", " 'Position_LM',\n", " 'Position_LW',\n", " 'Position_RB',\n", " 'Position_RCB',\n", " 'Position_RF',\n", " 'Position_RM',\n", " 'Position_RW',\n", " 'Position_ST',\n", " 'Body Type_Lean',\n", " 'Nationality_Argentina',\n", " 'Nationality_Belgium',\n", " 'Nationality_Brazil',\n", " 'Nationality_England',\n", " 'Nationality_France',\n", " 'Nationality_Korea Republic',\n", " 'Nationality_Netherlands',\n", " 'Nationality_Slovenia',\n", " 'Nationality_Spain',\n", " 'Nationality_Uruguay']" ] }, "execution_count": 40, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedded_lr_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree based(Random Forest): SelectFromModel" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectFromModel\n", "from sklearn.ensemble import RandomForestClassifier" ] }, { "cell_type": "code", "execution_count": 86, "metadata": {}, "outputs": [], "source": [ "def embedded_rf_selector(X, y, num_feats):\n", " # Your code goes here (Multiple lines)\n", " \n", " # Your code ends here\n", " return embedded_rf_support, embedded_rf_feature" ] }, { "cell_type": "code", "execution_count": 87, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "24 selected features\n" ] } ], "source": [ "embedder_rf_support, embedder_rf_feature = embedded_rf_selector(X, y, num_feats)\n", "print(str(len(embeded_rf_feature)), 'selected features')" ] }, { "cell_type": "code", "execution_count": 88, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Crossing',\n", " 'Finishing',\n", " 'ShortPassing',\n", " 'Dribbling',\n", " 'LongPassing',\n", " 'BallControl',\n", " 'Acceleration',\n", " 'SprintSpeed',\n", " 'Agility',\n", " 'Stamina',\n", " 'Volleys',\n", " 'FKAccuracy',\n", " 'Reactions',\n", " 'Balance',\n", " 'ShotPower',\n", " 'Strength',\n", " 'LongShots',\n", " 'Aggression',\n", " 'Interceptions',\n", " 'Weak Foot',\n", " 'Body Type_Courtois',\n", " 'Body Type_Normal',\n", " 'Nationality_Belgium',\n", " 'Nationality_Slovenia']" ] }, "execution_count": 88, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embeded_rf_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Tree based(Light GBM): SelectFromModel" ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [], "source": [ "from sklearn.feature_selection import SelectFromModel\n", "from lightgbm import LGBMClassifier" ] }, { "cell_type": "code", "execution_count": 90, "metadata": {}, "outputs": [], "source": [ "def embedded_lgbm_selector(X, y, num_feats):\n", " # Your code goes here (Multiple lines)\n", " \n", " # Your code ends here\n", " return embedded_lgbm_support, embedded_lgbm_feature" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "30 selected features\n" ] } ], "source": [ "embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)\n", "print(str(len(embeded_lgb_feature)), 'selected features')" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "['Crossing',\n", " 'Finishing',\n", " 'ShortPassing',\n", " 'Dribbling',\n", " 'LongPassing',\n", " 'BallControl',\n", " 'Acceleration',\n", " 'SprintSpeed',\n", " 'Agility',\n", " 'Stamina',\n", " 'Volleys',\n", " 'FKAccuracy',\n", " 'Reactions',\n", " 'Balance',\n", " 'ShotPower',\n", " 'Strength',\n", " 'LongShots',\n", " 'Aggression',\n", " 'Interceptions',\n", " 'Weak Foot',\n", " 'Preferred Foot_Left',\n", " 'Preferred Foot_Right',\n", " 'Position_CAM',\n", " 'Position_CB',\n", " 'Position_CDM',\n", " 'Position_CF',\n", " 'Position_CM',\n", " 'Position_GK',\n", " 'Position_LAM',\n", " 'Position_LB']" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" } ], "source": [ "embedded_lgbm_feature" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Putting all of it together: AutoFeatureSelector Tool" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Feature</th>\n", " <th>Pearson</th>\n", " <th>Chi-2</th>\n", " <th>RFE</th>\n", " <th>Logistics</th>\n", " <th>Random Forest</th>\n", " <th>LightGBM</th>\n", " <th>Total</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>1</th>\n", " <td>Reactions</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>6</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>Weak Foot</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>Volleys</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>Strength</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>5</th>\n", " <td>ShortPassing</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>6</th>\n", " <td>Nationality_Slovenia</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>7</th>\n", " <td>Nationality_Belgium</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>8</th>\n", " <td>LongPassing</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>9</th>\n", " <td>Finishing</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>10</th>\n", " <td>FKAccuracy</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>11</th>\n", " <td>BallControl</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>5</td>\n", " </tr>\n", " <tr>\n", " <th>12</th>\n", " <td>SprintSpeed</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>13</th>\n", " <td>Position_RF</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>14</th>\n", " <td>Position_CM</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>15</th>\n", " <td>Nationality_Uruguay</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>16</th>\n", " <td>LongShots</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>17</th>\n", " <td>Body Type_Courtois</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>18</th>\n", " <td>Agility</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>4</td>\n", " </tr>\n", " <tr>\n", " <th>19</th>\n", " <td>Stamina</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>20</th>\n", " <td>ShotPower</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>21</th>\n", " <td>Position_RB</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>22</th>\n", " <td>Position_LF</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>23</th>\n", " <td>Position_LAM</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>24</th>\n", " <td>Position_GK</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>25</th>\n", " <td>Position_CAM</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>26</th>\n", " <td>Nationality_Gabon</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>27</th>\n", " <td>Nationality_Costa Rica</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>28</th>\n", " <td>Dribbling</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>29</th>\n", " <td>Crossing</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>3</td>\n", " </tr>\n", " <tr>\n", " <th>30</th>\n", " <td>Body Type_PLAYER_BODY_TYPE_25</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>True</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>False</td>\n", " <td>3</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " Feature Pearson Chi-2 RFE Logistics \\\n", "1 Reactions True True True True \n", "2 Weak Foot True False True True \n", "3 Volleys True True True False \n", "4 Strength True False True True \n", "5 ShortPassing True True True False \n", "6 Nationality_Slovenia True True True True \n", "7 Nationality_Belgium True True True True \n", "8 LongPassing True True True False \n", "9 Finishing True True True False \n", "10 FKAccuracy True True True False \n", "11 BallControl True True True False \n", "12 SprintSpeed True False True False \n", "13 Position_RF True True True True \n", "14 Position_CM False True True True \n", "15 Nationality_Uruguay True True True True \n", "16 LongShots True True False False \n", "17 Body Type_Courtois True True True False \n", "18 Agility True False True False \n", "19 Stamina True False False False \n", "20 ShotPower True False False False \n", "21 Position_RB False True True True \n", "22 Position_LF True True False True \n", "23 Position_LAM True True False False \n", "24 Position_GK False False True True \n", "25 Position_CAM False False True True \n", "26 Nationality_Gabon True True True False \n", "27 Nationality_Costa Rica True True True False \n", "28 Dribbling True False False False \n", "29 Crossing True False False False \n", "30 Body Type_PLAYER_BODY_TYPE_25 True True True False \n", "\n", " Random Forest LightGBM Total \n", "1 True True 6 \n", "2 True True 5 \n", "3 True True 5 \n", "4 True True 5 \n", "5 True True 5 \n", "6 True False 5 \n", "7 True False 5 \n", "8 True True 5 \n", "9 True True 5 \n", "10 True True 5 \n", "11 True True 5 \n", "12 True True 4 \n", "13 False False 4 \n", "14 False True 4 \n", "15 False False 4 \n", "16 True True 4 \n", "17 True False 4 \n", "18 True True 4 \n", "19 True True 3 \n", "20 True True 3 \n", "21 False False 3 \n", "22 False False 3 \n", "23 False True 3 \n", "24 False True 3 \n", "25 False True 3 \n", "26 False False 3 \n", "27 False False 3 \n", "28 True True 3 \n", "29 True True 3 \n", "30 False False 3 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.set_option('display.max_rows', None)\n", "# put all selection together\n", "feature_selection_df = pd.DataFrame({'Feature':feature_name, 'Pearson':cor_support, 'Chi-2':chi_support, 'RFE':rfe_support, 'Logistics':embeded_lr_support,\n", " 'Random Forest':embeded_rf_support, 'LightGBM':embeded_lgb_support})\n", "# count the selected times for each feature\n", "feature_selection_df['Total'] = np.sum(feature_selection_df, axis=1)\n", "# display the top 100\n", "feature_selection_df = feature_selection_df.sort_values(['Total','Feature'] , ascending=False)\n", "feature_selection_df.index = range(1, len(feature_selection_df)+1)\n", "feature_selection_df.head(num_feats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Can you build a Python script that takes dataset and a list of different feature selection methods that you want to try and output the best (maximum votes) features from all methods?" ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [], "source": [ "def preprocess_dataset(dataset_path):\n", " # Your code starts here (Multiple lines)\n", " \n", " # Your code ends here\n", " return X, y, num_feats" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "def autoFeatureSelector(dataset_path, methods=[]):\n", " # Parameters\n", " # data - dataset to be analyzed (csv file)\n", " # methods - various feature selection methods we outlined before, use them all here (list)\n", " \n", " # preprocessing\n", " X, y, num_feats = preprocess_dataset(dataset_path)\n", " \n", " # Run every method we outlined above from the methods list and collect returned best features from every method\n", " if 'pearson' in methods:\n", " cor_support, cor_feature = cor_selector(X, y,num_feats)\n", " if 'chi-square' in methods:\n", " chi_support, chi_feature = chi_squared_selector(X, y,num_feats)\n", " if 'rfe' in methods:\n", " rfe_support, rfe_feature = rfe_selector(X, y,num_feats)\n", " if 'log-reg' in methods:\n", " embedded_lr_support, embedded_lr_feature = embedded_log_reg_selector(X, y, num_feats)\n", " if 'rf' in methods:\n", " embedded_rf_support, embedded_rf_feature = embedded_rf_selector(X, y, num_feats)\n", " if 'lgbm' in methods:\n", " embedded_lgbm_support, embedded_lgbm_feature = embedded_lgbm_selector(X, y, num_feats)\n", " \n", " \n", " # Combine all the above feature list and count the maximum set of features that got selected by all methods\n", " #### Your Code starts here (Multiple lines)\n", " \n", " #### Your Code ends here\n", " return best_features" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fitting estimator with 223 features.\n", "Fitting estimator with 213 features.\n", "Fitting estimator with 203 features.\n", "Fitting estimator with 193 features.\n", "Fitting estimator with 183 features.\n", "Fitting estimator with 173 features.\n", "Fitting estimator with 163 features.\n", "Fitting estimator with 153 features.\n", "Fitting estimator with 143 features.\n", "Fitting estimator with 133 features.\n", "Fitting estimator with 123 features.\n", "Fitting estimator with 113 features.\n", "Fitting estimator with 103 features.\n", "Fitting estimator with 93 features.\n", "Fitting estimator with 83 features.\n", "Fitting estimator with 73 features.\n", "Fitting estimator with 63 features.\n", "Fitting estimator with 53 features.\n", "Fitting estimator with 43 features.\n", "Fitting estimator with 33 features.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/subashgandyer/opt/anaconda3/envs/testing/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:762: ConvergenceWarning: lbfgs failed to converge (status=1):\n", "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n", "\n", "Increase the number of iterations (max_iter) or scale the data as shown in:\n", " https://scikit-learn.org/stable/modules/preprocessing.html\n", "Please also refer to the documentation for alternative solver options:\n", " https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n", " n_iter_i = _check_optimize_result(\n" ] }, { "data": { "text/plain": [ "['Reactions', 'Weak Foot', 'Volleys', 'Strength', 'ShortPassing']" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "best_features = autoFeatureSelector(dataset_path=\"data/fifa19.csv\", methods=['pearson', 'chi-square', 'rfe', 'log-reg', 'rf', 'lgbm'])\n", "best_features" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Last, Can you turn this notebook into a python script, run it and submit the python (.py) file that takes dataset and list of methods as inputs and outputs the best features" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "testing", "language": "python", "name": "testing" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }