python knn data

skulcandy1190
KNNproject.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# CIS 4321\n", "# Dr. Mohammad Salehan\n", "## K-Nearest Neighbors Assignment\n", "In this assignment you will conduct KNN classification on a dataset. The data include customer demographic information (age, income, etc.), the customer’s relationship with the bank (mortgage, securities account, etc.), and the customer response to the last personal loan campaign (Personal Loan). Let's start by loading the dataset." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "from pathlib import Path\n", "\n", "import pandas as pd\n", "from sklearn import preprocessing\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.metrics import accuracy_score\n", "from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier\n", "import matplotlib.pylab as plt" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(5000, 14)" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_excel('UniversalBank.xlsx', 'Data')\n", "df.shape" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>ID</th>\n", " <th>Age</th>\n", " <th>Experience</th>\n", " <th>Income</th>\n", " <th>ZIP Code</th>\n", " <th>Family</th>\n", " <th>CCAvg</th>\n", " <th>Education</th>\n", " <th>Mortgage</th>\n", " <th>Personal Loan</th>\n", " <th>Securities Account</th>\n", " <th>CD Account</th>\n", " <th>Online</th>\n", " <th>CreditCard</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>1</td>\n", " <td>25</td>\n", " <td>1</td>\n", " <td>49</td>\n", " <td>91107</td>\n", " <td>4</td>\n", " <td>1.6</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>2</td>\n", " <td>45</td>\n", " <td>19</td>\n", " <td>34</td>\n", " <td>90089</td>\n", " <td>3</td>\n", " <td>1.5</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>3</td>\n", " <td>39</td>\n", " <td>15</td>\n", " <td>11</td>\n", " <td>94720</td>\n", " <td>1</td>\n", " <td>1.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>4</td>\n", " <td>35</td>\n", " <td>9</td>\n", " <td>100</td>\n", " <td>94112</td>\n", " <td>1</td>\n", " <td>2.7</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "</div>" ], "text/plain": [ " ID Age Experience Income ZIP Code Family CCAvg Education Mortgage \\\n", "0 1 25 1 49 91107 4 1.6 1 0 \n", "1 2 45 19 34 90089 3 1.5 1 0 \n", "2 3 39 15 11 94720 1 1.0 1 0 \n", "3 4 35 9 100 94112 1 2.7 2 0 \n", "\n", " Personal Loan Securities Account CD Account Online CreditCard \n", "0 0 1 0 0 0 \n", "1 0 1 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check out the proportion of the two classes in the column used as label (i.e., Personal Loan). There is no need to conduct oversampling or undersampling in this assignment." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0 4520\n", "1 480\n", "Name: Personal Loan, dtype: int64" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['Personal Loan'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Select columns\n", "Exclude ID and ZIP Code columns." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "df1 = df[['Age','Experience','Income','Family','CCAvg','Education','Mortgage','Personal Loan','Securities Account','CD Account','Online','CreditCard']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing values\n", "Check missing values. Drop them if needed." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Age 0\n", "Experience 0\n", "Income 0\n", "Family 0\n", "CCAvg 0\n", "Education 0\n", "Mortgage 0\n", "Personal Loan 0\n", "Securities Account 0\n", "CD Account 0\n", "Online 0\n", "CreditCard 0\n", "dtype: int64" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df1.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Dummies\n", "Create dummies if any is needed." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/html": [ "<div>\n", "<style scoped>\n", " .dataframe tbody tr th:only-of-type {\n", " vertical-align: middle;\n", " }\n", "\n", " .dataframe tbody tr th {\n", " vertical-align: top;\n", " }\n", "\n", " .dataframe thead th {\n", " text-align: right;\n", " }\n", "</style>\n", "<table border=\"1\" class=\"dataframe\">\n", " <thead>\n", " <tr style=\"text-align: right;\">\n", " <th></th>\n", " <th>Age</th>\n", " <th>Experience</th>\n", " <th>Income</th>\n", " <th>Family</th>\n", " <th>CCAvg</th>\n", " <th>Education</th>\n", " <th>Mortgage</th>\n", " <th>Personal Loan</th>\n", " <th>Securities Account</th>\n", " <th>CD Account</th>\n", " <th>Online</th>\n", " <th>CreditCard</th>\n", " </tr>\n", " </thead>\n", " <tbody>\n", " <tr>\n", " <th>0</th>\n", " <td>25</td>\n", " <td>1</td>\n", " <td>49</td>\n", " <td>4</td>\n", " <td>1.6</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>1</th>\n", " <td>45</td>\n", " <td>19</td>\n", " <td>34</td>\n", " <td>3</td>\n", " <td>1.5</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>2</th>\n", " <td>39</td>\n", " <td>15</td>\n", " <td>11</td>\n", " <td>1</td>\n", " <td>1.0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>3</th>\n", " <td>35</td>\n", " <td>9</td>\n", " <td>100</td>\n", " <td>1</td>\n", " <td>2.7</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4</th>\n", " <td>35</td>\n", " <td>8</td>\n", " <td>45</td>\n", " <td>4</td>\n", " <td>1.0</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " </tr>\n", " <tr>\n", " <th>...</th>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " <td>...</td>\n", " </tr>\n", " <tr>\n", " <th>4995</th>\n", " <td>29</td>\n", " <td>3</td>\n", " <td>40</td>\n", " <td>1</td>\n", " <td>1.9</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4996</th>\n", " <td>30</td>\n", " <td>4</td>\n", " <td>15</td>\n", " <td>4</td>\n", " <td>0.4</td>\n", " <td>1</td>\n", " <td>85</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4997</th>\n", " <td>63</td>\n", " <td>39</td>\n", " <td>24</td>\n", " <td>2</td>\n", " <td>0.3</td>\n", " <td>3</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4998</th>\n", " <td>65</td>\n", " <td>40</td>\n", " <td>49</td>\n", " <td>3</td>\n", " <td>0.5</td>\n", " <td>2</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>0</td>\n", " </tr>\n", " <tr>\n", " <th>4999</th>\n", " <td>28</td>\n", " <td>4</td>\n", " <td>83</td>\n", " <td>3</td>\n", " <td>0.8</td>\n", " <td>1</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>0</td>\n", " <td>1</td>\n", " <td>1</td>\n", " </tr>\n", " </tbody>\n", "</table>\n", "<p>5000 rows × 12 columns</p>\n", "</div>" ], "text/plain": [ " Age Experience Income Family CCAvg Education Mortgage \\\n", "0 25 1 49 4 1.6 1 0 \n", "1 45 19 34 3 1.5 1 0 \n", "2 39 15 11 1 1.0 1 0 \n", "3 35 9 100 1 2.7 2 0 \n", "4 35 8 45 4 1.0 2 0 \n", "... ... ... ... ... ... ... ... \n", "4995 29 3 40 1 1.9 3 0 \n", "4996 30 4 15 4 0.4 1 85 \n", "4997 63 39 24 2 0.3 3 0 \n", "4998 65 40 49 3 0.5 2 0 \n", "4999 28 4 83 3 0.8 1 0 \n", "\n", " Personal Loan Securities Account CD Account Online CreditCard \n", "0 0 1 0 0 0 \n", "1 0 1 0 0 0 \n", "2 0 0 0 0 0 \n", "3 0 0 0 0 0 \n", "4 0 0 0 0 1 \n", "... ... ... ... ... ... \n", "4995 0 0 0 1 0 \n", "4996 0 0 0 1 0 \n", "4997 0 0 0 0 0 \n", "4998 0 0 0 1 0 \n", "4999 0 0 0 1 1 \n", "\n", "[5000 rows x 12 columns]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.get_dummies(df1, dummy_na=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Partitioning\n", "Partition the dataset into train and validation partitions. Use 40% for validation. There is no need to make up artificial records." ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(3000, 12) (2000, 12)\n" ] } ], "source": [ "\n", "trainData, validData = train_test_split(df1, test_size=0.4, random_state=5000)\n", "print(trainData.shape, validData.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocessing\n", "Conduct all required preprocessing which includes (1) selecting features and (2) normalization. Use 'Personal Loan' as label and the rest of the columns as predictors.<br>\n", "At the end of this cell you should have 2 variables named trainNorm and validNorm representing train and validation partitions.\n", "<br>\n", "Tip: create a list of features and name it features. Use it when needed instead of copy-pasting column names each time." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## More partitioning\n", "create 4 variables train_X, train_y, valid_X, valid_y representing training features, training label, validation features, and validation label respectively." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Run KNN\n", "Ru KNN. Examine k values in range 1 and 15. Remeber that end index of range() function is excluded." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Select the best value for K\n", "Select the best value for K and write it below. Justify your selection." ] }, { "cell_type": "raw", "metadata": {}, "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.3" } }, "nbformat": 4, "nbformat_minor": 4 }