Classification Exercise

profilenieyanan
ClassificationHomework.ipynb

{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Classification Exercise\n", "Use the 3 different classifiers introduced to predict whether a patient has breast cancer or not based on the characteristics of the tumor. The dataset is provided in the file: cancer.csv, which can be downloaded from Canvas. The file has information about the tumors listed below. It also has the diagnoses in the column named \"result\". Result has two classes: 0 - benign (not cancer) and 1 - malignant (cancer). Please use all the available features in your classifiers. \n", "1. mean radius\n", "1. mean texture\n", "1. mean perimeter\n", "1. mean smoothness\n", "1. mean compactness\n", "1. mean concavity\n", "1. mean concave points\n", "1. mean symmetry\n", "1. mean fractal dimension\n", "\n", "Please follow the steps as instructed in the video and the class. The markdown cells and comments in each cell provided details on what to do. You can open the example Classification notebook and place it side-by-side with your notebook when work on this exercise.\n", "\n", "<b>Be sure to run the first cell to import libraries before running any other cells.</b>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import needed libraries and set up the enviornment" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt \n", "import seaborn as sns\n", "np.set_printoptions(suppress=True) # suppress scientific notation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Import and Understand Data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Load the data in cancer.csv file (can be downloaded from Canvas) into a pandas dataframe\n", "\n", "# Print the number of rows and columns in the dataset\n", "\n", "# Print the names of the features and the top five rows of the data.\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Use Pandas functions to explore the data" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Get the data types of the columns\n", "\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Get the descriptive statistics of the data set\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get the proportion of the two classes in the result column\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### You can visually examine the data here if you want" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Preprocess data\n", "Since the dataset has no missing data, no text data, or other problems, we will skip this step" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Prepare Data for Training\n", "### Define Predictors and Target Variable" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Use all 9 features as predictors. Set X and y\n", "\n", "# print out the shape of X and y\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Split data into training and test sets" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# import the necessary scikit-learn function\n", "\n", "# Partition the data into training and test set with test size = 0.4\n", "# Set the reandom seed to be 1 and use stratified split\n", "\n", "# print out training and test size\n", "\n", "\n", "# Print out the proportion of classes in the training set\n", "\n", "# Print out the proportion of classes in the test set\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Standardize data" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# import necessary class from scikit-learn\n", "\n", "# Create the scaler\n", "\n", "# Calculate the mean and std using training data\n", "\n", "# transform the training and test data\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training Models" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model 1: Fit the data to the logistic regression model \n", "Please know that the target variable of this problem has only 2 classes. You don't have to set the multi_class argument as in the example. You can try different settings of the following arguments:\n", "1. solver: different algorithms to use\n", "1. max_iter: number of iterations to attempt to find the solution\n", "1. C: a hyper parameter to control overfitting\n", "You should be able to get both training and test accuracy over 85% with different combination of paramters." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# 1. import the classifier\n", "\n", "# 2. Create your model here. Use lbfgs as the default solving algorithm\n", "\n", "# 3. Train the model using the training data\n", "\n", "# 4 print out training and test accuarcy\n", "\n", "# 5 Make prediction on the test set data\n", "\n", "# Get the predicted probabilities and print out the first 5 rows\n", "\n", "# Get the the actual prediction\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model 2 - Fit the data to decision tree model¶\n", "1. Set the random_state=1 for reproducable results\n", "1. For purity measurement criterion, use gini\n", "1. Try different max_depth: 2, 3, 4, 5, 6 etc to find the best result. Consider whether larger or smaller depth values leads to overfit." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# 1. import the classifier\n", "\n", "# 2. Create your model \n", "\n", "# 3. Train the model using the training data\n", "\n", "# 4 print out training and test accuarcy\n", "\n", "# 5 Make prediction on the test set data\n", "\n", "# Get the the actual prediction\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Model 3 - Fit the data to KNN model\n", "1. Try different number of neighbors using 2, 3, 4, 5, 6, etc. to find the model with the best performance" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "# 1. import the classifier\n", "\n", "# 2. Create your model \n", "\n", "# 3. Train the model using the training data\n", "\n", "# 4 print out training and test accuarcy\n", "\n", "# 5 Make prediction on the test set data\n", "\n", "# Get the the actual prediction\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 2 }