Python/Data processing
{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "<center>\n", "<h1> Assignment 2: Data Preprocessing</h1>\n", "<hr>\n", "<h2>UFO Sighting Data Exploration</h2>\n", "<hr>\n", "<h3> MCIS 6283-Machine Learning </h3>\n", "\n", "<h3><mark>Due date: Feb 16th, 2021 (Tuesday)</mark></h3>\n", "<h3>Total Points: 100</h3>\n", "\n", "<h4>Instructor: Dr Ahmad Al Shami</h4>\n", "<h4>Department of Math & Computer Science</h4>\n", "<h4>Southern Arkansas University</h4>\n", "\n", "</center>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Please put your name, student ID, date and time here (5 points)\n", "* Name:Koteru Divya\n", "* Student ID:xxxxx\n", "* Date:2/14/2021\n", "* Time:7.03pm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "* In this assignment, you will investigate UFO data over the last century to gain some insight.\n", "* Please use all the techniques we have learned in the class to preprocesss/clean the dataset <p style=\"color:blue\"><b>ufo_sightings_large.csv</b></p>. \n", "* After the dataset is preprocessed, please split the dataset into training sets and test sets\n", "* Fit KNN to the training sets. \n", "* Print the score of KNN on the test sets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Import dataset \"ufo_sightings_large.csv\" in pandas (5 points)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "# importing pandas package \n", "import pandas as pd \n", "import numpy as np\n", "# making data frame from csv file \n", "ufo = pd.read_csv(\"ufo_sightings_large.csv\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. Checking column types & Converting Column types (10 points)\n", "Take a look at the UFO dataset's column types using the dtypes attribute. Please convert the column types to the proper types.\n", "For example, the date column, which can be transformed into the datetime type. \n", "That will make our feature engineering efforts easier later on." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "date object\n", "city object\n", "state object\n", "country object\n", "type object\n", "seconds float64\n", "length_of_time object\n", "desc object\n", "recorded object\n", "lat object\n", "long float64\n", "dtype: object\n", "seconds float64\n", "date datetime64[ns]\n", "dtype: object\n" ] } ], "source": [ "print(ufo.dtypes)\n", "\n", "# Change the type of seconds to float\n", "ufo['seconds'] = ufo['seconds'].astype(float)\n", "\n", "# Change the date column to type datetime\n", "ufo['date'] = pd.to_datetime(ufo['date'])\n", "\n", "# Check the column types\n", "print(ufo[['seconds', 'date']].dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. Dropping missing data (10 points)\n", "Let's remove some of the rows where certain columns have missing values. " ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "length_of_time 143\n", "state 419\n", "type 159\n", "dtype: int64\n", "(4283, 11)\n" ] } ], "source": [ "print(ufo[['length_of_time', 'state', 'type']].isnull().sum())\n", "\n", "# Keep only rows where length_of_time, state, and type are not null\n", "ufo_no_missing = ufo[ufo['length_of_time'].notnull() &\n", " ufo['state'].notnull() & \n", " ufo['type'].notnull()]\n", "\n", "# Print out the shape of the new dataset\n", "print(ufo_no_missing.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 4. Extracting numbers from strings (10 points)\n", "The <b>length_of_time</b> column in the UFO dataset is a text field that has the number of \n", "minutes within the string. \n", "Here, you'll extract that number from that text field using regular expressions." ] }, { "cell_type": "code", "execution_count": 83, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0 2 weeks\n", "1 30sec.\n", "2 NaN\n", "3 about 5 minutes\n", "4 2\n", " ... \n", "4930 about 5 seconds\n", "4931 25 seconds\n", "4932 early morning\n", "4933 2 hours\n", "4934 1 minutes\n", "Name: length_of_time, Length: 4935, dtype: object\n" ] }, { "ename": "NameError", "evalue": "name 'length_of_time' is not defined", "output_type": "error", "traceback": [ "\u001b[1;31m---------------------------------------------------------------------------\u001b[0m", "\u001b[1;31mNameError\u001b[0m Traceback (most recent call last)", "\u001b[1;32m<ipython-input-83-e5a64fa21949>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[0;32m 1\u001b[0m \u001b[0mprint\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mufo\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mlength_of_time\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mre\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m----> 3\u001b[1;33m \u001b[0mtemp\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mre\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mfindall\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34mr'\\d+'\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mlength_of_time\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 4\u001b[0m \u001b[0mres\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mlist\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mmap\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mint\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mtemp\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n", "\u001b[1;31mNameError\u001b[0m: name 'length_of_time' is not defined" ] } ], "source": [ "print(ufo.length_of_time)\n", "import re \n", "temp = re.findall(r'\\d+', length_of_time) \n", "res = list(map(int, temp)) \n", " \n", "# print result \n", "print(\"The numbers list is : \" + str(res)) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 5. Identifying features for standardization (10 points)\n", "In this section, you'll investigate the variance of columns in the UFO dataset to \n", "determine which features should be standardized. You can log normlize the high variance column." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 6. Encoding categorical variables (20 points)\n", "There are couple of columns in the UFO dataset that need to be encoded before they can be \n", "modeled through scikit-learn. \n", "You'll do that transformation here, <b>using both binary and one-hot encoding methods</b>." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 7. Text vectorization (10 points)\n", "Let's transform the <b>desc</b> column in the UFO dataset into tf/idf vectors, \n", "since there's likely something we can learn from this field." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 8. Selecting the ideal dataset (10 points)\n", "Let's get rid of some of the unnecessary features. " ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 9. Split the X and y using train_test_split, setting stratify = y (5 points)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": true }, "outputs": [], "source": [ "X = ufo.drop([\"type\"],axis = 1)\n", "y = ufo[\"type\"].astype(str)\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 10. Fit knn to the training sets and print the score of knn on the test sets (5 points)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "knn = KNeighborsClassifier(n_neighbors=5)\n", "# Fit knn to the training sets\n", "knn.fit(train_X, train_y)\n", "# Print the score of knn on the test sets\n", "print(knn.score(test_X, test_y))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 2 }