ASAP Lab4 computer science C++ coding assignment

profiletegoshi
lab_43.pdf

CSCI 222

C++

Lab #4

For this laboratory assignment, you will work in groups (1-3) to build a basic spam email filter

based on machine learning using C++. No worries, I won’t ask you to implement the machine

learning code! You will be using an existing machine learning open source library.

Your code will read in an email message in some standard format (we will determine that

standard) and will classify whether that email is a spam or non-spam email. There are

databases containing example spam and non-spam emails that you can download and use to

train your machine learning filter to make this determination. More about that database will be

provided.

The basic steps in determining whether a text file containing an email is a spam or non-spam is

as followed:

1. Prepare the text data by reading in the file

2. Create the word dictionary to count the words and their frequency

3. Extract the feature from the dictionary created and prepare the data to input into the

machine learning function (whether it is for training or for testing)

4. Pick a machine learning algorithm to train on your features from step 3

5. Finally, you will test your code using sample emails to see how well your software

works.

Step 0: Download the Ling-spam corpus data

http://www2.aueb.gr/users/ion/data/enron-spam/

Step 1: Prepare the data

Divide the data into a training set and test set (perhaps a 4:1 ratio is good), where the training

data will contain equal number of spam and non-spam emails and the testing data will containing

equal number of spam and non-spam emails.

In any text mining problem, text cleaning is the first step where we remove those words from the

document which may not contribute to the information we want to extract. Emails may contain a

lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be

helpful in detecting the spam email. The emails in Ling-spam corpus have been already

preprocessed in the following ways:

a) Removal of stop words – Stop words like “and”, “the”, “of”, etc are very common in all

English sentences and are not very meaningful in deciding spam or legitimate status, so these

words have been removed from the emails.

b) Lemmatization – It is the process of grouping together the different inflected forms of a word

so they can be analyzed as a single item. For example, “include”, “includes,” and “included”

would all be represented as “include”. The context of the sentence is also preserved in

lemmatization as opposed to stemming (another buzz word in text mining which does not

consider meaning of the sentence).

We still need to remove the non-words like punctuation marks or special characters from the

mail documents. There are several ways to do it. Here, we will remove such words after creating

a dictionary, which is a very convenient method to do so since when you have a dictionary, you

need to remove every such word only once. This has already been done for you!!!

Step 2: Create the dictionary

A sample email in the data-set looks like this:

Subject: posting

hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article

english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel

sutton ( sutton @ garnet . berkeley . edu

It can be seen that the first line of the mail is subject and the 3rd line contains the body of the

email. We will only perform text analytics on the content to detect the spam mails. As a first

step, we need to create a dictionary of words and their frequency. Once the dictionary is created

we can add just a few lines of code to remove non-words about which we talked in step 1.

Step 3: Extract the Features

Once the dictionary is ready, we can extract word count vector (our feature here) of 3000

dimensions for each email of training set. Each word count vector contains the frequency of

3000 words in the training file. Most of them will be zero. Let us take an example. Suppose we

have 500 words in our dictionary. Each word count vector contains the frequency of 500

dictionary words in the training file. Suppose text in training file was “Get the work done, work

done” then it will be encoded as

[0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word

counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the

rest are zero.

Step 4: Train the Classifiers

There are many different machine learning algorithms. We will use an industry standard

machine learning library called PyTorch:

https://pytorch.org/

For this lab assignment, you will use a method called Support Vector Machines (SVM). SVMs

are supervised binary classifiers which are very effective when you have higher number of

features. The goal of SVM is to separate some subset of training data from rest called the support

vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts

the class of the test data is based on support vectors and makes use of a kernel trick.

Once the classifiers are trained, we can check the performance of the models on

test-set. We extract word count vector for each mail in test-set and predict its class

(non-spam or spam) with the trained SVM model.

More details on how SVM works and how to install and use Pytorch will be

discuss in lectures.