ASAP Lab4 computer science C++ coding assignment
CSCI 222
C++
Lab #4
For this laboratory assignment, you will work in groups (1-3) to build a basic spam email filter
based on machine learning using C++. No worries, I won’t ask you to implement the machine
learning code! You will be using an existing machine learning open source library.
Your code will read in an email message in some standard format (we will determine that
standard) and will classify whether that email is a spam or non-spam email. There are
databases containing example spam and non-spam emails that you can download and use to
train your machine learning filter to make this determination. More about that database will be
provided.
The basic steps in determining whether a text file containing an email is a spam or non-spam is
as followed:
1. Prepare the text data by reading in the file
2. Create the word dictionary to count the words and their frequency
3. Extract the feature from the dictionary created and prepare the data to input into the
machine learning function (whether it is for training or for testing)
4. Pick a machine learning algorithm to train on your features from step 3
5. Finally, you will test your code using sample emails to see how well your software
works.
Step 0: Download the Ling-spam corpus data
http://www2.aueb.gr/users/ion/data/enron-spam/
Step 1: Prepare the data
Divide the data into a training set and test set (perhaps a 4:1 ratio is good), where the training
data will contain equal number of spam and non-spam emails and the testing data will containing
equal number of spam and non-spam emails.
In any text mining problem, text cleaning is the first step where we remove those words from the
document which may not contribute to the information we want to extract. Emails may contain a
lot of undesirable characters like punctuation marks, stop words, digits, etc which may not be
helpful in detecting the spam email. The emails in Ling-spam corpus have been already
preprocessed in the following ways:
a) Removal of stop words – Stop words like “and”, “the”, “of”, etc are very common in all
English sentences and are not very meaningful in deciding spam or legitimate status, so these
words have been removed from the emails.
b) Lemmatization – It is the process of grouping together the different inflected forms of a word
so they can be analyzed as a single item. For example, “include”, “includes,” and “included”
would all be represented as “include”. The context of the sentence is also preserved in
lemmatization as opposed to stemming (another buzz word in text mining which does not
consider meaning of the sentence).
We still need to remove the non-words like punctuation marks or special characters from the
mail documents. There are several ways to do it. Here, we will remove such words after creating
a dictionary, which is a very convenient method to do so since when you have a dictionary, you
need to remove every such word only once. This has already been done for you!!!
Step 2: Create the dictionary
A sample email in the data-set looks like this:
Subject: posting
hi , ' m work phonetics project modern irish ' m hard source . anyone recommend book article
english ? ' , specifically interest palatal ( slender ) consonant , work helpful too . thank ! laurel
sutton ( sutton @ garnet . berkeley . edu
It can be seen that the first line of the mail is subject and the 3rd line contains the body of the
email. We will only perform text analytics on the content to detect the spam mails. As a first
step, we need to create a dictionary of words and their frequency. Once the dictionary is created
we can add just a few lines of code to remove non-words about which we talked in step 1.
Step 3: Extract the Features
Once the dictionary is ready, we can extract word count vector (our feature here) of 3000
dimensions for each email of training set. Each word count vector contains the frequency of
3000 words in the training file. Most of them will be zero. Let us take an example. Suppose we
have 500 words in our dictionary. Each word count vector contains the frequency of 500
dictionary words in the training file. Suppose text in training file was “Get the work done, work
done” then it will be encoded as
[0,0,0,0,0,…….0,0,2,0,0,0,……,0,0,1,0,0,…0,0,1,0,0,……2,0,0,0,0,0]. Here, all the word
counts are placed at 296th, 359th, 415th, 495th index of 500 length word count vector and the
rest are zero.
Step 4: Train the Classifiers
There are many different machine learning algorithms. We will use an industry standard
machine learning library called PyTorch:
https://pytorch.org/
For this lab assignment, you will use a method called Support Vector Machines (SVM). SVMs
are supervised binary classifiers which are very effective when you have higher number of
features. The goal of SVM is to separate some subset of training data from rest called the support
vectors (boundary of separating hyper-plane). The decision function of SVM model that predicts
the class of the test data is based on support vectors and makes use of a kernel trick.
Once the classifiers are trained, we can check the performance of the models on
test-set. We extract word count vector for each mail in test-set and predict its class
(non-spam or spam) with the trained SVM model.
More details on how SVM works and how to install and use Pytorch will be
discuss in lectures.