milestone2.docx

You will calculte a trigram frequency vector for each file in a given set of training files. You will also calculate the frequency vector for a test file. You will then compute the similarity between the test frequency vector and each training frequency vector. You will keep track of which similarity was the greatest and will output the name of training file which produced the best match.

Similarity

Given two vectors (mathematical vectors, not C++ vectors), the similarity between those vectors can be described using cosine similarity. This is the cosine of the angle between the two vectors.

A formula for the cosine similarity of two vectors A and B, where both A and B have n elements, is:

\begin{equation} \cos^2 \theta = \frac{\Big(\sum\limits_{i=0}^{n-1}{A_i B_i}\Big)^2}{\Big(\sum\limits_{i=0}^{n-1}{A_i^2}\Big) \Big(\sum\limits_{i=0}^{n-1}{B_i^2}\Big)} \end{equation}

This may look like a scary equation but, when we get down to it, it is actually pretty simple. Let's deconstruct it and look at its parts.

Let's take a look at the numerator:

\begin{equation} \sum\limits_{i=0}^{n-1}{A_i B_i} \end{equation}

This is the sum of the element-wise product of corresponding elements in the vectors A and B. In math, this is known as the dot-product of two vectors.

Let's take a look at each term in the denominator:

\begin{equation} \sum\limits_{i=0}^{n-1}{A_i^2} \end{equation}

This is the square-root of the sum of the element-wise square of each element in the vector A. In math, this is called the norm of a vector. The same goes for the second term in the denominator.

The value for cosine similarity will range between 0 and 1 inclusive. The larger this value is between two vectors, the more similar they are.

Pitfalls

As it so often happens, there are some pitfalls when calculating the similarity between vectors using built-in C++ integer types. The frequency vectors will have some values that are quite large and when these values are squared, they overlow even unsigned long long. Naturally, we are going to have you use bigints to get around this issue. You will store the intermediate parts of the similarity calculation as bigints and then use bigint methods to calculate $cos^2 \theta$.

Another thing to notice is that similarity is going to be of type double. We have not implemented bigints that can handle floating-point types. We will get around this problem by using a little math trick called scaled-division. The idea is that you:

· take the numerator and multiply it by some convenient number (e.g. $1,000,000$),

· perform bigint division using the scaled numerator,

· convert the result into a regular integer,

· and finally perform floating-point division by 1,000,000

to get the actual value for $cos^2 \theta$. You can then calculate the square-root of this value to get the cosine similaritybetween the two vectors.

Note that you will be performing multiplications and divisions with some really large numbers. The iterative bigint multiplyand divide methods you implemented are going to be painfully slow for numbers of this magnitude. We suggest that you download the solution key for bigint provided to you and use the fast_multiply and fast_divide methods implemented by a certain wonderful TA.

Milestone Requirements

Your program will receive an unknown (greater than 2) number of command-line arguments. These will be the names of the training files and finally the name of the test file. You are to calculate the trigram frequency vector for each of the files given. You will then compute the cosine similarity between the frequency vector of the test file and the frequency vector of each training file. You will find the highest value for similarity and then print the name of the language that was the best match (followed by a new-line character).

You must name your output file language.

Sample command-line input:

$ ./language english german spanish icelandic maori test

If icelandic happened to be the best match for test, your command-line should end up looking like:

$ ./language english german spanish icelandic maori test

icelandic

$

Grading Rubric

· Functional Correctness: 70%

· Design, Representation, and Comments: 30%

Important Note: You are not being graded on how fast your program runs, with one caveat: if your program is so slow or so memory-inefficient that Mimir cannot run it, you will receive no points for functional correctness. A well-written program should run in anywhere from 5-60 seconds on the data set we have provided. If your program takes multiple minutes to run on your conputer, then something is wrong.

Helpful Hints

Dealing with the data files

If you are anything like some of your instructors, you hate typing repetitive and unnecessary things on the command line. You should learn to rely on wildcards, also known as file globbing. For example, if you ran:

$ ./language training_languages/* testing_languages/english

the command line will expand the * into all the files in the training_languages sub-directory, so your program will see every file individually just as if you had painstakingly typed out the name of every file individually!

Writing your own compile script

For milestones 1 and 2, you need to write your own compile script. We have provided you something to start you off but you will need to edit it. In particular, you need your compile script to take every .cpp file you have written and compile them into a single executable file.

Note that ease-of-debugging and speed-of-execution are at odds with one another. When you are just developing, you should not have flags like -O2 (which stands for optimization level 2), and you should have a flag like -g (which keeps some symbol information so your debugger can show you your own source code). However, prior to submitting your project, you should remove the -g flag. You can also experiment with putting in the -O2 od -O3 flags for compiler optimization. You may even want to see which optimization flags produce the fastest executables. Sometimes, the difference can be significant!

You can put a stopwatch to your program with the time command:

$ time ./language training_languages/* testing_languages/english

On my (rather fast) laptop, I can say something like this:

$ time ./language training_languages/* testing_languages/english

training_languages/english

real 0m2.610s

user 0m2.586s

sys 0m0.024s

The relevant number here is the user time, since the computer is also busy doing some other stuff (like running my operating system). Essentially, my program ran in 2.586 seconds.

Memory Efficiency

If you are not careful, it is easy to make your program use too much memory or run very slowly (the two are often related). Remember that arguments are passed to functions by value in C++ by default. This means that if you take a vector containing ~20,000 8-byte integers and pass it around by value, C++ is making several copies of that vector. Each copy is worth ~160 kB. These can add up very quickly, particularly given that the total size of the language text files is ~200 MB. Use the information presented in class on passing arguments by reference to cut down on this wasted memory usage.