c++
CS/DSA 5005 – Computing Structures – Fall 2019
Project 3 – Due 12th November, 2019
Objective:
In this project, you will learn about Hashing and implement a hash table where you will store
string and do text analysis using this hash table. You will also learn to use the vector stl library
and do string manipulations as well.
Description:
The input are a set of documents(doc1.txt, doc2.txt…) that contain various
text/words/sentences(assumptions- no punctuations, all lower case). There is also a redirected
input file that will contain the names of these text documents that are supposed to be opened and
analyzed.
You will construct a Hash class that has the hash table and along with the number of key
values/buckets(assumed to be 0 to 19). The array of the instances/objects of this Hash class that
you will create will be for each of the documents.
The Hash table in the Hash class will be a vector of linked lists(use the vector stl library
<vector>). The hash function h on a word/string in a document will result in an index position i
that will be the index of the bucket to which the string will be linked/appended. This value i will
be the count of the number of times that word has occurred in that document.
For example, if the word ‘awesome has occurred 5 times in docX.txt, then the value will be the
string ‘awesome’ and the key value will be 5.
Expectations:
- The strings/words in each of the documents should be stored in the hash table.
- Given a word, the output should be the number of times the word occurs in every
document, that is the sum of the key value of this word in every document. – findKey()
- Given a key value, the output should be a list of all the words that occur so many times
across every document. – findValue()
- The term frequency(tf) is to be calculated that is the frequency of the word occurring in
a given document. This is just the raw count of the word occurring in a specific
document.
o tf(t,d) = frequency of the word ‘t’ occurring in the document ‘d’.
- The inverse document frequency(idf) are to be calculated for the given word in the
input file. This is the importance of a word across all the documents.
o 𝑖𝑑𝑓(𝑡) = 𝑙𝑜𝑔 +,-./0 23 425,-/+67
+,-./0 23 425,-/+67 8+ 9:85: 6:/ 9204 6; ;255,07
The metric tf might not be the best to find the importance of a word, because it just gives you the
raw frequency of occurrence, but the idf metric would give you the importance of a word across
all the documents since it considers the presence or not of the word in every document.
You are also expected to overload the ostream operator for displaying the hash table and also
implement the copy constructor.
Input file:
The following is a sample input file format and how to deduce them.
3 <- This is the number of documents to read
doc1.txt 20 <- Following are the names of the documents
doc2.txt 10 along with the word count of the
doc3.txt 10 document
the <- Input word for which we need key across all documents
1 <- Input key for which needs the words across all documents
the doc2.txt <- Input word and doc number for finding tf
the <- Input word to find idf
Sample output files along with the documents are also provided. You can go through the
documents and hand calculate the values to compare the values with the sample output file. A
sampleMain.cpp file is also given along with this project that will be a good starting point. The
submission will be through GradeScope and will be autograded. That will be setup soon.
Constraints:
- The project is to be done individually and any code sharing of sorts will be misconduct.
- You are restricted to use the header files that are given in the sampleMain.cpp file.