c++

pallavisai

Project.pdf

Home >Computer Science homework help >c++

CS/DSA 5005 – Computing Structures – Fall 2019

Project 3 – Due 12th November, 2019

Objective:

In this project, you will learn about Hashing and implement a hash table where you will store

string and do text analysis using this hash table. You will also learn to use the vector stl library

and do string manipulations as well.

Description:

The input are a set of documents(doc1.txt, doc2.txt…) that contain various

text/words/sentences(assumptions- no punctuations, all lower case). There is also a redirected

input file that will contain the names of these text documents that are supposed to be opened and

analyzed.

You will construct a Hash class that has the hash table and along with the number of key

values/buckets(assumed to be 0 to 19). The array of the instances/objects of this Hash class that

you will create will be for each of the documents.

The Hash table in the Hash class will be a vector of linked lists(use the vector stl library

<vector>). The hash function h on a word/string in a document will result in an index position i

that will be the index of the bucket to which the string will be linked/appended. This value i will

be the count of the number of times that word has occurred in that document.

For example, if the word ‘awesome has occurred 5 times in docX.txt, then the value will be the

string ‘awesome’ and the key value will be 5.

Expectations:

- The strings/words in each of the documents should be stored in the hash table.

- Given a word, the output should be the number of times the word occurs in every

document, that is the sum of the key value of this word in every document. – findKey()

- Given a key value, the output should be a list of all the words that occur so many times

across every document. – findValue()

- The term frequency(tf) is to be calculated that is the frequency of the word occurring in

a given document. This is just the raw count of the word occurring in a specific

document.

o tf(t,d) = frequency of the word ‘t’ occurring in the document ‘d’.

- The inverse document frequency(idf) are to be calculated for the given word in the

input file. This is the importance of a word across all the documents.

o 𝑖𝑑𝑓(𝑡) = 𝑙𝑜𝑔 +,-./0 23 425,-/+67

+,-./0 23 425,-/+67 8+ 9:85: 6:/ 9204 6; ;255,07

The metric tf might not be the best to find the importance of a word, because it just gives you the

raw frequency of occurrence, but the idf metric would give you the importance of a word across

all the documents since it considers the presence or not of the word in every document.

You are also expected to overload the ostream operator for displaying the hash table and also

implement the copy constructor.

Input file:

The following is a sample input file format and how to deduce them.

3 <- This is the number of documents to read

doc1.txt 20 <- Following are the names of the documents

doc2.txt 10 along with the word count of the

doc3.txt 10 document

the <- Input word for which we need key across all documents

1 <- Input key for which needs the words across all documents

the doc2.txt <- Input word and doc number for finding tf

the <- Input word to find idf

Sample output files along with the documents are also provided. You can go through the

documents and hand calculate the values to compare the values with the sample output file. A

sampleMain.cpp file is also given along with this project that will be a good starting point. The

submission will be through GradeScope and will be autograded. That will be setup soon.

Constraints:

- The project is to be done individually and any code sharing of sorts will be misconduct.

- You are restricted to use the header files that are given in the sampleMain.cpp file.