coding program
Program an inverted index :).
1. You don't need any starter �les for this homework. An inverted index is just another implementation of the Index interface that you can add to the indexes package/namespace of your homework 1 solution. Decide on an appropriate way to associate terms with postings lists inside the inverted index class, then implement the getPostings(String) method and the getVocabulary() method (recall that getVocabulary() must return a sorted vocabulary list).
(a) You will also need to write an addTerm method. This method is a little tricky... Postings lists must contain distinct document IDs (no ID can occur more than once). When tokenizing and indexing a document, it's possible that you will �nd a term in a document that you have already seen before, and so that document ID already occurs in the postings list for that term. You must take care to not insert that document ID a second time into the postings list in the addTerm method.
Your implementation of addTerm must run in O (1) time. I will reject your eventual project if it does not. This means you cannot call a method like .find(), .contains(), .indexOf() etc., and cannot write a loop over each index in a postings list. Those methods run in O (n) time.
(b) Create a new main application similar to TermDocumentIndexer in which you construct a corpus, iterate the documents, tokenize each document's content, and then add postings to the inverted index. Note that only a term-document matrix requires two passes through the corpus, but the inverted index will only require one.
(c) Test your application with the Moby Dick corpus.
Next, extend the functionality of the search engine by teaching it to work with JSON �le documents.
1. The �le all-nps-sites-extracted.zip on BeachBoard contains 36,803 documents that I scraped from the National Parks Service in 2016. Each document is a .json �le, which is a format for repre- senting data objects using JavaScript-like syntax.
2. Each JSON document in this corpus looks something like this: {
�title� : �The title of the article�,
�body� : �The body of the article�,
�url� : �https://www.nps.gov/..........�
}
title is the title of the document; url is the Web address where I originally scraped the article from; and body is the text of the article that will be indexed by the search engine.
3. Create a new class in the documents package called JsonFileDocument, using TextFileDocument as a reference. You need to change two methods:
(a) getContent(): the content of a .json is the value of the �body� key, which you can read as a string. You will need to �nd a way to construct a stream around that string in memory in order to be compliant with the getContent method requirements.1
Please note that we do not want to keep the contents of the document in memory at all times; we want to dispose of that memory as soon as we are done indexing the �le. You should not save the contents of the �le as a member �eld of the JsonFileDocument class, as doing so would keep a large number of very large strings in memory even after indexing is over. This will automatically
1This is not as easy as simply returning the body as a string. You need to construct a �stream�-like object that reads
from a string variable, instead of reading from a �le. Java and C# both have a StringReader class that should plug into our
architecture just �ne; in Python, the StringIO class should su�ce.
1
happen for you in Python if you use the �with� statement, or in C# if you use the �try with resources� pattern.
(b) getTitle(): the title of document is in the �title� key. Unlike the �le's content, we can a�ord to keep a json document's title in memory, so you can read this once from the �le and save it as a private member.
4. Once JsonFileDocument is complete, you will need to expand DirectoryCorpus so it knows how to load JsonFileDocuments. I encourage you to read the documentation of DirectoryCorpus.RegisterFileDocumentFactory and DirectoryCorpus.LoadTextDirectory, then create analagous functions to load a directory of json documents. If you do this correctly, you'll only need to change your main to use your new method(s) to construct the DirectoryCorpus, and nothing
else will need to change.
2