Data mining lab

Nikhilreddi04
CIS660_LabAssignment2_DocumentSimNew.pdf

CIS660 Big Data Analytics Sunnie S. Chung

Lab Assignment 2 CIS 660/EEC 525 Data Mining Sunnie Chung Part 1: Preprocessing to Build Document Vectors for Web Page Content Analysis A document can be represented as a Bag of Words where a document is represented in a set of thousand terms as features with Term Frequency(TF) - the frequency of each term occurring in each document - and Inverted Document Frequency. For this transformation, you have to build an Inverted Index (which is a Term Look_Up Table or Term Dictionary with Term Frequency (TF) and Document Frequency (DF). See how to build an Inverted Index for TF-IDF in Slides 10 – 14 in the Lecture Notes. For text analysis tasks, for example, to categorize (cluster) each document in a collection into a certain number of topics or to find the Top N most related documents in a collection per a given user query (topics) in a Question Answering (QA) System, each document can be transformed to be represented as a vector of weights on the topic terms (topic words/keywords/phrases in bi-gram or a tri-gram) in TF-IDF. Construct a document vector with the 7 given topics below with their frequency count for each topic word in the following 6 webpages. The 7 terms for topics are the following 4 keywords and 3 phrases (bi-grams) as below. Topics to Analyze: research, data, mining, analytics, data mining, machine learning, deep learning Doc1: https://www.edx.org/course/data-science-machine-learning Doc2: https://en.wikipedia.org/wiki/Engineering Doc3: http://my.clevelandclinic.org/research Doc4: https://en.wikipedia.org/wiki/Data_mining Doc5: https://en.wikipedia.org/wiki/Data_mining#Data_mining Doc6: http://cis.csuohio.edu/~sschung/

CIS660 Big Data Analytics Sunnie S. Chung

Right Click on each webpage -> view sourcing then save the html file as .txt file. Those 6 text files are your input files to process to count frequency of each topic word to construct 6 document vectors for the 6 docs as below. doc1 doc2 doc3 doc4 doc5 doc6 research data mining analytics data mining machine learning deep learning

1. We want to count only the words appeared in the Webpage Text as the Content of the page, not the words included inside any tags <……> or any system generated scripts or html codes. You can count the words appeared in the title bar as well as in the Menu in the webpage

2. We do not want to count any subexpressions that are a part of another words.

“Spin” should not be counted as “pin”

3. However, No case sensitive: Insert, insert, INSERT, insert are all counted as a same word.

4. The words from a same stem are counted as a same word. For example, program, programming, programed, programmable are all counted as “program”. You can directly add OR conditions with all the variations of the words that are from a same stem to count all as a same word.

5. We want to count for a phrase (bi-grams or tri-grams) by counting occurring of ‘data mining’, for example, when ‘data’ immediately followed by ‘mining’. This is usually done to add a discovered bi-gram or tri-gram in the term dictionary as a single term, for example, ‘data mining’ is added as a single term with ‘data_mining’ in the term dictionary with its frequency.

6. See FAQ for Lab2 on the class webpage for more guides. Common NLP Preprocessing Procedures for Text Analysis Minimum Requirements:

CIS660 Big Data Analytics Sunnie S. Chung

1. Remove all the special symbols like punctuation mark, question mark using the character deletion step of translate

2. Remove all stop words (Search for Stop word Lists or Python or R Libs) 3. Do Stemming to Reduce inflected (or sometimes derived) words to their word

stem. 4. Convert uppercase to lowercase

For more accurate Text Preprocessing, see Lab2 Section. You can build your term dictionary with each term frequency (Inverted Index) first to construct 6 document vectors for the given Topics. Although IDF is highly recommended to add, you can ignore document frequency for simplicity for this lab. You may omit to build an Inverted Index (Term Dictionary) with TF-IDF. You can directly construct 6 document vectors to count from your parsing script for this lab. Inverted Index (Term Dictionary) Construction with TF-IDF will be counted as an Extra Credit.  You can use or adopt any online word count program for this Lab if you want.

For example, import java.util.StringTokenizer;  You can make any assumptions to simplify the program.  Briefly make notes on these in your report. You can create a Term Dictionary with TF (and DF) in a Table format or in MongoDB with the scheme. Your Term Dictionary with TF and DF would look like either one of those tables below.

For an Inverted Index in Mongo DB, See the Common NLP Tasks in Lab2 section.

CIS660 Big Data Analytics Sunnie S. Chung

Part 2: Data Transformation for Topic Analysis of Documents (Webpages) Using the 6 document vectors, construct a cosine similarity matrix like The one below to indicate a cosine similarity between every pair of document doc1 – doc5. You may want to normalize each document vector first to calculate each cos(di, dj) as in the example in the Lecture Notes on Information Retrieval.

 Cosine measure: If d1 and d2 are two vectors (e.g., term-frequency vectors), then cos(d1, d2) = (d1  d2) /||d1|| ||d2|| , where  indicates vector dot product, ||d||: the length of vector d

doc1 doc2 doc3 doc4 doc5 doc6 doc1 doc2 doc3 doc4 doc5 doc6 Sample Output: (Note this output may not be a correct output of this lab)

Part3: Analysis and Discussion of Problems  Discuss briefly about your topic analysis with your cosine similarity matrix focusing

on that:

CIS660 Big Data Analytics Sunnie S. Chung

Whether each value (in Cosine Sim) of each pair of any two docs indicate the similarity correctly?  Which 2 docs are most similar in terms of 7 given topics?  The Topics of Doc6 is similar to the Topics of Doc 4 and 5? Explain Why or Why Not in terms of 7 TFs? If not, what are the reasons?