week 8-data science

rav

Chapter_9.pptx

Home >Computer Science homework help >week 8-data science

Data Science and Big Data Analytics

Chap 9: Advanced Analytical Theory and Methods:

Text Analysis

Contents

9.1 Text Analysis Steps

9.2 A Text Analysis Example

9.3 Collecting Raw Text

9.4 Representing Text

9.5 Term Frequency – Inverse Document Frequency

9.6 Categorizing Documents by Topics

9.7 Determining Sentiments

9.8 Gaining Insights

Summary

9. Text Analysis

Text analysis, or text analytics, concerns the representation, processing, and modeling of text data to derive useful insights

Text mining is the important component of text analysis that discovers the relationships and interesting patterns

Corpus – large collection of texts (plural of corpus is corpora)

Dimension – number of distinct words or base forms in corpus

The high dimensionality of text is a major issue

Green Eggs and Ham by Dr. Seuss – 804 total words, 50 different words

Most of the time the text is not structured

9. Text Analysis Example Corpora in Natural Language Processing (NLP)

9. Text Analysis Example Data Sources and Formats for Text Analysis

9.1 Text Analysis Steps

Parsing

Imposes a structure on the unstructured text

Search and Retrieval

Identifies documents in a corpus that contain search items

Specific words, phrases, topics, entities – e.g., people, organizations

These search items are generally called key terms

Text Mining

Discovers meaningful insights in the text

Uses techniques such as clustering and classification

9.1 Text Analysis Steps Part-of-Speech (POS) Tagging, Lemmatization, and Stemming

Part-of-Speech (POS) Tagging

“he saw a fox” => PRP VBD DT NN

pronoun (personal), verb (past tense), determiner, noun (singular)

Lemmatization finds dictionary base forms

obesity causes many problems => obesity cause many problem

Stemming (e.g., Porter’s stemming algorithm)

Similar to lemmatization but dictionary not required

obesity causes many problems => obes caus mani problem

9.2 Text Analysis Example

Fictitious company ACME

Makes two products – bPhone and bEbook

ACME monitors social media and popular review sites

Are people mentioning its products?

What is being said?

Are the products seen as good or bad?

If people say ACME product is bad, why?

For example, are they complaining about battery life of the bPhone or response time in their bEbook?

9.2 Text Analysis Example

ACME’s Text Analysis Process – rest of chapter

9.3 Collecting Raw Data

The text data must first be collected

ACME is interested in what the reviews say about bPhone and bEbook and when the reviews are posted

Many websites and services offer public APIs for third-party developers to access their data

For example, Twitter APIs can retrieve public Twitter posts that contain the keywords bPhone or bEbook

9.3 Collecting Raw Data

Example tweet shown in textbook, pages 260-262

Line 02: date created

Lines 22-23:

Lines 40-42:

Lines 59-61:

9.3 Collecting Raw Data

Example RSS feed for phone review blog

9.3 Collecting Raw Data

Use web scraper to extract useful web info

Use the curl tool [ref] to fetch HTML source code given specific URLs

Use Xpath [ref] and regular expressions to select and extract data that matches certain patterns

Regular expressions can find words and strings that match particular patterns of interest

9.3 Collecting Raw Data Example Regular Expressions

9.4 Representing Text

Tokenization – separates words from the text

Case folding – reduces all letters to lowercase

Problems – e.g., WHO = World Health Organization

9.4 Representing Text

Bag-of-words – represents text as set of terms

Widely-used but naïve approach that eliminates order

“a dog bites a man” is equivalent to “a man bites a dog”

Still considered a good approach

9.4 Representing Text

Term frequency (TF) – easily calculated from bag-of-words representation

See figure next slide

Roughly follows Zipf’s Law – the frequency of a word is inversely proportional to its rank in the frequency table

9.4 Representing Text 50 most frequent words in Shakespeare’s Hamlet

9.4 Representing Text

Morphological features – additional info such a POS tag, named entities, etc.

The features are usually designed for a specific task

Creating the features can be a text analysis task in itself

One such example is topic modeling, a method to quickly analyze large volumes of text to identify the topic

Information content (IC) – a metric to denote the importance of a term in a corpus

The next section, 9.5, discusses such a metric

9.4 Representing Text Categories of the Brown Corpus