week 8-data science

rav
Chapter_9.pptx

Data Science and Big Data Analytics

Chap 9: Advanced Analytical Theory and Methods:

Text Analysis

1

Contents

9.1 Text Analysis Steps

9.2 A Text Analysis Example

9.3 Collecting Raw Text

9.4 Representing Text

9.5 Term Frequency – Inverse Document Frequency

9.6 Categorizing Documents by Topics

9.7 Determining Sentiments

9.8 Gaining Insights

Summary

2

9. Text Analysis

Text analysis, or text analytics, concerns the representation, processing, and modeling of text data to derive useful insights

Text mining is the important component of text analysis that discovers the relationships and interesting patterns

Corpus – large collection of texts (plural of corpus is corpora)

Dimension – number of distinct words or base forms in corpus

The high dimensionality of text is a major issue

Green Eggs and Ham by Dr. Seuss – 804 total words, 50 different words

Most of the time the text is not structured

3

9. Text Analysis Example Corpora in Natural Language Processing (NLP)

4

9. Text Analysis Example Data Sources and Formats for Text Analysis

5

9.1 Text Analysis Steps

Parsing

Imposes a structure on the unstructured text

Search and Retrieval

Identifies documents in a corpus that contain search items

Specific words, phrases, topics, entities – e.g., people, organizations

These search items are generally called key terms

Text Mining

Discovers meaningful insights in the text

Uses techniques such as clustering and classification

6

9.1 Text Analysis Steps Part-of-Speech (POS) Tagging, Lemmatization, and Stemming

Part-of-Speech (POS) Tagging

“he saw a fox” => PRP VBD DT NN

pronoun (personal), verb (past tense), determiner, noun (singular)

Lemmatization finds dictionary base forms

obesity causes many problems => obesity cause many problem

Stemming (e.g., Porter’s stemming algorithm)

Similar to lemmatization but dictionary not required

obesity causes many problems => obes caus mani problem

7

9.2 Text Analysis Example

Fictitious company ACME

Makes two products – bPhone and bEbook

ACME monitors social media and popular review sites

Are people mentioning its products?

What is being said?

Are the products seen as good or bad?

If people say ACME product is bad, why?

For example, are they complaining about battery life of the bPhone or response time in their bEbook?

8

9.2 Text Analysis Example

ACME’s Text Analysis Process – rest of chapter

9

9.3 Collecting Raw Data

The text data must first be collected

ACME is interested in what the reviews say about bPhone and bEbook and when the reviews are posted

Many websites and services offer public APIs for third-party developers to access their data

For example, Twitter APIs can retrieve public Twitter posts that contain the keywords bPhone or bEbook

10

9.3 Collecting Raw Data

Example tweet shown in textbook, pages 260-262

Line 02: date created

Lines 22-23:

Lines 40-42:

Lines 59-61:

11

9.3 Collecting Raw Data

Example RSS feed for phone review blog

12

9.3 Collecting Raw Data

Use web scraper to extract useful web info

Use the curl tool [ref] to fetch HTML source code given specific URLs

Use Xpath [ref] and regular expressions to select and extract data that matches certain patterns

Regular expressions can find words and strings that match particular patterns of interest

13

9.3 Collecting Raw Data Example Regular Expressions

14

9.4 Representing Text

Tokenization – separates words from the text

Case folding – reduces all letters to lowercase

Problems – e.g., WHO = World Health Organization

15

9.4 Representing Text

Bag-of-words – represents text as set of terms

Widely-used but naïve approach that eliminates order

“a dog bites a man” is equivalent to “a man bites a dog”

Still considered a good approach

16

9.4 Representing Text

Term frequency (TF) – easily calculated from bag-of-words representation

See figure next slide

Roughly follows Zipf’s Law – the frequency of a word is inversely proportional to its rank in the frequency table

17

9.4 Representing Text 50 most frequent words in Shakespeare’s Hamlet

18

9.4 Representing Text

19

9.4 Representing Text

Morphological features – additional info such a POS tag, named entities, etc.

The features are usually designed for a specific task

Creating the features can be a text analysis task in itself

One such example is topic modeling, a method to quickly analyze large volumes of text to identify the topic

Information content (IC) – a metric to denote the importance of a term in a corpus

The next section, 9.5, discusses such a metric

20

9.4 Representing Text Categories of the Brown Corpus

21

9.5 Term Frequency – Inverse Document Frequency (TFIDF)

TFIDF is a widely used measure in text analysis

Robust and efficient

22

9.5 Term Frequency – Inverse Document Frequency (TFIDF)

Other common Term Frequency measures

Log function

Normalized by the length of text document

23

9.5 Term Frequency – Inverse Document Frequency (TFIDF)

Term frequency highlights common words

Eliminate stop words, such as the, a, of, and

Also fixing this problem, consider the metrics

Document frequency (DF) = the number of documents in the corpus that contain the term

24

9.5 Term Frequency – Inverse Document Frequency (TFIDF)

Inverted document frequency (IDF) is obtained by dividing N by the document frequency

In log form as

Or to avoid division-by-zero as

25

9.5 Term Frequency Brown Corpus news category: TF, DF, IDF

26

9.5 Term Frequency – Inverse Document Frequency (TFIDF)

Words with high IDF tend to be more meaningful over the entire corpus

There is still a problem with IDF

Because the document count in a corpus (N) remains constant, IDF depends solely on DF

For example, sunbonnet and narcotic appear same in the previous figure

27

9.5 Term Frequency – Inverse Document Frequency (TFIDF)

TFIDF (or TF-IDF) involves both TF and IDF

TFIDF scores words higher that appear more often in a document but less often across all documents

TFIDF applies to a term in a specific document, so it gets different scores in different documents

Reveals little of inter- or intra-document structure

28

9.6 Categorizing Documents by Topics

Returning to the ACME example, the team wants to categorize the reviews by topic

Topic modeling – prevalent statistical approach

Uncovers hidden topical patterns within a corpus

Annotates documents according to these topics

Uses annotations to organize, search, and summarize texts

A topic is formally defined as a distribution over a fixed vocabulary of words

29

9.6 Categorizing Documents by Topics

Latent Dirichlet allocation (LDA) topic model

Simple generative probabilistic model of a corpus

Data treated as a result of a generative process that includes hidden variables

Assumes fixed vocabulary of words

Assumes constant predefined number of topics

30

9.6 Categorizing Documents by Topics Figure illustrating intuitions behind LDA

31

9.6 Categorizing Documents by Topics Distribution of ten topics over nine documents

32

9.7 Determining Sentiments

Sentiment analysis is a group of tasks that use statistics and NLP to mine opinions from texts

Make lists of positive and negative words

Positive – brilliant, awesome, spectacular

Negative – awful, stupid, hideous

This simple approach achieves about 60% accuracy

Naïve Bayes, maximum entropy, SVM

Can achieve about 80% accuracy

33

9.7 Determining Sentiments Evaluation of prediction models

Data usually split into training and testing sets

Supervised learning – labeled data

Confusion matrix of naïve Bayes example

Performance measures: precision, recall, etc.

34

9.7 Determining Sentiments

Tweet demo – http://www.sentiment140.com/

35

9.7 Determining Sentiments Tweet sentiment analysis for Boston weather

36

9.7 Determining Sentiments

Emoticons can make it easy and fast to detect sentiment but this method can be misleading

E.g., the text below with :) emoticon does not necessarily correspond to a positive sentiment

37

9.7 Determining Sentiments Amazon Mechanical Turk (MTurk)

To address problems mentioned above, Amazon Mechanical Turk (MTurk) can be used

It is a crowdsourcing Internet marketplace that enables individuals or businesses to coordinate the use of human intelligence to perform tasks difficult for computers

MTurk performs Human Intelligence Tasks (HITs)

For example, for the tweets illustrative example, human workers can be asked to tag each tweet as positive, neutral, or negative

38

9.7 Determining Sentiments Amazon Mechanical Turk (MTurk)

39

9.8 Gaining Insights

Returning to the ACME example used in this chapter, this section shows how various techniques can be used to gain insights into customer opinions

For simplicity, only the bPhone product is used here

The ACME data science team collects 300 reviews

Using the keyword bPhone

After tokenization, removing stop words, and case folding to lowercase, the 300 reviews are visualized as a word cloud with more frequently appearing words in larger font size

40

9.8 Gaining Insights Word cloud on all 300 bPhone reviews

Often remove domain-specific stop words not useful for the study.

In this case, remove word like phone, bPhone, and ACME.

41

9.8 Gaining Insights Word cloud on 50 five-star reviews

42

9.8 Gaining Insights Word cloud on 70 one-star reviews

Note the words sim, button, stolen, venezuela. Further investigation

revealed unauthorized sellers in Venezuela sold stolen bPhones.

43

9.8 Gaining Insights Reviews highlighted by TFIDF values

44

9.8 Gaining Insights LDA model: ten topics on five-star reviews

45

9.8 Gaining Insights LDA model: ten topics on one-star reviews

46

9.8 Gaining Insights Five topics: five-star (left) one-star (right)

47

9.8 Gaining Insights Sentiment analysis on over 100 tweets

Indicates most customers satisfied with ACME’s bPhone.

48

Summary

Chapter discussed the subtasks of text analysis:

Parsing

Search and retrieval

Text mining

ACME example used to review the text analysis process

Collecting raw data

Representing text

Using TFIDF to compute the usefulness of each word in the text

Categorizing documents by topics using topic modeling

Sentiment analysis

Gaining greater insights