Business_Intelligence_week5

profileSunny@009
sharda_dss10_ppt_071.pptx

Chapter 7:

Text Analytics, Text Mining, and Sentiment Analysis

Business Intelligence and Analytics: Systems for Decision Support

(10th Edition)

Business Intelligence and Analytics: Systems for Decision Support

(10th Edition)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

1

Learning Objectives

Describe text mining and understand the need for text mining

Differentiate between text mining, Web mining, and data mining

Understand the different application areas for text mining

Know the process of carrying out a text mining project

Understand the different methods to introduce structure to text-based data

(Continued…)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Learning Objectives

Describe sentiment analysis

Develop familiarity with popular applications of sentiment analysis

Learn the common methods for sentiment analysis

Become familiar with speech analytics as it relates to sentiment analysis

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Opening Vignette…

Machine Versus Men on Jeopardy!: The Story of Watson

Situation

Problem

Solution

Results

Answer & discuss the case questions...

Watch it on YouTube!

https://www.youtube.com/watch?v=YLR1byL0U8M

Copyright © 2014 Pearson Education, Inc.

7-‹#›

4

Questions for the Opening Vignette

What is Watson? What is special about it?

What technologies were used in building Watson (both hardware and software)?

What are the innovative characteristics of DeepQA architecture that made Watson superior?

Why did IBM spend all that time and money to build Watson? Where is the ROI?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

A High-Level Depiction of IBM Watson’s DeepQA Architecture

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Text Mining Concepts

85-90 percent of all corporate data is in some kind of unstructured form (e.g., text)

Unstructured corporate data is doubling in size every 18 months

Tapping into these information sources is not an option, but a need to stay competitive

Answer: text mining

A semi-automated process of extracting knowledge from unstructured data sources

a.k.a. text data mining or knowledge discovery in textual databases

Copyright © 2014 Pearson Education, Inc.

7-‹#›

7

Text Analytics and Text Mining

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Data Mining versus Text Mining

Both seek for novel and useful patterns

Both are semi-automated processes

Difference is the nature of the data:

Structured versus unstructured data

Structured data: in databases

Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on

Text mining – first, impose structure to the data, then mine the structured data.

Copyright © 2014 Pearson Education, Inc.

7-‹#›

9

Text Mining Concepts

Benefits of text mining are obvious, especially in text-rich data environments

e.g., law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc.

Electronic communication records (e.g., Email)

Spam filtering

Email prioritization and categorization

Automatic response generation

Copyright © 2014 Pearson Education, Inc.

7-‹#›

10

Text Mining Application Area

Information extraction

Topic tracking

Summarization

Categorization

Clustering

Concept linking

Question answering

Copyright © 2014 Pearson Education, Inc.

7-‹#›

11

Text Mining Terminology

Unstructured or semi-structured data

Corpus (and corpora)

Terms

Concepts

Stemming

Stop words (and include words)

Synonyms (and polysemes)

Tokenizing

Copyright © 2014 Pearson Education, Inc.

7-‹#›

12

Text Mining Terminology

Term dictionary

Word frequency

Part-of-speech tagging

Morphology

Term-by-document matrix

Occurrence matrix

Singular value decomposition

Latent semantic indexing

Copyright © 2014 Pearson Education, Inc.

7-‹#›

13

Application Case 7.1 Text Mining for Patent Analysis

What is a patent?

“exclusive rights granted by a country to an inventor for a limited period of time in exchange for a disclosure of an invention”

How do we do patent analysis (PA)?

Why do we need to do PA?

What are the benefits?

What are the challenges?

How does text mining help in PA?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

14

Natural Language Processing (NLP)

Structuring a collection of text

Old approach: bag-of-words

New approach: natural language processing

NLP is …

a very important concept in text mining

a subfield of artificial intelligence and computational linguistics

the studies of "understanding" the natural human language

Syntax versus semantics-based text mining

Copyright © 2014 Pearson Education, Inc.

7-‹#›

15

Natural Language Processing (NLP)

What is “Understanding” ?

Human understands, what about computers?

Natural language is vague, context driven

True understanding requires extensive knowledge of a topic

Can/will computers ever understand natural language the same/accurate way we do?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

16

Natural Language Processing (NLP)

Challenges in NLP

Part-of-speech tagging

Text segmentation

Word sense disambiguation

Syntax ambiguity

Imperfect or irregular input

Speech acts

Dream of AI community

to have algorithms that are capable of automatically reading and obtaining knowledge from text

Copyright © 2014 Pearson Education, Inc.

7-‹#›

17

Natural Language Processing (NLP)

WordNet

A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets.

A major resource for NLP.

Need automation to be completed.

Sentiment Analysis

A technique used to detect favorable and unfavorable opinions toward specific products and services

SentiWordNet

Copyright © 2014 Pearson Education, Inc.

7-‹#›

18

Application Case 7.2

Text Mining Improves Hong Kong Government’s Ability to Anticipate and Address Public Complaints

Questions for Discussion

How did the Hong Kong government use text mining to better serve its constituents?

What were the challenges, the proposed solution, and the obtained results?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

19

NLP Task Categories

Information retrieval, information extraction

Named-entity recognition

Question answering

Automatic summarization

Natural language generation & understanding

Machine translation

Foreign language reading & writing

Speech recognition

Text proofing, optical character recognition

Copyright © 2014 Pearson Education, Inc.

7-‹#›

20

Text Mining Applications

Marketing applications

Enables better CRM

Security applications

ECHELON, OASIS

Deception detection (…)

Medicine and biology

Literature-based gene identification (…)

Academic applications

Research stream analysis

Copyright © 2014 Pearson Education, Inc.

7-‹#›

21

Application Case 7.3

Mining for Lies!

Deception detection

A difficult problem

If detection is limited to only text, then the problem is even more difficult

The study

analyzed text-based testimonies of persons of interest at military bases

used only text-based features (cues)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

22

Application Case 7.3 Mining for Lies

Copyright © 2014 Pearson Education, Inc.

7-‹#›

23

Application Case 7.3 Mining for Lies

Copyright © 2014 Pearson Education, Inc.

7-‹#›

24

Application Case 7.3 Mining for Lies

371 usable statements are generated

31 features are used

Different feature selection methods used

10-fold cross validation is used

Results (overall % accuracy)

Logistic regression 67.28

Decision trees 71.60

Neural networks 73.46

Copyright © 2014 Pearson Education, Inc.

7-‹#›

25

Text Mining Applications (Gene/Protein Interaction Identification)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

26

Application Case 7.4

Text mining and Sentiment Analysis Help Improve Customer Service Performance

Questions for Discussion

How did the financial services firm use text mining and text analytics to improve its customer service performance?

What were the challenges, the proposed solution, and the obtained results?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

27

Text Mining Process

Context diagram for the text mining process

Copyright © 2014 Pearson Education, Inc.

7-‹#›

28

Text Mining Process

The three-step text mining process

Copyright © 2014 Pearson Education, Inc.

7-‹#›

29

Text Mining Process

Step 1: Establish the corpus

Collect all relevant unstructured data (e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings…)

Digitize, standardize the collection (e.g., all in ASCII text files)

Place the collection in a common place (e.g., in a flat file, or in a directory as separate files)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

30

Text Mining Process

Step 2: Create the Term-by-Document Matrix (TDM)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

31

Text Mining Process

Step 2: Create the Term-by-Document Matrix (TDM)

Should all terms be included?

Stop words, include words

Synonyms, homonyms

Stemming

What is the best representation of the indices (values in cells)?

Row counts; binary frequencies; log frequencies;

Inverse document frequency

Copyright © 2014 Pearson Education, Inc.

7-‹#›

32

Text Mining Process

Step 2: Create the Term–by–Document Matrix (TDM)

TDM is a sparse matrix. How can we reduce the dimensionality of the TDM?

Manual - a domain expert goes through it

Eliminate terms with very few occurrences in very few documents (?)

Transform the matrix using singular value decomposition (SVD)

SVD is similar to principle component analysis

Copyright © 2014 Pearson Education, Inc.

7-‹#›

33

Text Mining Process

Step 3: Extract patterns/knowledge

Classification (text categorization)

Clustering (natural groupings of text)

Improve search recall

Improve search precision

Scatter/gather

Query-specific clustering

Association

Trend Analysis (…)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

34

Application Case 7.5 (Research Literature Survey with Text Mining)

Mining the published IS literature

MIS Quarterly (MISQ)

Journal of MIS (JMIS)

Information Systems Research (ISR)

Covers 12-year period (1994-2005)

901 papers are included in the study

Only the paper abstracts are used

9 clusters are generated for further analysis

Copyright © 2014 Pearson Education, Inc.

7-‹#›

35

Application Case 7.5 (Research Literature Survey with Text Mining)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

36

Application Case 7.5 (Research Literature Survey with Text Mining)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

37

Application Case 7.5 (Research Literature Survey with Text Mining)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

38

Text Mining Tools

Commercial Software Tools

IBM SPSS Modler - Text Miner

SAS Enterprise Miner – Text Miner

Statistical Data Miner – Text Miner

ClearForest, …

Free Software Tools

RapidMiner

GATE

Spy-EM, …

Copyright © 2014 Pearson Education, Inc.

7-‹#›

39

Application Case 7.6

A Potpourri of Text Mining Case Synopses

Alberta’s Parks Division gains insight from unstructured data

American Honda Saves Millions by Using Text and Data Mining

MaspexWadowice Group Analyzes Online Brand Image with Text Mining

Viseca Card Services Reduces Fraud Loss with Text Analytics

Improving Quality with Text Mining and Advanced Analytics

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis Overview

Sentiment  belief, view, opinion, conviction

Sentiment analysis  opinion mining, subjectivity analysis, and appraisal extraction

The goal is to answer the question:

“What do people feel about a certain topic?”

Explicit versus Implicit sentiment

Sentiment polarity

Positive versus Negative

… versus Neutral?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Example – Real-Time Social Signal (by Attensity)

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Application Case 7.7

Whirlpool Achieves Customer Loyalty and Product Success with Text Analytics

Questions for Discussion

How did Whirlpool use capabilities of text analytics to better understand their customers and improve product offerings?

What were the challenges, the proposed solution, and the obtained results?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis Applications

Voice of the customer (VOC)

Voice of the Market (VOM)

Voice of the Employee (VOE)

Brand Management

Financial Markets

Politics

Government Intelligence

… others

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis Process

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis Process

Step 1 – Sentiment Detection

Comes right after the retrieval and preparation of the text documents

It is also called detection of objectivity

Fact [= objectivity] versus Opinion [= subjectivity]

Step 2 – N-P Polarity Classification

Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities

N [= negative] versus P [= positive]

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis Process

Step 3 – Target Identification

The goal of this step is to accurately identify the target of the expressed sentiment (e.g., a person, a product, an event, etc.)

Level of difficulty  the application domain

Step 4 – Collection and Aggregation

Once the sentiments of all text data points in the document are identified and calculated, they are to be aggregated

Word  Statement  Paragraph  Document

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis Methods for Polarity Identification

Polarity Identification – P vs. N

Can be made at the level of word, term, sentence, paragraph, document

Two competing methods

Using a lexicon

WordNet [wordnet.princeton.edu]

SentiWordNet [sentiwordnet.isti.cnr.it]

Using pre-classified training documents

Data mining / machine learning

Copyright © 2014 Pearson Education, Inc.

7-‹#›

P-N Polarity and S-O Polarity

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Sentiment Analysis and Speech Analytics

Speech analytics – analysis of voice

Content versus other Voice Features

Two Approaches

The Acoustic Approach

Intensity, Pitch, Jitter, Shimmer, etc.

The Linguistic Approach

Lexical: words, phrases, etc.

Disfluencies: filled pauses, hesitation, restarts, etc.

Higher semantics: taxonomy/ontology, pragmatics

Many uses and use cases exist

Copyright © 2014 Pearson Education, Inc.

7-‹#›

Application Case 7.8

Cutting Through the Confusion: Blue Cross Blue Shield of North Carolina Uses Nexidia’s Speech Analytics to Ease Member Experience in Healthcare

Questions for Discussion

For a large company like BCBSNC with a lot of customers, what does “listening to customer” mean?

What were the challenges, the proposed solution, and the obtained results for BCBSNC?

Copyright © 2014 Pearson Education, Inc.

7-‹#›

End of the Chapter

Questions, comments

Copyright © 2014 Pearson Education, Inc.

7-‹#›

52

All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America.

Copyright © 2014 Pearson Education, Inc.

7-‹#›

53

Trained

models

Question

analysis

Hypothesis

generation

Query

decomposition

Soft

filtering

Hypothesis and

evidence scoring

Synthesis

Final merging

and ranking

Answer and

confidence

.........

Hypothesis

generation

Soft

filtering

Hypothesis and

evidence scoring

Answer

sources

Evidence

sources

Primary

search

Candidate

answer

generation

Support

evidence

retrieval

Deep

evidence

scoring

Question

1

2

3

4

5

Information

Retrieval

Information

Extraction

Web Mining

Data Mining

StatisticsComputer Science

Natural Language Processing

TEXT ANALYTICS

Text Mining

Artificial Intelligence

Machine Learning

Management Science

Linguistic

Statements

Transcribed for

Processing

Text Processing

Software Identified

Cues in Statements

Statements Labeled as

Truthful or Deceptive

By Law Enforcement

Text Processing

Software Generated

Quantified Cues

Classification Models

Trained and Tested on

Quantified Cues

Cues Extracted &

Selected

Category Example Cues

Quantity Verb count, noun-phrase count, ...

Complexity Avg. no of clauses, sentence length, …

Uncertainty Modifiers, modal verbs, ...

Nonimmediacy Passive voice, objectification, ...

Expressivity Emotiveness

Diversity Lexical diversity, redundan cy, ...

Informality Typographical error ratio

Specificity Spatiotemporal , perceptual information …

Affect Positive affect, negative affect, etc.

G

e

n

e

/

P

r

o

t

e

i

n

596 12043 24224 28102042722 397276

D007962

D 016923

D 001773

D019254D044465D001769D002477D003643D016158

185851112923017275874279189521623563217825282523

NNINNNINVBZINJJJJNNNNNNCCNNINNN

NPPPNPNPPPNPNPPPNP

O

n

t

o

l

o

g

y

W

o

r

d

P

O

S

S

h

a

l

l

o

w

P

a

r

s

e

...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.

Extract

knowledge

from available

data sources

A0

Unstructured data (text)

Structured data (databases)

Context-specific knowledge

Software/hardware limitations

Privacy issues

Tools and techniques

Domain expertise

Linguistic limitations

Establish the Corpus:

Collect & Organize the

Domain Specific

Unstructured Data

Create the Term-

Document Matrix:

Introduce Structure

to the Corpus

Extract Knowledge:

Discover Novel

Patterns from the

T-D Matrix

The inputs to the process

includes a variety of relevant

unstructured (and semi-

structured) data sources such

as text, XML, HTML, etc.

The output of the Task 1 is a

collection of documents in

some digitized format for

computer processing

The output of the Task 2 is a

flat file called term-document

matrix where the cells are

populated with the term

frequencies

The output of Task 3 is a

number of problem specific

classification, association,

clustering models and

visualizations

Task 1Task 2Task 3

FeedbackFeedback

i

n

v

e

s

t

m

e

n

t

r

i

s

k

p

r

o

j

e

c

t

m

a

n

a

g

e

m

e

n

t

s

o

f

t

w

a

r

e

e

n

g

i

n

e

e

r

i

n

g

d

e

v

e

l

o

p

m

e

n

t

1

S

A

P

.

.

.

Document 1

Document 2

Document 3

Document 4

Document 5

Document 6

...

Documents

Terms

1

1

1

2

1

1

1

3

1

Journal

Year

Author(s)

Title

Vol/No

Pages

Keywords

Abstract

MISQ

2005

A. Malhotra,

S. Gosain and

O. A. El Sawy

Absorptive capacity

configurations in

supply chains:

Gearing for partner-

enabled market

knowledge creation

29/1

145-187

knowledge management

supply chain

absorptive capacity

interorganizational

information systems

configuration approaches

The need for continual value

innovation is driving supply

chains to evolve from a pure

transactional focus to

leveraging interorganizational

partner ships for sharing

ISR

1999

D. Robey and

M. C. Boudreau

Accounting for the

contradictory

organizational

consequences of

information

technology:

Theoretical directions

and methodological

implications

2-Oct

167-185

organizational

transformation

impacts of technology

organization theory

research methodology

intraorganizational power

electronic communication

mis implementation

culture

systems

Although much contemporary

thought considers advanced

information technologies as

either determinants or enablers

of radical organizational

change, empirical studies have

revealed inconsistent findings to

support the deterministic logic

implicit in such arguments. This

paper reviews the contradictory

JMIS

2001

R. Aron and

E. K. Clemons

Achieving the optimal

balance between

investment in quality

and investment in self-

promotion for

information products

18/2

65-88

information products

internet advertising

product positioning

signaling

signaling games

When producers of goods (or

services) are confronted by a

situation in which their offerings

no longer perfectly match

consumer preferences, they

must determine the extent to

which the advertised features of

YEAR

No of Articles

CLUSTER: 1

199419951996199719981999200020012002200320042005

0

5

10

15

20

25

30

35

CLUSTER: 2

199419951996199719981999200020012002200320042005

CLUSTER: 3

199419951996199719981999200020012002200320042005

CLUSTER: 4

199419951996199719981999200020012002200320042005

0

5

10

15

20

25

30

35

CLUSTER: 5

199419951996199719981999200020012002200320042005

CLUSTER: 6

199419951996199719981999200020012002200320042005

CLUSTER: 7

199419951996199719981999200020012002200320042005

0

5

10

15

20

25

30

35

CLUSTER: 8

199419951996199719981999200020012002200320042005

CLUSTER: 9

199419951996199719981999200020012002200320042005

JOURNAL

No of Articles

CLUSTER: 1

ISRJMISMISQ

0

10

20

30

40

50

60

70

80

90

100

CLUSTER: 2

ISRJMISMISQ

CLUSTER: 3

ISRJMISMISQ

CLUSTER: 4

ISRJMISMISQ

0

10

20

30

40

50

60

70

80

90

100

CLUSTER: 5

ISRJMISMISQ

CLUSTER: 6

ISRJMISMISQ

CLUSTER: 7

ISRJMISMISQ

0

10

20

30

40

50

60

70

80

90

100

CLUSTER: 8

ISRJMISMISQ

CLUSTER: 9

ISRJMISMISQ

Identify the target

for the sentiment

Calculate the NP

polarity of the

sentiment

Is there a

sentiment?

Record the Polarity,

Strength, and the

Target of the

sentiment.

Tabulate & aggregate

the sentiment

analysis results

Textual Data

Calculate the

O-S Polarity

YesNo

A statement

Yes

Lexicon

Lexicon

O-S

polarity

measure

N-P Polarity

Target

Step 1

Step 2

Step 3

Step 4

S

-

O

P

o

l

a

r

i

t

y

P-N Polarity

Positive (P)

(+)

Negative (N)

(-)

Objective (O)

Subjective (S)