Business_Intelligence_week5
Chapter 7:
Text Analytics, Text Mining, and Sentiment Analysis
Business Intelligence and Analytics: Systems for Decision Support
(10th Edition)
Business Intelligence and Analytics: Systems for Decision Support
(10th Edition)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
1
Learning Objectives
Describe text mining and understand the need for text mining
Differentiate between text mining, Web mining, and data mining
Understand the different application areas for text mining
Know the process of carrying out a text mining project
Understand the different methods to introduce structure to text-based data
(Continued…)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Learning Objectives
Describe sentiment analysis
Develop familiarity with popular applications of sentiment analysis
Learn the common methods for sentiment analysis
Become familiar with speech analytics as it relates to sentiment analysis
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Opening Vignette…
Machine Versus Men on Jeopardy!: The Story of Watson
Situation
Problem
Solution
Results
Answer & discuss the case questions...
Watch it on YouTube!
Copyright © 2014 Pearson Education, Inc.
7-‹#›
4
Questions for the Opening Vignette
What is Watson? What is special about it?
What technologies were used in building Watson (both hardware and software)?
What are the innovative characteristics of DeepQA architecture that made Watson superior?
Why did IBM spend all that time and money to build Watson? Where is the ROI?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
A High-Level Depiction of IBM Watson’s DeepQA Architecture
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Text Mining Concepts
85-90 percent of all corporate data is in some kind of unstructured form (e.g., text)
Unstructured corporate data is doubling in size every 18 months
Tapping into these information sources is not an option, but a need to stay competitive
Answer: text mining
A semi-automated process of extracting knowledge from unstructured data sources
a.k.a. text data mining or knowledge discovery in textual databases
Copyright © 2014 Pearson Education, Inc.
7-‹#›
7
Text Analytics and Text Mining
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Data Mining versus Text Mining
Both seek for novel and useful patterns
Both are semi-automated processes
Difference is the nature of the data:
Structured versus unstructured data
Structured data: in databases
Unstructured data: Word documents, PDF files, text excerpts, XML files, and so on
Text mining – first, impose structure to the data, then mine the structured data.
Copyright © 2014 Pearson Education, Inc.
7-‹#›
9
Text Mining Concepts
Benefits of text mining are obvious, especially in text-rich data environments
e.g., law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), marketing (customer comments), etc.
Electronic communication records (e.g., Email)
Spam filtering
Email prioritization and categorization
Automatic response generation
Copyright © 2014 Pearson Education, Inc.
7-‹#›
10
Text Mining Application Area
Information extraction
Topic tracking
Summarization
Categorization
Clustering
Concept linking
Question answering
Copyright © 2014 Pearson Education, Inc.
7-‹#›
11
Text Mining Terminology
Unstructured or semi-structured data
Corpus (and corpora)
Terms
Concepts
Stemming
Stop words (and include words)
Synonyms (and polysemes)
Tokenizing
Copyright © 2014 Pearson Education, Inc.
7-‹#›
12
Text Mining Terminology
Term dictionary
Word frequency
Part-of-speech tagging
Morphology
Term-by-document matrix
Occurrence matrix
Singular value decomposition
Latent semantic indexing
Copyright © 2014 Pearson Education, Inc.
7-‹#›
13
Application Case 7.1 Text Mining for Patent Analysis
What is a patent?
“exclusive rights granted by a country to an inventor for a limited period of time in exchange for a disclosure of an invention”
How do we do patent analysis (PA)?
Why do we need to do PA?
What are the benefits?
What are the challenges?
How does text mining help in PA?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
14
Natural Language Processing (NLP)
Structuring a collection of text
Old approach: bag-of-words
New approach: natural language processing
NLP is …
a very important concept in text mining
a subfield of artificial intelligence and computational linguistics
the studies of "understanding" the natural human language
Syntax versus semantics-based text mining
Copyright © 2014 Pearson Education, Inc.
7-‹#›
15
Natural Language Processing (NLP)
What is “Understanding” ?
Human understands, what about computers?
Natural language is vague, context driven
True understanding requires extensive knowledge of a topic
Can/will computers ever understand natural language the same/accurate way we do?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
16
Natural Language Processing (NLP)
Challenges in NLP
Part-of-speech tagging
Text segmentation
Word sense disambiguation
Syntax ambiguity
Imperfect or irregular input
Speech acts
Dream of AI community
to have algorithms that are capable of automatically reading and obtaining knowledge from text
Copyright © 2014 Pearson Education, Inc.
7-‹#›
17
Natural Language Processing (NLP)
WordNet
A laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets.
A major resource for NLP.
Need automation to be completed.
Sentiment Analysis
A technique used to detect favorable and unfavorable opinions toward specific products and services
SentiWordNet
Copyright © 2014 Pearson Education, Inc.
7-‹#›
18
Application Case 7.2
Text Mining Improves Hong Kong Government’s Ability to Anticipate and Address Public Complaints
Questions for Discussion
How did the Hong Kong government use text mining to better serve its constituents?
What were the challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
19
NLP Task Categories
Information retrieval, information extraction
Named-entity recognition
Question answering
Automatic summarization
Natural language generation & understanding
Machine translation
Foreign language reading & writing
Speech recognition
Text proofing, optical character recognition
Copyright © 2014 Pearson Education, Inc.
7-‹#›
20
Text Mining Applications
Marketing applications
Enables better CRM
Security applications
ECHELON, OASIS
Deception detection (…)
Medicine and biology
Literature-based gene identification (…)
Academic applications
Research stream analysis
Copyright © 2014 Pearson Education, Inc.
7-‹#›
21
Application Case 7.3
Mining for Lies!
Deception detection
A difficult problem
If detection is limited to only text, then the problem is even more difficult
The study
analyzed text-based testimonies of persons of interest at military bases
used only text-based features (cues)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
22
Application Case 7.3 Mining for Lies
Copyright © 2014 Pearson Education, Inc.
7-‹#›
23
Application Case 7.3 Mining for Lies
Copyright © 2014 Pearson Education, Inc.
7-‹#›
24
Application Case 7.3 Mining for Lies
371 usable statements are generated
31 features are used
Different feature selection methods used
10-fold cross validation is used
Results (overall % accuracy)
Logistic regression 67.28
Decision trees 71.60
Neural networks 73.46
Copyright © 2014 Pearson Education, Inc.
7-‹#›
25
Text Mining Applications (Gene/Protein Interaction Identification)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
26
Application Case 7.4
Text mining and Sentiment Analysis Help Improve Customer Service Performance
Questions for Discussion
How did the financial services firm use text mining and text analytics to improve its customer service performance?
What were the challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
27
Text Mining Process
Context diagram for the text mining process
Copyright © 2014 Pearson Education, Inc.
7-‹#›
28
Text Mining Process
The three-step text mining process
Copyright © 2014 Pearson Education, Inc.
7-‹#›
29
Text Mining Process
Step 1: Establish the corpus
Collect all relevant unstructured data (e.g., textual documents, XML files, emails, Web pages, short notes, voice recordings…)
Digitize, standardize the collection (e.g., all in ASCII text files)
Place the collection in a common place (e.g., in a flat file, or in a directory as separate files)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
30
Text Mining Process
Step 2: Create the Term-by-Document Matrix (TDM)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
31
Text Mining Process
Step 2: Create the Term-by-Document Matrix (TDM)
Should all terms be included?
Stop words, include words
Synonyms, homonyms
Stemming
What is the best representation of the indices (values in cells)?
Row counts; binary frequencies; log frequencies;
Inverse document frequency
Copyright © 2014 Pearson Education, Inc.
7-‹#›
32
Text Mining Process
Step 2: Create the Term–by–Document Matrix (TDM)
TDM is a sparse matrix. How can we reduce the dimensionality of the TDM?
Manual - a domain expert goes through it
Eliminate terms with very few occurrences in very few documents (?)
Transform the matrix using singular value decomposition (SVD)
SVD is similar to principle component analysis
Copyright © 2014 Pearson Education, Inc.
7-‹#›
33
Text Mining Process
Step 3: Extract patterns/knowledge
Classification (text categorization)
Clustering (natural groupings of text)
Improve search recall
Improve search precision
Scatter/gather
Query-specific clustering
Association
Trend Analysis (…)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
34
Application Case 7.5 (Research Literature Survey with Text Mining)
Mining the published IS literature
MIS Quarterly (MISQ)
Journal of MIS (JMIS)
Information Systems Research (ISR)
Covers 12-year period (1994-2005)
901 papers are included in the study
Only the paper abstracts are used
9 clusters are generated for further analysis
Copyright © 2014 Pearson Education, Inc.
7-‹#›
35
Application Case 7.5 (Research Literature Survey with Text Mining)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
36
Application Case 7.5 (Research Literature Survey with Text Mining)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
37
Application Case 7.5 (Research Literature Survey with Text Mining)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
38
Text Mining Tools
Commercial Software Tools
IBM SPSS Modler - Text Miner
SAS Enterprise Miner – Text Miner
Statistical Data Miner – Text Miner
ClearForest, …
Free Software Tools
RapidMiner
GATE
Spy-EM, …
Copyright © 2014 Pearson Education, Inc.
7-‹#›
39
Application Case 7.6
A Potpourri of Text Mining Case Synopses
Alberta’s Parks Division gains insight from unstructured data
American Honda Saves Millions by Using Text and Data Mining
MaspexWadowice Group Analyzes Online Brand Image with Text Mining
Viseca Card Services Reduces Fraud Loss with Text Analytics
Improving Quality with Text Mining and Advanced Analytics
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis Overview
Sentiment belief, view, opinion, conviction
Sentiment analysis opinion mining, subjectivity analysis, and appraisal extraction
The goal is to answer the question:
“What do people feel about a certain topic?”
Explicit versus Implicit sentiment
Sentiment polarity
Positive versus Negative
… versus Neutral?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Example – Real-Time Social Signal (by Attensity)
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Application Case 7.7
Whirlpool Achieves Customer Loyalty and Product Success with Text Analytics
Questions for Discussion
How did Whirlpool use capabilities of text analytics to better understand their customers and improve product offerings?
What were the challenges, the proposed solution, and the obtained results?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis Applications
Voice of the customer (VOC)
Voice of the Market (VOM)
Voice of the Employee (VOE)
Brand Management
Financial Markets
Politics
Government Intelligence
… others
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis Process
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis Process
Step 1 – Sentiment Detection
Comes right after the retrieval and preparation of the text documents
It is also called detection of objectivity
Fact [= objectivity] versus Opinion [= subjectivity]
Step 2 – N-P Polarity Classification
Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities
N [= negative] versus P [= positive]
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis Process
Step 3 – Target Identification
The goal of this step is to accurately identify the target of the expressed sentiment (e.g., a person, a product, an event, etc.)
Level of difficulty the application domain
Step 4 – Collection and Aggregation
Once the sentiments of all text data points in the document are identified and calculated, they are to be aggregated
Word Statement Paragraph Document
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis Methods for Polarity Identification
Polarity Identification – P vs. N
Can be made at the level of word, term, sentence, paragraph, document
Two competing methods
Using a lexicon
WordNet [wordnet.princeton.edu]
SentiWordNet [sentiwordnet.isti.cnr.it]
Using pre-classified training documents
Data mining / machine learning
Copyright © 2014 Pearson Education, Inc.
7-‹#›
P-N Polarity and S-O Polarity
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Sentiment Analysis and Speech Analytics
Speech analytics – analysis of voice
Content versus other Voice Features
Two Approaches
The Acoustic Approach
Intensity, Pitch, Jitter, Shimmer, etc.
The Linguistic Approach
Lexical: words, phrases, etc.
Disfluencies: filled pauses, hesitation, restarts, etc.
Higher semantics: taxonomy/ontology, pragmatics
Many uses and use cases exist
Copyright © 2014 Pearson Education, Inc.
7-‹#›
Application Case 7.8
Cutting Through the Confusion: Blue Cross Blue Shield of North Carolina Uses Nexidia’s Speech Analytics to Ease Member Experience in Healthcare
Questions for Discussion
For a large company like BCBSNC with a lot of customers, what does “listening to customer” mean?
What were the challenges, the proposed solution, and the obtained results for BCBSNC?
Copyright © 2014 Pearson Education, Inc.
7-‹#›
End of the Chapter
Questions, comments
Copyright © 2014 Pearson Education, Inc.
7-‹#›
52
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher. Printed in the United States of America.
Copyright © 2014 Pearson Education, Inc.
7-‹#›
53
Trained
models
Question
analysis
Hypothesis
generation
Query
decomposition
Soft
filtering
Hypothesis and
evidence scoring
Synthesis
Final merging
and ranking
Answer and
confidence
.........
Hypothesis
generation
Soft
filtering
Hypothesis and
evidence scoring
Answer
sources
Evidence
sources
Primary
search
Candidate
answer
generation
Support
evidence
retrieval
Deep
evidence
scoring
Question
1
2
3
4
5
Information
Retrieval
Information
Extraction
Web Mining
Data Mining
StatisticsComputer Science
Natural Language Processing
TEXT ANALYTICS
Text Mining
Artificial Intelligence
Machine Learning
Management Science
Linguistic
Statements
Transcribed for
Processing
Text Processing
Software Identified
Cues in Statements
Statements Labeled as
Truthful or Deceptive
By Law Enforcement
Text Processing
Software Generated
Quantified Cues
Classification Models
Trained and Tested on
Quantified Cues
Cues Extracted &
Selected
Category Example Cues
Quantity Verb count, noun-phrase count, ...
Complexity Avg. no of clauses, sentence length, …
Uncertainty Modifiers, modal verbs, ...
Nonimmediacy Passive voice, objectification, ...
Expressivity Emotiveness
Diversity Lexical diversity, redundan cy, ...
Informality Typographical error ratio
Specificity Spatiotemporal , perceptual information …
Affect Positive affect, negative affect, etc.
G
e
n
e
/
P
r
o
t
e
i
n
596 12043 24224 28102042722 397276
D007962
D 016923
D 001773
D019254D044465D001769D002477D003643D016158
185851112923017275874279189521623563217825282523
NNINNNINVBZINJJJJNNNNNNCCNNINNN
NPPPNPNPPPNPNPPPNP
O
n
t
o
l
o
g
y
W
o
r
d
P
O
S
S
h
a
l
l
o
w
P
a
r
s
e
...expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.
Extract
knowledge
from available
data sources
A0
Unstructured data (text)
Structured data (databases)
Context-specific knowledge
Software/hardware limitations
Privacy issues
Tools and techniques
Domain expertise
Linguistic limitations
Establish the Corpus:
Collect & Organize the
Domain Specific
Unstructured Data
Create the Term-
Document Matrix:
Introduce Structure
to the Corpus
Extract Knowledge:
Discover Novel
Patterns from the
T-D Matrix
The inputs to the process
includes a variety of relevant
unstructured (and semi-
structured) data sources such
as text, XML, HTML, etc.
The output of the Task 1 is a
collection of documents in
some digitized format for
computer processing
The output of the Task 2 is a
flat file called term-document
matrix where the cells are
populated with the term
frequencies
The output of Task 3 is a
number of problem specific
classification, association,
clustering models and
visualizations
Task 1Task 2Task 3
FeedbackFeedback
i
n
v
e
s
t
m
e
n
t
r
i
s
k
p
r
o
j
e
c
t
m
a
n
a
g
e
m
e
n
t
s
o
f
t
w
a
r
e
e
n
g
i
n
e
e
r
i
n
g
d
e
v
e
l
o
p
m
e
n
t
1
S
A
P
.
.
.
Document 1
Document 2
Document 3
Document 4
Document 5
Document 6
...
Documents
Terms
1
1
1
2
1
1
1
3
1
Journal
Year
Author(s)
Title
Vol/No
Pages
Keywords
Abstract
MISQ
2005
A. Malhotra,
S. Gosain and
O. A. El Sawy
Absorptive capacity
configurations in
supply chains:
Gearing for partner-
enabled market
knowledge creation
29/1
145-187
knowledge management
supply chain
absorptive capacity
interorganizational
information systems
configuration approaches
The need for continual value
innovation is driving supply
chains to evolve from a pure
transactional focus to
leveraging interorganizational
partner ships for sharing
ISR
1999
D. Robey and
M. C. Boudreau
Accounting for the
contradictory
organizational
consequences of
information
technology:
Theoretical directions
and methodological
implications
2-Oct
167-185
organizational
transformation
impacts of technology
organization theory
research methodology
intraorganizational power
electronic communication
mis implementation
culture
systems
Although much contemporary
thought considers advanced
information technologies as
either determinants or enablers
of radical organizational
change, empirical studies have
revealed inconsistent findings to
support the deterministic logic
implicit in such arguments. This
paper reviews the contradictory
JMIS
2001
R. Aron and
E. K. Clemons
Achieving the optimal
balance between
investment in quality
and investment in self-
promotion for
information products
18/2
65-88
information products
internet advertising
product positioning
signaling
signaling games
When producers of goods (or
services) are confronted by a
situation in which their offerings
no longer perfectly match
consumer preferences, they
must determine the extent to
which the advertised features of
…
…
…
…
…
…
…
…
YEAR
No of Articles
CLUSTER: 1
199419951996199719981999200020012002200320042005
0
5
10
15
20
25
30
35
CLUSTER: 2
199419951996199719981999200020012002200320042005
CLUSTER: 3
199419951996199719981999200020012002200320042005
CLUSTER: 4
199419951996199719981999200020012002200320042005
0
5
10
15
20
25
30
35
CLUSTER: 5
199419951996199719981999200020012002200320042005
CLUSTER: 6
199419951996199719981999200020012002200320042005
CLUSTER: 7
199419951996199719981999200020012002200320042005
0
5
10
15
20
25
30
35
CLUSTER: 8
199419951996199719981999200020012002200320042005
CLUSTER: 9
199419951996199719981999200020012002200320042005
JOURNAL
No of Articles
CLUSTER: 1
ISRJMISMISQ
0
10
20
30
40
50
60
70
80
90
100
CLUSTER: 2
ISRJMISMISQ
CLUSTER: 3
ISRJMISMISQ
CLUSTER: 4
ISRJMISMISQ
0
10
20
30
40
50
60
70
80
90
100
CLUSTER: 5
ISRJMISMISQ
CLUSTER: 6
ISRJMISMISQ
CLUSTER: 7
ISRJMISMISQ
0
10
20
30
40
50
60
70
80
90
100
CLUSTER: 8
ISRJMISMISQ
CLUSTER: 9
ISRJMISMISQ
Identify the target
for the sentiment
Calculate the NP
polarity of the
sentiment
Is there a
sentiment?
Record the Polarity,
Strength, and the
Target of the
sentiment.
Tabulate & aggregate
the sentiment
analysis results
Textual Data
Calculate the
O-S Polarity
YesNo
A statement
Yes
Lexicon
Lexicon
O-S
polarity
measure
N-P Polarity
Target
Step 1
Step 2
Step 3
Step 4
S
-
O
P
o
l
a
r
i
t
y
P-N Polarity
Positive (P)
(+)
Negative (N)
(-)
Objective (O)
Subjective (S)