Business Intelligence 7
404 Part II • Predictive Analytics/Machine Learning
Another security-related application of text mining is in the area of deception detection. Applying text mining to a large set of real-world criminal (person-of-interest) statements, Fuller, Biros, and Delen (2008) developed prediction models to differentiate deceptive statements from truthful ones. Using a rich set of cues extracted from textual state- ments, the model predicted the holdout samples with 70 percent accuracy, which is believed to be a significant success considering that the cues are extracted only from textual state- ments (no verbal or visual cues are present). Furthermore, compared to other deception- detection techniques, such as polygraphs, this method is nonintrusive and widely applicable to not only textual data but also (potentially) transcriptions of voice recordings. A more detailed description of text-based deception detection is provided in Application Case 7.3.
Biomedical Applications
Text mining holds great potential for the medical field in general and biomedicine in particular for several reasons. First, published literature and publication outlets (especially with the advent of the open source journals) in the field are expanding at an exponential rate. Second, compared to most other fields, medical literature is more standardized and orderly, making it a more “minable” information source. Finally, the terminology used in
Driven by advancements in Web-based informa- tion technologies and increasing globalization, computer-mediated communication continues to fil- ter into everyday life, bringing with it new venues for deception. The volume of text-based chat, instant messaging, text messaging, and text generated by online communities of practice is increasing rapidly. Even the use of e-mail continues to increase. With the massive growth of text-based communication, the potential for people to deceive others through computer-mediated communication has also grown, and such deception can have disastrous results.
Unfortunately, in general, humans tend to perform poorly at deception-detection tasks. This phenomenon is exacerbated in text-based commu- nications. A large part of the research on deception detection (also known as credibility assessment) has involved face-to-face meetings and interviews. Yet with the growth of text-based communication, text- based deception-detection techniques are essential.
Techniques for successfully detecting deception—that is, lies—have wide applicability. Law enforcement can use decision support tools and tech- niques to investigate crimes, conduct security screening in airports, and monitor communications of suspected terrorists. Human resources professionals might use deception-detection tools to screen applicants. These tools and techniques also have the potential to screen
e-mails to uncover fraud or other wrongdoings com- mitted by corporate officers. Although some people believe that they can readily identify those who are not being truthful, a summary of deception research showed that, on average, people are only 54 percent accurate in making veracity determinations (Bond & DePaulo, 2006). This figure may actually be worse when humans try to detect deception in text.
Using a combination of text mining and data mining techniques, Fuller et al. (2008) analyzed person-of-interest statements completed by peo- ple involved in crimes on military bases. In these statements, suspects and witnesses are required to write their recollection of the event in their own words. Military law enforcement personnel searched archival data for statements that they could conclu- sively identify as being truthful or deceptive. These decisions were made on the basis of corroborating evidence and case resolution. Once labeled as truth- ful or deceptive, the law enforcement personnel removed identifying information and gave the state- ments to the research team. In total, 371 usable state- ments were received for analysis. The text-based deception-detection method used by Fuller et al. was based on a process known as message feature mining, which relies on elements of data and text mining techniques. A simplified depiction of the process is provided in Figure 7.3.
Application Case 7.3 Mining for Lies
Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 405
First, the researchers prepared the data for pro- cessing. The original handwritten statements had to be transcribed into a word processing file. Second, features (i.e., cues) were identified. The research- ers identified 31 features representing categories or types of language that are relatively independent of
the text content and that can be readily analyzed by automated means. For example, first-person pronouns such as I or me can be identified with- out analysis of the surrounding text. Table 7.1 lists the categories and examples of features used in this study.
Statements Transcribed for
Processing
Text Processing Software Identified Cues in Statements
Statements Labeled as Truthful or Deceptive by Law Enforcement
Text Processing Software Generated
Quantified Cues
Classification Models Trained and Tested on Quantified Cues
Cues Extracted & Selected
FIGURE 7.3 Text-Based Deception-Detection Process. Source: Fuller, C. M., D. Biros, & D. Delen. (2008, January). Exploration of Feature Selection and Advanced Classification Models for High-Stakes Deception Detection. Proceedings of
the Forty-First Annual Hawaii International Conference on System Sciences (HICSS), Big Island, HI: IEEE Press, pp. 80–99.
TABLE 7.1 Categories and Examples of Linguistic Features Used in Deception Detection
Number Construct (Category) Example Cues
1 Quantity Verb count, noun phrase count, etc.
2 Complexity Average number of clauses, average sentence length, etc.
3 Uncertainty Modifiers, modal verbs, etc.
4 Nonimmediacy Passive voice, objectification, etc.
5 Expressivity Emotiveness
6 Diversity Lexical diversity, redundancy, etc.
7 Informality Typographical error ratio
8 Specificity Spatiotemporal information, perceptual information, etc.
9 Affect Positive affect, negative affect, etc.
(Continued )
406 Part II • Predictive Analytics/Machine Learning
this literature is relatively constant, having a fairly standardized ontology. What follows are a few exemplary studies that successfully used text mining techniques in extracting novel patterns from biomedical literature.
Experimental techniques such as DNA microarray analysis, serial analysis of gene ex- pression (SAGE), and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins. As in any other experimental approach, it is necessary to analyze this vast amount of data in the context of previously known information about the biological entities under study. The literature is a particularly valu- able source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.
Knowing the location of a protein within a cell can help to elucidate its role in biological processes and to determine its potential as a drug target. Numerous location- prediction systems are described in the literature; some focus on specific organisms, whereas others attempt to analyze a wide range of organisms. Shatkay et al. (2007) pro- posed a comprehensive system that uses several types of sequence- and text-based fea- tures to predict the location of proteins. The main novelty of their system lies in the way in which it selects its text sources and features and integrates them with sequence-based features. They tested the system on previously used and new data sets devised specifi- cally to test its predictive power. The results showed that their system consistently beat previously reported results.
Chun et al. (2006) described a system that extracts disease–gene relationships from literature accessed via MEDLINE. They constructed a dictionary for disease and gene names from six public databases and extracted relation candidates by dictionary matching. Because dictionary matching produces a large number of false positives, they developed a method of machine-learning–based system, named entity recognition (NER), to filter out false recognition of disease/gene names. They found that the success of relation extraction is heavily dependent on the performance of NER filtering and that the filtering improved the precision of relation extraction by 26.7 percent at the cost of a small reduction in recall.
Figure 7.4 shows a simplified depiction of a multilevel text analysis process for dis- covering gene–protein relationships (or protein–protein interactions) in the biomedical
The features were extracted from the textual statements and input into a flat file for further pro- cessing. Using several feature-selection methods along with 10-fold cross-validation, the researchers compared the prediction accuracy of three popu- lar data mining methods. Their results indicated that neural network models performed the best, with 73.46 percent prediction accuracy on test data samples; decision trees performed second best, with 71.60 percent accuracy; and logistic regression was last, with 65.28 percent accuracy.
The results indicate that automated text-based deception detection has the potential to aid those who must try to detect lies in text and can be suc- cessfully applied to real-world data. The accuracy of these techniques exceeded the accuracy of most
other deception-detection techniques, even though it was limited to textual cues.
Questions for Case 7.3
1. Why is it difficult to detect deception?
2. How can text/data mining be used to detect deception in text?
3. What do you think are the main challenges for such an automated system?
Sources: Fuller, C. M., D. Biros, & D. Delen. (2008, January). “Exploration of Feature Selection and Advanced Classification Models for High-Stakes Deception Detection.” Proceedings of the Forty-First Annual Hawaii International Conference on System Sciences (HICSS), Big Island, HI: IEEE Press, pp. 80–99; Bond, C. F., & B. M. DePaulo. (2006). “Accuracy of Deception Judgments.” Personality and Social Psychology Reports, 10(3), pp. 214–234.
Application Case 7.3 (Continued)