Discussion paper

profileLucky2549
Peerspost.docx

Post 1:

What is text analytics? How does it differ from text mining?

Text Analytics is applying of statistical and machine learning techniques to be able to predict /prescribe or infer any information from the text-mined data. Text mining is a tool that helps in getting the data cleaned up.Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.

Differences between Text Mining and Text Analytics:

• Text Mining and Text Analytics solve the same problems, but use different techniques and are complementary ways to automatically extract meaning from text.

• Text Analytics is developed within the field of computational linguistics. It has the ability to encode human understanding into a series of linguistic rules which are generated by humans are high in precision, but they do not automatically adapt and are usually fragile when tried in new situations.

• Text mining is a newer discipline arising out of the fields of statistics, data mining, and machine learning. Its strength is the ability to inductively create models from collections of historical data. Because statistical models are learned from training data they are adaptive and can identify “unknown unknowns”, leading to the better recall. Still, they can be prone to missing something that would seem obvious to a human.

• Text analytics and text mining approaches have essentially equivalent performance. Text analytics requires an expert linguist to produce complex rule sets, whereas text mining requires the analyst to hand-label cases with outcomes or classes to create training data.

• Due to their different perspectives and strengths, combining text analytics with text mining often leads to better performance than either approach alone.

2. What technologies were used in building Watson (both hardware and software)?

Watson is an extraordinary computer system (a novel combination of advanced hardware an software) designed at answering questions posed in natural human language.Watson is an artificially intelligent computer system capable of answering questions posed in natural language, developed in IBM's DeepQA project by a research team led by principal investigator David Ferrucci. Watson was named after IBM's first CEO and industrialist Thomas J. Watson. The computer system was specifically developed to answer questions on the quiz show Jeopardy! In 2011, Watson competed on Jeopardy! against former winners Brad Rutter and Ken Jennings.

Watson received the first prize of $1 million.The goal was to advance computer science by exploring new ways for computer technology to affect science, business, and society.IBM undertook a challenge to build a computer system that could compete at the human champion level in real time on the American TV quiz show Jeopardy!The extent of the challenge included fielding a real-time automatic contestant on the show, capable of listening, understanding, and responding- not merely a laboratory exercise.watson called DeepQA, is a massively parallel, text mining focused, probabilistic evidence based computational architecture.Many techniques were combined such that overlapping approaches can bring their strengths to bear and contribute to improvements in accuracy, confidence, and speed.overarching principles in Deep QA are:

a) massive parallelism- consider multiple interpretations and hypothesis

b)many experts- integration, application, and contextual evaluation of loosely coupled questions and analytics

c)pervasive confidence estimation- no one component commits to an answer but all are weighed against confidence intervals

d) integrate shallow and deep knowledge

Software:

Watson uses IBM's DeepQA software and the Apache UIMA (Unstructured Information Management Architecture) framework. The system was written in various languages, including Java, C++, and Prolog, and runs on the SUSE Linux Enterprise Server 11 operating system using Apache Hadoop framework to provide distributed computing.

Hardware:

The system is workload optimized, integrating massively parallel POWER7 processors and being built on IBM's DeepQA technology, which it uses to generate hypotheses, gather massive evidence, and analyze data.Watson is composed of a cluster of ninety IBM Power 750 servers, each of which uses a 3.5 GHz POWER7 eight core processor, with four threads per core. In total, the system has 2,880 POWER7 processor cores and has 16 terabytes of RAM.According to John Rennie, Watson can process 500 gigabytes, the equivalent of a million books, per second. IBM's master inventor and senior consultant Tony Pearson estimated Watson's hardware cost at about $3 million. Its performance stands at 80 TeraFLOPs which is not enough to place it at Top 500 Supercomputers list.According to Rennie, the content was stored in Watson's RAM for the game because data stored on hard drives are too slow to access.

Data:

The sources of information for Watson include encyclopedias, dictionaries, thesauri, newswire articles, and literary works. Watson also used databases, taxonomies, and ontologies. Specifically,DBPedia, WordNet, and Yago were used.The IBM team provided Watson with millions of documents, including dictionaries, encyclopedias, and other reference material that it could use to build its knowledge.Watson is a question answering (QA) computer system developed by an IBM Research team and named after IBM's first president as part of a project called DeepQA. What makes it special is that it is able to compete at the human champion level in real time on the TV quiz show, Jeopardy! In fact, in 2011, it was able to defeat Ken Jennings, who held the record for the longest winning streak in the game. Like Deep Blue has done with chess, Watson is showing that computer systems are getting quite good at demonstrating human-like intelligence.1

  3. Why is the popularity of text mining as an analytics tool increasing?

Text mining is also considered as the process for knowledge discovery in the textual databases and the process is not fully automated for extracting the useful and important pattern from a huge collection of unstructured data.Text mining is the process of exploring and analyzing large amounts of unstructured text data aided by software that can identify concepts, patterns, topics, keywords and other attributes in the data. It's also known as text analytics, although some people draw a distinction between the two terms; in that view, text analytics is an application enabled by the use of text mining techniques to sort through data sets.2 Text analytics is a concept that includes information retrieval (e.g., searching and identifying relevant documents for a given set of key terms) as well as information extraction, data mining, and Web mining. By contrast, text mining is primarily focused on discovering new and useful knowledge from textual data sources. The overarching goal for both text analytics and text mining is to turn unstructured textual data into actionable information through the application of natural language processing (NLP) and analytics. However, text analytics is a broader term because of its inclusion of information retrieval. You can think of text analytics as a combination of information retrieval plus text mining.

 Popular application areas of text mining

· Information extraction. Identification of key phrases and relationships within text by looking for predefined sequences in text via pattern matching.

· Topic tracking. Based on a user profile and documents that a user views, text mining can predict other documents of interest to the user.

· Summarizing a document to save time on the part of the reader.

· Identifying the main themes of a document and then placing the document into a predefined set of categories based on those themes.

· Grouping similar documents without having a predefined set of categories.

· Concept linking. Connects related documents by identifying their shared concepts and, by doing so, helps users find information that they perhaps would not have found using traditional search methods.

· Question answering. Finding the best answer to a given question through knowledge-driven pattern matching.

References:

1.AN INTRODUCTION TO TEXT MINING by Charu C. Aggarwal IBM T. J. Watson Research Center ,Yorktown Heights, NY

2.International Conference on Computation of Power, Energy Information and Communication (ICCPEIC) published in 2017

Post 2:

Text analytics is an automated method for analyzing and extracting valuable information from a piece of text. This is also achieved with the aid of a program designed to go through long texts and collect information that may be useful for the purposes of marketing, branding and other research. To analyze blogs, tweets, social media posts, reviews, comments and other forms of content, many businesses use text analytics to find meaning and gather information with the aid of algorithms and machine learning software.

While the common goal for both text analytics and text mining is to turn unstructured textual data into actionable information by applying natural language processing (NLP) and analytics, their meanings are very different, at least to some experts in the sector. According to them, text analytics is a wider field that involves retrieval of information (e.g. scanning and finding relevant documents for a given set of key terms) as well as extraction of information, data mining, and web mining, while text mining focuses mainly on discovering new and useful knowledge from textual data sources. Based on this definition of text analytics and text mining, we can say that Text Analytics is the combination of Information Retrieval and Text Mining (which includes Information Extraction, Data Mining and Web Mining).

Watson used advanced technologies like Text Mining, Text Analytics and Natural Language Processing which includes the following: 

· Parsing

· Question Classification

· Question Decomposition

· Automatic Source Acquisition and Evaluation

· Entity and Relation Detection

· Logical Form Generation

· Knowledge Representation and Reasoning

In areas where very large quantities of textual data are produced, such as law (court orders), academic science (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), and marketing (customer comments), the benefits of text mining are evident. Another field where automated unstructured text processing has had a huge influence is in electronic correspondence and e-mail. Text mining can not only be used to identify and delete junk e-mail but can also be used to automatically prioritize e-mail based on the level of significance and to produce automated responses (Weng and Liu, 2004). That's why the popularity of Text Mining is increasing as an analytics tool.

References:  https://www.revuze.it/blog/what-is-text-analytics/