Assignment

profiletwinkle
Chapter7and8.pdf

388

Text Mining, Sentiment Analysis, and Social Analytics

LEARNING OBJECTIVES

■■ Describe text analytics and understand the need for text mining

■■ Differentiate among text analytics, text mining, and data mining

■■ Understand the different application areas for text mining

■■ Know the process of carrying out a text mining project

■■ Appreciate the different methods to introduce structure- to text-based data

■■ Describe sentiment analysis ■■ Develop familiarity with popular applications of sentiment analysis

■■ Learn the common methods for sentiment analysis ■■ Become familiar with speech analytics as it relates to sentiment analysis

■■ Learn three facets of Web analytics—content, structure, and usage mining

■■ Know social analytics including social media and social network analyses

T his chapter provides a comprehensive overview of text analytics/mining and Web analytics/mining along with their popular application areas such as search engines, sentiment analysis, and social network/media analytics. As we have been witness- ing in recent years, the unstructured data generated over the Internet of Things (IoT) (Web, sensor networks, radio-frequency identification [RFID]–enabled supply chain sys- tems, surveillance networks, etc.) are increasing at an exponential pace, and there is no in- dication of its slowing down. This changing nature of data is forcing organizations to make text and Web analytics a critical part of their business intelligence/analytics infrastructure.

7.1 Opening Vignette: Amadori Group Converts Consumer Sentiments into Near-Real-Time Sales 389

7.2 Text Analytics and Text Mining Overview 392 7.3 Natural Language Processing (NLP) 397 7.4 Text Mining Applications 402 7.5 Text Mining Process 410 7.6 Sentiment Analysis 418

7 C H A P T E R

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 389

7.7 Web Mining Overview 429 7.8 Search Engines 433 7.9 Web Usage Mining 441

7.10 Social Analytics 446

7.1 OPENING VIGNETTE: Amadori Group Converts Consumer Sentiments into Near-Real-Time Sales

BACKGROUND

Amadori Group, or Gruppo Amadori in Italian, is a leading manufacturing company in Italy that produces and markets food products. Headquartered in San Vittore di Cesena, Italy, the company employs more than 7,000 people and operates 16 production plants.

Amadori wanted to evolve its marketing to dynamically align with the changing life- styles and dietary needs of young people aged 25-35. It sought to create fun ways to engage this target segment by exploiting the potential of online marketing and social media. The company wanted to boost brand visibility, encourage customer loyalty, and gauge consumers’ reactions to products and marketing campaigns.

ENGAGING YOUNG ADULTS WITH CREATIVE DIGITAL MARKETING PROMOTIONS

Together with Tecla (a digital business company), Amadori used IBM WebSphere® Portal and IBM Web Content Manager software to create and manage interactive content for four mini websites, or “minisites,” which promote ready-to-eat and quick-to-prepare products that fit young adults’ preferences and lifestyles. For example, to market its new Evviva sausage product, the company created the “Evviva Il Würstel Italiano” minisite and let consumers upload images and videos of themselves attending events organized by Amadori. To encourage participation, the company offered the winner a spot in its next national ad campaign.

With this and other campaigns, the Amadori marketing staff compiled a database of consumer profiles by asking minisite visitors to share data to enter competitions, down- load applications, receive regular newsletters, and sign up for events. Additionally, the company uses Facebook Insights technology to obtain metrics on its Facebook page, including the number of new fans and favorite content.

MONITORING MARKETPLACE PERCEPTIONS OF THE AMADORI BRAND

The company capitalizes on IBM SPSS® Data Collection software to help assess peo- ples’ opinions of its products and draw conclusions about any fluctuation in Amadori brands’ popularity among consumers. For example, as it launched its Evviva campaign

390 Part II • Predictive Analytics/Machine Learning

TV advertisements and Beach Party Tour, Amadori experienced a flood of consumer conversation. The company captured comments about the product from its Web site and social media networks using SPSS software’s sentiment analysis functionality and suc- cessfully adjusted its marketing efforts in near-real time. The software does not depend solely on keyword searches but also analyzes the syntax of languages, connotations, and even slang to reveal hidden speech patterns that help gauge whether comments about the company or Amadori products express a positive, negative, or neutral opinion. Figure 7.1 shows Amadori’s three facets of commerce analytics to improve consumer engagement.

MAINTAINING BRAND INTEGRITY AND CONSISTENCY ACROSS PRODUCT LINES

Building on the success of marketing minisites, Amadori launched a new corporate Web site built on the same IBM portal and content management technology. The company now concentrates on bringing visitors to the corporate Web site. Instead of individual minisites, Amadori offers sections within the Web site, some with different templates and graphics and a user interface specific to a particular marketing or ad campaign. “For example, we introduced a new product that is made from organic, free-range chickens,” says Fabbri. “As part of the marketing plan, we offered webcam viewing in a new sec- tion of our corporate site so visitors could see how the poultry live and grow. We created a new graphic, but the URL, the header and the footer are always the same so visitors understand that they are always in the Amadori site.”

Visitors can move from one section to another, remaining longer and learning about other offerings. With new content added weekly, the Amadori site has become bigger and is gaining greater prominence on Google and other search engines. “The first year after implementation, our Website traffic grew to approximately 240,000 unique visitors, with 30 percent becoming loyal users,” says Fabbri.

KEEPING CONTENT CURRENT AND ENGAGING TO DIVERSE AUDIENCES

With more content and a high volume of traffic on the Web, it is important that visitors continue to easily find what they are looking for no matter how they access the Amadori site. With this aim in mind, the Amadori project team created a content taxonomy organized

Instrumented The interactive digital platform supports rapid, accurate data collection from business partners and customers.

Interconnected The digital platform also provides an integrated view of the company’s end-to-end processes from production plan to marketing and sales.

Intelligent Content management, data collection, and predictive analytics applications monitor and analyze social media relevant to the Amadori brand, helping the company anticipate issues and better align products and market- ing promotions with customers’ needs and desires.

FIGURE 7.1 Smarter Commerce—Improving Consumer Engagement through Analytics.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 391

by role and area of interest. For example, when people visit the Amadori Web site, they see a banner inviting them to “Reorganize the contents.” They can identify themselves as consumers, buyers, or journalists/bloggers and slide selection bars to indicate interest level in corporate, cooking, and/or entertainment information. The content appearing on the site changes in real time based on these selections. “If the visitor identifies himself as a profes- sional buyer interested predominantly in corporate information, the icons he sees at the top of the screen invite him to either view a digital product catalog online or download a PDF,” says Fabbri. “In that same area of the screen, a consumer interested in cooking sees an icon that clicks through to pages with recipes for preparing dishes using Amadori products.”

Amadori’s advanced analytics projects have been producing significant business ben- efits, making a very strong case for the company to venture into more innovative use of social data. Following are a few of the most prevalent ones:

• Boost by 100 percent the company’s ability to dynamically monitor and learn about the health of its brand using sentiment analysis.

• Improve the company’s social media presence by 100 percent using near-real-time marketing insights, gaining 45,000 Facebook fans in less than one year.

• Establish direct communication with the target segment through Web integration with social media.

• Increase sales by facilitating timely promotions such as eCoupons.

As this case illustrates, in this age of Internet and social media, customer-focused companies are in a race to better communicate with their customers to obtain an inti- mate understanding of their needs, wants, likes, and dislikes. Social analytics that builds on social media—providing both content and the social network–related data—enables these companies to gain deeper insights than ever before.

u QUESTIONS FOR THE OPENING VIGNETTE

1. According to the vignette and based on your opinion, what are the challenges that the food industry is facing today?

2. How can analytics help businesses in the food industry to survive and thrive in this competitive marketplace?

3. What were and still are the main objectives for Amadori to embark into analytics? What were the results?

4. Can you think of other businesses in the food industry that utilize analytics to become more competitive and customer focused? If not, an Internet search could help find relevant information to answer this question.

WHAT WE CAN LEARN FROM THIS VIGNETTE

It is safe to say that computer technology, both on the hardware and software fronts, is advancing faster than anything else in the last 50-plus years. Things that were too big, too complex, and impossible to solve are now well within the reach of information technol- ogy. One of the enabling technologies is perhaps text analytics/text mining and its deriva- tive called sentiment analysis. Traditionally, we have created databases to structure the data so that they can be processed by computers. Textual content, on the other hand, has always been meant for humans to process. Can machines do the things that are meant for humans’ creativity and intelligence? Evidently, yes! This case illustrates the viability and  value proposition of collecting and processing customer opinions to develop new and improved products and services, managing the company’s brand name, and engag- ing and energizing the customer base for mutually beneficial and closer relationships. Under the overarching name of “digital marketing,” Amadori showcases the use of text

392 Part II • Predictive Analytics/Machine Learning

mining, sentiment analysis, and social media analytics to significantly advance the bottom line through improved customer satisfaction, increased sales, and enhanced brand loyalty.

Sources: IBM Customer Case Study. “Amadori Group Converts Consumer Sentiments into Near-Real-Time Sales.” Used with permission of IBM.

7.2 TEXT ANALYTICS AND TEXT MINING OVERVIEW

The information age that we are living in is characterized by the rapid growth in the amount of data and information collected, stored, and made available in electronic format. A vast majority of business data are stored in text documents that are virtually un- structured. According to a study by Merrill Lynch and Gartner, 85 percent of all corporate data are captured and stored in some sort of unstructured form (McKnight, 2005). The same study also stated that these unstructured data are doubling in size every 18 months. Because knowledge is power in today’s business world and knowledge is derived from data and information, businesses that effectively and efficiently tap into their text data sources will have the necessary knowledge to make better decisions, leading to a com- petitive advantage over those businesses that lag behind. This is where the need for text analytics and text mining fits into the big picture of today’s businesses.

Even though the overarching goal for both text analytics and text mining is to turn unstructured textual data into actionable information through the application of natural language processing (NLP) and analytics, the definitions of these terms are somewhat different, at least to some experts in the field. According to them, “text analytics” is a broader concept that includes information retrieval (e.g., searching and identifying rel- evant documents for a given set of key terms), as well as information extraction, data mining, and Web mining, whereas “text mining” is primarily focused on discovering new and useful knowledge from the textual data sources. Figure 7.2 illustrates the relation- ships between text analytics and text mining along with other related application areas. The bottom of Figure 7.2 lists the main disciplines (the foundation of the house) that play a critical role in the development of these increasingly more popular application areas. Based on this definition of text analytics and text mining, one could simply formulate the difference between the two as follows:

Text Analytics = Information Retrieval + Information Extraction + Data Mining + Web Mining

or simply

Text Analytics = Information Retrieval + Text Mining

Compared to text mining, text analytics is a relatively new term. With the recent emphasis on analytics, as has been the case in many other related technical application areas (e.g., consumer analytics, completive analytics, visual analytics, social analytics), the field of text has also wanted to get on the analytics bandwagon. Although the term text analytics is more commonly used in a business application context, text mining is frequently used in academic research circles. Even though the two can be defined some- what differently at times, text analytics and text mining are usually used synonymously, and we (authors of this book) concur with this.

Text mining (also known as text data mining or knowledge discovery in textual databases) is the semiautomated process of extracting patterns (useful information and knowledge) from large amounts of unstructured data sources. Remember that data mining is the process of identifying valid, novel, potentially useful, and ultimately understandable

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 393

patterns in data stored in structured databases where the data are organized in records structured by categorical, ordinal, or continuous variables. Text mining is the same as data mining in that it has the same purpose and uses the same processes, but with text mining, the input to the process is a collection of unstructured (or less structured) data files such as Word documents, PDF files, text excerpts, and XML files. In essence, text mining can be thought of as a process (with two main steps) that starts with imposing structure on the text-based data sources followed by extracting relevant information and knowledge from these structured text-based data using data mining techniques and tools.

The benefits of text mining are obvious in the areas in which very large amounts of textual data are being generated, such as law (court orders), academic research (research articles), finance (quarterly reports), medicine (discharge summaries), biology (molecular interactions), technology (patent files), and marketing (customer comments). For example, the free-form text-based interactions with customers in the form of complaints (or com- pliments) and warranty claims can be used to objectively identify product and service characteristics that are deemed to be less than perfect and can be used as input to better product development and service allocations. Likewise, market outreach programs and focus groups generate large amounts of data. By not restricting product or service feed- back to a codified form, customers can present, in their own words, what they think about a company’s products and services. Another area where the automated processing of unstructured text has had much impact is in electronic communications and e-mail. Text mining can be used not only to classify and filter junk e-mail but also to automatically

Text Mining “Knowledge Discovery in

Textual Data”

Information Retrieval

Natural Language

Processing

Web Mining

Data Mining

Web Content Mining

Web Structure Mining

Web Usage Mining

Classification

Clustering

Association

Document Matching

Link Analysis

Search Engines

POS Tagging

Lemmatization

Word Disambiguation

TEXT ANALYTICS

Statistics

Artificial Intelligence

Machine Learning

Computer Science

Management Science

Other Disciplines

FIGURE 7.2 Text Analytics, Related Application Areas, and Enabling Disciplines.

394 Part II • Predictive Analytics/Machine Learning

prioritize e-mail based on importance level as well as generate automatic responses (Weng & Liu, 2004). Following are among the most popular application areas of text mining:

• Information extraction. Identifying key phrases and relationships within text by looking for predefined objects and sequences in text by way of pattern matching.

• Topic tracking. Based on a user profile and documents that a user views, pre- dicting other documents of interest to the user.

• Summarization. Summarizing a document to save the reader time. • Categorization. Identifying the main themes of a document and then placing the

document into a predefined set of categories based on those themes. • Clustering. Grouping similar documents without having a predefined set of

categories. • Concept linking. Connecting related documents by identifying their shared con-

cepts and, by doing so, helping users find information that they perhaps would not have found using traditional search methods.

• Question answering. Finding the best answer to a given question through knowledge-driven pattern matching.

See Technology Insights 7.1 for explanations of some of the terms and concepts used in text mining. Application Case 7.1 describes the use of text mining in the insurance industry.

Application Case 7.1 shows how text mining and a variety of user-generated data sources enable Netflix stay innovative in its business practices, generate deeper customer insight, and drive very successful content for its viewers.

TECHNOLOGY INSIGHTS 7.1 Text Mining Terminology

The following list describes some commonly used text mining terms:

• Unstructured data (versus structured data). Structured data have a predetermined format. They are usually organized into records with simple data values (categorical, or- dinal, and continuous variables) and stored in databases. In contrast, unstructured data do not have a predetermined format and are stored in the form of textual documents. In essence, structured data are for the computers to process, whereas unstructured data are for humans to process and understand.

• Corpus. In linguistics, a corpus (plural corpora) is a large and structured set of texts (now usually stored and processed electronically) prepared for the purpose of conducting knowledge discovery.

• Terms. A term is a single word or multiword phrase extracted directly from the corpus of a specific domain by means of NLP methods.

• Concepts. Concepts are features generated from a collection of documents by means of manual, statistical, rule-based, or hybrid categorization methodology. Compared to terms, concepts are the result of higher-level abstraction.

• Stemming. Stemming is the process of reducing inflected words to their stem (or base or root) form. For instance, stemmer, stemming, stemmed are all based on the root stem.

• Stop words. Stop words (or noise words) are words that are filtered out prior to or after processing natural language data (i.e., text). Even though there is no universally accepted list of stop words, most NLP tools use a list that includes articles (a, an, the), preposi- tions (of, on, for), auxiliary verbs (is, are, was, were), and context-specific words that are deemed not to have differentiating value.

• Synonyms and polysemes. Synonyms are syntactically different words (i.e., spelled differently) with identical or at least similar meanings (e.g., movie, film, and motion pic- ture). In contrast, polysemes, which are also called homonyms, are syntactically identi- cal words (i.e., spelled exactly the same) with different meanings (e.g., bow can mean “to bend forward,” “the front of the ship,” “the weapon that shoots arrows,” or “a kind of tied ribbon”).

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 395

• Tokenizing. A token is a categorized block of text in a sentence. The block of text corresponding to the token is categorized according to the function it performs. This as- signment of meaning to blocks of text is known as tokenizing. A token can look like anything; it just needs to be a useful part of the structured text.

• Term dictionary. This is a collection of terms specific to a narrow field that can be used to restrict the extracted terms within a corpus.

• Word frequency. This is the number of times a word is found in a specific document. • Part-of-speech tagging. This is the process of marking the words in a text as corre-

sponding to a particular part of speech (nouns, verbs, adjectives, adverbs, etc.) based on a word’s definition and the context in which it is used.

• Morphology. This is the branch of the field of linguistics and a part of NLP that studies the internal structure of words (patterns of word formation within a language or across languages).

• Term-by-document matrix (occurrence matrix). This term refers to the common representation schema of the frequency-based relationship between the terms and docu- ments in tabular format where terms are listed in columns, documents are listed in rows, and the frequency between the terms and documents is listed in cells as integer values.

• Singular value decomposition (latent semantic indexing). This dimensionality re- duction method is used to transform the term-by-document matrix to a manageable size by generating an intermediate representation of the frequencies using a matrix manipula- tion method similar to principal component analysis.

The Problem

In today’s hyper-connected world, businesses are under enormous pressure to build relationships with fully engaged consumers who keep coming back for more.

In theory, fostering more intimate consumer relationships becomes easier as new sources of data emerge, data volumes continue their unprecedented growth, and technology becomes more sophisticated. These developments should enable businesses to do a much better job of personalizing marketing campaigns and generating precise content recommendations that drive engagement, adoption, and value for subscribers.

Yet achieving an advanced understanding of one’s audience is a continuous process of testing and learning. It demands the ability to quickly gather and reliably analyze thousands, millions, even billions of events every day found in a variety of data sources, formats, and locations—otherwise known as Big Data. Technology platforms crafted to gather these data and conduct the analyses must be powerful enough to deliver timely insights today and flexible enough to change and grow in business and technol- ogy landscapes that morph with remarkable speed.

Netflix, an undisputed leader and innovator in the over-the-top (OTT ) content space, understands this context better than most. It has staked its busi- ness and its brand on delivering highly targeted, personalized experiences for every subscriber— and has even begun using its remarkably detailed insights to change the way it buys, licenses, and develops content, causing many throughout the Media and Entertainment industries to sit up and take notice.

To support these efforts, Netflix leverages Teradata as a critical component of its data and ana- lytics platform. More recently, the two companies partnered to transition Netflix to the Teradata Cloud, which has given Netflix the power and flexibility it needs—and, so, the ability to maintain its focus on those initiatives at the core of its business.

A Model for Data-Driven, Consumer-Focused Business

The Netflix story is a model for data-driven, direct- to-consumer, and subscriber-based companies— and, in fact, for any business that needs engaged audiences to thrive in a rapidly changing world.

Application Case 7.1 Netflix: Using Big Data to Drive Big Engagement: Unlocking the Power of Analytics to Drive Content and Consumer Insight

(Continued )

396 Part II • Predictive Analytics/Machine Learning

After beginning as a mail-order DVD business, Netflix became the first prominent OTT content pro- vider and turned the media world on its head; wit- ness recent decisions by other major media compa- nies to begin delivering OTT content.

One major element in Netflix’s success is the way it relentlessly tweaks its recommendation engines, constantly adapting to meet each consum- er’s preferred style. Most of the company’s streaming activity emerges from its recommendations, which generate enormous consumer engagement and loy- alty. Every interaction a Netflix subscriber has with the service is based on meticulously culled and ana- lyzed interactions—no two experiences are the same.

In addition, as noted above, Netflix has applied its understanding of subscribers and potential subscribers—as individuals and as groups—to make strategic purchasing, licensing, and content develop- ment decisions. It has created two highly successful dramatic series—House of Cards and Orange is the New Black—that are informed in part by the compa- ny’s extraordinary understanding of its subscribers.

While those efforts and the business minds that drive them make up the heart of the company’s busi- ness, the technology that supports these initiatives must be more powerful and reliable than that of its competi- tors. The data and analytics platform must be able to:

• Rapidly and reliably handle staggering work- loads; it must support insightful analysis of bil- lions of transactional events each day—every search, browse, stop, and start—in whatever data format that records the events.

• Work with a variety of analytics approaches, including neural networks, Python, Pig, as well as varied Business Intelligence tools, like Mi- croStrategy.

• Easily scale and contract as necessary with ex- ceptional elasticity.

• Provide a safe and redundant repository for all of the company’s data.

• Fit within the company’s cost structure and de- sired profit margins.

Bringing Teradata Analytics to the Cloud

With these considerations in mind, Netflix and Teradata teamed up to launch a successful ven- ture to bring Netflix’s Teradata Data Warehouse into the cloud.

Power and Maturity: Teradata’s well-earned reputa- tion for exceptional performance is especially im- portant to a company like Netflix, which pounds its analytics platform with hundreds of concurrent queries. Netflix also needed data warehousing and analytics tools that enable complex work- load management—essential for creating differ- ent queues for different users, and thus allowing for the constant and reliable filtering of what each user needs.

Hybrid Analytical Ecosystems and a Unified Data Architecture: Netflix’s reliance on a hybrid analytical ecosystem that leverages Ha- doop where appropriate but refuses to compro- mise on speed and agility was the perfect fit for Teradata. Netflix’s cloud environment relies on a Teradata-Hadoop connector that enables Net- flix to seamlessly move cloud-based data from another provider into the Teradata Cloud. The result is that Netflix can do much of its analytics off a world-class data warehouse in the Teradata Cloud that offers peace-of-mind redundancy, the ability to expand and contract in response to changing business conditions, and a significantly reduced need for data movement. And, Netflix’s no-holds-barred approach to allowing their ana- lysts to use whatever analytical tools fit the bill demanded a unique analytics platform that could accommodate them. Having a partner that works efficiently with the full complement of analyti- cal applications—both its own and other leading software providers—was critical.

Teradata’s Unified Data Architecture (UDA) helps provide this by recognizing that most companies need a safe, cost-effective collection of services, platforms, applications, and tools for smarter data management, processing, and analytics. In turn, organizations can get the most from all their data. The Teradata UDA includes:

• An integrated data warehouse, which enables or- ganizations to access a comprehensive and shared data environment to quickly and reliably opera- tionalize insights throughout an organization.

• A powerful discovery platform offers companies discovery analytics that rapidly unlock insights from all available data using a variety of techniques accessible to mainstream business analysts.

Application Case 7.1 (Continued)

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 397

u SECTION 7.2 REVIEW QUESTIONS

1. What is text analytics? How does it differ from text mining? 2. What is text mining? How does it differ from data mining? 3. Why is the popularity of text mining as an analytics tool increasing? 4. What are some of the most popular application areas of text mining?

7.3 NATURAL LANGUAGE PROCESSING (NLP)

Some of the early text mining applications used a simplified representation called bag-of-words when introducing structure to a collection of text-based documents to classify them into two or more predetermined classes or to cluster them into natu- ral groupings. In the bag-of-words model, text, such as a sentence, paragraph, or complete document, is represented as a collection of words, disregarding the gram- mar or the order in which the words appear. The bag-of-words model is still used in some simple document classification tools. For instance, in spam filtering, an e-mail message can be modeled as an unordered collection of words (a bag-of-words) that is compared against two different predetermined bags. One bag is filled with words found in spam messages and the other is filled with words found in legitimate e-mails. Although some of the words are likely to be found in both bags, the “spam” bag will contain spam-related words such as stock, Viagra, and buy much more frequently than the legitimate bag, which will contain more words related to the user’s friends or workplace. The level of match between a specific e-mail’s bag-of-words and the two bags containing the descriptors determines the membership of the e-mail as either spam or legitimate.

Naturally, we (humans) do not use words without some order or structure. We use words in sentences, which have semantic as well as syntactic structure. Thus, automated techniques (such as text mining) need to look for ways to go beyond the bag-of-words

• A data platform (e.g., Hadoop) provides the means to economically gather, store, and refine all a company’s data and facilitate the type of discovery never before believed possible.

The Proof Is in the Eyeballs

Netflix scrupulously adheres to a few simple and powerful metrics when evaluating the success of its personalization capabilities: eyeballs. Are subscrib- ers watching? Are they watching more? Are they watching more of what interests them?

With engagement always top of mind, it’s no surprise that Netflix is among the world’s leaders in personalizing content to successfully attract and retain profitable consumers. It has achieved this standing by drawing on its understanding that in a rapidly changing business and technology landscape, one key to success is constantly testing new ways of gathering and analyzing data to deliver the most effective and targeted recommendations. Working with technology partners that make such testing pos- sible frees Netflix to focus on its core business.

Moving ahead, Netflix believes that making increased use of cloud-based technology will fur- ther empower its customer engagement initiatives. By relying on technology partners that understand how to tailor solutions and provide peace of mind about the redundancy of Netflix’s data, the company expects to continue its organic growth and expand its capacity to respond nimbly to technological change and the inevitable ebbs and flows of business.

Questions for Case 7.1

1. What does Netflix do? How did they evolve into this current business model?

2. In the case of Netflix, what was it meant to be data-driven and customer-focused?

3. How did Netflix use Teradata technologies in its analytics endeavors?

Source: Teradata Case Study “Netflix: Using Big Data to Drive Big Engagement” https://www.teradata.com/Resources/Case- Studies/Netflix-Using-Big-Data-to-Drive-Big-Engageme (accessed July 2018).

398 Part II • Predictive Analytics/Machine Learning

interpretation and incorporate more and more semantic structure into their operations. The current trend in text mining is toward including many of the advanced features that can be obtained using NLP.

It has been shown that the bag-of-words method might not produce good enough information content for text mining tasks (e.g., classification, clustering, association). A good example of this can be found in evidence-based medicine. A critical component of evidence-based medicine is incorporating the best available research findings into the clinical decision-making process, which involves appraisal of the information collected from the printed media for validity and relevance. Several researchers from the University of Maryland developed evidence assessment models using a bag-of-words method (Lin and Demner-Fushman, 2005). They employed popular machine-learning methods along with more than half a million research articles collected from Medical Literature Analysis and Retrieval System Online (MEDLINE). In their models, the researchers represented each abstract as a bag-of-words, where each stemmed term represented a feature. Despite using popular classification methods with proven experimental design methodologies, their prediction results were not much better than simple guessing, which could indicate that the bag-of-words is not generating a good enough representation of the research ar- ticles in this domain; hence, more advanced techniques such as NLP were needed.

Natural language processing (NLP) is an important component of text mining and is a subfield of artificial intelligence and computational linguistics. It studies the problem of “understanding” the natural human language with the task of converting depictions of human language (such as textual documents) into more formal repre- sentations (in the form of numeric and symbolic data) that are easier for computer programs to manipulate. The goal of NLP is to move beyond syntax-driven text ma- nipulation (which is often called word counting) to a true understanding and process- ing of natural language that considers grammatical and semantic constraints as well as the context.

The definition and scope of the word understanding is one of the major discus- sion topics in NLP. Considering that the natural human language is vague and that a true understanding of meaning requires extensive knowledge of a topic (beyond what is in the words, sentences, and paragraphs), will computers ever be able to un- derstand natural language the same way and with the same accuracy that humans do? Probably not! NLP has come a long way from the days of simple word counting, but it has an even longer way to go to really understand natural human language. The following are just a few of the challenges commonly associated with the implementa- tion of NLP:

• Part-of-speech tagging. It is difficult to mark up terms in a text as corresponding to a particular part of speech (such as nouns, verbs, adjectives, or adverbs) because the part of speech depends not only on the definition of the term but also on the context within which it is used.

• Text segmentation. Some written languages, such as Chinese, Japanese, and Thai, do not have single-word boundaries. In these instances, the text-parsing task requires the identification of word boundaries, which is often difficult. Similar chal- lenges in speech segmentation emerge when analyzing spoken language because sounds representing successive letters and words blend into each other.

• Word sense disambiguation. Many words have more than one meaning. Selecting the meaning that makes the most sense can be accomplished only by tak- ing into account the context within which the word is used.

• Syntactic ambiguity. The grammar for natural languages is ambiguous; that is, multiple possible sentence structures often need to be considered. Choosing the most appropriate structure usually requires a fusion of semantic and contextual information.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 399

• Imperfect or irregular input. Foreign or regional accents and vocal impedi- ments in speech and typographical or grammatical errors in texts make the process- ing of the language an even more difficult task.

• Speech acts. A sentence can often be considered an action by the speaker. The sentence structure alone might not contain enough information to define this action. For example, “Can you pass the class?” requests a simple yes/no answer, whereas “Can you pass the salt?” is a request for a physical action to be performed.

A long-standing dream of the artificial intelligence community is to have algo- rithms that are capable of automatically reading and obtaining knowledge from text. By applying a learning algorithm to parsed text, researchers from Stanford University’s NLP lab have developed methods that can automatically identify the concepts and rela- tionships between those concepts in the text. By applying a unique procedure to large amounts of text, the lab’s algorithms automatically acquire hundreds of thousands of items of world knowledge and use them to produce significantly enhanced repositories for WordNet. WordNet is a laboriously hand-coded database of English words, their definitions, sets of synonyms, and various semantic relations between synonym sets. It is a major resource for NLP applications, but it has proven to be very expensive to build and maintain manually. By automatically inducing knowledge into WordNet, the potential exists to make it an even greater and more comprehensive resource for NLP at a fraction of the cost. One prominent area in which the benefits of NLP and WordNet are already being harvested is in customer relationship management (CRM). Broadly speaking, the goal of CRM is to maximize customer value by better understanding and effectively responding to customers’ actual and perceived needs. An important area of CRM in which NLP is making a significant impact is sentiment analysis. Sentiment analysis is a technique used to detect favorable and unfavorable opinions toward specific products and services using a large number of textual data sources (customer feedback in the form of Web postings). A detailed coverage of sentiment analysis and WordNet is given in Section 7.6.

Analytics in general and text analytics and text mining in particular can be used in the broadcasting industry. Application Case 7.2 provides an example that uses a wide range of analytics capabilities to capture new viewers, predict ratings, and add business value to a broadcasting company.

Over the past 10 years, the cable television sector in the United States has enjoyed a period of growth that has enabled unprecedented creativity in the creation of high-quality content. AMC Networks has been at the forefront of this new golden age of television, producing a string of successful, critically acclaimed shows such as Breaking Bad, Mad Men, and The Walking Dead.

Dedicated to producing quality program- ming and movie content for more than 30 years, AMC Networks owns and operates several of the most popular and award-winning brands in cable

television, producing and delivering distinctive, compelling, and culturally relevant content that engages audiences across multiple platforms.

Getting Ahead of the Game

Despite its success, AMC Networks has no plans to rest on its laurels. As Vitaly Tsivin, SVP Business Intelligence, explains:

We have no interest in standing still. Although a large percentage of our business is still linear

Application Case 7.2 AMC Networks Is Using Analytics to Capture New Viewers, Predict Ratings, and Add Value for Advertisers in a Multichannel World

(Continued )

400 Part II • Predictive Analytics/Machine Learning

cable TV, we need to appeal to a new gen- eration of millennials who consume content in very different ways.

TV has evolved into a multichannel, mul- tistream business, and cable networks need to get smarter about how they market to and connect with audiences across all of those streams. Relying on traditional ratings data and third-party analytics providers is going to be a losing strategy: you need to take ownership of your data, and use it to get a richer picture of who your viewers are, what they want, and how you can keep their attention in an increas- ingly crowded entertainment marketplace

Zoning in on the Viewer

The challenge is that there is just so much information available—hundreds of billions of rows of data from industry data-providers such as Nielsen and com- Score, from channels such as AMC’s TV Everywhere live Web streaming and video-on-demand service,

from retail partners such as iTunes and Amazon, and from third-party online video services such as Netflix and Hulu.

“We can’t rely on high-level summaries; we need to be able to analyze both structured and unstructured data, minute-by-minute and viewer- by-viewer,” says Tsivin. “We need to know who’s watching and why—and we need to know it quickly so that we can decide, for example, whether to run an ad or a promo in a particular slot during tomor- row night’s episode of Mad Men.”

AMC decided it needed to develop an industry- leading analytics capability in-house and focused on delivering this capability as quickly as possible. Instead of conducting a prolonged and expensive vendor and product selection process, AMC decided to leverage its existing relationship with IBM as its trusted strategic technology partner. The time and money traditionally spent on procurement were instead invested in realizing the solution, accelerat- ing AMC’s progress on its analytics roadmap by at least six months.

Application Case 7.2 (Continued)

Web-Based Dashboard Used by AMC Networks. Source: Used with permission of AMC Networks.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 401

Empowering the Research Department

In the past, AMC’s research team spent a large por- tion of its time processing data. Today, thanks to its new analytics tools, it is able to focus most of its energy on gaining actionable insights.

“By investing in big data analytics technology from IBM, we’ve been able to increase the pace and detail of our research an order of magnitude,” says Tsivin. “Analyses that used to take days and weeks are now possible in minutes, or even seconds.” He added,

Bringing analytics in-house will provide major ongoing cost-savings. Instead of paying hun- dreds of thousands of dollars to external ven- dors when we need some analysis done, we can do it ourselves—more quickly, more accu- rately, and much more cost-effectively. We’re expecting to see a rapid return on investment.

As more sources of potential insight become available and analytics becomes more strategic to the business, an in-house approach is really the only viable way forward for any network that truly wants to gain competitive advantage from its data.

Driving Decisions with Data

Many of the results delivered by this new analytics capability demonstrate a real transformation in the way AMC operates. For example, the company’s business intelligence department has been able to create sophis- ticated statistical models that help the company refine its marketing strategies and make smarter decisions about how intensively it should promote each show.

Instrumented AMC combines ratings data with viewer information from a wide range of digital channels: its own video- on-demand and live-streaming services, retailers, and online TV services.

Interconnected A powerful and comprehensive big data and analytics engine centralizes the data and makes them available to a range of descriptive and predictive analytics tools for accelerated modeling, reporting, and analysis.

Intelligent AMC can predict which shows will be successful, how it should schedule them, what promos it should create, and to whom it should market them—helping to win new audience share in an increasingly competitive market.

segmentation and look-alike modeling helped the company target new and existing viewers so effec- tively that AMC video-on-demand transactions were higher than would be expected otherwise.

This newfound ability to reach out to new viewers based on their individual needs and prefer- ences is not just valuable for AMC; it also has huge potential value for the company’s advertising part- ners. AMC is currently working on providing access to its rich data sets and analytics tools as a service for advertisers, helping them fine-tune their cam- paigns to appeal to ever-larger audiences across both linear and digital channels.

Tsivin concludes, “Now that we can really har- ness the value of big data, we can build a much more attractive proposition for both consumers and advertisers—creating even better content, market- ing it more effectively, and helping it reach a wider audience by taking full advantage of our multichan- nel capabilities.”

Questions for Case 7.2

1. What are the common challenges that broadcast- ing companies are facing today? How can analyt- ics help to alleviate these challenges?

2. How did AMC leverage analytics to enhance its business performance?

3. What were the types of text analytics and text minisolutions developed by AMC networks? Can you think of other potential uses of text mining applications in the broadcasting industry?

Sources: IBM Customer Case Study. “Using Analytics to Capture New Viewers, Predict Ratings and Add Value for Advertisers in a Multichannel World.” http://www-03.ibm.com/software/ businesscasestudies/us/en/corp?synkey=A023603A76220M60 (accessed July 2016); www.ibm.com; www.amcnetworks.com.

With deeper insight into viewership, AMC’s direct marketing campaigns are also much more suc- cessful than before. In one recent example, intelligent

402 Part II • Predictive Analytics/Machine Learning

NLP has successfully been applied to a variety of domains for a wide range of tasks via computer programs to automatically process natural human language that previously could be done only by humans. Following are among the most popular of these tasks:

• Question answering. The task of automatically answering a question posed in natural language; that is, producing a human language answer when given a human language question. To find the answer to a question, the computer program can use either a prestructured database or a collection of natural language documents (a text corpus such as the World Wide Web).

• Automatic summarization. The creation of a shortened version of a textual document by a computer program that contains the most important points of the original document.

• Natural language generation. The conversion of information from computer databases into readable human language.

• Natural language understanding. The conversion of samples of human lan- guage into more formal representations that are easier for computer programs to manipulate.

• Machine translation. The automatic translation of one human language to another.

• Foreign language reading. A computer program that assists a nonnative lan- guage speaker in reading a foreign language with correct pronunciation and accents on different parts of the words.

• Foreign language writing. A computer program that assists a nonnative lan- guage user in writing in a foreign language.

• Speech recognition. Conversion of spoken words to machine-readable input. Given a sound clip of a person speaking, the system produces a text dictation.

• Text to speech. Also called speech synthesis, a computer program that automati- cally converts normal language text into human speech.

• Text proofing. A computer program that reads a proof copy of a text to detect and correct any errors.

• Optical character recognition. The automatic translation of images of hand- written, typewritten, or printed text (usually captured by a scanner) into machine- editable textual documents.

The success and popularity of text mining depends greatly on advancements in NLP in both generating and understanding human languages. NLP enables the extraction of features from unstructured text so that a wide variety of data mining techniques can be used to extract knowledge (novel and useful patterns and relationships) from it. In that sense, simply put, text mining is a combination of NLP and data mining.

u SECTION 7.3 REVIEW QUESTIONS

1. What is NLP? 2. How does NLP relate to text mining? 3. What are some of the benefits and challenges of NLP? 4. What are the most common tasks addressed by NLP?

7.4 TEXT MINING APPLICATIONS

As the amount of unstructured data collected by organizations increases, so do the value proposition and popularity of text mining tools. Many organizations are now realizing the importance of extracting knowledge from their document-based data repositories through

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 403

the use of text mining tools. The following is only a small subset of the exemplary ap- plication categories of text mining.

Marketing Applications

Text mining can be used to increase cross-selling and up-selling by analyzing the un- structured data generated by call centers. Text generated notes from call center as well as transcriptions of voice conversations with customers can be analyzed by text min- ing algorithms to extract novel, actionable information about customers’ perceptions toward a company’s products and services. In addition, blogs, user reviews of products at independent Web sites, and discussion board postings are gold mines of customer sentiments. This rich collection of information, once properly analyzed, can be used to increase satisfaction and the overall lifetime value of the customer (Coussement & Van den Poel, 2008).

Text mining has become invaluable for CRM. Companies can use text mining to analyze rich sets of unstructured text data combined with the relevant structured data extracted from organizational databases to predict customer perceptions and subsequent purchasing behavior. Coussement and Van den Poel (2009) successfully applied text min- ing to significantly improve a model’s ability to predict customer churn (i.e., customer at- trition) so that those customers identified as most likely to leave a company are accurately identified for retention tactics.

Ghani et al. (2006) used text mining to develop a system capable of inferring im- plicit and explicit attributes of products to enhance retailers’ ability to analyze product databases. Treating products as sets of attribute–value pairs rather than as atomic entities can potentially boost the effectiveness of many business applications, including demand forecasting, assortment optimization, product recommendations, assortment comparison across retailers and manufacturers, and product supplier selection. The proposed system allows a business to represent its products in terms of attributes and attribute values with- out much manual effort. The system learns these attributes by applying supervised and semi-supervised learning techniques to product descriptions found on retailers’ Web sites.

Security Applications

One of the largest and most prominent text mining applications in the security domain is probably the highly classified ECHELON surveillance system. As rumor has it, ECHELON is assumed to be capable of identifying the content of telephone calls, faxes, e-mails, and other types of data, intercepting information sent via satellites, public-switched telephone networks, and microwave links.

In 2007, the European Union Agency for Law Enforcement Cooperation (EUROPOL) developed an integrated system capable of accessing, storing, and analyzing vast amounts of structured and unstructured data sources to track transnational organized crime. Called the Overall Analysis System for Intelligence Support (OASIS), it aims to integrate the most advanced data and text mining technologies available in today’s market. The system has enabled EUROPOL to make significant progress in supporting its law enforcement objec- tives at the international level (EUROPOL, 2007).

The U.S. Federal Bureau of Investigation (FBI) and the Central Intelligence Agency (CIA), under the direction of the Department for Homeland Security, are jointly develop- ing a supercomputer data and text mining system. The system is expected to create a gigantic data warehouse along with a variety of data and text mining modules to meet the knowledge-discovery needs of federal, state, and local law enforcement agencies. Prior to this project, the FBI and CIA each had its own separate database with little or no interconnection.

404 Part II • Predictive Analytics/Machine Learning

Another security-related application of text mining is in the area of deception detection. Applying text mining to a large set of real-world criminal (person-of-interest) statements, Fuller, Biros, and Delen (2008) developed prediction models to differentiate deceptive statements from truthful ones. Using a rich set of cues extracted from textual state- ments, the model predicted the holdout samples with 70 percent accuracy, which is believed to be a significant success considering that the cues are extracted only from textual state- ments (no verbal or visual cues are present). Furthermore, compared to other deception- detection techniques, such as polygraphs, this method is nonintrusive and widely applicable to not only textual data but also (potentially) transcriptions of voice recordings. A more detailed description of text-based deception detection is provided in Application Case 7.3.

Biomedical Applications

Text mining holds great potential for the medical field in general and biomedicine in particular for several reasons. First, published literature and publication outlets (especially with the advent of the open source journals) in the field are expanding at an exponential rate. Second, compared to most other fields, medical literature is more standardized and orderly, making it a more “minable” information source. Finally, the terminology used in

Driven by advancements in Web-based informa- tion technologies and increasing globalization, computer-mediated communication continues to fil- ter into everyday life, bringing with it new venues for deception. The volume of text-based chat, instant messaging, text messaging, and text generated by online communities of practice is increasing rapidly. Even the use of e-mail continues to increase. With the massive growth of text-based communication, the potential for people to deceive others through computer-mediated communication has also grown, and such deception can have disastrous results.

Unfortunately, in general, humans tend to perform poorly at deception-detection tasks. This phenomenon is exacerbated in text-based commu- nications. A large part of the research on deception detection (also known as credibility assessment) has involved face-to-face meetings and interviews. Yet with the growth of text-based communication, text- based deception-detection techniques are essential.

Techniques for successfully detecting deception—that is, lies—have wide applicability. Law enforcement can use decision support tools and tech- niques to investigate crimes, conduct security screening in airports, and monitor communications of suspected terrorists. Human resources professionals might use deception-detection tools to screen applicants. These tools and techniques also have the potential to screen

e-mails to uncover fraud or other wrongdoings com- mitted by corporate officers. Although some people believe that they can readily identify those who are not being truthful, a summary of deception research showed that, on average, people are only 54 percent accurate in making veracity determinations (Bond & DePaulo, 2006). This figure may actually be worse when humans try to detect deception in text.

Using a combination of text mining and data mining techniques, Fuller et al. (2008) analyzed person-of-interest statements completed by peo- ple involved in crimes on military bases. In these statements, suspects and witnesses are required to write their recollection of the event in their own words. Military law enforcement personnel searched archival data for statements that they could conclu- sively identify as being truthful or deceptive. These decisions were made on the basis of corroborating evidence and case resolution. Once labeled as truth- ful or deceptive, the law enforcement personnel removed identifying information and gave the state- ments to the research team. In total, 371 usable state- ments were received for analysis. The text-based deception-detection method used by Fuller et al. was based on a process known as message feature mining, which relies on elements of data and text mining techniques. A simplified depiction of the process is provided in Figure 7.3.

Application Case 7.3 Mining for Lies

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 405

First, the researchers prepared the data for pro- cessing. The original handwritten statements had to be transcribed into a word processing file. Second, features (i.e., cues) were identified. The research- ers identified 31 features representing categories or types of language that are relatively independent of

the text content and that can be readily analyzed by automated means. For example, first-person pronouns such as I or me can be identified with- out analysis of the surrounding text. Table 7.1 lists the categories and examples of features used in this study.

Statements Transcribed for

Processing

Text Processing Software Identified Cues in Statements

Statements Labeled as Truthful or Deceptive by Law Enforcement

Text Processing Software Generated

Quantified Cues

Classification Models Trained and Tested on Quantified Cues

Cues Extracted & Selected

FIGURE 7.3 Text-Based Deception-Detection Process. Source: Fuller, C. M., D. Biros, & D. Delen. (2008, January). Exploration of Feature Selection and Advanced Classification Models for High-Stakes Deception Detection. Proceedings of

the Forty-First Annual Hawaii International Conference on System Sciences (HICSS), Big Island, HI: IEEE Press, pp. 80–99.

TABLE 7.1 Categories and Examples of Linguistic Features Used in Deception Detection

Number Construct (Category) Example Cues

1 Quantity Verb count, noun phrase count, etc.

2 Complexity Average number of clauses, average sentence length, etc.

3 Uncertainty Modifiers, modal verbs, etc.

4 Nonimmediacy Passive voice, objectification, etc.

5 Expressivity Emotiveness

6 Diversity Lexical diversity, redundancy, etc.

7 Informality Typographical error ratio

8 Specificity Spatiotemporal information, perceptual information, etc.

9 Affect Positive affect, negative affect, etc.

(Continued )

406 Part II • Predictive Analytics/Machine Learning

this literature is relatively constant, having a fairly standardized ontology. What follows are a few exemplary studies that successfully used text mining techniques in extracting novel patterns from biomedical literature.

Experimental techniques such as DNA microarray analysis, serial analysis of gene ex- pression (SAGE), and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins. As in any other experimental approach, it is necessary to analyze this vast amount of data in the context of previously known information about the biological entities under study. The literature is a particularly valu- able source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research.

Knowing the location of a protein within a cell can help to elucidate its role in biological processes and to determine its potential as a drug target. Numerous location- prediction systems are described in the literature; some focus on specific organisms, whereas others attempt to analyze a wide range of organisms. Shatkay et al. (2007) pro- posed a comprehensive system that uses several types of sequence- and text-based fea- tures to predict the location of proteins. The main novelty of their system lies in the way in which it selects its text sources and features and integrates them with sequence-based features. They tested the system on previously used and new data sets devised specifi- cally to test its predictive power. The results showed that their system consistently beat previously reported results.

Chun et al. (2006) described a system that extracts disease–gene relationships from literature accessed via MEDLINE. They constructed a dictionary for disease and gene names from six public databases and extracted relation candidates by dictionary matching. Because dictionary matching produces a large number of false positives, they developed a method of machine-learning–based system, named entity recognition (NER), to filter out false recognition of disease/gene names. They found that the success of relation extraction is heavily dependent on the performance of NER filtering and that the filtering improved the precision of relation extraction by 26.7 percent at the cost of a small reduction in recall.

Figure 7.4 shows a simplified depiction of a multilevel text analysis process for dis- covering gene–protein relationships (or protein–protein interactions) in the biomedical

The features were extracted from the textual statements and input into a flat file for further pro- cessing. Using several feature-selection methods along with 10-fold cross-validation, the researchers compared the prediction accuracy of three popu- lar data mining methods. Their results indicated that neural network models performed the best, with 73.46 percent prediction accuracy on test data samples; decision trees performed second best, with 71.60 percent accuracy; and logistic regression was last, with 65.28 percent accuracy.

The results indicate that automated text-based deception detection has the potential to aid those who must try to detect lies in text and can be suc- cessfully applied to real-world data. The accuracy of these techniques exceeded the accuracy of most

other deception-detection techniques, even though it was limited to textual cues.

Questions for Case 7.3

1. Why is it difficult to detect deception?

2. How can text/data mining be used to detect deception in text?

3. What do you think are the main challenges for such an automated system?

Sources: Fuller, C. M., D. Biros, & D. Delen. (2008, January). “Exploration of Feature Selection and Advanced Classification Models for High-Stakes Deception Detection.” Proceedings of the Forty-First Annual Hawaii International Conference on System Sciences (HICSS), Big Island, HI: IEEE Press, pp. 80–99; Bond, C. F., & B. M. DePaulo. (2006). “Accuracy of Deception Judgments.” Personality and Social Psychology Reports, 10(3), pp. 214–234.

Application Case 7.3 (Continued)

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 407

literature (Nakov, Schwartz, Wolf, and Hearst, 2005). As can be seen in this simplified ex- ample that uses a simple sentence from biomedical text, first (at the bottom three levels) the text is tokenized using part-of-speech tagging and shallow parsing. The tokenized terms (words) are then matched (and interpreted) against the hierarchical representa- tion of the domain ontology to derive the gene–protein relationship. Application of this method (and/or some variation of it) to the biomedical literature offers great potential to decode the complexities in the Human Genome Project.

Academic Applications

The issue of text mining is of great importance to publishers who hold large databases of information requiring indexing for better retrieval. This is particularly true in scien- tific disciplines in which highly specific information is often contained within written text. Initiatives have been launched, such as Nature’s proposal for an Open Text Mining Interface and the National Institutes of Health’s common Journal Publishing Document Type Definition, which would provide semantic cues to machines to answer specific que- ries contained within text without removing publisher barriers to public access.

Academic institutions have also launched text mining initiatives. For example, the National Centre for Text Mining, a collaborative effort between the Universities of Manchester and Liverpool, provides customized tools, research facilities, and advice on text mining to the academic community. With an initial focus on text mining in the biological and bio- medical sciences, research has since expanded into the social sciences. In the United States, the School of Information at the University of California–Berkeley is developing a program called BioText to assist bioscience researchers in text mining and analysis.

G e n e /

P ro

te in

596 12043 24224 281020 42722 397276

D007962

D016923

D001773

D019254 D044465 D001769 D002477 D003643 D016158

185 8 51112 9 23017 27 5874 2791 8952 1623 5632 17 8252 8 2523

NN IN NN IN VBZ IN JJ JJ NN NN NN CC NN IN NN

NP PP NP NP PP NP NP PP NP

O n to

lo g y

W o rd

P O

S S

h a llo

w P

a rs

e

... expression of Bcl-2 is correlated with insufficient white blood cell death and activation of p53.

FIGURE 7.4 Multilevel Analysis of Text for Gene/Protein Interaction Identification. Source: Used with permission of Nakov, P., Schwartz, A., Wolf, B., & Hearst, M. A. (2005). Supporting annotation layers for natural language processing. Proceedings of the

Association for Computational Linguistics (ACL), Interactive Poster and Demonstration Sessions, Ann Arbor, MI. Association for

Computational Linguistics, 65–68.

408 Part II • Predictive Analytics/Machine Learning

As described in this section, text mining has a wide variety of applications in a number of different disciplines. See Application Case 7.4 for an example of how a leading computing product manufacturer uses text mining to better understand its current and potential customers’ needs and wants related to product quality and product design.

Advanced analytics techniques that use both structured and unstructured data have been successfully used in many application domains. Application Case 7.4 provides an interesting example where a wide range of analytics capabilities are used to successfully manage the Orlando Magic organization both on and off the courts of NBA.

From ticket sales to starting lineups, the Orlando Magic have come a long way since their inaugural season in 1989. There weren’t many wins in those early years, but the franchise has weathered the ups and downs to compete at the highest levels of the NBA.

Professional sports teams in smaller markets often struggle to build a big enough revenue base to compete against their larger market rivals. By using SAS® Analytics and SAS® Data Management, the Orlando Magic are among the top revenue earners in the NBA, despite being in the 20th-largest market.

The Magic accomplish this feat by studying the resale ticket market to price tickets better, to predict season ticket holders at risk of defection (and lure them back), and to analyze concession and product merchandise sales to make sure the organization has what the fans want every time they enter the arena. The club has even used SAS to help coaches put together the best lineup.

“Our biggest challenge is to customize the fan experience, and SAS helps us manage all that in a robust way,” says Alex Martins, CEO of the Orlando Magic. Having been with the Magic since the begin- ning (working his way up from PR Director to President to CEO), Martins has seen it all and knows the value that analytics adds. Under Martins’ leader- ship, the season-ticket base has grown as large as 14,200, and the corporate sales department has seen tremendous growth.

The Challenge: Filling Every Seat

But like all professional sports teams, the Magic are constantly looking for new strategies that will keep the seats filled at each of the 41 yearly home games. “Generating new revenue streams in this day of escalating player salaries and escalating expenses is important,” says Anthony Perez, vice president of Business Strategy. But with the advent of a robust

online secondary market for tickets, reaching the industry benchmark of 90 percent renewal of season tickets has become more difficult.

“In the first year, we saw ticket revenue increase around 50 percent. Over the last three years—for that period, we’ve seen it grow maybe 75 percent. It’s had a huge impact” said Anthony Perez, vice president of Business Strategy, Orlando Magic.

Perez’s group takes a holistic approach by combining data from all revenue streams (conces- sion, merchandise, and ticket sales) with outside data (secondary ticket market) to develop models that benefit the whole enterprise. “We’re like an in- house consulting group,” explains Perez.

In the case of season ticket holders, the team uses historical purchasing data and renewal patterns to build decision tree models that place subscrib- ers into three categories: most likely to renew, least likely to renew, and fence sitters. The fence sitters then get the customer service department’s attention come renewal time.

“SAS has helped us grow our business. It is probably one of the greatest investments that we’ve made as an organization over the last half-dozen years because we can point to top-line revenue growth that SAS has helped us create through the specific messaging that we’re able to direct to each one of our client groups.”

How Do They Predict Season Ticket Renewals?

When analytics showed the team that 80 percent of revenue was from season ticket holders, it decided to take a proactive approach to renewals and at- risk accounts. The Magic don’t have a crystal ball, but they do have SAS® Enterprise Miner™, which allowed them to better understand their data and

Application Case 7.4 The Magic Behind the Magic: Instant Access to Information Helps the Orlando Magic Up their Game and the Fan’s Experience

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 409

develop analytic models that combine three pillars for predicting season ticket holder renewals:

• Tenure (how long had the customer been a ticket holder?).

• Ticket use (did the customer actually attend the games?).

• Secondary market activity (were the unused tickets successfully sold on secondary sites?).

The data mining tools allowed the team to accomplish more accurate scoring that led to a difference—and marked improvement—in the way it approached customer retention and marketing.

Ease of Use Helps Spread Analytics Message

Perez likes how easy it is to use SAS—it was a factor in opting to do the work in-house rather than out- sourcing it. Perez’s team has set up recurring pro- cesses and automated them. Data manipulation is minimal, “allowing us more time to interpret rather than just manually crunching the numbers.” Business users throughout the organization, including execu- tives, have instant access to information through SAS® Visual Analytics. “It’s not just that we’re using the tools daily; we are using them throughout the day to make decisions,” Perez says.

Being Data-Driven

“We adopted an analytics approach years ago, and we're seeing it transform our entire organization,” says Martins. “Analytics helps us understand custom- ers better, helps in business planning (ticket pricing, etc.), and provides game-to-game and year-to-year data on demand by game and even by seat.”

“And analytics has helped transform the game. GMs and analytics teams look at every aspect of the game, including movements of players on the court, to transform data to predict defense against certain teams. We can now ask ourselves, ‘What are the most efficient lineups in a game? Which team can produce more points vs. another lineup? Which team is better defensively than another?’”

“We used to produce a series of reports man- ually, but now we can do it with five clicks of a mouse (instead of five hours overnight in anticipa- tion of tomorrow’s game). We can have dozens of reports available to staff in minutes. Analytics has made us smarter,” says Martins.

What’s Next?

“Getting real-time data is the next step for us in our analytical growth process,” says Martins. “On a game day, getting real-time data to track what tickets are available and how to maximize yield of those tickets is critical. Additionally, you're going to see major techno- logical changes and acceptance of the technology on the bench to see how the games are played moving forward. Maybe as soon as next season you’ll see our assistant coaches with iPad® tablets getting real-time data, learning what the opponent is doing and what plays are working. It’ll be necessary in the future.

“We’re setting ourselves up to be successful moving forward. And in the very near future, we’ll be in a position again to compete for a conference championship and an NBA championship,” says Martins. “All of the moves made this year and the ones to come in the future will be done in order to build success on [and off] the court.’’

Questions for Case 7.4

1. According to the application case, what were the main challenges the Orlando Magic was facing?

2. How did analytics help the Orlando Magic to overcome some of its most significant challenges on and off the court?

3. Can you think of other uses of analytics in sports and especially in the case of the Orlando Magic? You can search the Web to find some answers to this question.

Source: SAS Customer Story, “The magic behind the Magic: Instant access to information helps the Orlando Magic up their game and the fan’s experience” at https://www.sas.com/en_us/customers/ orlando-magic.html and https://www.nba.com/magic/news/ denton-25-years-magic-history (accessed November 2018).

u SECTION 7.4 REVIEW QUESTIONS

1. List and briefly discuss some of the text mining applications in marketing. 2. How can text mining be used in security and counterterrorism? 3. What are some promising text mining applications in biomedicine?

410 Part II • Predictive Analytics/Machine Learning

7.5 TEXT MINING PROCESS

To be successful, text mining studies should follow a sound methodology based on best practices. A standardized process model is needed similar to Cross-Industry Standard Process for Data Mining (CRISP-DM), which is the industry standard for data mining projects (see Chapter 4). Even though most parts of CRISP-DM are also applicable to text mining projects, a specific process model for text mining would include much more elab- orate data preprocessing activities. Figure 7.5 depicts a high-level context diagram of a typical text mining process (Delen & Crossland, 2008). This context diagram presents the scope of the process, emphasizing its interfaces with the larger environment. In essence, it draws boundaries around the specific process to explicitly identify what is included in (and excluded from) the text mining process.

As the context diagram indicates, the input (inward connection to the left edge of the box) into the text-based knowledge-discovery process is the unstructured as well as struc- tured data collected, stored, and made available to the process. The output (outward exten- sion from the right edge of the box) of the process is the context-specific knowledge that can be used for decision making. The controls, also called the constraints (inward connec- tion to the top edge of the box), of the process include software and hardware limitations, privacy issues, and difficulties related to processing the text that is presented in the form of natural language. The mechanisms (inward connection to the bottom edge of the box) of the process include proper techniques, software tools, and domain expertise. The primary purpose of text mining (within the context of knowledge discovery) is to process unstruc- tured (textual) data (along with structured data if relevant to the problem being addressed and available) to extract meaningful and actionable patterns for better decision making.

At a very high level, the text mining process can be broken down into three consec- utive tasks, each of which has specific inputs to generate certain outputs (see Figure 7.6). If, for some reason, the output of a task is not what is expected, a backward redirection to the previous task execution is necessary.

Task 1: Establish the Corpus

The main purpose of the first task activity is to collect all the documents related to the context (domain of interest) being studied. This collection may include textual docu- ments, XML files, e-mails, Web pages, and short notes. In addition to the readily available

Extract knowledge from available data sources

A0

Unstructured data (text)

Structured data (databases)

Context-specific knowledge

Software/hardware limitations

Privacy issues

Linguistic limitations

Tools and techniques Domain expertise

FIGURE 7.5 Context Diagram for the Text Mining Process.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 411

textual data, voice recordings may also be transcribed using speech-recognition algo- rithms and made a part of the text collection.

Once collected, the text documents are transformed and organized in a manner such that they are all in the same representational form (e.g., ASCII text files) for com- puter processing. The organization of the documents can be as simple as a collection of digitized text excerpts stored in a file folder or a list of links to a collection of Web pages in a specific domain. Many commercially available text mining software tools could ac- cept these as input and convert them into a flat file for processing. Alternatively, the flat file can be prepared outside the text mining software and then presented as the input to the text mining application.

Task 2: Create the Term–Document Matrix

In this task, the digitized and organized documents (the corpus) are used to create the term–document matrix (TDM). In the TDM, the rows represent the documents and the columns represent the terms. The relationships between the terms and documents are characterized by indices (i.e., a relational measure that can be as simple as the num- ber of occurrences of the term in respective documents). Figure 7.7 is a typical example of a TDM.

Establish the Corpus: Collect and organize the domain-specific unstructured data

Create the Term- Document Matrix: Introduce structure

to the corpus

Extract Knowledge: Discover novel

patterns from the T-D matrix

The inputs to the process include a variety of relevant unstructured (and semi- structured) data sources such as text, XML, HTML

The output of Task 1 is a collection of documents in some digitized format for computer processing

The output of Task 2 is a flat file called term-document matrix where the cells are populated with the term frequencies

The output of Task 3 is a number of problem-specific classification, association, clustering models, and visualizations

Task 1 Task 2 Task 3

FeedbackFeedback

Knowledge 1

2 3

4 5

DataText

FIGURE 7.6 The Three-Step/Task Text Mining Process.

Inv es

tm en

t R isk

Pr oje

ct Ma

na ge

me nt

So ftw

are En

gin ee

rin g

De vel

op me

nt

1

SA P

...

Document 1

Document 2

Document 3

Document 4

Document 5

Document 6

...

Documents

Terms

1

1

1

2

1

1

1

3

1

FIGURE 7.7 Simple Term–Document Matrix.

412 Part II • Predictive Analytics/Machine Learning

The goal is to convert the list of organized documents (the corpus) into a TDM where the cells are filled with the most appropriate indices. The assumption is that the essence of a document can be represented with a list and frequency of the terms used in that document. However, are all terms important when characterizing documents? Obviously, the answer is “no.” Some terms, such as articles, auxiliary verbs, and terms used in almost all the documents in the corpus, have no differentiating power and, there- fore, should be excluded from the indexing process. This list of terms, commonly called stop terms or stop words, is specific to the domain of study and should be identified by the domain experts. On the other hand, one might choose a set of predetermined terms under which the documents are to be indexed (this list of terms is conveniently called in- clude terms or dictionary). In addition, synonyms (pairs of terms that are to be treated the same) and specific phrases (e.g., “Eiffel Tower”) can also be provided so that the index entries are more accurate.

Another filtration that should take place to accurately create the indices is stem- ming, which refers to the reduction of words to their roots so that, for example, different grammatical forms or declinations of a verb are identified and indexed as the same word. For example, stemming will ensure that modeling and modeled will be recognized as the word model.

The first generation of the TDM includes all the unique terms identified in the corpus (as its columns), excluding the ones in the stop term list; all the documents (as its rows); and the occurrence count of each term for each document (as its cell values). If, as is commonly the case, the corpus includes a rather large number of documents, then there is a very good chance that the TDM will have a very large number of terms. Processing such a large matrix might be time consuming and, more important, might lead to extraction of inaccurate patterns. At this point, one has to decide the following: (1) What is the best representation of the indices? and (2) How can we reduce the dimen- sionality of this matrix to a manageable size?

REPRESENTING THE INDICES Once the input documents have been indexed and the ini- tial word frequencies (by document) computed, a number of additional transformations can be performed to summarize and aggregate the extracted information. The raw term frequencies generally reflect on how salient or important a word is in each document. Specifically, words that occur with greater frequency in a document are better descriptors of the contents of that document. However, it is not reasonable to assume that the word counts themselves are proportional to their importance as descriptors of the documents. For example, if a word occurs one time in document A but three times in document B, it is not necessarily reasonable to conclude that this word is three times as important a descriptor of document B as compared to document A. To have a more consistent TDM for further analysis, these raw indices need to be normalized. As opposed to showing the actual frequency counts, the numerical representation between terms and documents can be normalized using a number of alternative methods, such as log frequencies, binary frequencies, and inverse document frequencies.

REDUCING THE DIMENSIONALITY OF THE MATRIX Because the TDM is often very large and rather sparse (most of the cells filled with zeros), another important question is, “How do we reduce the dimensionality of this matrix to a manageable size?” Several op- tions are available for managing the matrix size:

• A domain expert goes through the list of terms and eliminates those that do not make much sense for the context of the study (this is a manual, labor-intensive process).

• Eliminate terms with very few occurrences in very few documents. • Transform the matrix using SVD.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 413

Singular value decomposition (SVD), which is closely related to principal com- ponents analysis, reduces the overall dimensionality of the input matrix (number of input documents by number of extracted terms) to a lower-dimensional space when each con- secutive dimension represents the largest degree of variability (between words and docu- ments) possible (Manning and Schutze, 1999). Ideally, the analyst might identify the two or three most salient dimensions that account for most of the variability (differences) between the words and documents, thus identifying the latent semantic space that orga- nizes the words and documents in the analysis. Once such dimensions are identified, the underlying “meaning” of what is contained (discussed or described) in the documents has been extracted.

Task 3: Extract the Knowledge

Using the well-structured TDM and potentially augmented with other structured data elements, novel patterns are extracted in the context of the specific problem being ad- dressed. The main categories of knowledge extraction methods are classification, cluster- ing, association, and trend analysis. A short description of these methods follows.

CLASSIFICATION Arguably the most common knowledge-discovery topic in analyzing complex data sources is the classification (or categorization) of certain objects. The task is to classify a given data instance into a predetermined set of categories (or classes). As it applies to the domain of text mining, the task is known as text categorization for a given set of categories (subjects, topics, or concepts) and a collection of text documents whose goal is to find the correct topic (subject or concept) for each document using models developed with a training data set that includes both the documents and actual document categories. Today, automated text classification is applied in a variety of contexts, including automatic or semi-automatic (interactive) indexing of text, spam filtering, Web page categorization under hierarchical catalogs, automatic generation of metadata, and detection of genre.

The two main approaches to text classification are knowledge engineering and machine learning (Feldman and Sanger, 2007). With the knowledge-engineering ap- proach, an expert’s knowledge about the categories is encoded into the system either declaratively or in the form of procedural classification rules. With the machine-learning approach, a general inductive process builds a classifier by learning from a set of reclas- sified examples. As the number of documents increases at an exponential rate and as knowledge experts become harder to come by, the popularity trend between the two is shifting toward the machine-learning approach.

CLUSTERING Clustering is an unsupervised process whereby objects are classified into “natural” groups called clusters. Compared to categorization that uses a collection of preclassified training examples to develop a model based on the descriptive features of the classes to classify a new unlabeled example, in clustering the problem is to group an unlabeled collection of objects (e.g., documents, customer comments, Web pages) into meaningful clusters without any prior knowledge.

Clustering is useful in a wide range of applications from document retrieval to en- abling better Web content searches. In fact, one of the prominent applications of clustering is the analysis and navigation of very large text collections, such as Web pages. The basic underlying assumption is that relevant documents tend to be more similar to each other than to irrelevant ones. If this assumption holds, the clustering of documents based on the similarity of their content improves search effectiveness (Feldman and Sanger, 2007):

• Improved search recall. Because it is based on overall similarity as opposed to the presence of a single term, clustering can improve the recall of a query-based search in such a way that when a query matches a document, its whole cluster is returned.

414 Part II • Predictive Analytics/Machine Learning

• Improved search precision. Clustering can also improve search precision. As the number of documents in a collection grows, it becomes difficult to browse through the list of matched documents. Clustering can help by grouping the documents into a number of much smaller groups of related documents, ordering them by relevance, and returning only the documents from the most relevant group (or groups).

The two most popular clustering methods are scatter/gather clustering and query-specific clustering:

• Scatter/gather. This document browsing method uses clustering to enhance the efficiency of human browsing of documents when a specific search query cannot be formulated. In a sense, the method dynamically generates a table of contents for the collection and adapts and modifies it in response to the user selection.

• Query-specific clustering. This method employs a hierarchical clustering approach where the most relevant documents to the posed query appear in small tight clusters that are nested in larger clusters containing less-similar documents, creating a spectrum of relevance levels among the documents. This method per- forms consistently well for document collections of realistically large sizes.

ASSOCIATION Associations, or association rule learning in data mining, is a popular and well-researched technique for discovering interesting relationships among variables in large databases. The main idea in generating association rules (or solving market- basket problems) is to identify the frequent sets that go together.

In text mining, associations specifically refer to the direct relationships between con- cepts (terms) or sets of concepts. The concept set association rule A + C relating two frequent concept sets A and C can be quantified by the two basic measures of support and confidence. In this case, confidence is the percentage of documents that include all concepts in C within the same subset of those documents that include all concepts in A. Support is the percentage (or number) of documents that include all the concepts in A and C. For instance, in a document collection the concept “Software Implementation Failure” could appear most often in association with “Enterprise Resource Planning” and “Customer Relationship Management” with significant support (4%) and confidence (55%), meaning that 4 percent of the documents had all three concepts represented in the same document, and of the documents that included “Software Implementation Failure,” 55 percent of them also included “Enterprise Resource Planning” and “Customer Relationship Management.”

Text mining with association rules was used to analyze published literature (news and academic articles posted on the Web) to chart the outbreak and progress of the bird flu (Mahgoub et al., 2008). The idea was to automatically identify the association among the geographic areas, spreading across species, and countermeasures (treatments).

TREND ANALYSIS Recent methods of trend analysis in text mining have been based on the notion that the various types of concept distributions are functions of document collections; that is, different collections lead to different concept distributions for the same set of concepts. It is, therefore, possible to compare two distributions that are otherwise identical except that they are from different subcollections. One notable direction of this type of analysis is hav- ing two collections from the same source (such as from the same set of academic journals) but from different points in time. Delen and Crossland (2008) applied trend analysis to a large number of academic articles (published in the three highest-rated academic journals) to identify the evolution of key concepts in the field of information systems.

As described in this section, a number of methods are available for text mining. Application Case 7.5 describes the use of a number of different techniques in analyzing a large set of literature.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 415

Researchers conducting searches and reviews of rel- evant literature face an increasingly complex and voluminous task. In extending the body of relevant knowledge, it has always been important to work hard to gather, organize, analyze, and assimilate existing information from the literature, particularly from one’s home discipline. With the increasing abundance of potentially significant research being reported in related fields, and even in what are tra- ditionally deemed to be nonrelated fields of study, the researcher’s task is ever more daunting if a thor- ough job is desired.

In new streams of research, the researcher’s task can be even more tedious and complex. Trying to ferret out relevant work that others have reported can be difficult, at best, and perhaps even nearly impossible if traditional, largely manual reviews of published literature are required. Even with a legion of dedicated graduate students or helpful col- leagues, trying to cover all potentially relevant pub- lished work is problematic.

Many scholarly conferences take place every year. In addition to extending the body of knowl- edge of the current focus of a conference, organiz- ers often desire to offer additional minitracks and workshops. In many cases, these additional events are intended to introduce attendees to significant streams of research in related fields of study and to try to identify the “next big thing” in terms of research interests and focus. Identifying reasonable candidate topics for such minitracks and workshops is often subjective rather than derived objectively from the existing and emerging research.

In a recent study, Delen and Crossland (2008) proposed a method to greatly assist and enhance the efforts of the researchers by enabling a semi- automated analysis of large volumes of published literature through the application of text mining. Using standard digital libraries and online publica- tion search engines, the authors downloaded and collected all the available articles for the three major journals in the field of management information sys- tems: MIS Quarterly (MISQ), Information Systems Research (ISR), and the Journal of Management Information Systems (JMIS). To maintain the same

time interval for all three journals (for potential comparative longitudinal studies), the journal with the most recent starting date for its digital publica- tion availability was used as the start time for this study (i.e., JMIS articles have been digitally available since 1994). For each article, Delen and Crossland extracted the title, abstract, author list, published keywords, volume, issue number, and year of pub- lication. They then loaded all the article data into a simple database file. Also included in the com- bined data set was a field that designated the journal type of each article for likely discriminatory analysis. Editorial notes, research notes, and executive over- views were omitted from the collection. Table  7.2 shows how the data were presented in a tabular format.

In the analysis phase, the researchers chose to use only the abstract of an article as the source of information extraction. They chose not to include the keywords listed with the publica- tions for two main reasons: (1) under normal cir- cumstances, the abstract would already include the listed keywords, and therefore inclusion of the listed keywords for the analysis would mean repeating the same information and potentially giving them unmerited weight and (2) the listed keywords could be terms that authors would like their article to be associated with (as opposed to what is really contained in the article), therefore, potentially introducing unquantifiable bias to the analysis of the content.

The first exploratory study was to look at the longitudinal perspective of the three journals (i.e., evolution of research topics over time). To conduct a longitudinal study, Delen and Crossland divided the 12-year period (from 1994 to 2005) into four 3-year periods for each of the three journals. This framework led to 12 text mining experiments with 12 mutually exclusive data sets. At this point, for each of the 12 data sets, the researchers used text mining to extract the most descriptive terms from these collections of articles represented by their abstracts. The results were tabulated and examined for time-varying changes in the terms published in these three journals.

Application Case 7.5 Research Literature Survey with Text Mining

(Continued )

416 Part II • Predictive Analytics/Machine Learning

As a second exploration, using the complete data set (including all three journals and all four periods), Delen and Crossland conducted a cluster- ing analysis. Clustering is arguably the most com- monly used text mining technique. Clustering was used in this study to identify the natural groupings of the articles (by putting them into separate clus- ters) and then to list the most descriptive terms that characterized those clusters. They used SVD to reduce the dimensionality of the term-by-document matrix and then an expectation-maximization algo- rithm to create the clusters. They conducted sev- eral experiments to identify the optimal number of clusters, which turned out to be nine. After the construction of the nine clusters, they analyzed the content of those clusters from two perspectives: (1) representation of the journal type (see Figure 7.8a) and (2) representation of time (Figure 7.8b). The idea was to explore the potential differences

and/or commonalities among the three journals and potential changes in the emphasis on those clusters; that is, to answer questions such as “Are there clus- ters that represent different research themes specific to a single journal?” and “Is there a time-varying characterization of those clusters?” The researchers discovered and discussed several interesting pat- terns using tabular and graphical representation of their findings (for further information, see Delen and Crossland, 2008).

Questions for Case 7.5

1. How can text mining be used to ease the insur- mountable task of literature review?

2. What are the common outcomes of a text mining project on a specific collection of journal articles? Can you think of other potential outcomes not mentioned in this case?

Application Case 7.5 (Continued)

TABLE 7.2 Tabular Representation of the Fields Included in the Combined Data Set

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 417

0 20

MISQ ISR

Histogram of JOURNAL; categorized by CLUSTER

JMIS

40 60

(a)

80 100 120 140

MISQ ISR JMIS MISQ ISR JMIS

0 20

MISQ ISR

CLUSTER: 3 CLUSTER: 8 CLUSTER: 6

JMIS

40 60 80

N o o

f o b s 100

120 140

MISQ ISR JMIS MISQ ISR JMIS

0 20

MISQ ISR

CLUSTER: 0 CLUSTER: 4 CLUSTER: 5

CLUSTER: 1 CLUSTER: 7

JOURNAL

CLUSTER: 2

JMIS

40 60 80

100 120 140

MISQ ISR JMIS MISQ ISR JMIS

FIGURE 7.8 (a) Distribution of the Number of Articles for the Three Journals over the Nine Clusters. (b) Development of the Nine Clusters over the Years.

Source: Used with permission of Delen, D., & M. Crossland. (2008). “Seeding the Survey and Analysis of Research Literature

with Text Mining.” Expert Systems with Applications, 34(3), pp. 1707–1720.

Histogram of YEAR; categorized by CLUSTER

N o o

f o b s

0 1994 1996 1998 2000 2002 2004

1995 1997 1999 2001 2003 2005

CLUSTER: 3

40 35 30 25 20 15 10

5

1994 1996 1998 2000 2002 2004 1995 1997 1999 2001 2003 2005

CLUSTER: 8

1994 1996 1998 2000 2002 2004 1995 1997 1999 2001 2003 2005

CLUSTER: 6

0 1994 1996 1998 2000 2002 2004

1995 1997 1999 2001 2003 2005

CLUSTER: 0

40 35 30 25 20 15 10

5

1994 1996 1998 2000 2002 2004 1995 1997 1999 2001 2003 2005

CLUSTER: 4

1994 1996 1998 2000 2002 2004 1995 1997 1999 2001 2003 2005

CLUSTER: 5

0 1994 1996 1998 2000 2002 2004

1995 1997 1999 2001 2003 2005

CLUSTER: 1

40 35 30 25 20 15 10

5

1994 1996 1998 2000 2002 2004 1995 1997 1999 2001 2003 2005

CLUSTER: 7 YEAR

1994 1996 1998 2000 2002 2004 1995 1997 1999 2001 2003 2005

CLUSTER: 2

(b)

418 Part II • Predictive Analytics/Machine Learning

u SECTION 7.5 REVIEW QUESTIONS

1. What are the main steps in the text mining process? 2. What is the reason for normalizing word frequencies? What are the common methods

for normalizing word frequencies?

3. What is SVD? How is it used in text mining? 4. What are the main knowledge extraction methods from corpus?

7.6 SENTIMENT ANALYSIS

We humans are social beings. We are adept at utilizing a variety of means to communicate. We often consult financial discussion forums before making an investment decision; ask our friends for their opinions on a newly opened restaurant or a newly released movie; and conduct Internet searches and read consumer reviews and expert reports before making a big purchase like a house, a car, or an appliance. We rely on others’ opinions to make better decisions, especially in an area where we do not have much knowledge or experience. Thanks to the growing availability and popularity of opinion-rich Internet resources such as social media outlets (e.g., Twitter, Facebook), online review sites, and personal blogs, it is now easier than ever to find opinions of others (thousands of them, as a matter of fact) on everything from the latest gadgets to political and public figures. Even though not everybody expresses opinions over the Internet—due mostly to the fast-growing number and capabilities of social communication channels—the numbers are increasing exponentially.

Sentiment is a difficult word to define. It is often linked to or confused with other terms like belief, view, opinion, and conviction. Sentiment suggests a settled opinion reflective of one’s feelings (Mejova, 2009). Sentiment has some unique properties that set it apart from other concepts that we might want to identify in text. Often we want to categorize text by topic, which could involve dealing with whole taxonomies of topics. Sentiment classification, on the other hand, usually deals with two classes (positive versus negative), a range of polar- ity (e.g., star ratings for movies), or even a range in strength of opinion (Pang and Lee, 2008). These classes span many topics, users, and documents. Although dealing with only a few classes might seem like an easier task than standard text analysis, this is far from the truth.

As a field of research, sentiment analysis is closely related to computational linguis- tics, NLP, and text mining. Sentiment analysis has many names. It is often referred to as opinion mining, subjectivity analysis, and appraisal extraction with some connections to affective computing (computer recognition and expression of emotion). The sudden upsurge of interest and activity in the area of sentiment analysis (i.e., opinion mining), which deals with the automatic extraction of opinions, feelings, and subjectivity in text, is creating opportunities and threats for businesses and individuals alike. The ones who embrace and take advantage of it will greatly benefit from it. Every opinion put on the Internet by an individual or a company will be accredited to the originator (good or bad) and will be retrieved and mined by others (often automatically by computer programs).

Sentiment analysis is trying to answer the question, “What do people feel about a certain topic?” by digging into opinions held by many using a variety of automated tools. Bringing together researchers and practitioners in business, computer science, computa- tional linguistics, data mining, text mining, psychology, and even sociology, sentiment analysis aims to expand the traditional fact-based text analysis to new frontiers, to real- ize opinion-oriented information systems. In a business setting, especially in marketing and CRM, sentiment analysis seeks to detect favorable and unfavorable opinions toward specific products and/or services using large numbers of textual data sources (customer feedback in the form of Web postings, tweets, blogs, etc.).

Sentiment that appears in text comes in two flavors: explicit in which the subjective sentence directly expresses an opinion (“It’s a wonderful day”), and implicit in which the text

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 419

implies an opinion (“The handle breaks too easily”). Most of the earlier work done in senti- ment analysis focused on the first kind of sentiment because it is easier to analyze. Current trends are to implement analytical methods to consider both implicit and explicit sentiments. Sentiment polarity is a particular feature of text on which sentiment analysis primarily focuses. It is usually dichotomized into two—positive and negative—but polarity can also be thought of as a range. A document containing several opinionated statements will have a mixed polarity overall, which is different from not having a polarity at all (being objective; Mejova, 2009). Timely collection and analysis of textual data, which may be coming from a variety of sources—ranging from customer call center transcripts to social media postings—is a crucial part of the capabilities of proactive and customer-focused companies today. These real-time analyses of textual data are often visualized in easy-to-understand dashboards. Application Case 7.6 provides a customer success story in which a collection of analytics solutions is col- lectively used to enhance viewers’ experience at the Wimbledon tennis tournament.

Known to millions of fans simply as “Wimbledon,” The Championships are the oldest of tennis’s four Grand Slams, and one of the world’s highest-profile sporting events. Organized by the All England Lawn Tennis Club (AELTC), it has been a global sporting and cultural institution since 1877.

The Champion of Championships

The organizers of The Championships, Wimbledon, and AELTC have a simple objective: every year, they want to host the best tennis championships in the world—in every way and by every metric.

Application Case 7.6 Creating a Unique Digital Experience to Capture Moments That Matter at Wimbledon

(Continued )

Live scores displayed on Wimbledon.com, the official website of the championships, wimbledon,

Copyright AELTC and IBM. Used with permission.

420 Part II • Predictive Analytics/Machine Learning

The motivation behind this commitment is not simply pride; it also has a commercial basis. Wimbledon’s brand is built on its premier status; this is what attracts both fans and partners. The world’s best media organizations and greatest corporations—IBM included—want to be associated with Wimbledon precisely because of its reputation for excellence.

For this reason, maintaining the prestige of The Championships is one of AELTC’s top priorities, but there are only two ways that the organization can directly control how the rest of the world per- ceives The Championships.

The first, and most important, is to provide an outstanding experience for the players, journalists, and spectators who are lucky enough to visit and watch the tennis courtside. AELTC has vast experi- ence in this area. Since 1877, it has delivered two weeks of memorable, exciting competition in an idyllic setting: tennis in an English country garden.

The second is The Championships’ online pres- ence, which is delivered via the wimbledon.com Web site, mobile apps, and social media channels. The con- stant evolution of these digital platforms is the result of a 26-year partnership between AELTC and IBM.

Mick Desmond, commercial and media director at AELTC, explains, “When you watch Wimbledon on TV, you are seeing it through the broadcaster’s lens. We do everything we can to help our media partners put on the best possible show, but at the end of the day, their broadcast is their presentation of The Championships.”

He adds, “Digital is different: it’s our platform, where we can speak directly to our fans—so it’s vital that we give them the best possible experience. No sporting event or media channel has the right to demand a viewer’s attention, so if we want to strengthen our brand, we need people to see our digital experience as the number-one place to fol- low The Championships online.”

To that end, AELTC set a target of attracting 70 million visits, 20 million unique devices, and 8 million social followers during the two weeks of The Championships in 2015. It was up to IBM and AELTC to find a way to deliver.

Delivering a Unique Digital Experience

IBM and AELTC embarked on a complete redesign of the digital platform, using their intimate knowledge

of The Championships’ audience to develop an experience tailor-made to attract and retain tennis fans from across the globe.

“We recognized that while mobile is increas- ingly important, 80% of our visitors are using desktop computers to access our Web site,” says Alexandra Willis, head of Digital and Content at AELTC. She continued,

Our challenge for 2015 was how to update our digital properties to adapt to a mobile-first world, while still offering the best possible desk- top experience. We wanted our new site to take maximum advantage of that large screen size and give desktop users the richest possible expe- rience in terms of high-definition visuals and video content—while also reacting and adapting seamlessly to smaller tablet or mobile formats.

Second, we placed a major emphasis on putting content in context—integrating articles with relevant photos, videos, stats and snip- pets of information, and simplifying the navi- gation so that users could move seamlessly to the content that interests them most.

On the mobile side, the team recognized that the wider availability of high bandwidth 4G connec- tions meant that the mobile Web site would become more popular than ever—and ensured that it would offer easy access to all rich media content. At the same time, The Championships’ mobile apps were enhanced with real-time notifications of match scores and events—and could even greet visitors as they passed through stations on the way to the grounds.

The team also built a special set of Web sites for the most important tennis fans of all: the play- ers themselves. Using IBM' Bluemix' technology, it built a secure Web application that provided players a personalized view of their court bookings, trans- port, and on-court times, as well as helping them review their performance with access to stats on every match they played.

Turning Data into Insight—and Insight into Narrative

To supply its digital platforms with the most com- pelling possible content, the team took advantage of a unique opportunity: its access to real-time, shot-by-shot data on every match played during

Application Case 7.6 (Continued)

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 421

The Championships. Over the course of the Wimbledon fortnight, 48 courtside experts capture approximately 3.4 million data points, tracking the type of shot, strategies, and outcome of each and every point.

These data are collected and analyzed in real time to produce statistics for TV commentators and journalists—and for the digital platform’s own edito- rial team.

Willis went on to explain:

This year IBM gave us an advantage that we had never had before—using data streaming technology to provide our editorial team with real-time insight into significant milestones and breaking news.

The system automatically watched the streams of data coming in from all 19 courts, and whenever something significant happened— such as Sam Groth hitting the second-fastest serve in Championships’ history—it let us know instantly. Within seconds, we were able to bring that news to our digital audience and share it on social media to drive even more traffic to our site.

The ability to capture the moments that matter and uncover the compelling narratives within the data, faster than anyone else, was key. If you wanted to experience the emotions of The Championships live, the next best thing to being there in person was to follow the action on wimbledon.com.

Harnessing the Power of Natural Language

Another new capability tried in 2015 was the use of IBM’s NLP technologies to help mine AELTC’s huge library of tennis history for interesting contextual infor- mation. The team trained IBM Watson™ Engagement Advisor to digest this rich unstructured data set and use it to answer queries from the press desk.

The same NLP front-end was also connected to a comprehensive structured database of match statistics, dating back to the first Championships in 1877—providing a one-stop shop for both basic questions and more complex inquiries.

“The Watson trial showed a huge amount of potential. Next year, as part of our annual innova- tion planning process, we will look at how we can

use it more widely—ultimately in pursuit of giving fans more access to this incredibly rich source of tennis knowledge,” says Desmond.

Taking to the Cloud

IBM hosted the whole digital environment in its Hybrid Cloud. IBM used sophisticated modeling techniques to predict peaks in demand based on the schedule, popularity of each player, time of day, and many other factors—enabling it to dynamically allocate cloud resources appropriately to each piece of digital content and ensure a seamless experience for millions of visitors around the world.

In addition to the powerful private cloud plat- form that has supported The Championships for several years, IBM also used a separate SoftLayer' cloud to host the Wimbledon Social Command Centre and provide additional incremental capacity to supplement the main cloud environment during times of peak demand.

The elasticity of the cloud environment is key because The Championships’ digital platforms need to be able to scale efficiently by a factor of more than 100 within a matter of days as the interest builds ahead of the first match on Centre Court.

Keeping Wimbledon Safe and Secure

Online security is a key concern today for all orga- nizations. For major sporting events in particu- lar, brand reputation is everything—and while the world is watching, it is particularly important to avoid becoming a high-profile victim of cyber crime. For these reasons, security has a vital role to play in IBM’s partnership with AELTC.

Over the first five months of 2015, IBM secu- rity systems detected a 94 percent increase in secu- rity events on the wimbledon.com infrastructure compared to the same period in 2014.

As security threats—in particular distributed denial of service (DDoS) attacks—become ever more prevalent, IBM continually increases its focus on providing industry-leading levels of security for AELTC’s entire digital platform.

A full suite of IBM security products, includ- ing IBM QRadar' SIEM and IBM Preventia Intrusion Prevention, enabled the 2015 Championships to run smoothly and securely and the digital platform to deliver a high-quality user experience at all times.

(Continued )

422 Part II • Predictive Analytics/Machine Learning

Sentiment Analysis Applications

Compared to traditional sentiment analysis methods, which were survey based or focus group centered, costly, and time consuming (and therefore driven from a small sample of participants), the new face of text analytics–based sentiment analysis is a limit breaker. Current solutions automate very large-scale data collection, filtering, classification, and clustering methods via NLP and data mining technologies that handle both factual and subjective information. Sentiment analysis is perhaps the most popular application of text analytics, tapping into data sources such as tweets, Facebook posts, online communities, discussion boards, Web logs, product reviews, call center logs and recordings, product rating sites, chat rooms, price comparison portals, search engine logs, and newsgroups. The following applications of sentiment analysis are meant to illustrate the power and the widespread coverage of this technology.

VOICE OF THE CUSTOMER Voice of the customer (VOC) is an integral part of ana- lytic CRM and customer experience management systems. As the enabler of VOC, sentiment analysis can access a company’s product and service reviews (either con- tinuously or periodically) to better understand and better manage customer com- plaints and compliments. For instance, a motion picture advertising/marketing company can detect negative sentiments about a movie that is soon to open in theatres (based on its trailers) and quickly change the composition of trailers and advertising strategy (on all media outlets) to mitigate the negative impact. Similarly, a software company can detect the negative buzz regarding the bugs found in their newly released product early enough to release patches and quick fixes to alleviate the situation.

Capturing Hearts and Minds

The success of the new digital platform for 2015— supported by IBM cloud, analytics, mobile, social, and security technologies—was immediate and complete. Targets for total visits and unique visitors were not only met but also exceeded. Achieving 71 million visits and 542 million page views from 21.1 million unique devices demonstrates the platform’s success in attracting a larger audience than ever before and keeping those viewers engaged through- out The Championships.

“Overall, we had 13% more visits from 23% more devices than in 2014, and the growth in the use of wimbledon.com on mobile was even more impressive,” says Willis. “We saw 125% growth in unique devices on mobile, 98% growth in total vis- its, and 79% growth in total page views.”

Desmond concludes, “The results show that in 2015, we won the battle for fans’ hearts and minds. People may have favorite newspapers and sports

website that they visit for 50 weeks of the year—but for two weeks, they came to us instead.”

He continued, “That’s a testament to the sheer quality of the experience we can provide— harnessing our unique advantages to bring them closer to the action than any other media channel. The ability to capture and communicate relevant content in real time helped our fans experience The Championships more vividly than ever before.”

Questions for Case 7.6

1. How did Wimbledon use analytics capabilities to enhance viewers’ experience?

2. What were the challenges, proposed solution, and obtained results?

Source: IBM Case Study. “Creating a Unique Digital Experience to Capture the Moments That Matter.” http:// www-03.ibm.com/software/businesscasestudies/us/en/ corp?synkey=D140192K15783Q68 (accessed May 2016).

Application Case 7.6 (Continued)

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 423

Often, the focus of VOC is individual customers, their service- and support-related needs, wants, and issues. VOC draws data from the full set of customer touch points, including e-mails, surveys, call center notes/recordings, and social media postings, and matches customer voices to transactions (inquiries, purchases, returns) and individual customer profiles captured in enterprise operational systems. VOC, mostly driven by sen- timent analysis, is a key element of customer experience management initiatives, where the goal is to create an intimate relationship with the customer.

VOICE OF THE MARKET (VOM) VOM is about understanding aggregate opinions and trends. It is about knowing what stakeholders—customers, potential customers, influenc- ers, whoever—are saying about your (and your competitors’) products and services. A well-done VOM analysis helps companies with competitive intelligence and product de- velopment and positioning.

VOICE OF THE EMPLOYEE (VOE) Traditionally, VOE has been limited to employee satis- faction surveys. Text analytics in general (and sentiment analysis in particular) is a huge enabler of assessing the VOE. Using rich, opinionated textual data provides an effective and efficient way to listen to what employees are saying. As we all know, happy employ- ees empower customer experience efforts and improve customer satisfaction.

BRAND MANAGEMENT Brand management focuses on listening to social media where anyone (past/current/prospective customers, industry experts, other authorities) can post opinions that can damage or boost a company’s reputation. A number of relatively newly launched start-up companies offer analytics-driven brand management services for oth- ers. Brand management is product and company (rather than customer) focused. It at- tempts to shape perceptions rather than to manage experiences using sentiment analysis techniques.

FINANCIAL MARKETS Predicting the future values of individual (or a group of) stocks has been an interesting and seemingly unsolvable problem. What makes a stock (or a group of stocks) move up or down is anything but an exact science. Many believe that the stock market is mostly sentiment driven, making it anything but rational (especially for short-term stock movements). Therefore, the use of sentiment analysis in financial markets has gained significant popularity. Automated analysis of market sentiment using social media, news, blogs, and discussion groups seems to be a proper way to compute the market movements. If done correctly, sentiment analysis can identify short-term stock movements based on the buzz in the market, potentially impacting liquidity and trading.

POLITICS As we all know, opinions matter a great deal in politics. Because political discussions are dominated by quotes, sarcasm, and complex references to persons, or- ganizations, and ideas, politics is one of the most difficult, and potentially fruitful, areas for sentiment analysis. By analyzing the sentiment on election forums, one might predict who is more likely to win or lose a race. Sentiment analysis can help understand what voters are thinking and can clarify a candidate’s position on issues. Sentiment analysis can help political organizations, campaigns, and news analysts to better understand which is- sues and positions matter the most to voters. The technology was successfully applied by both parties to the 2008 and 2012 U.S. presidential election campaigns.

GOVERNMENT INTELLIGENCE Government intelligence is another application that has been used by intelligence agencies. For example, it has been suggested that one could monitor sources for increases in hostile or negative communications. Sentiment analysis can allow the automatic analysis of the opinions that people submit about pending policy

424 Part II • Predictive Analytics/Machine Learning

or government regulation proposals. Furthermore, monitoring communications for spikes in negative sentiment could be of use to agencies such as Homeland Security.

OTHER INTERESTING AREAS Sentiments of customers can be used to better design e-commerce sites (product suggestions, up-sell/cross-sell advertising), better place adver- tisements (e.g., placing dynamic advertisements of products and services that consider the sentiment on the page the user is browsing), and manage opinion- or review-oriented search engines (i.e., an opinion-aggregation Web site, an alternative to sites similar to Epinions, summarizing user reviews). Sentiment analysis can help with e-mail filtration by categorizing and prioritizing incoming e-mails (e.g., it can detect strongly negative or flaming e-mails and forward them to a proper folder), and citation analysis can determine whether an author is citing a piece of work as supporting evidence or in research but dismisses.

Sentiment Analysis Process

Because of the complexity of the problem (underlying concepts, expressions in text, con- text in which text is expressed, etc.), there is no readily available standardized process to conduct sentiment analysis. However, based on the published work in the field of sensi- tivity analysis so far (both on research methods and range of applications), a multistep, simple logical process as given in Figure 7.9 seems to be an appropriate methodology for sentiment analysis. These logical steps are iterative (i.e., feedback, corrections, and iterations are part of the discovery process) and experimental in nature, and once com- pleted and combined, capable of producing desired insight about the opinions in the text collection.

STEP 1: SENTIMENT DETECTION After retrieval and preparation of the text documents, the first main task in sensitivity analysis is the detection of objectivity. Here the goal is to differentiate between a fact and an opinion, which can be viewed as classification of text as objective or subjective. This can also be characterized as calculation of Objectivity– Subjectivity (3O-S4 Polarity, which can be represented with a numerical value ranging from 0 to 1). If the objectivity value is close to 1, there is no opinion to mine (i.e., it is a fact); therefore, the process goes back and grabs the next text data to analyze. Usually opinion detection is based on the examination of adjectives in text. For example, the po- larity of “what a wonderful work” can be determined relatively easily by looking at the adjective.

STEP 2: N–P (NEGATIVE OR POSITIVE) POLARITY CLASSIFICATION The second main task is that of polarity classification. Given an opinionated piece of text, the goal is to classify the opinion as falling under one of two opposing sentiment polarities or to locate its posi- tion on the continuum between these two polarities (Pang & Lee, 2008). When viewed as a binary feature, polarity classification is the binary classification task of labeling an opin- ionated document as expressing either an overall positive or an overall negative opinion (e.g., thumbs up or thumbs down). In addition to the identification of N–P polarity, one should also be interested in identifying the strength of the sentiment (as opposed to just positive, it can be expressed as mildly, moderately, strongly, or very strongly positive). Most of this research was done on product or movie reviews where the definitions of “positive” and “negative” are quite clear. Other tasks, such as classifying news as “good” or “bad,” present some difficulty. For instance, an article could contain negative news without explicitly using any subjective words or terms. Furthermore, these classes usually appear intermixed when a document expresses both positive and negative sentiments. Then the task can be to identify the main (or dominating) sentiment of the document.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 425

Still, for lengthy texts, the tasks of classification might need to be done at several levels: term, phrase, sentence, and perhaps document level. For those, it is common to use the outputs of one level as the inputs for the next higher layer. Several methods used to iden- tify the polarity and strengths of the polarity are explained in the next section.

STEP 3: TARGET IDENTIFICATION The goal of this step is to accurately identify the target of the expressed sentiment (e.g., a person, a product, an event). The difficulty of this task depends largely on the domain of the analysis. Even though it is usually easy to accurately identify the target for product or movie reviews because the review is directly connected to the target, it can be quite challenging in other domains. For instance, lengthy, general-purpose text such as Web pages, news articles, and blogs do not always have a predefined topic assigned to them and often mention many ob- jects, any of which could be deduced as the target. Sometimes there is more than one target in a sentiment sentence, which is the case in comparative texts. A subjective comparative sentence orders objects in order of preferences—for example, “This lap- top computer is better than my desktop PC.” These sentences can be identified using

Identify the Target for the sentiment

Calculate the N–P Polarity of the

sentiment

Is there a sentiment?

Record the Polarity, Strength,

and the Target of the sentiment.

Textual Data

Calculate the O–S Polarity

YesNo

A statement

Yes

Lexicon

Lexicon

O–S Polarity measure

N–P Polarity

Target

Step 1

Step 2

Step 3

Tabulate & aggregate the sentiment

analysis results

Step 4

FIGURE 7.9 Multistep Process to Sentiment Analysis.

426 Part II • Predictive Analytics/Machine Learning

comparative adjectives and adverbs (more, less, better, longer), superlative adjectives (most, least, best), and other words (such as same, differ, win, prefer). Once the sen- tences have been retrieved, the objects can be put in an order that is most representa- tive of their merits as described in the text.

STEP 4: COLLECTION AND AGGREGATION Once the sentiments of all text data points in the document have been identified and calculated, in this step they are aggregated and converted to a single sentiment measure for the entire document. This aggregation could be as simple as summing up the polarities and strengths of all texts or as complex as using semantic aggregation techniques from NLP to identify the ultimate sentiment.

Methods for Polarity Identification

As mentioned in the previous section, polarity identification can be made at the word, term, sentence, or document level. The most granular level for polarity identification is at the word level. Once the polarity identification has been made at the word level, then it can be aggregated to the next higher level, and then the next until the level of aggrega- tion desired from the sentiment analysis is reached. Two dominant techniques have been used for identification of polarity at the word/term level, each having its advantages and disadvantages:

1. Using a lexicon as a reference library (developed either manually or automat- ically by an individual for a specific task or developed by an institution for general use).

2. Using a collection of training documents as the source of knowledge about the polarity of terms within a specific domain (i.e., inducing predictive models from opinionated textual documents).

Using a Lexicon

A lexicon is essentially the catalog of words, their synonyms, and their meanings for a given language. In addition to lexicons for many other languages, there are several general-purpose lexicons created for English. Often general-purpose lexicons are used to create a variety of special-purpose lexicons for use in sentiment analysis projects. Perhaps the most popular general-purpose lexicon is WordNet created at Princeton University; it has been extended and used by many researchers and practitioners for sen- timent analysis purposes. As described on the WordNet Web site (wordnet. princeton. edu), it is a large lexical database of English, including nouns, verbs, adjectives, and adverbs grouped into sets of cognitive synonyms (i.e., synsets), each expressing a dis- tinct concept. Synsets are interlinked by means of conceptual–semantic and lexical relations.

An interesting extension of WordNet was created by Esuli and Sebastiani (2006) where they added polarity (Positive–Negative; P–N) and objectivity (Subjective– Objective; S–O) labels for each term in the lexicon. To label each term, they classified the synset (a group of synonyms) to which a term belongs using a set of ternary classi- fiers (a measure that attaches to each object exactly one of three labels), each capable of deciding whether a synset is positive, or negative, or objective. The resulting scores range from 0.0 to 1.0, giving a graded evaluation of opinion-related properties of the terms. These can be summed up visually as in Figure 7.10. The edges of the triangle represent one of the three classifications (positive, negative, and objective). A term can be located in this space as a point representing the extent to which it belongs to each of the classifications.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 427

A similar extension methodology is used to create SentiWordNet, a publicly avail- able lexicon specifically developed for opinion mining (sentiment analysis) purposes. SentiWordNet assigns to each synset of WordNet three sentiment scores: positivity, negativity, and objectivity. More about SentiWordNet can be found at sentiwordnet. isti.cnr.it.

Another extension to WordNet is WordNet-Affect, developed by Strapparava and Valitutti (2004). They label WordNet synsets using affective labels representing different affective categories (emotion, cognitive state, attitude, and feeling). WordNet has also been directly used in sentiment analysis. For example, Kim and Hovy (2004) and Liu, Hu, and Cheng (2005) generate lexicons of positive and negative terms by starting with a small list of “seed” terms of known polarities (e.g., love, like, nice) and then using the antonymy and synonymy properties of terms to group them into either of the polarity categories.

Using a Collection of Training Documents

It is possible to perform sentiment classification using statistical analysis and machine- learning tools that take advantage of the vast resources of labeled (manually by annota- tors or using a star/point system) documents available. Product review Web sites such as Amazon, C-NET, eBay, RottenTomatoes, and the Internet Movie Database have all been extensively used as sources of annotated data. The star (or tomato, as it were) system provides an explicit label of the overall polarity of the review, and it is often taken as a gold standard in algorithm evaluation.

A variety of manually labeled textual data is available through evaluation ef- forts such as the Text REtrieval Conference, NII Test Collection for IR Systems, and Cross Language Evaluation Forum. The data sets these efforts produce often serve as a standard in the text mining community including sentiment analysis research- ers. Individual researchers and research groups have also produced many interesting data sets. Technology Insights 7.2 lists some of the most popular ones. Once an al- ready labeled textual data set has been obtained, a variety of predictive modeling and other machine-learning algorithms can be used to train sentiment classifiers. Some of the most popular algorithms used for this task include artificial neural networks, sup- port vector machines, k-nearest neighbor, Naïve Bayes, decision trees, and expectation maximization-based clustering.

Positive (P) (1)

Negative (N) (2)

Objective (O)

Subjective (S)

P–N Polarity

S –

O P

o la

ri ty

FIGURE 7.10 Graphical Representation of the P–N Polarity and S–O Polarity Relationship.

428 Part II • Predictive Analytics/Machine Learning

TECHNOLOGY INSIGHTS 7.2 Large Textual Data Sets for Predictive Text Mining and Sentiment Analysis

Following are a few of the most commonly used examples to large textual data sets:

Congressional Floor-Debate Transcripts: Published by Thomas, Pang, and Lee (2006); contains political speeches that are labeled to indicate whether the speaker supported or opposed the legislation discussed.

Economining: Published by the Stern School at New York University; consists of feedback postings for merchants at Amazon.com.

Cornell Movie-Review Data Sets: Introduced by Pang and Lee (2008); contains 1,000 positive and 1,000 negative automatically derived document-level labels and 5,331 positive and 5,331 negative sentences/snippets.

Stanford—Large Movie Review Data Set: A set of 25,000 highly polar movie reviews for training and 25,000 for testing. There is additional unlabeled data for use as well. Raw text and already processed bag-of-words formats are provided. (See http://ai.stanford. edu/~amaas/data/sentiment.)

MPQA Corpus: Corpus and Opinion Recognition System corpus; contains 535 manually annotated news articles from a variety of news sources containing labels for opinions and private states (beliefs, emotions, speculations, etc.).

Multiple-Aspect Restaurant Reviews: Introduced by Snyder and Barzilay (2007); con- tains 4,488 reviews with an explicit 1-to-5 rating for five different aspects: food, ambiance, service, value, and overall experience.

Identifying Semantic Orientation of Sentences and Phrases

Once the semantic orientation of individual words has been determined, it is often desir- able to extend this to the phrase or sentence in which the word appears. The simplest way to accomplish such aggregation is to use some type of averaging for the polarities of words in the phrases or sentences. Though rarely applied, such aggregation can be as complex as using one or more machine-learning techniques to create a predictive rela- tionship between the words (and their polarity values) and phrases or sentences.

Identifying Semantic Orientation of Documents

Even though the vast majority of the work in this area is done in determining semantic orientation of words and phrases/sentences, some tasks such as summarization and in- formation retrieval could require semantic labeling of the whole document (Ramage et al., 2009). Similar to the case in aggregating sentiment polarity from word level to phrase or sentence level, aggregation to document level is also accomplished by some type of averaging. Sentiment orientation of the document might not make sense for very large documents; therefore, it is often used on small to medium-size documents posted on the Internet.

u SECTION 7.6 REVIEW QUESTIONS

1. What is sentiment analysis? How does it relate to text mining? 2. What are the most popular application areas for sentiment analysis? Why? 3. What would be the expected benefits and beneficiaries of sentiment analysis in

politics?

4. What are the main steps in carrying out sentiment analysis projects? 5. What are the two common methods for polarity identification? Explain.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 429

7.7 WEB MINING OVERVIEW

The Internet has changed the landscape for conducting business forever. Because of the highly connected, flattened world and broadened competition field, today’s companies are increasingly facing more opportunities (being able to reach customers and markets that they might never have thought possible) and more challenges (a globalized and ever- changing competitive marketplace). Companies with the vision and capabilities to deal with such a volatile environment are greatly benefiting from it, whereas others who resist adapting are having difficulty surviving. Having an engaged presence on the Internet is not a choice anymore; it is a business requirement. Customers are expecting companies to offer their products and/or services over the Internet. Customers are not only buying products and services but also talking about companies and sharing their transactional and usage experiences with others over the Internet.

The growth of the Internet and its enabling technologies has made data creation, data collection, and data/information/opinion exchange easier. Delays in service, manufacturing, shipping, delivery, and customer inquiries are no longer private incidents and are accepted as necessary evils. Now, thanks to social media tools and technologies on the Internet, ev- erybody knows everything. Successful companies are the ones that embrace these Internet technologies and use them to improve their business processes to better communicate with their customers, understand their needs and wants, and serve them thoroughly and expedi- tiously. Being customer focused and keeping customers happy has never been as important a concept for businesses as they are now in this age of the Internet and social media.

The World Wide Web (or for short, Web) serves as an enormous repository of data and information on virtually everything one can conceive; business, personal, you name it—an abundant amount of it is there. The Web is perhaps the world’s largest data and text repository, and the amount of information on the Web is growing rapidly. Much interesting information can be found online: whose home page is linked to which other pages, how many people have links to a specific Web page, and how a particular site is organized. In addition, each visitor to a Web site, each search on a search engine, each click on a link, and each transaction on an e-commerce site create additional data. Although unstructured textual data in the form of Web pages coded in HTML or XML are the dominant content of the Web, the Web infrastructure also contains hyperlink informa- tion (connections to other Web pages) and usage information (logs of visitors’ interac- tions with Web sites), all of which provide rich data for knowledge discovery. Analysis of this information can help us make better use of Web sites and also aid us in enhancing relationships and value for the visitors to our own Web sites.

Because of its sheer size and complexity, mining the Web is not an easy undertak- ing by any means. The Web also poses great challenges for effective and efficient knowl- edge discovery (Han & Kamber, 2006):

• The Web is too big for effective data mining. The Web is so large and growing so rapidly that it is difficult to even quantify its size. Because of the sheer size of the Web, it is not feasible to set up a data warehouse to replicate, store, and integrate all of the data on the Web, making data collection and integration a challenge.

• The Web is too complex. The complexity of a Web page is far greater than that of a page in a traditional text document collection. Web pages lack a unified struc- ture. They contain far more authoring style and content variation than any set of books, articles, or other traditional text-based document.

• The Web is too dynamic. The Web is a highly dynamic information source. Not only does the Web grow rapidly but also its content is constantly being updated. Blogs, news stories, stock market results, weather reports, sports scores, prices, company advertisements, and numerous other types of information are updated regularly on the Web.

430 Part II • Predictive Analytics/Machine Learning

• The Web is not specific to a domain. The Web serves a broad diversity of com- munities and connects billions of workstations. Web users have very different back- grounds, interests, and usage purposes. Most users might not have good knowledge of the structure of the information network and might not be aware of the heavy cost of a particular search that they perform.

• The Web has everything. Only a small portion of the information on the Web is truly relevant or useful to someone (or some task). It is said that 99 percent of the information on the Web is useless to 99 percent of Web users. Although this might not seem obvious, it is true that a particular person is generally interested in only a tiny portion of the Web, whereas the rest of the Web contains information that is uninteresting to the user and could swamp desired results. Finding the portion of the Web that is truly relevant to a person and the task being performed is a promi- nent issue in Web-related research.

These challenges have prompted many research efforts to enhance the effectiveness and efficiency of discovering and using data assets on the Web. A number of index-based Web search engines constantly search the Web and index Web pages under certain keywords. Using these search engines, an experienced user might be able to locate docu- ments by providing a set of tightly constrained keywords or phrases. However, a sim- ple keyword-based search engine suffers from several deficiencies. First, a topic of any breadth can easily contain hundreds or thousands of documents. This can lead to a large number of document entries returned by the search engine, many of which are marginally relevant to the topic. Second, many documents that are highly relevant to a topic might not contain the exact keywords defining them. As we cover in more detail later in this chapter, compared to keyword-based Web search, Web mining is a prominent (and more challeng- ing) approach that can be used to substantially enhance the power of Web search engines because Web mining can identify authoritative Web pages, classify Web documents, and resolve many ambiguities and subtleties raised in keyword-based Web search engines.

Web mining (or Web data mining) is the process of discovering intrinsic relation- ships (i.e., interesting and useful information) from Web data, which are expressed in the form of textual, linkage, or usage information. The term Web mining was first used by Etzioni (1996); today, many conferences, journals, and books focus on Web data min- ing. It is a continually evolving area of technology and business practice. Web mining is essentially the same as data mining that uses data generated over the Web. The goal is to turn vast repositories of business transactions, customer interactions, and Web site usage data into actionable information (i.e., knowledge) to promote better decision mak- ing throughout the enterprise. Because of the increased popularity of the term analytics, today many have started to refer to Web mining as Web analytics. However, these two terms are not the same. Whereas Web analytics is primarily Web site usage focused data, Web mining is inclusive of all data generated via the Internet including transaction, social, and usage data. Where Web analytics aims to describe what has happened on the Web site (employing a predefined, metrics-driven descriptive analytics methodology), Web mining aims to discover previously unknown patterns and relationships (employing a novel predictive or prescriptive analytics methodology). From a big-picture perspective, Web analytics can be considered to be a part of Web mining. Figure 7.11 presents a sim- ple taxonomy of Web mining divided into three main areas: Web content mining, Web structure mining, and Web usage mining. In the figure, the data sources used in these three main areas are also specified. Although these three areas are shown separately, as you will see in the following section, they are often used collectively and synergistically to address business problems and opportunities.

As Figure 7.11 indicates, Web mining relies heavily on data mining and text mining and their enabling tools and techniques, which we covered in detail early in this chapter

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 431

and in the previous chapter (Chapter 4). The figure also indicates that these three generic areas are further extended into several very well-known application areas. Some of these areas were explained in previous chapters, and some of the others are covered in detail in this chapter.

Web Content and Web Structure Mining

Web content mining refers to the extraction of useful information from Web pages. The documents can be extracted in some machine-readable format so that automated tech- niques can extract some information from these Web pages. Web crawlers (also called spiders) are used to read through the content of a Web site automatically. The informa- tion gathered could include document characteristics similar to what is used in text min- ing, but it could also include additional concepts, such as the document hierarchy. Such an automated (or semiautomated) process of collecting and mining of Web content can be used for competitive intelligence (collecting intelligence about competitors’ products, services, and customers). It can also be used for information/news/opinion collection and summarization, sentiment analysis, and automated data collection and structuring for predictive modeling. As an illustrative example of using Web content mining as an automated data collection tool, consider the following. For more than 10 years now, two of the three authors of this book (Drs. Sharda and Delen) have been developing mod- els to predict the financial success of Hollywood movies before their theatrical release. The data that they use for training of the models come from several Web sites, each having a different hierarchical page structure. Collecting a large set of variables on thou- sands of movies (from the past several years) from these Web sites is a time-demanding,

Marketing Attribution Customer Analytics 360 Customer View Voice of the Customer

Search Engine Optimization Social Network Analysis Social Media Analytics Weblog Analysis

Page Rank Information Retrieval Graph Mining Social Analytics Clickstream Analysis

Data Mining

Text Mining

Web Mining

Web Structure Mining Source: the unified

resource locator (URL) links contained in the

Web pages

Web Content Mining Source: unstructured textual content of the Web pages (usually in

HTML format)

Web Usage Mining Source: the detailed description of a Web site’s visits (sequence of clicks by sessions)

Web AnalyticsSearch Engines Sentiment Analysis Semantic Webs

FIGURE 7.11 Simple Taxonomy of Web Mining.

432 Part II • Predictive Analytics/Machine Learning

error-prone process. Therefore, Sharda and Delen use Web content mining and spiders as an enabling technology to automatically collect, verify, validate (if the specific data item is available on more than one Web site, then the values are validated against each other and anomalies are captured and recorded), and store these values in a relational database. That way, they ensure the quality of the data while saving valuable time (days or weeks) in the process.

In addition to text, Web pages also contain hyperlinks pointing one page to an- other. Hyperlinks contain a significant amount of hidden human annotation that can potentially help to automatically infer the notion of centrality or authority. When a Web page developer includes a link pointing to another Web page, this could be regarded as the developer’s endorsement of the other page. The collective endorsement of a given page by different developers on the Web might indicate the importance of the page and might naturally lead to the discovery of authoritative Web pages (Miller, 2005). Therefore, the vast amount of Web linkage information provides a rich collection of information about the relevance, quality, and structure of the Web’s contents and thus is a rich source for Web mining.

Web content mining can also be used to enhance the results produced by search engines. In fact, search is perhaps the most prevailing application of Web content min- ing and Web structure mining. A search on the Web to obtain information on a specific topic (presented as a collection of keywords or a sentence) usually returns a few relevant, high-quality Web pages and a larger number of unusable Web pages. Use of a relevance index based on keywords and authoritative pages (or some measure of it) improves the search results and ranking of relevant pages. The idea of authority (or authoritative pages) stems from earlier information retrieval work using citations among journal ar- ticles to evaluate the impact of research papers (Miller, 2005). Although that was the origination of the idea, there are significant differences between the citations in research articles and hyperlinks on Web pages. First, not every hyperlink represents an endorse- ment (some links are created for navigation purposes and some are for paid advertise- ments). Although this is true, if the majority of the hyperlinks are of the endorsement type, then the collective opinion will still prevail. Second, for commercial and competi- tive interests, one authority rarely has its Web page point to rival authorities in the same domain. For example, Microsoft might prefer not to include links on its Web pages to Apple’s Web sites because this could be regarded as an endorsement of its competitor’s authority. Third, authoritative pages are seldom particularly descriptive. For example, the main Web page of Yahoo! might not contain the explicit self-description that it is in fact a Web search engine.

The structure of Web hyperlinks has led to another important category of Web pages called a hub, which is one or more Web pages that provide a collection of links to authoritative pages. Hub pages might not be prominent, and only a few links might point to them; however, hubs provide links to a collection of prominent sites on a specific topic of interest. A hub could be a list of recommended links on an individual’s home page, recommended reference sites on a course Web page, or a professionally assembled resource list on a specific topic. Hub pages play the role of implicitly conferring the authorities on a narrow field. In essence, a close symbiotic relationship exists between good hubs and authoritative pages; a good hub is good because it points to many good authorities; and a good authority is good because it is being pointed to by many good hubs. Such relationships between hubs and authorities make it possible to automatically retrieve high-quality content from the Web.

The most popular publicly known and referenced algorithm used to calculate hubs and authorities is hyperlink-induced topic search (HITS). It was originally developed by Kleinberg (1999) and has since been improved by many researchers. HITS is a link- analysis algorithm that rates Web pages using the hyperlink information contained within

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 433

them. In the context of Web search, the HITS algorithm collects a base document set for a specific query. It then recursively calculates the hub and authority values for each document. To gather the base document set, a root set that matches the query is fetched from a search engine. For each document retrieved, a set of documents that points to the original document and another set of documents that is pointed to by the original docu- ment are added to the set as the original document’s neighborhood. A recursive process of document identification and link analysis continues until the hub and authority values converge. These values are then used to index and prioritize the document collection generated for a specific query.

Web structure mining is the process of extracting useful information from the links embedded in Web documents. It is used to identify authoritative pages and hubs, which are the cornerstones of the contemporary page-rank algorithms that are central to popular search engines such as Google and Yahoo! Just as links going to a Web page can indicate a site’s popularity (or authority), links within the Web page (or the complete Web site) can indicate the depth of coverage of a specific topic. Analysis of links is very important in understanding the interrelationships among large numbers of Web pages, leading to a better understanding of a specific Web community, clan, or clique.

u SECTION 7.7 REVIEW QUESTIONS

1. What are some of the main challenges the Web poses for knowledge discovery? 2. What is Web mining? How does it differ from regular data mining or text mining? 3. What are the three main areas of Web mining? 4. What is Web content mining? How can it be used for competitive advantage? 5. What is Web structure mining? How does it differ from Web content mining?

7.8 SEARCH ENGINES

In this day and age, there is no denying the importance of Internet search engines. As the size and complexity of the World Wide Web increase, finding what you want is becoming a complex and laborious process. People use search engines for a variety of reasons. We use them to learn about a product or service before committing to buy it (including who else is selling it, what the prices are at different locations/sellers, common issues people are discussing about it, how satisfied previous buyers are, and what other products or services might be better) and to search for places to go, people to meet, things to do. In a sense, search engines have become the centerpiece of most Internet-based transactions and other activities. The incredible success and popularity of Google, the most popular search engine company, is a good testament to this claim. What is somewhat of a mystery to many is how a search engine actually does what it is meant to do. In simplest terms, a search engine is a software program that searches for documents (Internet sites or files) based on the keywords (individual words, multiword terms, or a complete sentence) users have provided that have to do with the subject of their inquiry. Search engines are the workhorses of the Internet, responding to billions of queries in hundreds of different languages every day.

Technically speaking, “search engine” is the popular term for information retrieval systems. Although Web search engines are the most popular, search engines are often used in contexts other than the Web, such as desktop search engines and document search engines. As you will see in this section, many of the concepts and techniques that we covered in text analytics and text mining early in this chapter also apply here. The overall goal of a search engine is to return one or more documents/pages (if more than one document/page applies, then a ranked-order list is often provided) that best match the user’s query. The two metrics that are often used to evaluate search engines are

434 Part II • Predictive Analytics/Machine Learning

effectiveness (or quality—finding the right documents/pages) and efficiency (or speed— returning a response quickly). These two metrics tend to work in reverse directions; improving one tends to worsen the other. Often, based on the user expectation, search engines focus on one at the expense of the other. Better search engines are the ones that excel in both at the same time. Because search engines not only search but also find and return documents/pages, perhaps a more appropriate name for them would have been finding engines.

Anatomy of a Search Engine

Now let us dissect a search engine and look inside it. At the highest level, a search engine system is composed of two main cycles: a development cycle and a responding cycle (see the structure of a typical Internet search engine in Figure 7.12). While one is interfacing with the World Wide Web, the other is interfacing with the user. One can think of the development cycle as a production process (manufacturing and inventorying documents/ pages) and the responding cycle as a retailing process (providing customers/users what they want). In the following section, these two cycles are explained in more detail.

1. Development Cycle

The two main components of the development cycle are the Web crawler and document indexer. The purpose of this cycle is to create a huge database of documents/pages orga- nized and indexed based on their content and information value. The reason for devel- oping such a repository of documents/pages is quite obvious: Due to its sheer size and complexity, searching the Web to find pages in response to a user query is not practical (or feasible within a reasonable time frame); therefore, search engines “cache the Web” into their database and use the cached version of the Web for searching and finding. Once created, this database allows search engines to rapidly and accurately respond to user queries.

WEB CRAWLER A Web crawler (also called a spider or Web spider) is a piece of software that systematically browses (crawls through) the World Wide Web for the purpose of finding and fetching Web pages. Often Web crawlers copy all the pages they visit for later processing by other functions of a search engine.

Query Analyzer

Document Matcher/Ranker

Web Crawler

Document Indexer

Scheduler

Cashed/Indexed Documents DB

User World Wide Web

Se arc

h

Qu ery

Processed Query

Lis t o

f U RL

s t o

Cr aw

l

Crawling the Web

Un pro

ce ss

ed

We b P

ag es

Processed Pages Lis

t o f M

atc he

d

Pa ge

s Ranked-

Ordered Pages

Responding Cycle Development Cycle

M e ta

d a ta

In d e x

FIGURE 7.12 Structure of a Typical Internet Search Engine.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 435

A Web crawler starts with a list of URLs to visit, which are listed in the scheduler and often are called the seeds. These URLs can come from submissions made by Webmasters or, more often, from the internal hyperlinks of previously crawled documents/pages. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit (i.e., the scheduler). URLs in the scheduler are recursively visited according to a set of policies determined by the specific search engine. Because there are large volumes of Web pages, the crawler can download only a limited number of them within a given time; therefore, it might need to prioritize its downloads.

DOCUMENT INDEXER As the documents are found and fetched by the crawler, they are stored in a temporary staging area for the document indexer to grab and process. The document indexer is responsible for processing the documents (Web pages or document files) and placing them into the document database. To convert the documents/pages into the desired, easily searchable format, the document indexer performs the following tasks.

STEP 1: PREPROCESSING THE DOCUMENTS Because the documents fetched by the crawler might all be in different formats, for the ease of processing them further, in this step they all are converted to some type of standard representation. For instance, different content types (text, hyperlink, image, etc.) could be separated from each other, formatted (if necessary), and stored in a place for further processing.

STEP 2: PARSING THE DOCUMENTS This step is essentially the application of text mining (i.e., computational linguistic, NLP) tools and techniques to a collection of documents/ pages. In this step, first the standardized documents are parsed into components to iden- tify index-worthy words/terms. Then, using a set of rules, the words/terms are indexed. More specifically, using tokenization rules, the words/terms/entities are extracted from the sentences in these documents. Using proper lexicons, the spelling errors and other anomalies in these words/terms are corrected. Not all the terms are discriminators. The nondiscriminating words/terms (also known as stop words) are eliminated from the list of index-worthy words/terms. Because the same word/term can be in many different forms, stemming is applied to reduce the words/terms to their root forms. Again, using lexicons and other language-specific resources (e.g., WordNet), synonyms and homonyms are iden- tified, and the word/term collection is processed before moving into the indexing phase.

STEP 3: CREATING THE TERM-BY-DOCUMENT MATRIX In this step, the relationships be- tween the words/terms and documents/pages are identified. The weight can be as simple as assigning 1 for presence or 0 for absence of the word/term in the document/page. Usually more sophisticated weight schemas are used. For instance, as opposed to binary, one can choose to assign frequency of occurrence (number of times the same word/ term is found in a document) as a weight. As we saw earlier in this chapter, text mining research and practice have clearly indicated that the best weighting could come from the use of term frequency divided by inverse document frequency (TF/IDF). This algorithm measures the frequency of occurrence of each word/term within a document and then compares that frequency against the frequency of occurrence in the document collection. As we all know, not all high-frequency words/terms are good document discriminators, and a good document discriminator in a domain might not be one in another domain. Once the weighing schema is determined, the weights are calculated and the term-by- document index file is created.

2. Response Cycle

The two main components of the responding cycle are the query analyzer and the document matcher/ranker.

436 Part II • Predictive Analytics/Machine Learning

QUERY ANALYZER The query analyzer is responsible for receiving a search request from the user (via the search engine’s Web server interface) and converting it into a standardized data structure so that it can be easily queried/matched against the entries in the document database. How the query analyzer does what it is supposed to do is quite similar to what the document indexer does (as we have just explained). The query analyzer parses the search string into individual words/terms using a series of tasks that include tokenization, removal of stop words, stemming, and word/term dis- ambiguation (identification of spelling errors, synonyms, and homonyms). The close similarity between the query analyzer and the document indexer is not coincidental. In fact, it is quite logical because both are working off the document database; one is putting in documents/pages using a specific index structure, and the other is convert- ing a query string into the same structure so that it can be used to quickly locate most relevant documents/pages.

DOCUMENT MATCHER/RANKER This is where the structured query data are matched against the document database to find the most relevant documents/pages and rank them in the order of relevance/importance. The proficiency of this step is perhaps the most important component when different search engines are compared to one another. Every search engine has its own (often proprietary) algorithm that it uses to carry out this im- portant step.

The early search engines used a simple keyword match against the document data- base and returned a list of ordered documents/pages when the determinant of the order was a function that used the number of words/terms matched between the query and the document along with the weights of those words/terms. The quality and the usefulness of the search results were not all that good. Then, in 1997, the creators of Google came up with a new algorithm, called PageRank. As the name implies, PageRank is an algorith- mic way to rank order documents/pages based on their relevance and value/importance. Even though PageRank is an innovative way to rank documents/pages, it is an augmenta- tion to the process of retrieving relevant documents from the database and ranking them based on the weights of the words/terms. Google does all of these collectively and more to identify the most relevant list of documents/pages for a given search request. Once an ordered list of documents/pages is created, it is pushed back to the user in an easily digestible format. At this point, users might choose to click on any of the documents in the list, and it might not be the one at the top. If they click on a document/page link that is not at the top of the list, then can we assume that the search engine did not do a good job ranking them? Perhaps, yes. Leading search engines like Google monitor the performance of their search results by capturing, recording, and analyzing postdelivery user actions and experiences. These analyses often lead to more and more rules to further refine the ranking of the documents/pages so that the links at the top are more preferable to the end users.

Search Engine Optimization

Search engine optimization (SEO) is the intentional activity of affecting the visibility of an e-commerce site or a Web site in a search engine’s natural (unpaid or organic) search results. In general, the higher it is ranked on the search results page, and the more frequently a site appears in the search results list, the more visitors it will receive from the search engine’s users. As an Internet marketing strategy, SEO considers how search engines work, what people search for, the actual search terms or keywords typed into search engines, and which search engines are preferred by their targeted audience. Optimizing a Web site can involve editing its content, HTML, and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 437

activities of search engines. Promoting a site to increase the number of backlinks, or in- bound links, is another SEO tactic.

In the early days, in order to be indexed, the only thing that Webmasters needed to do was to submit the address of a page, or URL, to the various engines, which would then send a “spider” to “crawl” that page, extract links to other pages from it, and return infor- mation found on the page to the server for indexing. The process, as explained before, involves a search engine spider downloading a page and storing it on the search engine’s own server, where a second program, known as an indexer, extracts various informa- tion about the page, such as the words it contains and where they are located as well as any weight for specific words, and all links the page contains, which are then placed into a scheduler for crawling at a later date. Today, search engines are no longer rely- ing on Webmasters submitting URLs (even though they still can); instead, search engines are proactively and continuously crawling the Web and finding, fetching, and indexing everything about it.

Being indexed by search engines such as Google, Bing, and Yahoo! is not good enough for businesses. Getting ranked on the most widely used search engines (see Technology Insights 7.3 for a list of most widely used search engines) and getting ranked higher than your competitors are what make a difference in the eye of the customers and other constituents. A variety of methods can increase the ranking of a Web page within the search results. Cross-linking between pages of the same Web site to provide more links to the most important pages could improve its visibility. Writing content that includes frequently searched keyword phrases to be relevant to a wide variety of search queries tends to increase traffic. Updating content to keep search engines crawling back frequently can give additional weight to a site. Adding relevant keywords to a Web page’s metadata, including the title tag and metadescription, will tend to improve the relevancy of a site’s search listings, thus increasing traffic. URL normalization of Web pages (so that they are accessible via multiple and simpler URLs) and using canonical link elements and redirects can help make sure that the links to different versions of the Web pages and their URLs all count toward the Web site’s link popularity score.

Methods for Search Engine Optimization

In general, SEO techniques can be classified into two broad categories: techniques that search engines recommend as part of good site design and those techniques of which search engines do not approve. The search engines attempt to minimize the effect of the latter, which is often called spamdexing (also known as search spam, search engine spam, or search engine poisoning). Industry commentators, and the practitioners who employ them, have classified these methods as either white-hat SEO or black-hat SEO (Goodman, 2005). White hats tend to produce results that last a long time, whereas black hats anticipate that their sites might eventually be banned either temporarily or perma- nently once the search engines discover what they are doing.

An SEO technique is considered white hat if it conforms to the search engine’s guidelines and involves no deception. Because search engine guidelines are not written as a series of rules or commandments, this is an important distinction to note. White- hat SEO is not just about following guidelines but also about ensuring that the content a search engine indexes and subsequently ranks is the same content a user will see. White-hat advice is generally summed up as creating content for users, not for search engines, and then making that content easily accessible to the spiders rather than at- tempting to trick the algorithm from its intended purpose. White-hat SEO is in many ways similar to Web development that promotes accessibility, although the two are not identical.

438 Part II • Predictive Analytics/Machine Learning

TECHNOLOGY INSIGHTS 7.3 Top 15 Most Popular Search Engines (August 2016)

These are the 15 most popular search engines as derived from eBizMBA Rank (ebizmba.com/ articles/search-engines), which is a constantly updated average of each Web site’s Alexa Global Traffic Rank and U.S. Traffic Rank from both Compete and Quantcast.

Rank Name Estimated Unique Monthly Visitors

1 Google 1,600,000,000

2 Bing 400,000,000

3 Yahoo! Search 300,000,000

4 Ask 245,000,000

5 AOL Search 125,000,000

6 Wow 100,000,000

7 WebCrawler 65,000,000

8 MyWebSearch 60,000,000

9 Infospace 24,000,000

10 Info 13,500,000

11 DuckDuckGo 11,000,000

12 Contenko 10,500,000

13 Dogpile 7,500,000

14 Alhea 4,000,000

15 ixQuick 1,000,000

Black-hat SEO attempts to improve rankings in ways that are not approved by the search engines or involve deception. One black-hat technique uses text that is hidden, either as text colored similar to the background, in an invisible div tag (that defines a division or a section in an HTML document), or positioned off-screen. Another method gives a different page depend- ing on whether the page is being requested by a human visitor or a search engine, a technique known as cloaking. Search engines can penalize sites they discover using black-hat methods, either by reducing their rankings or eliminating their listings from their databases altogether. Such penalties can be applied either automatically by the search engines’ algorithms or by a manual site review. One example was the February 2006 Google removal of both BMW Germany and Ricoh Germany for use of unapproved practices (Cutts, 2006). Both companies, however, quickly apologized, fixed their practices, and were restored to Google’s list.

For some businesses, SEO can generate a significant return on investment. However, one should keep in mind that search engines are not paid for organic search traffic, their algorithms change constantly, and there are no guarantees of continued referrals. Due to this lack of certainty and stability, a business that relies heavily on search engine traffic can suffer major losses if the search engine decides to change its algorithms and stop sending visitors. According to Google’s CEO, Eric Schmidt, in 2010, Google made over 500 algo- rithm changes—almost 1.5 per day. Because of the difficulty in keeping up with changing search engine rules, companies that rely on search traffic practice one or more of the fol- lowing: (1) hire a company that specializes in SEO (there seem to be an abundant number of those today) to continuously improve your site’s appeal to changing practices of the search engines, (2) pay the search engine providers to be listed on the paid sponsors’ sec- tions, and (3) consider liberating yourself from dependence on search engine traffic.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 439

Either originating from a search engine (organically or otherwise) or coming from other sites and places, what is most important for an e-commerce site is to maximize the likelihood of customer transactions. Having many visitors without sales is not what a typical e-commerce site is built for. Application Case 7.7 discusses a large Internet- based shopping mall where detailed analysis of customer behavior (using clickstreams and other data sources) is used to significantly improve the conversion rate.

Either originating from a search engine (organically or otherwise), responding to email- based marketing campaigns, or coming from social media sites, what is most important for an e-commerce site is to maximize its leads and subsequent customer sales transactions. Application Case 7.7 shows how a century-old fashionable cloth and accessory company used email-based campaigns to generate large number of new leads for its e-commerce business.

respected around the world, Barbour was aware of the importance in establishing direct relationships with its target audience—especially when encouraging users to engage with its new e-commerce platform. It also understood that it needed to take more control of shaping the customer journey. That way Barbour could create and maintain the same exceptional level of quality in the user experience as that applied to the manufacturing of its products. To do this, the com- pany needed to develop its understanding of its target market’s online behavior. With the goal of reaching its target audience in order to build meaningful cus- tomer relationships, Barbour approached Teradata. Barbour’s marketing department needed Teradata Interactive to offer a solution that would increase its knowledge of the unique characteristics and needs of its individual customers, as well as support the launch of its new UK e-commerce website.

The Solution: Implementing a Lead Nurture Program

The increasing shift to global e-commerce and the growth in digital consumerism require brands to hold a strong online presence. This also means that retail- ers have to implement strategies that support their cus- tomers’ evolving wants and needs, online and offline. Barbour and Teradata Interactive embarked on the design and construction of a Lead Nurture Program that ran over a one-month period. The campaign objective was to not only raise awareness and create demand for immediate sales activity but also to create a more long- term engagement mechanism that would lead to more

Application Case 7.7 Delivering Individualized Content and Driving Digital Engagement: How Barbour Collected More Than 49,000 New Leads in One Month with Teradata Interactive

Background

Founded in 1894, Barbour is an English heritage and lifestyle brand renowned for its waterproof outerwear—especially its classic waxed-cotton jacket. With more than 10,000 jackets ordered and hand- made each year, Barbour has held a strong position in the luxury goods industry for more than a century, building a strong relationship with fashion-conscious men and women of the British countryside. In 2000, Barbour broadened its product offering to include a full lifestyle range of everyday clothes and accesso- ries. Its major markets are the United Kingdom, the United States, and Germany; however, Barbour holds a presence in more than 40 countries worldwide, including Austria, New Zealand, and Japan. Using individualized insights derived with the services and digital marketing capabilities of Teradata Interactive, Barbour ran a one-month campaign that generated 49,700 new leads and 450,000 clicks to its website.

The Challenge: Taking Ownership of Customer Relationships

Barbour has experienced outstanding consistent growth within its lifetime, and in August 2013, it launched its first e-commerce site in a bid to gain a stronger online presence. However, being a late starter in the e-commerce world, it was a challenge for Barbour to establish itself in the saturated digi- tal arena. Having previously sold its products only through wholesalers and independent retail resellers, Barbour wanted to take ownership of the end-user relationship, whole customer journey, and perception of the brand. While the brand is iconic and highly

(Continued )

440 Part II • Predictive Analytics/Machine Learning

sales over a sustained period of time. It was clear from the start that the strong relationship Barbour enjoys with its customers was a crucial factor that set it apart from its luxury retail competitors. Teradata Interactive was keen to ensure this relationship was respected through the lead generation process.

The execution of the campaign was unique to Barbour. Typical lead generation campaigns were often executed as single registration events with a single sales promotion in mind. The data were usu- ally restricted to just email addresses and basic profile fields, generated without consideration of the regis- trant’s personal needs and imported to be used solely for generic newsletter campaigns. This strategy often missed a huge opportunity for brands when learn- ing about their prospects, often resulting in poor sales conversions. Teradata Interactive understood that the true value of lead generation is twofold. First of all, by using the registration event to gather as much informa- tion as possible, the understanding of future buying intent and its affecting factors are developed. Second, by making sure that the collated data are effectively used to deliver valuable and individualized content, relevant sales opportunities are provided to the cus- tomer when they are next in the market to buy. To make sure this strategy drove long-term sales, Teradata Interactive built a customer lifecycle program which delivered content over email and online display.

The nurture program content was integrated with display advertising and encouraged social media sharing. With Teradata Interactive’s smart tag- ging of nurture content, Barbour was able to segment audiences according to their product preferences and launch display re-targeting banners. Registrants were also invited to share content socially, which enabled Teradata Interactive to identify “social propensity” and segment users for future loyalty schemes and “Tell-a-Friend” activities. In addition to the focus of increasing Barbour’s newsletter base, Teradata conducted a data audit to analyze all of the

data collected and better understand what factors would influence user engagement behavior.

Results

The strong collaboration between Teradata and Barbour meant that over the one-month campaign period, Barbour was able to create new and inno- vative ways of communicating with its custom- ers. More than 49,700 leads were collected within the UK and DACH regions, and the lead genera- tion program showed open rates of up to 60% and click-through-rates of between 4 and 11 percent. The campaign also generated 450,000+ clicks to Barbour’s website and was so popular with fashion bloggers and national press that it was featured as a story in The Daily Mirror. Though the campaign was only a month long, a key focus was to help Barbour’s future marketing strategy. A preference center survey was implemented into the campaign design, which resulted in a 65 percent incentivized completion rate. User data included:

• Social network engagement • Device engagement • Location to nearest store • Important considerations to the customer

A deep level of insight has effectively given Barbour a huge capability to deliver personalized content and offers to its user base.

Questions for Case 7.7

1. What does Barbour do? What was the challenge Barbour was facing?

2. What was the proposed analytics solution?

3. What were the results?

Source: Teradata Case Study, “How Barbour Collected More Than 49,000 New Leads in One Month with Teradata Interactive” http://assets.teradata.com/resourceCenter/downloads/ CaseStudies/EB-8791_Interactive-Case-Study_Barbour.pdf (accessed November 2018).

Application Case 7.7 (Continued)

u SECTION 7.8 REVIEW QUESTIONS

1. What is a search engine? Why are search engines critically important for today’s businesses?

2. What is a Web crawler? What is it used for? How does it work? 3. What is “search engine optimization”? Who benefits from it? 4. What things can help Web pages rank higher in search engine results?

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 441

7.9 WEB USAGE MINING (WEB ANALYTICS)

Web usage mining (also called Web analytics) is the extraction of useful information from data generated through Web page visits and transactions. Analysis of the information collected by Web servers can help us better understand user behavior. Analysis of these data is often called clickstream analysis. By using data and text mining techniques, a company might be able to discern interesting patterns from the clickstreams. For example, it might learn that 60 percent of visitors who searched for “hotels in Maui” had searched earlier for “airfares to Maui.” Such information could be useful in determining where to place online advertisements. Clickstream analysis might also be useful for knowing when visitors access a site. For example, if a company knew that 70 percent of software downloads from its Web site occurred between 7 and 11 p.m., it could plan for better customer support and network bandwidth during those hours. Figure 7.13 shows the process of extracting knowledge from clickstream data and how the generated knowledge is used to improve the process, improve the Web site, and most importantly, increase the customer value.

Web Analytics Technologies

There are numerous tools and technologies for Web analytics in the marketplace. Because of their power to measure, collect, and analyze Internet data to better understand and optimize Web usage, the popularity of Web analytics tools is increasing. Web analytics holds the promise of revolutionizing how business is done on the Web. Web analytics is not just a tool for measuring Web traffic; it can also be used as a tool for e-business and market research and to assess and improve the effectiveness of e-commerce Web sites. Web analytics applications can also help companies measure the results of traditional print or broadcast advertising campaigns. It can help estimate how traffic to a Web site changes after the launch of a new advertising campaign. Web analytics provides infor- mation about the number of visitors to a Web site and the number of page views. It helps gauge traffic and popularity trends, which can be used for market research.

There are two main categories of Web analytics: off-site and on-site. Off-site Web analytics refers to Web measurement and analysis about you and your products that take place outside your Web site. These measurements include a Web site’s potential audience (prospect or opportunity), share of voice (visibility or word of mouth), and buzz (com- ments or opinions) that is happening on the Internet.

What is more mainstream has been on-site Web analytics. Historically, Web analyt- ics has been referred to as on-site visitor measurement. However, in recent years, this has blurred, mainly because vendors are producing tools that span both categories. On-site Web

Web Logs

Web Site Preprocess Data Collecting Merging Cleaning Structuring - Identify users - Identify sessions - Identify page views - Identify visits

Extract Knowledge Usage patterns User profiles Page profiles Visit profiles Customer value

How to better the data

How to improve the Web Site

How to increase the customer value

User/ Customer

0% 18–24 25–34 35–44 45–54 55+

5% 10% 15% 20% 25% 30% 35% 40% 45%

P e rc

e n t

o f

U s e rs

FIGURE 7.13 Extraction of Knowledge from Web Usage Data.

442 Part II • Predictive Analytics/Machine Learning

analytics measure visitors’ behavior once they are on your Web site. This includes its drivers and conversions—for example, the degree to which different landing pages are associated with online purchases. On-site Web analytics measure the performance of your Web site in a commercial context. The data collected on the Web site is then compared against key performance indicators for performance and used to improve audience response on a Web site or marketing campaign. Even though Google Analytics is the most widely used on-site Web analytics service, others are provided by Yahoo! and Microsoft, and newer and better tools are emerging constantly that provide additional layers of information.

There are two technical ways to collect the data with on-site Web analytics. The first and more traditional method is the server log file analysis by which the Web server records file requests made by browsers. The second method is page tagging, which uses JavaScript embedded in the site page code to make image requests to a third-party analytics–dedicated server whenever a page is rendered by a Web browser (or when a mouse click occurs). Both collect data that can be processed to produce Web traffic reports. In addition to these two main streams, other data sources can also be added to augment Web site behavior data. These other sources can include e-mail, direct mail cam- paign data, sales and lead history, or social media–originated data.

Web Analytics Metrics

Using a variety of data sources, Web analytics programs provide access to much valuable marketing data, which can be leveraged for better insights to grow your business and bet- ter document your return on investment (ROI). The insight and intelligence gained from Web analytics can be used to effectively manage the marketing efforts of an organization and its various products or services. Web analytics programs provide nearly real-time data, which can document an organization’s marketing campaign successes or empower it to make timely adjustments to its current marketing strategies.

Whereas Web analytics provides a broad range of metrics, four categories of metrics are generally actionable and can directly impact your business objectives (The Westover Group, 2013). These categories include:

• Web site usability: How were they using my Web site? • Traffic sources: Where did they come from? • Visitor profiles: What do my visitors look like? • Conversion statistics: What does it all mean for the business?

Web Site Usability

Beginning with your Web site, let’s take a look at how well it works for your visitors. This is where you can learn how “user friendly” it really is or whether or not you are providing the right content.

1. Page views. The most basic of measurements, this one is usually presented as the “average page views per visitor.” If people come to your Web site and do not view many pages, then your Web site could have issues with its design or structure. Another explanation for low page views is a disconnect in the marketing messages that brought the visitor to the site and the content that is actually available.

2. Time on site. Similar to page views, this is a fundamental measurement of a visitor’s interaction with your Web site. Generally, the longer a person spends on your Web site, the better it is. That could mean they are carefully reviewing your content, utilizing inter- active components you have available, and building toward an informed decision to buy, respond, or take the next step you have provided. On the contrary, the time on site also needs to be examined against the number of pages viewed to make sure the visitor is not spending his or her time trying to locate content that should be more readily accessible.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 443

3. Downloads. This includes PDFs, videos, and other resources you make avail- able to your visitors. Consider how accessible these items are as well as how well they are promoted. If your Web statistics, for example, reveal that 60 percent of the individuals who watch a demo video also make a purchase, then you will want to strategize to increase viewership of that video.

4. Click map. Most analytics programs can show you the percentage of clicks each item on your Web page received. This includes clickable photos, text links in your copy, downloads, and, of course, any navigation you have on the page. Are they clicking the most important items?

5. Click paths. Although an assessment of click paths is more involved, they can quickly reveal where you might be losing visitors in a specific process. A well- designed Web site uses a combination of graphics and information architecture to encourage visitors to follow “predefined” paths through your Web site. These are not rigid pathways but intuitive steps that align with the various processes you have built into the Web site. One process might be that of “educating” a visitor who has minimum understanding of your product or service. Another might be a process of “motivating” a returning visitor to consider an upgrade or repurchase. A third process might be structured around items you market online. You will have as many process pathways in your Web site as you have target audiences, prod- ucts, and services. Each can be measured through Web analytics to determine how effective it is.

Traffic Sources

Your Web analytics program is an incredible tool for identifying where your Web traf- fic originates. Basic categories such as search engines, referral Web sites, and visits from bookmarked pages (i.e., direct) are compiled with little involvement by the marketer. With a small amount of effort, however, you can also identify Web traffic that was gener- ated by your various offline or online advertising campaigns.

1. Referral Web sites. Other Web sites that contain links that send visitors directly to your Web site are considered referral Web sites. Your analytics program will identify each referral site your traffic comes from, and a deeper analysis will help you deter- mine which referrals produce the greatest volume, the highest conversions, the most new visitors, and so on.

2. Search engines. Data in the search engine category is divided between paid search and organic (or natural) search. You can review the top keywords that gener- ated Web traffic to your site and see if they are representative of your products and services. Depending upon your business, you might want to have hundreds (or thou- sands) of keywords that draw potential customers. Even the simplest product search can have multiple variations based on how the individual phrases the search query.

3. Direct. Direct searches are attributed to two sources. Individuals who bookmark one of your Web pages in their favorites and click that link will be recorded as a direct search. Another source occurs when someone types your URL directly into her or his browser. This happens when someone retrieves your URL from a busi- ness card, brochure, print ad, radio commercial, and so on. That’s why it is a good strategy to use coded URLs.

4. Offline campaigns. If you utilize advertising options other than Web-based cam- paigns, your Web analytics program can capture performance data if you include a mechanism for sending them to your Web site. Typically, this is a dedicated URL that you include in your advertisement (i.e., “www.mycompany.com/offer50”) that delivers those visitors to a specific landing page. You now have data on how many responded to that ad by visiting your Web site.

444 Part II • Predictive Analytics/Machine Learning

5. Online campaigns. If you are running a banner ad campaign, search engine ad- vertising campaign, or even e-mail campaign, you can measure individual campaign effectiveness by simply using a dedicated URL similar to the offline campaign strategy.

Visitor Profiles

One of the ways you can leverage your Web analytics into a really powerful marketing tool is through segmentation. By blending data from different analytics reports, you will begin to see a variety of user profiles emerge.

1. Keywords. Within your analytics report, you can see what keywords visitors used in search engines to locate your Web site. If you aggregate your keywords by similar attributes, you will begin to see distinct visitor groups who are using your Web site. For example, the particular search phrase that was used can indicate how well they understand your product or its benefits. If they use words that mirror your own product or service descriptions, then they probably are already aware of your offerings from effective advertisements, brochures, and so on. If the terms are more general in nature, then your visitors are seeking a solution for a problem that has happened upon your Web site. If this second group of searchers is sizable, then you will want to ensure that your site has a strong education component to convince them they have found their answer and then move them into your sales channel.

2. Content groupings. Depending on how you group your content, you could be able to analyze sections of your Web site that correspond to specific products, ser- vices, campaigns, and other marketing tactics. If you conduct a number of trade shows and drive traffic to your Web site for specific product literature, then your Web analytics will highlight the activity in that section.

3. Geography. Analytics permits you to see where your traffic geographically origi- nates, including country, state, and city locations. This can be especially useful if you use geotargeted campaigns or want to measure your visibility across a region.

4. Time of day. Web traffic generally has peaks at the beginning of the workday, during lunch, and toward the end of the workday. It is not unusual, however, to find strong Web traffic entering your Web site up until the late evening. You can analyze these data to determine when people browse versus buy and to make decisions on what hours you should offer customer service.

5. Landing page profiles. If you structure your various advertising campaigns properly, you can drive each of your targeted groups to a different landing page, which your Web analytics will capture and measure. By combining these numbers with the demographics of your campaign media, you can know what percentage of your visitors fits each demographic.

Conversion Statistics

Each organization defines a “conversion” according to its specific marketing objectives. Some Web analytics programs use the term goal to benchmark certain Web site objec- tives, whether that be a certain number of visitors to a page, a completed registration form, or an online purchase.

1. New visitors. If you are working to increase visibility, you will want to study the trends in your new visitors data. Analytics identifies all visitors as either new or returning.

2. Returning visitors. If you are involved in loyalty programs or offer a product that has a long purchase cycle, then your returning visitors data will help you mea- sure progress in this area.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 445

3. Leads. Once a form is submitted and a thank-you page is generated, you have created a lead. Web analytics will permit you to calculate a completion rate (or aban- donment rate) by dividing the number of completed forms by the number of Web visitors that came to your page. A low completion percentage would indicate a page that needs attention.

4. Sales/conversions. Depending on the intent of your Web site, you can define a “sale” by an online purchase, a completed registration, an online submission, or any number of other Web activities. Monitoring these figures will alert you to any changes (or successes!) that occur further upstream.

5. Abandonment/exit rates. Just as important as those moving through your Web site are those who began a process and quit or came to your Web site and left after a page or two. In the first case, you’ll want to analyze where the visitor terminated the process and whether there are a number of visitors quitting at the same place. Then investigate the situation for resolution. In the latter case, a high exit rate on a Web site or a specific page generally indicates an issue with expectations. Visitors click to your Web site based on some message contained in an advertisement, a pre- sentation, and so on, and expect some continuity in that message. Make sure you are advertising a message that your Web site can reinforce and deliver.

Within each of these items are metrics that can be established for your specific organization. You can create a weekly dashboard that includes specific numbers or per- centages that will indicate where you are succeeding—or highlight a marketing challenge that should be addressed. When these metrics are evaluated consistently and used in conjunction with other available marketing data, they can lead you to a highly quantified marketing program. Figure 7.14 shows a Web analytics dashboard created with freely available Google Analytics tools.

FIGURE 7.14 Sample Web Analytics Dashboard.

446 Part II • Predictive Analytics/Machine Learning

u SECTION 7.9 REVIEW QUESTIONS

1. What are the three types of data generated through Web page visits? 2. What is clickstream analysis? What is it used for? 3. What are the main applications of Web mining? 4. What are commonly used Web analytics metrics? What is the importance of metrics?

7.10 SOCIAL ANALYTICS

Social analytics could mean different things to different people based on their world- view and field of study. For instance, the dictionary definition of social analytics refers to a philosophical perspective developed by the Danish historian and philosopher Lars- Henrik Schmidt in the 1980s. The theoretical object of the perspective is socius, a kind of “commonness” that is neither a universal account nor a communality shared by every member of a body (Schmidt, 1996). Thus, social analytics differs from traditional philoso- phy as well as sociology. It might be viewed as a perspective that attempts to articulate the contentions between philosophy and sociology.

Our definition of social analytics is somewhat different; as opposed to focusing on the “social” part (as is done in its philosophical definition), we are more interested in the “analytics” part of the term. Gartner (a very well-known global IT consultancy company) defined social analytics as “monitoring, analyzing, measuring and interpret- ing digital interactions and relationships of people, topics, ideas and content” (gartner. com/it- glossary/social-analytics/). Social analytics include mining the textual content created in social media (e.g., sentiment analysis, NLP) and analyzing socially established networks (e.g., influencer identification, profiling, prediction) for the purpose of gaining insight about existing and potential customers’ current and future behaviors, and about the likes and dislikes toward a firm’s products and services. Based on this definition and the current practices, social analytics can be classified into two different, but not nec- essarily mutually exclusive, branches: social network analysis (SNA) and social media analytics.

Social Network Analysis

A social network is a social structure composed of individuals/people (or groups of individuals or organizations) linked to one another with some type of connections/rela- tionships. The social network perspective provides a holistic approach to analyzing the structure and dynamics of social entities. The study of these structures uses SNA to iden- tify local and global patterns, locate influential entities, and examine network dynamics. Social networks and their analysis is essentially an interdisciplinary field that emerged from social psychology, sociology, statistics, and graph theory. Development and for- malization of the mathematical extent of SNA dates back to the 1950s; the development of foundational theories and methods of social networks dates back to the 1980s (Scott & Davis, 2003). SNA is now one of the major paradigms in business analytics, consumer intelligence, and contemporary sociology and is employed in a number of other social and formal sciences.

A social network is a theoretical construct useful in the social sciences to study relationships between individuals, groups, organizations, or even entire societies (social units). The term is used to describe a social structure determined by such interactions. The ties through which any given social unit connects represent the convergence of the various social contacts of that unit. In general, social networks are self-organizing, emer- gent, and complex, such that a globally coherent pattern appears from the local interac- tion of the elements (individuals and groups of individuals) that make up the system.

Following are a few typical social network types that are relevant to business activities.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 447

COMMUNICATION NETWORKS Communication studies are often considered a part of both the social sciences and the humanities, drawing heavily on fields such as sociology, psychology, anthropology, information science, biology, political science, and econom- ics. Many communications concepts describe the transfer of information from one source to another and thus can be represented as a social network. Telecommunication compa- nies are tapping into this rich information source to optimize their business practices and to improve customer relationships.

COMMUNITY NETWORKS Traditionally, community has referred to a specific geographic location, and studies of community ties had to do with who talked, associated, traded, and attended social activities with whom. Today, however, there are extended “online” communities developed through social networking tools and telecommunications de- vices. Such tools and devices continuously generate large amounts of data that companies can use to discover invaluable, actionable information.

CRIMINAL NETWORKS In criminology and urban sociology, much attention has been paid to the social networks among criminal actors. For example, studying gang murders and other illegal activities as a series of exchanges between gangs can lead to better understanding and prevention of such criminal activities. Now that we live in a highly connected world (thanks to the Internet), much of the criminal networks’ formations and their activities are being watched/pursued by security agencies using state-of-the- art Internet tools and tactics. Even though the Internet has changed the landscape for criminal networks and law enforcement agencies, the traditional social and philosophical theories still apply to a large extent.

INNOVATION NETWORKS Business studies on the diffusion of ideas and innovations in a network environment focus on the spread and use of ideas among the members of the social network. The idea is to understand why some networks are more innovative, and why some communities are early adopters of ideas and innovations (i.e., examining the impact of social network structure on influencing the spread of an innovation and innovative behavior).

Social Network Analysis Metrics

SNA, the systematic examination of social networks, views social relationships in terms of network theory consisting of nodes (representing individuals or organizations within the network) and ties/connections (which represent relationships between the individuals or organizations, such as friendship, kinship, or organizational position). These networks are often represented using social network diagrams, where nodes are represented as points and ties are represented as lines.

Application Case 7.8 provides an interesting example of multichannel social analytics.

If Tito’s Handmade Vodka had to identify a single social media metric that most accurately reflects its mission, it would be engagement. Connecting with vodka lovers in an inclusive, authentic way is some- thing Tito’s takes very seriously, and the brand’s social strategy reflects that vision.

Founded nearly two decades ago, Tito’s credits the advent of social media with playing an integral role in engaging fans and raising brand awareness. In an interview with Entrepreneur, founder Bert “Tito” Beveridge credited social media for enabling Tito’s to compete for shelf space with more established liquor

Application Case 7.8 Tito’s Vodka Establishes Brand Loyalty with an Authentic Social Strategy

(Continued )

448 Part II • Predictive Analytics/Machine Learning

brands. “Social media is a great platform for a word- of-mouth brand, because it’s not just about who has the biggest megaphone,” Beveridge told Entrepreneur.

As Tito’s has matured, the social team has remained true to the brand’s founding values and actively uses Twitter and Instagram to have one-on-one conversations and connect with brand enthusiasts. “We never viewed social media as another way to adver- tise,” said Katy Gelhausen, Web & social media coordi- nator. “We’re on social so our customers can talk to us.”

To that end, Tito’s uses Sprout Social to under- stand the industry atmosphere, develop a consistent social brand, and create a dialogue with its audi- ence. As a result, Tito’s recently organically grew its Twitter and Instagram communities by 43.5 percent and 12.6 percent, respectively, within four months.

Informing a Seasonal, Integrated Marketing Strategy

Tito’s quarterly cocktail program is a key part of the brand’s integrated marketing strategy. Each quarter, a cocktail recipe is developed and distributed through Tito’s online and offline marketing initiatives.

It is important for Tito’s to ensure that the rec- ipe is aligned with the brand’s focus as well as the larger industry direction. Therefore, Gelhausen uses

Sprout’s Brand Keywords to monitor industry trends and cocktail flavor profiles. “Sprout has been a really important tool for social monitoring. The Inbox is a nice way to keep on top of hashtags and see general trends in one stream,” she said.

The information learned is presented to Tito’s in-house mixology team and used to ensure that the same quarterly recipe is communicated to the brand’s sales team and across marketing channels. “Whether you’re drinking Tito’s at a bar, buying it from a liquor store or following us on social media you’re getting the same quarterly cocktail,” said Gelhausen.

The program ensures that, at every consumer touch point, a person receives a consistent brand experience—and that consistency is vital. In fact, according to an Infosys study on the omnichan- nel shopping experience, 34 percent of consumers attribute cross-channel consistency as a reason they spend more on a brand. Meanwhile, 39 percent cite inconsistency as reason enough to spend less.

At Tito’s, gathering industry insights starts with social monitoring on Twitter and Instagram through Sprout. But the brand’s social strategy does not stop there. Staying true to its roots, Tito’s uses the plat- form on a daily basis to authentically connect with customers.

Application Case 7.8 (Continued)

Used with permission of Sprout Social, Inc.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 449

Sprout’s Smart Inbox displays Tito’s Twitter and Instagram accounts in a single, cohesive feed. This helps Gelhausen manage inbound messages and quickly identify which require a response.

“Sprout allows us to stay on top of the conver- sations we’re having with our followers. I love how you can easily interact with content from multiple accounts in one place,” she said.

Spreading the Word on Twitter

Tito’s approach to Twitter is simple: engage in per- sonal, one-on-one conversations with fans. Dialogue is a driving force for the brand, and over the course of four months, 88 percent of Tweets sent were replies to inbound messages.

Using Twitter as an open line of communication between Tito’s and its fans resulted in a 162.2 percent increase in engagement and a 43.5 percent gain in followers. Even more impressively, Tito’s ended the quarter with 538,306 organic impressions—an 81 per- cent rise. A similar strategy is applied to Instagram, which Tito’s uses to strengthen and foster a relation- ship with fans by publishing photos and videos of new recipe ideas, brand events, and initiatives.

Capturing the Party on Instagram

On Instagram, Tito’s primarily publishes lifestyle content and encourages followers to incorporate its brand in everyday occasions. Tito’s also uses the platform to promote its cause through marketing

efforts and to tell its brand story. The team finds value in Sprout’s Instagram Profiles Report, which helps them identify what media is receiving the most engagement, analyze audience demograph- ics and growth, dive more deeply into publish- ing patterns, and quantify outbound hashtag per- formance. “Given Instagram’s new personalized feed, it’s important that we pay attention to what really does resonate,” said Gelhausen.

Using the Instagram Profiles Report, Tito’s has been able to measure the impact of its Instagram marketing strategy and revise its approach accord- ingly. By utilizing the network as another way to engage with fans, the brand has steadily grown its organic audience. In four months, @TitosVodka saw a 12.6 percent rise in followers and a 37.1 percent increase in engagement. On average, each piece of published content gained 534 interactions, and men- tions of the brand’s hashtag, #titoshandmadevodka, grew by 33 percent.

Where to from Here?

Social is an ongoing investment in time and atten- tion. Tito’s will continue the momentum the brand experienced by segmenting each quarter into its own campaign. “We’re always getting smarter with our social strategies and making sure that what we’re posting is relevant and resonates,” said Gelhausen. Using social to connect with fans in a consistent, genuine, and memorable way will remain a cor- nerstone of the brand’s digital marketing efforts.

(Continued )

450 Part II • Predictive Analytics/Machine Learning

Over the years, various metrics (or measurements) have been developed to analyze social network structures from different perspectives. These metrics are often grouped into three categories: connections, distributions, and segmentation.

Connections

The connections category of metrics groups includes the following:

Homophily: The extent to which actors form ties with similar versus dissimilar oth- ers. Similarity can be defined by gender, race, age, occupation, educational achieve- ment, status, values, or any other salient characteristic.

Multiplexity: The number of content forms contained in a tie. For example, two people who are friends and also work together would have a multiplexity of two. Multiplexity has been associated with relationship strength.

Mutuality/reciprocity: The extent to which two actors reciprocate each other’s friendship or other interaction.

Network closure: A measure of the completeness of relational triads. An individu- al’s assumption of network closure (i.e., that their friends are also friends) is called transitivity. Transitivity is an outcome of the individual or situational trait of need for cognitive closure.

Propinquity: The tendency for actors to have more ties with geographically close others.

Distributions

The following relate to the distributions category:

Bridge: An individual whose weak ties fill a structural hole, providing the only link between two individuals or clusters. It also includes the shortest route when a lon- ger one is unfeasible due to a high risk of message distortion or delivery failure.

Centrality: A group of metrics that aims to quantify the importance or influence (in a variety of senses) of a particular node (or group) within a network. Examples of common methods of measuring centrality include betweenness centrality, closeness centrality, eigenvector centrality, alpha centrality, and degree centrality.

Density: The proportion of direct ties in a network relative to the total number possible. Distance: The minimum number of ties required to connect two particular actors. Structural holes: The absence of ties between two parts of a network. Finding and

exploiting a structural hole can give an entrepreneur a competitive advantage. This concept was developed by sociologist Ronald Burt and is sometimes referred to as an alternate conception of social capital.

Using Sprout’s suite of social media management tools, Tito’s will continue to foster a community of loyalists.

Some highlights of Tito’s success follow:

• A 162 percent increase in organic engagement on Twitter.

• An 81 percent increase in organic Twitter impressions.

• A 37 percent increase in engagement on Instagram.

Questions for Case 7.8

1. How can social media analytics be used in the consumer products industry?

2. What do you think are the key challenges, poten- tial solutions, and probable results in applying social media analytics in consumer products and services firms?

Source: SproutSocial Case Study, “Tito’s Vodka Establishes Brand Loyalty with an Authentic Social Strategy.” http://sproutsocial. com/insights/case-studies/titos/ (accessed July 2016). Used with permission.

Application Case 7.8 (Continued)

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 451

Tie strength: Defined by the linear combination of time, emotional intensity, inti- macy, and reciprocity (i.e., mutuality). Strong ties are associated with homophily, propinquity, and transitivity, whereas weak ties are associated with bridges.

Segmentation

This category includes following:

Cliques and social circles: Groups are identified as cliques if every individual is directly tied to every other individual or social circles if there is less stringency of direct contact, which is imprecise, or as structurally cohesive blocks if precision is wanted.

Clustering coefficient: A measure of the likelihood that two members of a node are associates. A higher clustering coefficient indicates a greater cliquishness.

Cohesion: The degree to which actors are connected directly to each other by cohe- sive bonds. Structural cohesion refers to the minimum number of members who, if removed from a group, would disconnect the group.

Social Media Analytics

Social media refers to the enabling technologies of social interactions among people in which they create, share, and exchange information, ideas, and opinions in virtual com- munities and networks. Social media is a group of Internet-based software applications that build on the ideological and technological foundations of Web 2.0 and that allows the creation and exchange of user-generated content (Kaplan & Haenlein, 2010). Social media depends on mobile and other Web-based technologies to create highly interactive platforms for individuals and communities to share, co-create, discuss, and modify user- generated content. It introduces substantial changes to communication among organiza- tions, communities, and individuals.

Since their emergence in the early 1990s, Web-based social media technologies have seen a significant improvement in both quality and quantity. These technologies take on many different forms, including online magazines, Internet forums, Web logs, so- cial blogs, microblogging, wikis, social networks, podcasts, pictures, videos, and product/ service evaluations/ratings. By applying a set of theories in the field of media research (social presence, media richness) and social processes (self-presentation, self-disclosure), Kaplan and Haenlein (2010) created a classification scheme with six different types of so- cial media: collaborative projects (e.g., Wikipedia), blogs and microblogs (e.g., Twitter), content communities (e.g., YouTube), social networking sites (e.g., Facebook), virtual game worlds (e.g., World of Warcraft), and virtual social worlds (e.g., Second Life).

Web-based social media are different from traditional/industrial media, such as newspapers, television, and film, because they are comparatively inexpensive and ac- cessible to enable anyone (even private individuals) to publish or access/consume in- formation. Industrial media generally require significant resources to publish information because in most cases, the articles (or books) go through many revisions before being published (as was the case in the publication of this very book). The following are some of the most prevailing characteristics that help differentiate between social and industrial media (Morgan, Jones, & Hodges, 2010):

Quality: In industrial publishing—mediated by a publisher—the typical range of qual- ity is substantially narrower than in niche, unmediated markets. The main challenge posed by content in social media sites is the fact that the distribution of quality has high variance from very high-quality items to low-quality, sometimes abusive, content.

Reach: Both industrial and social media technologies provide scale and are capable of reaching a global audience. Industrial media, however, typically use a centralized framework for organization, production, and dissemination, whereas social media

452 Part II • Predictive Analytics/Machine Learning

are by their very nature more decentralized, less hierarchical, and distinguished by multiple points of production and utility.

Frequency: Compared to industrial media, updating and reposting on social media platforms is easier, faster, and cheaper, and therefore practiced more frequently, resulting in fresher content.

Accessibility: The means of production for industrial media are typically govern- ment and/or corporate (privately owned) and are costly, whereas social media tools are generally available to the public at little or no cost.

Usability: Industrial media production typically requires specialized skills and train- ing. Conversely, most social media production requires only modest reinterpretation of existing skills; in theory, anyone with access can operate the means of social media production.

Immediacy: The time lag between communications produced by industrial media can be long (weeks, months, or even years) compared to social media (which can be capable of virtually instantaneous responses).

Updatability: Industrial media, once created, cannot be altered (once a magazine article is printed and distributed, changes cannot be made to that same article), whereas social media can be altered almost instantaneously by comments or editing.

How Do People Use Social Media?

Not only are the numbers on social networking sites growing, but so is the degree to which they are engaged with the channel. Brogan and Bastone (2011) presented research results that stratify users according to how actively they use social media and tracked the evolution of these user segments over time. They listed six different engagement levels (Figure 7.15).

According to the research results, the online user community has been steadily mi- grating upward on this engagement hierarchy. The most notable change is among Inactives. Of the online population, 44 percent fell into this category in 2008. Two years later, more than half of those Inactives had jumped into social media in some form or another. “Now roughly 82% of the adult population online is in one of the upper categories,” said Bastone. “Social media has truly reached a state of mass adoption” (Brogan and Bastone, 2011).

Social media analytics refers to the systematic and scientific ways to consume the vast amount of content created by Web-based social media outlets, tools, and techniques for the betterment of an organization’s competitiveness. Social media analytics is rapidly

Collectors

Joiners

Critics

Creators

Time

L e ve

l o f

S o c ia

l M

e d ia

E n g a g e m

e n t

Spectators

Inactives

FIGURE 7.15 Evolution of Social Media User Engagement.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 453

becoming a new force in organizations around the world, allowing them to reach out to and understand consumers as never before. In many companies, it is becoming the tool for integrated marketing and communications strategies.

The exponential growth of social media outlets from blogs, Facebook, and Twitter to LinkedIn and YouTube and of analytics tools that tap into these rich data sources offer organizations the chance to join a conversation with millions of customers around the globe every day. This ability is why nearly two-thirds of the 2,100 companies who participated in a recent survey by Harvard Business Review Analytic Services said they were either currently using social media channels or had social media plans in the works (Harvard Business Review, 2010). But many still say social media is an experiment, as they try to understand how to best use the different channels, gauge their effectiveness, and integrate social media into their strategy.

Measuring the Social Media Impact

For organizations, small or large, there is valuable insight hidden in all the user-generated content on social media sites. But how do you dig it out of dozens of review sites, thou- sands of blogs, millions of Facebook posts, and billions of tweets? Once you do that, how do you measure the impact of your efforts? These questions can be addressed by the analytics extension of the social media technologies. Once you decide on your goal for social media (what it is that you want to accomplish), a multitude of tools can help you get there. These analysis tools usually fall into three broad categories:

• Descriptive analytics: Uses simple statistics to identify activity characteristics and trends, such as how many followers you have, how many reviews were gener- ated on Facebook, and which channels are being used most often.

• Social network analysis: Follows the links between friends, fans, and follow- ers to identify connections of influence as well as the biggest sources of influence.

• Advanced analytics: Includes predictive analytics and text analytics that exam- ine the content in online conversations to identify themes, sentiments, and connec- tions that would not be revealed by casual surveillance.

Sophisticated tools and solutions to social media analytics use all three categories of analytics (i.e., descriptive, predictive, and prescriptive) in a somewhat progressive fashion.

Best Practices in Social Media Analytics

As an emerging tool, social media analytics is practiced by companies in a somewhat haphazard fashion. Because there are not well-established methodologies, everybody is trying to create their own by trial and error. What follows are some of the best field-tested practices for social media analytics proposed by Paine and Chaves (2012).

THINK OF MEASUREMENT AS A GUIDANCE SYSTEM, NOT A RATING SYSTEM Measure- ments  are often used for punishment or rewards; they should not be. They should be about figuring out what the most effective tools and practices are, what needs to be dis- continued because it does not work, and what needs to be done more because it does work very well. A good analytics system should tell you where you need to focus. Maybe all that emphasis on Facebook does not really matter because that is not where your audience is. Maybe they are all on Twitter, or vice versa. According to Paine and Chaves (2012), channel preference will not necessarily be intuitive: “We just worked with a hotel that had virtually no activity on Twitter for one brand but lots of Twitter activity for one of their higher brands.” Without an accurate measurement tool, you would not know.

TRACK THE ELUSIVE SENTIMENT Customers want to take what they are hearing and learn- ing from online conversations and act on it. The key is to be precise in extracting and tag- ging their intentions by measuring their sentiments. As we saw earlier in this chapter, text

454 Part II • Predictive Analytics/Machine Learning

analytic tools can categorize online content, uncover linked concepts, and reveal the senti- ment in a conversation as “positive,” “negative,” or “neutral,” based on the words people use. Ideally, you would like to be able to attribute sentiment to a specific product, service, and business unit. The more precise you can be in understanding the tone and percep- tion that people express, the more actionable the information becomes because you are mitigating concerns about mixed polarity. A mixed-polarity phrase, such as “hotel in great location but bathroom was smelly,” should not be tagged as “neutral” because you have positives and negatives offsetting each other. To be actionable, these types of phrases are to be treated separately; “bathroom was smelly” is something someone can own and im- prove on. One can classify and categorize these sentiments, look at trends over time, and see significant differences in the way people speak either positively or negatively about you. Furthermore, you can compare sentiment about your brand to your competitors.

CONTINUOUSLY IMPROVE THE ACCURACY OF TEXT ANALYSIS An industry-specific text analytics package will already know the vocabulary of your business. The system will have linguistic rules built into it, but it learns over time and gets better and better. Much as you would tune a statistical model as you have more data, better parameters, or new techniques to deliver better results, you would do the same thing with the NLP that goes into sentiment analysis. You set up rules, taxonomies, categorization, and meaning of words; watch what the results look like and then go back and do it again.

LOOK AT THE RIPPLE EFFECT It is one thing to be a great hit on a high-profile site, but that is only the start. There is a difference between a great hit that just sits there and goes away versus a great hit that is tweeted, retweeted, and picked up by influential bloggers. Analysis should show you which social media activities go “viral” and which quickly go dormant—and why.

LOOK BEYOND THE BRAND One of the biggest mistakes people make is to be concerned only about their brand. To successfully analyze and act on social media, people need to understand not just what is being said about their brand but also the broader conversa- tion about the spectrum of issues surrounding their product or service, as well. Customers do not usually care about a firm’s message or its brand; they care about themselves. Therefore, you should pay attention to what they are talking about, where they are talk- ing, and where their interests are.

IDENTIFY YOUR MOST POWERFUL INFLUENCERS Organizations struggle to identify who has the most power in shaping public opinion. It turns out, their most important influ- encers are not necessarily the ones who advocate specifically for their brand; they are the ones who influence the whole realm of conversation about their topic. Organizations need to understand whether influencers are saying nice things, expressing support, or simply making observations or critiquing. What is the nature of their conversations? How is the organization’s brand being positioned relative to the competition in that space?

LOOK CLOSELY AT THE ACCURACY OF ANALYTIC TOOLS USED Until recently, computer- based automated tools were not as accurate as humans for sifting through online content. Even now, accuracy varies depending on the media. For product review sites, hotel review sites, and Twitter, the accuracy can reach anywhere between 80 and 90 percent because the context is more boxed in. When an organization starts looking at blogs and discussion fo- rums where the conversation is more wide ranging, the software can deliver 60 to 70 percent accuracy (Paine & Chaves, 2012). These figures will increase over time because the analyt- ics tools are continually upgraded with new rules and improved algorithms to reflect field experience, new products, changing market conditions, and emerging patterns of speech.

INCORPORATE SOCIAL MEDIA INTELLIGENCE INTO PLANNING Once an organization has a big-picture perspective and detailed insight, it can begin to incorporate this information into

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 455

its planning cycle. But that is easier said than done. A quick audience poll revealed that very few people currently incorporate learning from online conversations into their planning cycles (Paine & Chaves, 2012). One way to achieve this is to find time-linked associations between social media metrics and other business activities or market events. Social media is typically either organically invoked or invoked by something an organization does; therefore, if it sees a spike in activity at some point in time, it wants to know what was behind that.

u SECTION 7.10 REVIEW QUESTIONS

1. What is meant by social analytics? Why is it an important business topic? 2. What is a social network? What is the need for SNA? 3. What is social media? How does it relate to Web 2.0? 4. What is social media analytics? What are the reasons behind its increasing popularity? 5. How can you measure the impact of social media analytics?

Chapter Highlights

• Text mining is the discovery of knowledge from unstructured (mostly text-based) data sources. Because a great deal of information is in text form, text mining is one of the fastest-growing branches of the business intelligence field.

• Text mining applications are in virtually every area of business and government, including mar- keting, finance, healthcare, medicine, and home- land security.

• Text mining uses NLP to induce structure into the text collection and then uses data mining algorithms such as classification, clustering, association, and se- quence discovery to extract knowledge from it.

• Sentiment can be defined as a settled opinion re- flective of one’s feelings.

• Sentiment analysis deals with differentiating be- tween two classes, positive and negative.

• As a field of research, sentiment analysis is closely related to computational linguistics, NLP, and text mining.

• Sentiment analysis is trying to answer the ques- tion, “What do people feel about a certain topic?” by digging into opinions of many by using a vari- ety of automated tools.

• VOC is an integral part of an analytic CRM and customer experience management systems and is often powered by sentiment analysis.

• VOM is about understanding aggregate opinions and trends at the market level.

• Polarity identification in sentiment analysis is ac- complished either by using a lexicon as a refer- ence library or by using a collection of training documents.

• WordNet is a popular general-purpose lexicon created at Princeton University.

• SentiWordNet is an extension of WordNet to be used for sentiment identification.

• Speech analytics is a growing field of science that allows users to analyze and extract information from both live and recorded conversations.

• Web mining can be defined as the discovery and analysis of interesting and useful information from the Web, about the Web, and usually using Web-based tools.

• Web mining can be viewed as consisting of three areas: content mining, structure mining, and usage mining.

• Web content mining refers to the automatic ex- traction of useful information from Web pages. It can be used to enhance search results produced by search engines.

• Web structure mining refers to generating inter- esting information from the links on Web pages.

• Web structure mining can also be used to identify the members of a specific community and perhaps even the roles of the members in the community.

• Web usage mining refers to developing useful information through analyzing Web server logs, user profiles, and transaction information.

• Text and Web mining are emerging as critical components of the next generation of business intelligence tools to enable organizations to com- pete successfully.

• A search engine is a software program that searches for documents (Internet sites or files) based on the keywords (individual words, multi- word terms, or a complete sentence) users have provided that relate to the subject of their inquiry.

• SEO is the intentional activity of affecting the visibility of an e-commerce site or a Web site

456 Part II • Predictive Analytics/Machine Learning

in a search engine’s natural (unpaid or organic) search results.

• VOC is a term generally used to describe the ana- lytic process of capturing a customer’s expecta- tions, preferences, and aversions.

• Social analytics is the monitoring, analyzing, mea- suring, and interpreting of digital interactions and relationships of people, topics, ideas, and content.

• A social network is a social structure composed of individuals/people (or groups of individuals or organizations) linked to one another with some type of connections/relationships.

• Social media analytics refers to the systematic and scientific ways to consume the vast amount of content created by Web-based social media out- lets, tools, and techniques to better an organiza- tion’s competitiveness.

Key Terms

association authoritative pages classification clickstream analysis clustering corpus deception detection hubs hyperlink-induced topic search

(HITS) natural language processing (NLP) part-of-speech tagging

polarity identification polyseme search engine sentiment analysis SentiWordNet singular value decomposition (SVD) social media analytics social network spider stemming stop words term–document matrix (TDM)

text mining tokenizing trend analysis unstructured data voice of the customer (VOC) Web analytics Web content mining Web crawler Web mining Web structure mining Web usage mining WordNet

Questions for Discussion

1. Explain the relationship among data mining, text min- ing, and sentiment analysis.

2. In your own words, define text mining, and discuss its most popular applications.

3. What does it mean to induce structure into text-based data? Discuss the alternative ways of inducing structure into them.

4. What is the role of NLP in text mining? Discuss the capa- bilities and limitations of NLP in the context of text mining.

5. List and discuss three prominent application areas for text mining. What is the common theme among the three application areas you chose?

6. What is sentiment analysis? How does it relate to text mining?

7. What are the common challenges with which sentiment analysis deals?

8. What are the most popular application areas for senti- ment analysis? Why?

9. What are the main steps in carrying out sentiment analy- sis projects?

10. What are the two common methods for polarity identi- fication? Explain.

11. Discuss the differences and commonalities between text mining and Web mining.

12. In your own words, define Web mining, and discuss its importance.

13. What are the three main areas of Web mining? Discuss the differences and commonalities among these three areas.

14. What is a search engine? Why is it important for businesses? 15. What is SEO? Who benefits from it? How? 16. What is Web analytics? What are the metrics used in

Web analytics? 17. Define social analytics, social network, and social net-

work analysis. What are the relationships among them? 18. What is social media analytics? How is it done? Who

does it? What comes out of it?

Exercises

Teradata University Network (TUN) and Other Hands-on Exercises

1. Visit teradatauniversitynetwork.com. Identify cases about text mining. Describe recent developments in the field. If you cannot find enough cases at the Teradata

University Network Web site, broaden your search to other Web-based resources.

2. Go to teradatauniversitynetwork.com to locate white papers, Web seminars, and other materials related to text mining. Synthesize your findings into a short written report.

Chapter 7 • Text Mining, Sentiment Analysis, and Social Analytics 457

3. Go to teradatauniversitynetwork.com and find the case study named “eBay Analytics.” Read the case carefully and extend your understanding of it by searching the Internet for additional information, and answer the case questions.

4. Go to teradatauniversitynetwork.com and find the sentiment analysis case named “How Do We Fix an App Like That?” Read the description, and follow the direc- tions to download the data and the tool to carry out the exercise.

5. Visit teradatauniversitynetwork.com. Identify cases about Web mining. Describe recent developments in the field. If you cannot find enough cases at the Teradata University Network Web site, broaden your search to other Web-based resources.

6. Browse the Web and your library’s digital databases to identify articles that make the linkage between text/Web mining and contemporary business intelligence systems.

Team Assignments and Role-Playing Projects

1. Examine how textual data can be captured automati- cally using Web-based technologies. Once captured, what are the potential patterns that you can extract from these unstructured data sources?

2. Interview administrators at your college or executives in your organization to determine how text mining and Web mining could assist them in their work. Write a proposal describing your findings. Include a preliminary cost–benefit analysis in your report.

3. Go to your library’s online resources. Learn how to download attributes of a collection of literature (journal articles) in a specific topic. Download and process the data using a methodology similar to the one explained in Application Case 7.5.

4. Find a readily available sentiment text data set (see Technology Insights 7.2 for a list of popular data sets) and download it onto your computer. If you have an analytics tool that is capable of text mining, use that. If not, download RapidMiner (http://rapid-i.com) and install it. Also install the Text Analytics add-on for RapidMiner. Process the downloaded data using your text mining tool (i.e., convert the data into a structured form). Build models and assess the sentiment detection accuracy of several classification models (e.g., support

vector machines, decision trees, neural networks, logis- tic regression). Write a detailed report in which you explain your findings and your experiences.

5. Examine how Web-based data can be captured auto- matically using the latest technologies. Once captured, what are the potential patterns that you can extract from these content-rich, mostly unstructured data sources?

Internet Exercises

1. Find recent cases of successful text mining and Web mining applications. Try text and Web mining software vendors and consultancy firms and look for cases or success stories. Prepare a report summarizing five new case studies.

2. Go to statsoft.com. Select Downloads, and download at least three white papers on applications. Which of these applications might have used the data/text/Web mining techniques discussed in this chapter?

3. Go to sas.com. Download at least three white papers on applications. Which of these applications might have used the data/text/Web mining techniques discussed in this chapter?

4. Go to ibm.com. Download at least three white papers on applications. Which of these applications might have used the data/text/Web mining techniques discussed in this chapter?

5. Go to teradata.com. Download at least three white papers on applications. Which of these applications might have used the data/text/Web mining techniques discussed in this chapter?

6. Go to clarabridge.com. Download at least three white papers on applications. Which of these applications might have used text mining in a creative way?

7. Go to kdnuggets.com. Explore the sections on appli- cations as well as software. Find names of at least three additional packages for data mining and text mining.

8. Survey some Web mining tools and vendors. Identify some Web mining products and service providers that are not mentioned in this chapter.

9. Go to attensity.com. Download at least three white papers on Web analytics applications. Which of these applications might have used a combination of data/ text/Web mining techniques?

References

Bond, C. F., & B. M. DePaulo. (2006). “Accuracy of Deception Judgments.” Personality and Social Psychology Reports, 10(3), pp. 214–234.

Brogan, C., & J. Bastone. (2011). “Acting on Customer Intelligence from Social Media: The New Edge for Building Customer Loyalty and Your Brand.” SAS white paper.

Chun, H. W., Y. Tsuruoka, J. D. Kim, R. Shiba, N. Nagata, & T. Hishiki. (2006). “Extraction of Gene-Disease Relations from MEDLINE Using Domain Dictionaries and Machine Learning.” Proceedings of the Eleventh Pacific Symposium on Biocomputing, pp. 4–15.

Coussement, K., & D. Van Den Poel. (2008). “Improving Customer Complaint Management by Automatic Email Classification Using Linguistic Style Features as Predictors.” Decision Support Systems, 44(4), pp. 870–882.

Coussement, K., & D. Van Den Poel. (2009). “Improving Customer Attrition Prediction by Integrating Emotions from Client/ Company Interaction Emails and Evaluating Multiple Classi- fiers.” Expert Systems with Applications, 36(3), pp. 6127–6134.

Cutts, M. (2006, February 4). “Ramping Up on International Web- spam.” mattcutts.com/blog. mattcutts.com/blog/ ramping- up-on-international-webspam (accessed March 2013).

458 Part II • Predictive Analytics/Machine Learning

Delen, D., & M. Crossland. (2008). “Seeding the Survey and Analysis of Research Literature with Text Mining.” Expert Systems with Applications, 34(3), pp. 1707–1720.

Esuli, A., & F. Sebastiani. (2006, May). SentiWordNet: A Pub- licly Available Lexical Resource for Opinion Mining. Pro- ceedings of LREC, 6, pp. 417–422.

Etzioni, O. (1996). “The World Wide Web: Quagmire or Gold Mine?” Communications of the ACM, 39(11), pp. 65–68.

EUROPOL. (2007). EUROPOL Work Program 2005. statewatch.org/news/2006/apr/europol-work- programme-2005.pdf (accessed October 2008).

Feldman, R., & J. Sanger. (2007). The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Boston, MA: ABS Ventures.

Fuller, C. M., D. Biros, and D. Delen. (2008). “Exploration of Feature Selection and Advanced Classification Models for High-Stakes Deception Detection.” Proceedings of the Forty-First Annual Hawaii International Conference on Sys- tem Sciences (HICSS). Big Island, HI: IEEE Press, pp. 80–99.

Ghani, R., K. Probst, Y. Liu, M. Krema, and A. Fano. (2006). “Text Mining for Product Attribute Extraction.” SIGKDD Ex- plorations, 8(1), pp. 41–48.

Goodman, A. (2005). “Search Engine Showdown: Black Hats Versus White Hats at SES. SearchEngineWatch.” searchenginewatch.com/article/2066090/Search- Engine-Showdown-Black-Hats-vs.-White-Hats-at-SES (accessed February 2013).

Han, J., & M. Kamber. (2006). Data Mining: Concepts and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann.

Harvard Business Review. (2010). “The New Conversation: Taking Social Media from Talk to Action.” A SAS–Sponsored Research Report by Harvard Business Review Analytic Ser- vices. sas.com/resources/whitepaper/wp_23348.pdf (accessed March 2013).

Kaplan, A. M., & M. Haenlein. (2010). “Users of the World, Unite! The Challenges and Opportunities of Social Media.” Business Horizons, 53(1), pp. 59–68.

Kim, S. M., & E. Hovy. (2004, August). “Determining the Senti- ment of Opinions.” Proceedings of the Twentieth Interna- tional Conference on Computational Linguistics, p. 1367.

Kleinberg, J. (1999). “Authoritative Sources in a Hyperlinked Environment.” Journal of the ACM, 46(5), pp. 604–632.

Lin, J., & D. Demner-Fushman. (2005). “Bag of Words” Is Not Enough for Strength of Evidence Classification.” AMIA Annual Symposium Proceedings, pp. 1031–1032. pubmedcentral. nih.gov/articlerender.fcgi?artid=1560897.

Liu, B., M. Hu, & J. Cheng. (2005, May). “Opinion Observer: Analyzing and Comparing Opinions on the Web.” Proceed- ings of the Fourth International Conference on World Wide Web, pp. 342–351.

Mahgoub, H., D. Rösner, N. Ismail, and F. Torkey. (2008). “A Text Mining Technique Using Association Rules Extrac- tion.” International Journal of Computational Intelligence, 4(1), pp. 21–28.

Manning, C. D., & H. Schutze. (1999). Foundations of Statis- tical Natural Language Processing. Cambridge, MA: MIT Press.

McKnight, W. (2005, January 1). “Text Data Mining in Business Intelligence.” Information Management Magazine. information-management.com/issues/20050101/ 1016487-1.html (accessed May 22, 2009).

Mejova, Y. (2009). “Sentiment Analysis: An Overview.” Comprehensive exam paper. http://www.cs.uiowa. edu/~ymejova/publications/CompsYelenaMejova. pdf (accessed February 2013).

Miller, T. W. (2005). Data and Text Mining: A Business: Ap- plications Approach. Upper Saddle River, NJ: Prentice Hall.

Morgan, N., G. Jones, & A. Hodges. (2010). “The Complete Guide to Social Media from the Social Media Guys.” thesocialmediaguys.co.uk/wp-content/uploads/ downloads/2011/03/CompleteGuidetoSocialMedia. pdf (accessed February 2013).

Nakov, P., A. Schwartz, B. Wolf, and M. A. Hearst. (2005). “Supporting Annotation Layers for Natural Language Pro- cessing.” Proceedings of the ACL, Interactive Poster and Demonstration Sessions. Ann Arbor, MI: Association for Computational Linguistics, pp. 65–68.

Paine, K. D., & M. Chaves. (2012). “Social Media Metrics.” SAS white paper. sas.com/resources/whitepaper/ wp_19861.pdf (accessed February 2013).

Pang, B., & L. Lee. (2008). OPINION Mining and Sentiment Analysis. Hanover, MA: Now Publishers; available at http://books.google.com.

Ramage, D., D. Hall, R. Nallapati, & C. D. Manning. (2009, Au- gust). “Labeled LDA: A Supervised Topic Model for Credit Attribution in Multi-Labeled Corpora.” Proceedings of the 2009 Conference on Empirical Methods in Natural Lan- guage Processing: Volume 1, pp. 248–256.

Schmidt, L.-H. (1996). “Commonness Across Cultures.” In A. N. Balslev (ed.), Cross-Cultural Conversation: Initiation (pp. 119–132). New York: Oxford University Press.

Scott, W. R., & G. F. Davis. (2003). “Networks in and Around Organizations.” Organizations and Organizing. Upper Saddle River: NJ: Pearson Prentice Hall.

Shatkay, H., A. Höglund, S. Brady, T. Blum, P. Dönnes, and O. Kohlbacher. (2007). “SherLoc: High-Accuracy Prediction of Protein Subcellular Localization by Integrating Text and Pro- tein Sequence Data.” Bioinformatics, 23(11), pp. 1410–1415.

Snyder, B., & R. Barzilay. (2007, April). “Multiple Aspect Rank- ing Using the Good Grief Algorithm.” HLT-NAACL, pp. 300–307.

Strapparava, C., & A. Valitutti. (2004, May). “WordNet Affect: An Affective Extension of WordNet.” LREC, 4, pp. 1083–1086.

The Westover Group. (2013). “20 Key Web Analytics Metrics and How to Use Them.” http://www.thewestovergroup. com (accessed February 2013).

Thomas, M., B. Pang, & L. Lee. (2006, July). “Get Out the Vote: Determining Support or Opposition from Congressional Floor-Debate Transcripts.” In Proceedings of the 2006 Con- ference on Empirical Methods in Natural Language Pro- cessing, pp. 327–335.

Weng, S. S., & C. K. Liu. (2004). “Using Text Classification and Multiple Concepts to Answer E-Mails.” Expert Systems with Applications, 26(4), pp. 529–543.

459

P A R T

Prescriptive Analytics and Big Data

III

460

Prescriptive Analytics: Optimization and Simulation

LEARNING OBJECTIVES

■■ Understand the applications of prescriptive analytics techniques in combination with reporting and predictive analytics

■■ Understand the basic concepts of analytical decision modeling

■■ Understand the concepts of analytical models for selected decision problems, including linear pro- gramming and simulation models for decision support

■■ Describe how spreadsheets can be used for analytical modeling and solutions

■■ Explain the basic concepts of optimization and when to use them

■■ Describe how to structure a linear programming model

■■ Explain what is meant by sensitivity analysis, what-if analysis, and goal seeking

■■ Understand the concepts and applications of different types of simulation

■■ Understand potential applications of discrete event simulation

T his chapter extends the analytics applications beyond reporting and predictive analytics. It includes coverage of selected techniques that can be employed in combination with predictive models to help support decision making. We focus on techniques that can be implemented relatively easily using either spreadsheet tools or by using stand-alone software tools. Of course, there is much additional detail to be learned about management science models, but the objective of this chapter is to simply illustrate what is possible and how it has been implemented in real settings.

We present this material with a note of caution: Modeling can be a difficult topic and is as much an art as it is a science. The purpose of this chapter is not necessarily for you to master the topics of modeling and analysis. Rather, the material is geared toward gaining familiarity with the important concepts as they relate to prescriptive analytics and their use in decision making. It is important to recognize that the mod- eling we discuss here is only cursorily related to the concepts of data modeling. You should not confuse the two. We walk through some basic concepts and definitions of decision modeling. We next introduce the idea of modeling directly in spreadsheets. We then discuss the structure and application of two successful time-proven models

8 C H A P T E R

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 461

and methodologies: linear programming and discrete event simulation. As noted earlier, one could take multiple courses just in these two topics, but our goal is to give you a sense of what is possible. This chapter includes the following sections:

8.1 Opening Vignette: School District of Philadelphia Uses Prescriptive Analytics to Find Optimal Solution for Awarding Bus Route Contracts 461

8.2 Model-Based Decision Making 462 8.3 Structure of Mathematical Models for Decision Support 469 8.4 Certainty, Uncertainty, and Risk 471 8.5 Decision Modeling with Spreadsheets 473 8.6 Mathematical Programming Optimization 477 8.7 Multiple Goals, Sensitivity Analysis, What-If Analysis, and Goal Seeking 486 8.8 Decision Analysis with Decision Tables and Decision Trees 490 8.9 Introduction to Simulation 493

8.10 Visual Interactive Simulation 500

8.1 OPENING VIGNETTE: School District of Philadelphia Uses Prescriptive Analytics to Find Optimal Solution for Awarding Bus Route Contracts

BACKGROUND

Selecting the best vendors to work with is a laborious yet important task for companies and government organizations. After a vendor submits a proposal for a specific task through a bidding process, the company or organization evaluates the proposal and makes a decision on which vendor is best suited for their needs. Typically, governments are required to use a bidding process to select one or more vendors. The School District of Philadelphia was in search of private bus vendors to outsource some of their bus routes. The district owned a few school buses, but needed more to serve their student population. They wanted to use their own school buses for 30 to 40% of the routes, and outsource the rest of the routes to these private vendors. Charles Lowitz, the fiscal coor- dinator for the transportation office, was tasked with determining how to maximize the return on investment and refine the way routes were awarded to various vendors.

Historically, the process of deciding which bus vendor contracts to award given the budget and time constraints was laborious as it was done manually by hand. In addition, the different variables and factors that had to be taken into account added to the com- plexity. The vendors were evaluated based on five variables: cost, capabilities, reliance, financial stability, and business acumen. Each vendor submitted a proposal with a differ- ent price for different routes. Some vendors specified a minimum number of routes, and if that minimum wasn’t met, their cost would increase. Lowitz needed to figure out how to combine the information from each proposal to determine which bus route to award to which vendor to meet all the route requirements at the least cost for the district.

SOLUTION

Lowitz initially looked for software that he could use in conjunction with his contract model in Excel. He began using the Premium Solver Platform from Frontline Systems, Inc., which allowed him to find the most beneficial vendors for the district from a financial and operational standpoint. He created an optimization model that took into account the aforementioned variables associated with each vendor. The model included binary integer variables (yes/no) for each of the routes to be awarded to the bidders

462 Part III • Prescriptive Analytics and Big Data

who proposed to serve a specific route at a specific cost. This amounted to about 1,600 yes/no variables. The model also included constraints indicating that each route was to be awarded to one vendor, and of course, each route had to be serviced. Other con- straints specified the minimum number of routes a vendor would accept and a few other details. All such constraints can be written as equations and entered in an integer linear programming model. Such models can be formulated and solved through many soft- ware tools, but using Microsoft Excel makes it easier to understand the model. Frontline Systems’ Solver software is built into Microsoft Excel to solve smaller problems for free. A larger version can be purchased to solve larger and more complex models. That is what Lowitz used.

BENEFITS

In addition to determining how many of the vendors should be awarded contracts, the model helped develop the size of each of the contracts. The size of the contracts var- ied from one vendor getting four routes to another receiving 97 routes. Ultimately, the School District of Philadelphia was able to create a plan with an optimized number of bus company vendors using Excel instead of a manual handwritten process. By using the Premium Solver Platform analytic tools to create an optimization model with the different variables, the district saved both time and money.

u QUESTIONS FOR THE OPENING VIGNETTE

1. What decision was being made in this vignette? 2. What data (descriptive and or predictive) might one need to make the best

allocations in this scenario?

3. What other costs or constraints might you have to consider in awarding contracts for such routes?

4. Which other situations might be appropriate for applications of such models?

WHAT CAN WE LEARN FROM THIS OPENING VIGNETTE?

Most organizations face the problem of making decisions where one has to select from multiple options. Each option has a cost and capability associated with it. The goal of such models is to select the combination of options that meet all the requirements and yet optimizes the costs. Prescriptive analytics particularly apply to the problem of such decisions. And tools such as built-in or Premium Solver for Excel make it easy to apply such techniques.

Source: Based on “Optimizing Vendor Contract Awards Gets an A+,” http://www.solver.com/news/ optimizing-vendor-contract-awards-gets, 2016 (accessed Sept 2018).

8.2 MODEL-BASED DECISION MAKING

As the preceding vignette indicates, making decisions using some kind of analytical model is what we call prescriptive analytics. In the last several chapters we have learned the value and the process of knowing the history of what has been going on and use that information to also predict what is likely to happen. However, we go through that exer- cise to determine what we should do next. This might entail deciding which customers are likely to buy from us and making an offer or giving a price point that will maximize the likelihood that they would buy and our profit would be optimized. Conversely, it might involve being able to predict which customer is likely to go somewhere else and

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 463

making a promotion offer to retain them as a customer and optimize our value. We may need to make decisions on awarding contracts to our vendors to make sure all our needs are covered and the costs are minimized. We could be facing a situation of deciding which prospective customers should receive what promotional campaign material so that our cost of promotion is not outrageous, and we maximize the response rate while man- aging within a budget. We may be deciding how much to pay for different paid search keywords to maximize the return on investment of our advertising budget. In another setting, we may have to study the history of our customers’ arrival patterns and use that information to predict future arrival rates, and apply that to schedule an appropriate num- ber of store employees to maximize customer responses and optimize our labor costs. We could be deciding where to locate our warehouses based on our analysis and prediction of demand for our products and the supply chain costs. We could be setting daily deliv- ery routes on the basis of product volumes to be delivered at various locations and the delivery costs and vehicle availability. One can find hundreds of examples of situations where data-based decisions are valuable. Indeed, the biggest opportunity for the grow- ing analytics profession is the ability to use descriptive and predictive insights to help a decision maker make better decisions. Although there are situations where one can use experience and intuition to make decisions, it is more likely that a decision supported by a model will help a decision maker make better decisions. In addition, it also provides decision makers with justification for what they are recommending. Thus prescriptive analytics has emerged as the next frontier for analytics. It essentially involves using an analytical model to help guide a decision maker in making a decision, or automating the decision process so that a model can make recommendations or decisions. Because the focus of prescriptive analytics is on making recommendations or making decisions, some call this category of analytics decision analytics.

INFORMS publications, such as Interfaces, ORMS Today, and Analytics magazine, all include stories that illustrate successful applications of decision models in real settings. This chapter includes many examples of such prescriptive analytic applications. Applying models to real-world situations can save millions of dollars or generate millions of dollars in revenue. Christiansen et al. (2009) describe the applications of such models in shipping company operations using TurboRouter, a decision support system (DSS) for ship routing and scheduling. They claim that over the course of a 3-week period, a company used this model to better utilize its fleet, generating additional profit of $1-2 million in such a short time. We provide another example of a model application in Application Case 8.1 that illustrates a sports application.

Canadian Football League (CFL) is Canada’s equiv- alent of the U.S. National Football League (NFL). It had a challenge of organizing 81 football games for 9 teams over a period of 5 months optimally while stabilizing matching priorities for sales revenue, television ratings, and the team rest days. Other considerations include organizing games over dif- ferent time zones and the main rivalry games to be

held on major public holidays. For any league, a robust schedule is a driving force for a variety of business collaborations, such as coordinating with broadcasting channels and organizing ground ticket sales. If the schedule is not optimized, it would directly hamper the promotions thus resulting in a huge loss of revenue and bad channel ratings. CFL used to create match schedules manually and

Application Case 8.1 Canadian Football League Optimizes Game Schedule

(Continued )

464 Part III • Prescriptive Analytics and Big Data

hence had to figure out finer ways to improve their schedules, taking all the constraints into account. They had tried to work with a consultant to build a comprehensive model for scheduling, but the implementation remained a challenge. The League decided to tackle the issue with the Solver avail- able within Microsoft Excel. Some of the match- ing priorities to be balanced while optimizing the schedule were:

1. Sales Revenue—Setting a schedule with match- es and time slots to those clubs that generate more revenue.

2. Channel Ratings—Setting a schedule with games that would improve channel ratings for the broadcasting company.

3. Team Rest Days—Setting a schedule with the two teams playing against each other having enough rest days.

The league decided to improve the match sched- ules by giving the player rest days as a higher priority, followed by sales revenue and channel scores for the broadcasting company. This is mainly because the sales revenue and channel scores are a byproduct of team players’ performance on the field, which is directly related to the rest days taken by the teams.

Methodology/Solution

Initially, organizing schedules was a huge task to perform on Excel through the built-in Solver feature. Frontline systems provided a premium version for Solver which allowed the model size to grow from about 200 decisions to 8,000 decisions. The League had to even add in more industry-specific constraints such as telecasting across different time zones, double header games cannot be overlapped, and arch rival games to be scheduled on Labor Day. Added limita- tions were never simple until the Frontline Systems consultants stepped up to help CFL turn this nonlin- ear problem into a linear problem. The linear pro- gramming “engine” got the model running. Premium Solver software turned out to be of great help to get an improved schedule.

Results/Benefits

Using the optimized schedule would lead to increased revenue through higher ticket sales and higher TV scores for the broadcasting channels. This was achieved because the tool was able to support added constraints of the vendors with great ease. The optimized schedule pleased most of the league’s stakeholders. This is a repetitive process, but those match schedules were CFL’s most advanced season match schedules to date.

Questions for DisCussion

1. List three ways in which Solver-based scheduling of games could result in more revenue as com- pared to the manual scheduling.

2. In what other ways can CFL leverage the Solver software to expand and enhance their other business operations?

3. What other considerations could be important in scheduling such games?

What Can We Learn from This Application Case?

By using the Solver add-in for Excel, the CFL made better decisions in scheduling their games by taking stakeholders and industry constraints into consider- ation, leading to revenue generation and good chan- nel ratings. Thus, an optimized schedule, a purview of prescriptive analytics, derived significant value. According to the case study, the modeler, Mr Trevor Hardy, was an expert Excel user, but not an expert in modeling. However, the ease of use of Excel per- mitted him to develop a practical application of pre- scriptive analytics.

Compiled from “Canadian Football League Uses Frontline Solvers to Optimize Scheduling in 2016.” Solver, September 7 2016, www. solver.com/news/canadian-football-league-uses-frontline- solvers-optimize-scheduling-2016 (accessed September 2018); Kostuk, Kent J., and Keith A. Willoughby. “A Decision Support System for Scheduling the Canadian Football League.” Interfaces, vol. 42, no. 3, 2012, pp. 286–295; Dilkina, Bistra N., and William S. Havens. The U.S. National Football League Scheduling Problem. Intelligent Systems Lab, www.cs.cornell.edu/~bistra/papers/ NFLsched1.pdf (accessed September 2018).

Application Case 8.1 (Continued)

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 465

Prescriptive Analytics Model Examples

Modeling is a key element for prescriptive analytics. In the examples mentioned earlier in the introduction and application cases, one has to employ a mathematical model to be able to recommend a decision for any realistic problem. For example, deciding which customers (among potentially millions) will receive what offer so as to maximize the overall response value but staying within a budget is not something you can do manu- ally. Building a probability-based response maximization model with the budget as a constraint would give us the information we are seeking. Depending on the problem we are addressing, there are many classes of models, and there are often many specialized techniques for solving each one. We will learn about two different modeling methods in this chapter. Most universities have multiple courses that cover these topics under titles such as Operations Research, Management Science, Decision Support Systems, and Simulation that can help you build more expertise in these topics. Because prescriptive analytics typically involves the application of mathematical models, sometimes the term data science is more commonly associated with the application of such mathematical models. Before we learn about mathematical modeling support in prescriptive analytics, let us understand some modeling issues first.

Identification of the Problem and Environmental Analysis

No decision is made in a vacuum. It is important to analyze the scope of the domain and the forces and dynamics of the environment. A decision maker needs to identify the organizational culture and the corporate decision-making processes (e.g., who makes decisions, degree of centralization). It is entirely possible that environmental factors have created the current problem. This can formally be called environmental scanning and analysis, which is the monitoring, scanning, and interpretation of collected information. Business intelligence/business analytics (BI/BA) tools can help identify problems by scan- ning for them. The problem must be understood, and everyone involved should share the same frame of understanding because the problem will ultimately be represented by the model in one form or another. Otherwise, the model will not help the decision maker.

VARIABLE IDENTIFICATION Identification of a model’s variables (e.g., decision, result, uncontrollable) is critical, as are the relationships among the variables. Influence dia- grams, which are graphical models of mathematical models, can facilitate the identifica- tion process. A more general form of an influence diagram, a cognitive map, can help a decision maker develop a better understanding of a problem, especially of variables and their interactions.

FORECASTING (PREDICTIVE ANALYTICS) As we have noted previously, an important prerequisite of prescriptive analytics is knowing what has happened and what is likely to happen. This form of predictive analytics is essential for construction and manipulating models because when a decision is implemented, the results usually occur in the future. There is no point in running a what-if (sensitivity) analysis on the past because decisions made then have no impact on the future. Online commerce and communication has cre- ated an immense need for forecasting and an abundance of available information for performing it. These activities occur quickly, yet information about such purchases is gathered and should be analyzed to produce forecasts. Part of the analysis involves sim- ply predicting demand; however, forecasting models can use product life-cycle needs and information about the marketplace and consumers to analyze the entire situation, ideally leading to additional sales of products and services.

We describe an effective example of such forecasting and its use in decision making at Ingram Micro in Application Case 8.2.

466 Part III • Prescriptive Analytics and Big Data

Ingram Micro is the world’s largest two-tier distrib- utor of technology products. In a two-tier distribu- tion system, a company purchases products from manufacturers and sells them to retailers who in turn sell these products to the end users. For exam- ple, one can purchase a Microsoft Office 365 pack- age from Ingram rather than purchasing it directly from Microsoft. Ingram has partnerships with Best Buy, Buffalo, Google, Honeywell, Libratone, and Sharper Image. The company delivers its products to 200,000 solution providers across the world and thus has a large volume of transaction data. Ingram wanted to use insights from this data to identify cross-selling opportunities and determine prices to offer to specific customers in conjunction with product bundles. This required setting up a busi- ness intelligence center (BIC) to compile and ana- lyze the data. In setting up the BIC, Ingram faced various issues.

1. Ingram faced several issues in their data- capture process such as a lack of loss data, ensuring the accuracy of end-user information, and linking quotes to orders.

2. Ingram faced technical issues in implementing a customer relationship management (CRM) system capable enough to handle its opera- tions around the world.

3. They faced resistance to the idea of demand pricing (determining price based on demand of product).

Methodology/Solution

Ingram explored communicating directly with its customers (resellers) using e-mail and offered them discounts on the purchase of supporting technolo- gies related to the products being ordered. They identified these opportunities through segmented market-basket analysis and developed the follow- ing business intelligence applications that helped in determining optimized prices. Ingram devel- oped a new price optimization tool known as IMPRIME, which is capable of setting data-driven prices and providing data-driven negotiation guid- ance. IMPRIME sets an optimized price for each

level of the product hierarchy (i.e., customer level, vendor-customer level, customer-segment level, and vendor-customer segment level). It does so by tak- ing into account the trade-off between the demand signal and pricing at that level.

The company also developed a digital market- ing platform known as Intelligence INGRAM. This platform utilizes predictive lead scoring (PLS), which selects end users to target with specific marketing programs. PLS is their system to score predictive leads for companies that have no direct relation with end users. Intelligence INGRAM is used to run white space programs, which encourage a reseller to purchase related products by offering discounts. For example, if a reseller purchases a server from INGRAM, then INGRAM offers a discount on disk storage units as both products are required to work together. Similarly, Intelligence INGRAM is used to run growth incentive campaigns (offering cash rewards to resellers on exceeding quarterly spend goals) and cross-sell campaigns (e-mailing the end users about the products that are related to their recently purchased product).

Results/Benefits

Profit generated by using IMPRIME is measured using a lift measurement methodology. This meth- odology compares periods before and after chang- ing the prices and compares test groups versus control groups. Lift measurement is done on aver- age daily sales, gross margin, and machine margin. The use of IMPRIME led to a $757 million growth in revenue and a $18 .8 million increase in gross profits.

Questions for DisCussion

1. What were the main challenges faced by Ingram Micro in developing a BIC?

2. List all the business intelligence solutions devel- oped by Ingram to optimize the prices of their products and to profile their customers.

3. What benefits did Ingram receive after using the newly developed BI applications?

Application Case 8.2 Ingram Micro Uses Business Intelligence Applications to Make Pricing Decisions

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 467

What Can We Learn from This Application Case?

By first building a BIC, a company begins to bet- ter understand its product lines, its customers, and their purchasing patterns. This insight is derived from what we call descriptive and predictive ana- lytics. Further value from this is derived through

price optimization, a purview of prescriptive analytics.

Sources: R. Mookherjee, J. Martineau, L. Xu, M. Gullo, K. Zhou, A. Hazlewood, X. Zhang, F. Griarte, & N. Li. (2016). “End-to-End Predictive Analytics and Optimization in Ingram Micro’s Two-Tier Distribution Business.” Interfaces, 46(1), 49–73; ingrammicro- commerce.com, “CUSTOMERS,” https://www.ingrammicro- commerce.com/customers/ (accessed July 2016).

Model Categories

Table 8.1 classifies some decision models into seven groups and lists several representa- tive techniques for each category. Each technique can be applied to either a static or a dynamic model, which can be constructed under assumed environments of certainty, uncertainty, or risk. To expedite model construction, we can use special decision analysis systems that have modeling languages and capabilities embedded in them. These in- clude spreadsheets, data mining systems, online analytic processing (OLAP) systems, and modeling languages that help an analyst build a model. We will introduce one of these systems later in the chapter.

MODEL MANAGEMENT Models, like data, must be managed to maintain their integrity, and thus their applicability. Such management is done with the aid of model-based management systems, which are analogous to database management systems (DBMS).

KNOWLEDGE-BASED MODELING DSS uses mostly quantitative models, whereas expert systems use qualitative, knowledge-based models in their applications. Some knowledge is necessary to construct solvable (and therefore usable) models. Many of the predictive

TABLE 8.1 Categories of Models

Category Process and Objective Representative Techniques

Optimization of problems with few alternatives

Find the best solution from a small number of alternatives

Decision tables, decision trees, analytic hierarchy process

Optimization via algorithm

Find the best solution from a large number of alternatives, using a step-by-step improvement process

Linear and other mathematical programming models, network models

Optimization via an analytic formula

Find the best solution in one step, using a formula

Some inventory models

Simulation Find a good enough solution or the best among the alternatives checked, using experimentation

Several types of simulation

Heuristics Find a good enough solution, using rules Heuristic programming, expert systems

Predictive models Predict the future for a given scenario Forecasting models, Markov analysis

Other models Solve a what-if case, using a formula Financial modeling, waiting lines

468 Part III • Prescriptive Analytics and Big Data

analytics techniques, such as classification and clustering, can be used in building knowledge-based models.

CURRENT TRENDS IN MODELING One recent trend in modeling involves the develop- ment of model libraries and solution technique libraries. Some of these codes can be run directly on the owner’s Web server for free, and others can be downloaded and run on a local computer. The availability of these codes means that powerful optimization and simulation packages are available to decision makers who may have only experienced these tools from the perspective of classroom problems. For example, the Mathematics and Computer Science Division at Argonne National Laboratory (Argonne, Illinois) main- tains the NEOS Server for Optimization at https://neos-server.org/neos/index.html. You can find links to other sites by clicking the Resources link at informs.org, the Web site of the Institute for Operations Research and the Management Sciences (INFORMS). A wealth of modeling and solution information is available from INFORMS. The Web site for one of INFORMS’ publications, OR/MS Today, at http://www.orms-today.org/ ormsmain.shtml includes links to many categories of modeling software. We will learn about some of these shortly.

There is a clear trend toward developing and using cloud-based tools and soft- ware to access and even run software to perform modeling, optimization, simulation, and so on. This has, in many ways, simplified the application of many models to real-world problems. However, to use models and solution techniques effectively, it is necessary to truly gain experience through developing and solving simple ones. This aspect is often overlooked. Organizations that have key analysts who understand how to apply models indeed apply them very effectively. This is most notably occurring in the revenue management area, which has moved from the province of airlines, hotels, and automobile rentals to retail, insurance, entertainment, and many other areas. CRM also uses models, but they are often transparent to the user. With management mod- els, the amount of data and model sizes are quite large, necessitating the use of data warehouses to supply the data and parallel computing hardware to obtain solutions in a reasonable time frame.

There is a continuing trend toward making analytics models completely transparent to the decision maker. For example, multidimensional analysis (modeling) involves data analysis in several dimensions. In multidimensional analysis (modeling), data are generally shown in a spreadsheet format, with which most decision makers are familiar. Many decision makers accustomed to slicing and dicing data cubes are now using OLAP systems that access data warehouses. Although these methods may make modeling pal- atable, they also eliminate many important and applicable model classes from consid- eration, and they eliminate some important and subtle solution interpretation aspects. Modeling involves much more than data analysis with trend lines and establishing rela- tionships with statistical methods.

There is also a trend to build a model of a model to help in its analysis. An influence diagram is a graphical representation of a model; that is, a model of a model. Some in- fluence diagram software packages are capable of generating and solving the resultant model.

u SECTION 8.2 REVIEW QUESTIONS

1. List three lessons learned from modeling. 2. List and describe the major issues in modeling. 3. What are the major types of models used in DSS? 4. Why are models not used in industry as frequently as they should or could be? 5. What are the current trends in modeling?

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 469

8.3 STRUCTURE OF MATHEMATICAL MODELS FOR DECISION SUPPORT

In the following sections, we present the topics of analytical mathematical models (e.g., mathematical, financial, and engineering). These include the components and the struc- ture of models.

The Components of Decision Support Mathematical Models

All quantitative models are typically made up of four basic components (see Figure 8.1): result (or outcome) variables, decision variables, uncontrollable variables (and/or parameters), and intermediate result variables. Mathematical relationships link these components together. In nonquantitative models, the relationships are symbolic or qualitative. The results of decisions are determined based on the decision made (i.e., the values of the decision variables), the factors that cannot be controlled by the decision maker (in the environment), and the relationships among the variables. The modeling process involves identifying the variables and relationships among them. Solving a model determines the values of these and the result variable(s).

RESULT (OUTCOME) VARIABLES Result (outcome) variables reflect the level of effectiveness of a system; that is, they indicate how well the system performs or at- tains its goal(s). These variables are outputs. Examples of result variables are shown in Table 8.2. Result variables are considered dependent variables. Intermediate result vari- ables are sometimes used in modeling to identify intermediate outcomes. In the case of a dependent variable, another event must occur first before the event described by the variable can occur. Result variables depend on the occurrence of the decision variables and the uncontrollable variables.

DECISION VARIABLES Decision variables describe alternative courses of action. The decision maker controls the decision variables. For example, for an investment problem, the amount to invest in bonds is a decision variable. In a scheduling problem, the deci- sion variables are people, times, and schedules. Other examples are listed in Table 8.2.

UNCONTROLLABLE VARIABLES, OR PARAMETERS In any decision-making situation, there are factors that affect the result variables but are not under the control of the decision maker. Either these factors can be fixed, in which case they are called uncontrollable variables, or parameters, or they can vary, in which case they are called variables. Examples of factors are the prime interest rate, a city’s building code, tax regulations, and utilities costs. Most of these factors are uncontrollable because they are in and determined by elements of the system environment in which the decision maker works. Some of

Mathematical relationships

Uncontrollable variables

Intermediate variables

Decision variables

Result variables

FIGURE 8.1 The General Structure of a Quantitative Model.

470 Part III • Prescriptive Analytics and Big Data

TABLE 8.2 Examples of Components of Models

Area Decision Variables Result Variables Uncontrollable Variables and Parameters

Financial investment

Investment alternatives and amounts

Total profit, risk Rate of return on investment (ROI) Earnings per share Liquidity level

Inflation rate Prime rate Competition

Marketing Advertising budget Where to advertise

Market share Customer satisfaction

Customer’s income Competitor’s actions

Manufacturing What and how much to produce Inventory levels Compensation programs

Total cost Quality level Employee satisfaction

Machine capacity Technology Materials prices

Accounting Use of computers Audit schedule

Data processing cost Error rate

Computer technology Tax rates Legal requirements

Transportation Shipments schedule Use of smart cards

Total transport cost Payment float time

Delivery distance Regulations

Services Staffing levels Customer satisfaction Demand for services

these variables limit the decision maker and therefore form what are called constraints of the problem.

INTERMEDIATE RESULT VARIABLES Intermediate result variables reflect intermediate outcomes in mathematical models. For example, in determining machine scheduling, spoil- age is an intermediate result variable, and total profit is the result variable (i.e., spoilage is one determinant of total profit). Another example is employee salaries. This constitutes a decision variable for management: It determines employee satisfaction (i.e., intermediate outcome), which, in turn, determines the productivity level (i.e., final result).

The Structure of Mathematical Models

The components of a quantitative model are linked by mathematical (algebraic) expressions— equations or inequalities.

A very simple financial model is

P = R - C

where P = profit, R = revenue, and C = cost. This equation describes the relationship among the variables. Another well-known financial model is the simple present-value cash flow model, where P = present value, F = a future single payment in dollars, i = interest rate (percentage), and n = number of years. With this model, it is possible to determine the present value of a payment of $100,000 to be made 5 years from today, at a 10% (0.1) interest rate, as follows:

P = 100,000>(1 + 0.1)5 = 62,092 We present more interesting and complex mathematical models in the following

sections.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 471

u SECTION 8.3 REVIEW QUESTIONS

1. What is a decision variable? 2. List and briefly discuss the major components of a quantitative model. 3. Explain the role of intermediate result variables.

8.4 CERTAINTY, UNCERTAINTY, AND RISK

The1 decision-making process involves evaluating and comparing alternatives. During this process, it is necessary to predict the future outcome of each proposed alternative. Decision situations are often classified on the basis of what the decision maker knows (or believes) about the forecasted results. We customarily classify this knowledge into three categories (see Figure 8.2), ranging from complete knowledge to complete ignorance:

• Certainty • Uncertainty • Risk

When we develop models, any of these conditions can occur, and different kinds of models are appropriate for each case. Next, we discuss both the basic definitions of these terms and some important modeling issues for each condition.

Decision Making under Certainty

In decision making under certainty, it is assumed that complete knowledge is available so that the decision maker knows exactly what the outcome of each course of action will be (as in a deterministic environment). It may not be true that the outcomes are 100% known, nor is it necessary to really evaluate all the outcomes, but often this assumption simplifies the model and makes it tractable. The decision maker is viewed as a perfect predictor of the future because it is assumed that there is only one outcome for each al- ternative. For example, the alternative of investing in U.S. Treasury bills is one for which there is complete availability of information about the future return on investment if it is held to maturity. A situation involving decision making under certainty occurs most often with structured problems and short time horizons (up to 1 year). Certainty models are relatively easy to develop and solve, and they can yield optimal solutions. Many financial models are constructed under assumed certainty, even though the market is anything but 100% certain.

1Some parts of the original versions of these sections were adapted from Turban and Meredith (1994).

Increasing knowledge

Complete Knowledge Certainty

Total Ignorance

Uncertainty

Decreasing knowledge

Risk

FIGURE 8.2 The Zones of Decision Making.

472 Part III • Prescriptive Analytics and Big Data

Decision Making under Uncertainty

In decision making under uncertainty, the decision maker considers situations in which several outcomes are possible for each course of action. In contrast to the risk situation, in this case, the decision maker does not know, or cannot estimate, the probability of occurrence of the possible outcomes. Decision making under uncertainty is more difficult than decision making under certainty because there is insufficient information. Modeling of such situations involves assessment of the decision maker’s (or the organization’s) attitude toward risk.

Managers attempt to avoid uncertainty as much as possible, even to the point of assuming it away. Instead of dealing with uncertainty, they attempt to obtain more in- formation so that the problem can be treated under certainty (because it can be “almost” certain) or under calculated (i.e., assumed) risk. If more information is not available, the problem must be treated under a condition of uncertainty, which is less definitive than the other categories.

Decision Making under Risk (Risk Analysis)

A decision made under risk2 (also known as a probabilistic or stochastic decision-making situation) is one in which the decision maker must consider several possible outcomes for each alternative, each with a given probability of occurrence. The long-run probabilities that the given outcomes will occur are assumed to be known or can be estimated. Under these assumptions, the decision maker can assess the degree of risk associated with each alternative (called calculated risk). Most major business decisions are made under assumed risk. Risk analysis (i.e., calculated risk) is a decision-making method that ana- lyzes the risk (based on assumed known probabilities) associated with different alterna- tives. Risk analysis can be performed by calculating the expected value of each alternative and selecting the one with the best expected value. Application Case 8.3 illustrates one application to reduce uncertainty.

2Our definitions of the terms risk and uncertainty were formulated by F. H. Knight of the University of Chicago in 1933. Other, comparable definitions also are in use.

American Airlines, Inc. (AA) is one of the world’s largest airlines. Its core business is passenger trans- portation, but it has other vital ancillary functions that include full-truckload (FTL) freight shipment of maintenance equipment and in-flight shipment of passenger service items that could add up to over $1 billion in inventory at any given time. AA receives numerous bids from suppliers in response to requests for quotes (RFQs) for inventories. AA’s RFQs could total over 500 in any given year. Bid quotes vary significantly as a result of the large number of bids and resultant complex bidding process. Sometimes, a single contract bid could deviate by about 200%. As a

result of the complex process, it is common to either overpay or underpay suppliers for their services. To this end, AA wanted a should-cost model that would streamline and assess bid quotes from suppliers to choose bid quotes that were fair to both them and their suppliers.

Methodology/Solution

To determine fair cost for supplier products and ser- vices, three steps were taken:

1. Primary (e.g., interviews) and secondary (e.g., Internet) sources were scouted for base-case

Application Case 8.3 American Airlines Uses Should-Cost Modeling to Assess the Uncertainty of Bids for Shipment Routes

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 473

u SECTION 8.4 REVIEW QUESTIONS

1. Define what it means to perform decision making under assumed certainty, risk, and uncertainty.

2. How can decision-making problems under assumed certainty be handled? 3. How can decision-making problems under assumed uncertainty be handled? 4. How can decision-making problems under assumed risk be handled?

8.5 DECISION MODELING WITH SPREADSHEETS

Models can be developed and implemented in a variety of programming languages and systems. We focus primarily on spreadsheets (with their add-ins), modeling languages, and transparent data analysis tools. With their strength and flexibility, spreadsheet packages were quickly recognized as easy-to-use implementation software for the development of a wide range of applications in business, engineering, mathematics, and science. Spreadsheets include extensive statistical, forecasting, and other modeling and database management capabilities, functions, and routines. As spreadsheet packages evolved, add- ins were developed for structuring and solving specific model classes. Among the add-in packages, many were developed for DSS development. These DSS-related add-ins in- clude Solver (Frontline Systems Inc., solver.com) and What’sBest! (a version of Lindo, from Lindo Systems, Inc., lindo.com) for performing linear and nonlinear optimization; Braincel (Jurik Research Software, Inc., jurikres.com) and NeuralTools (Palisade Corp., palisade.com) for artificial neural networks; Evolver (Palisade Corp.) for genetic algo- rithms; and @RISK (Palisade Corp.) for performing simulation studies. Comparable add-ins are available for free or at a very low cost. (Conduct a Web search to find them; new ones are added to the marketplace on a regular basis.)

and range data that would inform cost vari- ables that affect an FTL bid.

2. Cost variables were chosen so that they were mutually exclusive and collectively exhaustive.

3. The DPL decision analysis software was used to model the uncertainty.

Furthermore, Extended Swanson-Megill approximation was used to model the probability distribution of the most sensitive cost variables used. This was done to account for the high variability in the bids in the initial model.

Results/Benefits

A pilot test was done on an RFQ that attracted bids from six FTL carriers. Out of the six bids presented, five were within three standard deviations from the mean, whereas one was considered an outlier. Subsequently, AA used the should-cost FTL model on more than 20 RFQs to determine what a fair and accurate cost of goods and services should be.

It is expected that this model will help in reduc- ing the risk of either overpaying or underpaying its suppliers.

Questions for DisCussion

1. Besides reducing the risk of overpaying or underpaying suppliers, what are some other benefits AA would derive from its “should-be” model?

2. Can you think of other domains besides air transportation where such a model could be used?

3. Discuss other possible methods with which AA could have solved its bid overpayment and underpayment problem.

Source: Based on Bailey, M. J., Snapp, J., Yetur, S., Stonebraker, J. S., Edwards, S. A., Davis, A., & Cox, R. (2011). Practice sum- maries: American Airlines uses should-cost modeling to assess the uncertainty of bids for its full-truckload shipment routes. Interfaces, 41(2), 194–196.

474 Part III • Prescriptive Analytics and Big Data

The spreadsheet is clearly the most popular end-user modeling tool because it incorporates many powerful financial, statistical, mathematical, and other functions. Spreadsheets can perform model solution tasks such as linear programming and regres- sion analysis. The spreadsheet has evolved into an important tool for analysis, plan- ning, and modeling (see Farasyn, Perkoz, & Van de Velde, 2008; Hurley & Balez, 2008; Ovchinnikov & Milner, 2008). Application Cases 8.4 and 8.5 describe interesting applica- tions of spreadsheet-based models in a nonprofit setting.

The Pennsylvania Adoption Exchange (PAE) was established in 1979 by the State of Pennsylvania to help county and nonprofit agencies find prospec- tive families for orphan children who had not been adopted due to age or special needs. The PAE keeps detailed records about children and preferences of families who may adopt them. The exchange looks for families for the children across all 67 counties of Pennsylvania.

The Pennsylvania Statewide Adoption and Permanency Network is responsible for finding per- manent homes for orphans. If after a few attempts the network fails to place a child with a family, they then get help from the PAE. The PAE uses an auto- mated assessment tool to match children to fami- lies. This tool gives matching recommendations by calculating a score between 0 and 100% for a child on 78 pairs of the child’s attribute values and fam- ily preferences. For some years now, the PAE has struggled to give adoption match recommendations to caseworkers for children. They are finding it diffi- cult to manage a vast database of children collected over time for all 67 counties. The basic search algo- rithm produced match recommendations that were proving unfruitful for caseworkers. As a result, the number of children who have not been adopted has increased significantly, and there is a growing urgency to find families for these orphans.

Methodology/Solution

The PAE started collecting information about the orphans and families through online surveys that include a new set of questions. These questions col- lect information about hobbies of the child, child– caseworker preferences for families, and preference of the age range of children by families. The PAE and consultants created a spreadsheet matching

tool that included additional features compared to the previously used automated tool. In this model, caseworkers can specify the weight of the attributes for selecting a family for a child. For example, if a family had a narrow set of preferences regarding gender, age, and race, then those factors can receive a higher weight. Also, caseworkers can give pref- erence about the family’s county of residence, as community relationship is an important factor for a child. Using this tool, the matching committee can compare a child and family on each attribute, thus making a more accurate match decision between a family and a child.

Results/Benefits

Since the PAE started using the new spreadsheet model for matching a family with a child, they have been able to make better matching decisions. As a result, the percentage of children getting a perma- nent home has increased.

This short case is one of the many examples of using spreadsheets as a decision support tool. By creating a simple scoring system for a family’s desire and a child’s attribute, a better matching system is produced so that fewer rejections are reported on either side.

Questions for DisCussion

1. What were the challenges faced by PAE while making adoption matching decisions?

2. What features of the new spreadsheet tool helped PAE solve their issues of matching a fam- ily with a child?

Source: Based on Slaugh, V. W., Akan, M., Kesten, O., & Unver, M. U. (2016). The Pennsylvania Adoption Exchange improves its matching process. Interfaces, 46(2), 133–154.

Application Case 8.4 Pennsylvania Adoption Exchange Uses Spreadsheet Model to Better Match Children with Families

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 475

Meals on Wheels Association of America (now Meals on Wheels America) is a not-for-profit organi- zation that delivers approximately one million meals to homes of older people in need across the United States. Metro Meals on Wheels Treasure Valley is a local branch of Meals on Wheels America operat- ing in Idaho. This branch has a team of volunteer drivers that drive their personal vehicles each day to deliver meals to 800 clients along 21 routes and cover an area of 2,745 square kilometers.

The Meals on Wheels Treasure Valley organi- zation was facing many issues. First, they were look- ing to minimize the delivery time as the cooked food was temperature sensitive and could perish easily. They wanted to deliver the cooked food within 90 minutes after a driver left for the delivery. Second, the scheduling process was very time consuming. Two employees spent much of their time develop- ing scheduled routes for delivery. A route coordina- tor determined the stops according to the number of meal recipients for a given day. After determining the stops, the coordinator made a sequence of stops that minimized the travel time of volunteers. This routing schedule was then entered into an online tool to determine turn-by-turn driving instructions for drivers. The whole process of manually deciding routes was taking a lot of extra time. Metro Meals on Wheels wanted a routing tool that could improve their delivery system and generate routing solu- tions for both one-way and round-trip directions for delivering meals. Those who drive regularly could deliver the warmers or coolers the next day. Others who drive only occasionally would need to come back to the kitchen to drop off the warmers/coolers.

Methodology/Solution

To solve the routing problem, a spreadsheet-based tool was developed. This tool had an interface to easily input information about the recipient such as his/her name, meal requirements, and delivery

address. This information needed to be filled in the spreadsheet for each stop in the route. Next, Excel’s Visual Basic for Applications functionality was used to access a developer’s networking map application programming interface (API) called MapQuest. This API was used to create a travel matrix that calcu- lated time and distance needed for delivery of the meal. This tool gave time and distance information for 5,000 location pairs a day without any cost.

When the program starts, the MapQuest API first validates the entered addresses of meal recipi- ents. Then the program uses the API to retrieve driving distance, estimated driving time, and turn- by-turn instructions for driving between all stops in the route. The tool can then find the optimal route for up to 30 stops within a feasible time limit.

Results/Benefits

As a result of using this tool, the total annual driv- ing distance decreased by 10,000 miles, while travel time was reduced by 530 hours. Metro Meals on Wheels Treasure Valley saved $5,800 in 2015, based on an estimated savings rate of $0 .58 per mile (for a midsize sedan). This tool also reduced the time spent on route planning for meal deliveries. Other benefits included increased volunteer satisfaction and more retention of volunteers.

Questions for DisCussion

1. What were the challenges faced by Metro Meals on Wheels Treasure Valley related to meal delivery before adoption of the spreadsheet-based tool?

2. Explain the design of the spreadsheet-based model.

3. What are the intangible benefits of using the Excel-based model to Metro Meals on Wheels?

Source: Based on Manikas, A. S., Kroes, J. R., & Gattiker, T. F. (2016). Metro Meals on Wheels Treasure Valley employs a low- cost routing tool to improve deliveries. Interfaces, 46(2), 154–167.

Application Case 8.5 Metro Meals on Wheels Treasure Valley Uses Excel to Find Optimal Delivery Routes

Other important spreadsheet features include what-if analysis, goal seeking, data management, and programmability (i.e., macros). With a spreadsheet, it is easy to change a cell’s value and immediately see the result. Goal seeking is performed by indicating a target cell, its desired value, and a changing cell. Extensive database management can be performed with small data sets, or parts of a database can be imported for analysis (which

476 Part III • Prescriptive Analytics and Big Data

is essentially how OLAP works with multidimensional data cubes; in fact, most OLAP sys- tems have the look and feel of advanced spreadsheet software after the data are loaded). Templates, macros, and other tools enhance the productivity of building DSS.

Most spreadsheet packages provide fairly seamless integration because they read and write common file structures and easily interface with databases and other tools. Microsoft Excel is the most popular spreadsheet package. In Figure 8.3, we show a simple loan calculation model in which the boxes on the spreadsheet describe the contents of the cells, which contain formulas. A change in the interest rate in cell E7 is immediately reflected in the monthly payment in cell E13 . The results can be observed and analyzed immediately. If we require a specific monthly payment, we can use goal seeking to deter- mine an appropriate interest rate or loan amount.

Static or dynamic models can be built in a spreadsheet. For example, the monthly loan calculation spreadsheet shown in Figure 8.3 is static. Although the problem affects the borrower over time, the model indicates a single month’s performance, which is repli- cated. A dynamic model, in contrast, represents behavior over time. The loan calculations in the spreadsheet shown in Figure 8.4 indicate the effect of prepayment on the principal over time. Risk analysis can be incorporated into spreadsheets by using built-in random- number generators to develop simulation models (see the next chapter).

Spreadsheet applications for models are reported regularly. We will learn how to use a spreadsheet-based optimization model in the next section.

u SECTION 8.5 REVIEW QUESTIONS

1. What is a spreadsheet? 2. What is a spreadsheet add-in? How can add-ins help in DSS creation and use? 3. Explain why a spreadsheet is so conducive to the development of DSS.

FIGURE 8.3 Excel Spreadsheet Static Model Example of a Simple Loan Calculation of Monthly Payments.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 477

8.6 MATHEMATICAL PROGRAMMING OPTIMIZATION

Mathematical programming is a family of tools designed to help solve managerial problems in which the decision maker must allocate scarce resources among competing activities to optimize a measurable goal. For example, the distribution of machine time (the resource) among various products (the activities) is a typical allocation problem. Linear programming (LP) is the best-known technique in a family of optimization tools called mathematical programming; in LP, all relationships among the variables are linear. It is used extensively in DSS (see Application Case 8.6). LP models have many important applications in practice. These include supply chain management, product mix decisions, routing, and so on. Special forms of the models can be used for specific applications. For example, Application Case 8.6 describes a spreadsheet model that was used to create a schedule for physicians.

LP allocation problems usually display the following characteristics:

• A limited quantity of economic resources is available for allocation. • The resources are used in the production of products or services. • There are two or more ways in which the resources can be used. Each is called a

solution or a program. • Each activity (product or service) in which the resources are used yields a return in

terms of the stated goal. • The allocation is usually restricted by several limitations and requirements, called

constraints.

FIGURE 8.4 Excel Spreadsheet Dynamic Model Example of a Simple Loan Calculation of Monthly Payments and the Effects of Prepayment.

478 Part III • Prescriptive Analytics and Big Data

Regional Neonatal Associates is a nine-physician group working for the Neonatal Intensive Care Unit (NICU) at the University of Tennessee Medical Center in Knoxville, Tennessee. The group also serves two local hospitals in the Knoxville area for emergency purposes. For many years, one member of the group would schedule physicians manually. However, as his retirement approached, there was a need for a more automatic system to schedule physi- cians. The physicians wanted this system to balance their workload, as the previous schedules did not properly balance workload among them. In addi- tion, the schedule needed to ensure there would be 24-7 NICU coverage by the physicians, and if possible, accommodate individual preferences of physicians for shift types. To address this problem, the physicians contacted the faculty of Management Science at the University of Tennessee.

The problem of scheduling physicians to shifts was characterized by constraints based on work- load and lifestyle choices. The first step in solving the scheduling issue was to group shifts according to their types (day and night). The next step was determining constraints for the problem. The model needed to cover a nine-week period with nine phy- sicians, with two physicians working weekdays and one physician overnight and on weekends. In addi- tion, one physician had to be assigned exclusively for 24-7 coverage to the two local hospitals. Other obvious constraints also needed to be considered. For example, a day shift could not be assigned to a physician just after a night shift.

Methodology/Solution

The problem was formulated by creating a binary, mixed-integer optimization model. The first model divided workload equally among the nine physi- cians. But it could not assign an equal number of day and night shifts among them. This created a ques- tion of fair distribution. In addition, the physicians had differing opinions of the assigned workload. Six physicians wanted a schedule in which an equal

number of day and night shifts would be assigned to each physician in the nine-week schedule, while the others wanted a schedule based on individual preference of shifts. To satisfy requirements of both groups of physicians, a new model was formed and named the Hybrid Preference Scheduling Model (HPSM). For satisfying the equality requirement of six physicians, the model first calculated one week’s workload and divided it for nine weeks for them. This way, the work was divided equally for all six physicians. The workload for the three remaining physicians was distributed in the nine-week sched- ule according to their preference. The resulting schedule was reviewed by the physicians and they found the schedule more acceptable.

Results/Benefits

The HPSM method accommodated both the equal- ity and individual preference requirements of the physicians. In addition, the schedules from this model provided better rest times for the physicians compared to the previous manual schedules, and vacation requests could also be accommodated in the schedules. The HPSM model can solve similar scheduling problems demanding relative prefer- ences among shift types.

Techniques such as mixed-integer program- ming models can build optimal schedules and help in operations. These techniques have been used in large organizations for a long time. Now it is pos- sible to implement such prescriptive analytic models in spreadsheets and other easily available software.

Questions for DisCussion

1. What was the issue faced by the Regional Neonatal Associates group?

2. How did the HPSM model solve all of the physi- cian’s requirements?

Source: Adapted from Bowers, M. R., Noon, C. E., Wu, W., & Bass, J. K. (2016). Neonatal physician scheduling at the University of Tennessee Medical Center. Interfaces, 46(2), 168–182.

Application Case 8.6 Mixed-Integer Programming Model Helps the University of Tennessee Medical Center with Scheduling Physicians

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 479

The LP allocation model is based on the following rational economic assumptions:

• Returns from different allocations can be compared; that is, they can be measured by a common unit (e.g., dollars, utility).

• The return from any allocation is independent of other allocations. • The total return is the sum of the returns yielded by the different activities. • All data are known with certainty. • The resources are to be used in the most economical manner.

Allocation problems typically have a large number of possible solutions. Depending on the underlying assumptions, the number of solutions can be either infinite or finite. Usually, different solutions yield different rewards. Of the available solutions, at least one is the best, in the sense that the degree of goal attainment associated with it is the highest (i.e., the total reward is maximized). This is called an optimal solution, and it can be found by using a special algorithm.

Linear Programming Model

Every LP model is composed of decision variables (whose values are unknown and are searched for), an objective function (a linear mathematical function that relates the deci- sion variables to the goal, measures goal attainment, and is to be optimized), objective function coefficients (unit profit or cost coefficients indicating the contribution to the ob- jective of one unit of a decision variable), constraints (expressed in the form of linear in- equalities or equalities that limit resources and/or requirements; these relate the variables through linear relationships), capacities (which describe the upper and sometimes lower limits on the constraints and variables), and input/output (technology) coefficients (which indicate resource utilization for a decision variable).

Let us look at an example. MBI Corporation, which manufactures special-purpose computers, needs to make a decision: How many computers should it produce next month at the Boston plant? MBI is considering two types of computers: the CC@7, which requires 300 days of labor and $10,000 in materials, and the CC@8, which requires 500 days of labor and $15,000 in materials. The profit contribution of each CC@7 is $8,000, whereas that of each CC@8 is $12,000 . The plant has a capacity of 200,000 working days per month, and the material budget is $8 million per month. Marketing requires that at least 100 units of the CC@7 and at least 200 units of the CC@8 be produced each month. The problem is to maximize the company’s profits by determining how many units of the CC@7 and how many units of the CC@8 should be produced each month. Note that in a real-world envi- ronment, it could possibly take months to obtain the data in the problem statement, and

TECHNOLOGY INSIGHTS 8.1 Linear Programming

LP is perhaps the best-known optimization model. It deals with the optimal allocation of re- sources among competing activities. The allocation problem is represented by the model de- scribed here.

The problem is to find the values of the decision variables X1, X2, and so on, such that the value of the result variable Z is maximized, subject to a set of linear constraints that ex- press the technology, market conditions, and other uncontrollable variables. The mathematical relationships are all linear equations and inequalities. Theoretically, any allocation problem of this type has an infinite number of possible solutions. Using special mathematical proce- dures, the LP approach applies a unique computerized search procedure that finds the best solution(s) in a matter of seconds. Furthermore, the solution approach provides automatic sensitivity analysis.

480 Part III • Prescriptive Analytics and Big Data

while gathering the data the decision maker would no doubt uncover facts about how to structure the model to be solved. Web-based tools for gathering data can help.

Modeling in LP: An Example

A standard LP model can be developed for the MBI Corporation problem just described. As discussed in Technology Insights 8.1, the LP model has three components: decision variables, result variables, and uncontrollable variables (constraints). The decision variables are as follows:

X1 = units of CC-7 to be produced X2 = units of CC-8 to be produced

The result variable is as follows:

Total profit = Z

The objective is to maximize total profit:

Z = 8,000X1 + 12,000X2 The uncontrollable variables (constraints) are as follows:

Labor constraint: 300X1 + 500X2 … 200,000 (in days) Budget constraint: 10,000X1 + 15,0 0 0X2 … 8,000,000 (in dollars) Marketing requirement for CC-7: X1 Ú 100 (in units) Marketing requirement for CC-8: X2 Ú 200 (in units)

This information is summarized in Figure 8.5. The model also has a fourth, hidden component. Every LP model has some internal

intermediate variables that are not explicitly stated. The labor and budget constraints may each have some slack in them when the left-hand side is strictly less than the right-hand side. This slack is represented internally by slack variables that indicate excess resources available. The marketing requirement constraints may each have some surplus in them when the left-hand side is strictly greater than the right-hand side. This surplus is rep- resented internally by surplus variables indicating that there is some room to adjust the right-hand sides of these constraints. These slack and surplus variables are intermediate. They can be of great value to a decision maker because LP solution methods use them in establishing sensitivity parameters for economic what-if analyses.

Decision variables Mathematical (logical)

relationships

Maximize Z (profit)

subject to constraints

Total profit 5 Z

Z 5 8,000X1 1 12,000X2

Result variables

X1 5 units of CC-7

X2 5 units of CC-8

300X1 1 500X2 # 200,000

10,000X1 1 15,000X2 # 8,000,000

X1 $ 100

X2 $ 200

Constraints (uncontrollable)

FIGURE 8.5 Mathematical Model of a Product-Mix Example.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 481

The product-mix model has an infinite number of possible solutions. Assuming that a production plan is not restricted to whole numbers—which is a reasonable assumption in a monthly production plan—we want a solution that maximizes total profit: an optimal solution. Fortunately, Excel comes with the add-in Solver, which can readily obtain an optimal (best) solution to this problem. Although the location of Solver add-in has moved from one version of Excel to another, it is still available as a free add-in. Look for it under the Data tab and on the Analysis ribbon. If it is not there, you should be able to enable it by going to Excel’s Options Menu and selecting Add-ins.

We enter these data directly into an Excel spreadsheet, activate Solver, and identify the goal (by setting Target Cell equal to Max), decision variables (by setting By Changing Cells), and constraints (by ensuring that Total Consumed elements is less than or equal to Limit for the first two rows and is greater than or equal to Limit for the third and fourth rows). Cells C7 and D7 constitute the decision variable cells. Results in these cells will be filled after running the Solver Add-in. Target Cell is Cell E7, which is also the result vari- able, representing a product of decision variable cells and their per unit profit coefficients (in Cells C8 and D8). Note that all the numbers have been divided by 1,000 to make it easier to type (except the decision variables). Rows 9–12 describe the constraints of the problem: the constraints on labor capacity, budget, and the desired minimum production of the two products X1 and X2. Columns C and D define the coefficients of these con- straints. Column E includes the formulae that multiply the decision variables (Cells C7 and D7 ) with their respective coefficients in each row. Column F defines the right-hand side value of these constraints. Excel’s matrix multiplication capabilities (e.g., SUMPRODUCT function) can be used to develop such row and column multiplications easily.

After the model’s calculations have been set up in Excel, it is time to invoke the Solver Add-in. Clicking on the Solver Add-in (again under the Analysis group under Data Tab) opens a dialog box (window) that lets you specify the cells or ranges that define the objective function cell, decision/changing variables (cells), and the constraints. Also, in Options, we select the solution method (usually Simplex LP), and then we solve the problem. Next, we select all three reports—Answer, Sensitivity, and Limits—to obtain an optimal solution of X1 = 333.33, X2 = 200, Profit = $5,066,667, as shown in Figure 8.6. Solver produces three useful reports about the solution. Try it. Solver now also includes the ability to solve nonlinear programming problems and integer programming problems by using other solution methods available within it.

The following example was created by Professor Rick Wilson of Oklahoma State University to further illustrate the power of spreadsheet modeling for decision support.

The table in Figure 8.7 describes some hypothetical data and attributes of nine “swing states” for the 2016 election. Attributes of the nine states include their number of electoral votes, two regional descriptors (note that three states are classified as neither North nor South), and an estimated “influence function,” which relates to increased can- didate support per unit of campaign financial investment in that state.

For instance, influence function F1 shows that for every financial unit invested in that state, there will be a total of a 10-unit increase in voter support (let units stay general here), made up of an increase in young men support by 3 units, old men support by 1 unit, and young and old women each by 3 units.

The campaign has 1,050 financial units to invest in the nine states. It must invest at least 5% in each state of the total overall invested, but no more than 25% of the overall total invested can be in any one state. All 1,050 units do not have to be invested (your model must correctly deal with this).

The campaign has some other restrictions as well. From a financial investment standpoint, the West states (in total) must have campaign investments at levels that are at least 60% of the total invested in East states. In terms of people influenced, the decision to allocate financial investments to states must lead to at least 9,200 total people influenced.

482 Part III • Prescriptive Analytics and Big Data

FIGURE 8.7 Data for Election Resource Allocation Example.

FIGURE 8.6 Excel Solver Solution to the Product-Mix Example.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 483

Overall, the total number of females influenced must be greater than or equal to the total number of males influenced. Also, at least 46% of all people influenced must be “old.”

Our task is to create an appropriate integer programming model that determines the optimal integer (i.e., whole number) allocation of financial units to states that maximizes the sum of the products of the electoral votes times units invested subject to the other aforementioned restrictions. (Thus, indirectly, this model is giving preference to states with higher numbers of electoral votes.) Note that for ease of implementation by the cam- paign staff, all decisions for allocation in the model should lead to integer values.

The three aspects of the models can be categorized based on the following ques- tions that they answer:

1. What do we control? The amount invested in advertisements across the nine states, Nevada, Colorado, Iowa, Wisconsin, Ohio, Virginia, North Carolina, Florida, and New Hampshire, which are represented by the nine decision variables, NV, CO, IA, WI OH, VA, NC, FL, and NH.

2. What do we want to achieve? We want to maximize the total number of elec- toral votes gains. We know the value of each electoral vote in each state (EV), so this amounts to EV*Investments aggregated over the nine states, that is,

Max (6NV + 9CO + 6IA + 10WI + 18OH + 13VA + 15NC + 29FL + 4NH)

3. What constrains us? Following are the constraints as given in the problem description:

a. No more than 1,050 financial units to invest into, that is, NV + CO + IA + WI + OH + VA + NC + FL + NH <= 1,050 .

b. Invest at least 5% of the total in each state, that is,

NV 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) CO 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) IA 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) WI 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) OH 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) VA 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) NC 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH) FL 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH)

NH 7 = 0 .05 (NV + CO + IA + WI + OH + VA + NC + FL + NH)

We can implement these nine constraints in a variety of ways using Excel. c. Invest no more than 25% of the total in each state.

As with (b) we need nine individual constraints again because we do not know how much of the 1,050 we will invest. We must write the constraints in “general” terms.

NV 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) CO 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) IA 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) WI 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) OH 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) VA 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) NC 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH) FL 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH)

NH 6 = 0 .25 (NV + CO + IA + WI + OH + VA + NC + FL + NH)

484 Part III • Prescriptive Analytics and Big Data

d. Western states must have investment levels that are at least 60% of the Eastern states.

West States = NV + CO + IA + WI East States = OH + VA + NC + FL + NH

So, (NV + CO + IA + WI) 7 = 0 .60 (OH + VA + NC + FL + NH). Again, we can implement this constraint in a variety of ways using Excel.

e. Influence at least 9,200 total people, that is,

(10NV + 7 .5CO + 8IA + 10WI + 7 .5OH + 7 .5VA + 10NC + 8FL + 8 NH) 7 = 9,200

f. Influence at least as many females as males. This requires transition of influence functions.

F1 = 6 women influenced, F2 = 3.5 women F3 = 3 women influenced F1 = 4 men influenced, F2 = 4 men F3 = 5 men influenced

So, implementing females 7 = males, we get:

(6NV + 3.5CO + 3IA + 6WI + 3.5OH + 3.5VA + 6NC + 3FL + 3NH) 7 = (4NV + 4CO + 5IA + 4WI + 4OH + 4VA + 4NC + 5FL + 5NH)

As before, we can implement this in Excel in a couple of different ways. g. At least 46% of all people influenced must be old.

All people influenced were on the left-hand side of the constraint (e) . So, old people influenced would be:

(4NV + 3 .5CO + 4 .5IA + 4WI + 3 .5OH + 3 .5VA + 4NC + 4 .5FL + 4 .5NH)

This would be set 7 = 0.46* the left-hand side of constraint (e). (10NV + 7 .5CO + 8IA + 10WI + 7 .5OH + 7 .5VA + 10NC + 8FL + 8NH), which would give a right-hand side of (0 .46NV + 3 .45CO + 3 .68IA + 4 .6WI + 3 .45OH + 3 .45VA + 4.6NC + 3 .68FL + 3 .68NH)

This is the last constraint other than to force all variables to be integers.

All told in algebraic terms, this integer programming model would have 9 decision variables and 24 constraints (one constraint for integer requirements).

Implementation

One approach would be to implement the model in strict “standard form,” or a row-column form, where all constraints are written with decision variables on the left-hand side, a number on the right-hand side. Figure 8.8 shows such an implementation and displays the solved model.

Alternatively, we could use the spreadsheet to calculate different parts of the model in a less rigid manner, as well as uniquely implementing the repetitive constraints (b) and (c), and have a much more concise (but not as transparent) spreadsheet. This is shown in Figure 8.9.

LP models (and their specializations and generalizations) can also be specified di- rectly in a number of other user-friendly modeling systems. Two of the best known are Lindo and Lingo (Lindo Systems, Inc., lindo.com; demos are available). Lindo is an

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 485

FIGURE 8.8 Model for Election Resource Allocation—Standard Version.

LP and integer programming system. Models are specified in essentially the same way that they are defined algebraically. Based on the success of Lindo, the company developed Lingo, a modeling language that includes the powerful Lindo optimizer and extensions for solving nonlinear problems. Many other modeling languages such as AMPL, AIMMS, MPL, XPRESS, and others are available.

The most common optimization models can be solved by a variety of mathematical programming methods, including the following:

• Assignment (best matching of objects) • Dynamic programming • Goal programming • Investment (maximizing rate of return) • Linear and integer programming • Network models for planning and scheduling • Nonlinear programming • Replacement (capital budgeting) • Simple inventory models (e.g., economic order quantity) • Transportation (minimize cost of shipments)

u SECTION 8.6 REVIEW QUESTIONS

1. List and explain the assumptions involved in LP. 2. List and explain the characteristics of LP. 3. Describe an allocation problem. 4. Define the product-mix problem. 5. Define the blending problem. 6. List several common optimization models.

486 Part III • Prescriptive Analytics and Big Data

8.7 MULTIPLE GOALS, SENSITIVITY ANALYSIS, WHAT-IF ANALYSIS, AND GOAL SEEKING

Many, if not most, decision situations involve juggling between competing goals and alternatives. In addition, there is significant uncertainty about the assumptions and pre- dictions being used in building a prescriptive analytics model. The following paragraphs simply recognize that these are also addressed in prescriptive analytics software and techniques. Coverage of these techniques is usually common in prescriptive analytics or operations research/management science courses.

Multiple Goals

The analysis of management decisions aims at evaluating, to the greatest possible extent, how far each alternative advances managers toward their goals. Unfortunately, manage- rial problems are seldom evaluated with a single simple goal, such as profit maximiza- tion. Today’s management systems are much more complex, and one with a single goal is rare. Instead, managers want to attain simultaneous goals, some of which may conflict. Different stakeholders have different goals. Therefore, it is often necessary to analyze

FIGURE 8.9 A Compact Formulation for Election Resource Allocation.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 487

each alternative in light of its determination of each of several goals (see Koksalan & Zionts, 2001).

For example, consider a profit-making firm. In addition to earning money, the com- pany wants to grow, develop its products and employees, provide job security to its workers, and serve the community. Managers want to satisfy the shareholders and at the same time enjoy high salaries and expense accounts, and employees want to increase their take-home pay and benefits. When a decision is to be made—say, about an in- vestment project—some of these goals complement each other, whereas others conflict. Kearns (2004) described how the analytic hierarchy process (AHP) combined with in- teger programming, addresses multiple goals in evaluating information technology (IT) investments.

Many quantitative models of decision theory are based on comparing a single mea- sure of effectiveness, generally some form of utility to the decision maker. Therefore, it is usually necessary to transform a multiple-goal problem into a single-measure-of- effectiveness problem before comparing the effects of the solutions. This is a common method for handling multiple goals in an LP model.

Certain difficulties may arise when analyzing multiple goals:

• It is usually difficult to obtain an explicit statement of the organization’s goals. • The decision maker may change the importance assigned to specific goals over time

or for different decision scenarios. • Goals and subgoals are viewed differently at various levels of the organization and

within different departments. • Goals change in response to changes in the organization and its environment. • The relationship between alternatives and their role in determining goals may be

difficult to quantify. • Complex problems are solved by groups of decision makers, each of whom has a

personal agenda. • Participants assess the importance (priorities) of the various goals differently.

Several methods of handling multiple goals can be used when working with such situations. The most common ones are

• Utility theory • Goal programming • Expression of goals as constraints, using LP • A points system

Sensitivity Analysis

A model builder makes predictions and assumptions regarding input data, many of which deal with the assessment of uncertain futures. When the model is solved, the results de- pend on these data. Sensitivity analysis attempts to assess the impact of a change in the input data or parameters on the proposed solution (i.e., the result variable).

Sensitivity analysis is extremely important in prescriptive analytics because it al- lows flexibility and adaptation to changing conditions and to the requirements of dif- ferent decision-making situations, provides a better understanding of the model and the decision-making situation it attempts to describe, and permits the manager to input data to increase the confidence in the model. Sensitivity analysis tests relationships such as the following:

• The impact of changes in external (uncontrollable) variables and parameters on the outcome variable(s)

• The impact of changes in decision variables on the outcome variable(s)

488 Part III • Prescriptive Analytics and Big Data

• The effect of uncertainty in estimating external variables • The effects of different dependent interactions among variables • The robustness of decisions under changing conditions

Sensitivity analyses are used for:

• Revising models to eliminate too-large sensitivities • Adding details about sensitive variables or scenarios • Obtaining better estimates of sensitive external variables • Altering a real-world system to reduce actual sensitivities • Accepting and using the sensitive (and hence vulnerable) real world, leading to the

continuous and close monitoring of actual results

The two types of sensitivity analyses are automatic and trial and error.

AUTOMATIC SENSITIVITY ANALYSIS Automatic sensitivity analysis is performed in stan- dard quantitative model implementations such as LP. For example, it reports the range within which a certain input variable or parameter value (e.g., unit cost) can vary without having any significant impact on the proposed solution. Automatic sensitivity analysis is usually limited to one change at a time, and only for certain variables. However, it is powerful because of its ability to establish ranges and limits very fast (and with little or no additional computational effort). Sensitivity analysis is provided by Solver and almost all other software packages such as Lindo. Consider the MBI Corporation example intro- duced previously. Sensitivity analysis could be used to determine that if the right-hand side of the marketing constraint on CC-8 could be decreased by one unit, then the net profit would increase by $1,333.33. This is valid for the right-hand side decreasing to zero. Significant additional analysis is possible along these lines.

TRIAL-AND-ERROR SENSITIVITY ANALYSIS The impact of changes in any variable, or in several variables, can be determined through a simple trial-and-error approach. You change some input data and solve the problem again. When the changes are repeated several times, better and better solutions may be discovered. Such experimentation, which is easy to conduct when using appropriate modeling software, such as Excel, has two approaches: what-if analysis and goal seeking.

What-If Analysis

What-if analysis is structured as What will happen to the solution if an input variable, an assumption, or a parameter value is changed? Here are some examples:

• What will happen to the total inventory cost if the cost of carrying inventories in- creases by 10%?

• What will be the market share if the advertising budget increases by 5%?

With the appropriate user interface, it is easy for managers to ask a computer model these types of questions and get immediate answers. Furthermore, they can perform multiple cases and thereby change the percentage, or any other data in the question, as desired. The decision maker does all this directly, without a computer programmer.

Figure 8.10 shows a spreadsheet example of a what-if query for a cash flow prob- lem. When the user changes the cells containing the initial sales (from 100 to 120) and the sales growth rate (from 3% to 4% per quarter), the program immediately recomputes the value of the annual net profit cell (from $127 to $182) . At first, initial sales were 100, growing at 3% per quarter, yielding an annual net profit of $127. Changing the initial sales cell to 120 and the sales growth rate to 4% causes the annual net profit to rise to $182. What-if analysis is common in many decision systems. Users are given the opportunity to

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 489

change their answers to some of the system’s questions, and a revised recommendation is found.

Goal Seeking

Goal seeking calculates the values of the inputs necessary to achieve a desired level of an output (goal). It represents a backward solution approach. The following are some examples of goal seeking:

• What annual R&D budget is needed for an annual growth rate of 15% by 2018? • How many nurses are needed to reduce the average waiting time of a patient in the

emergency room to less than 10 minutes?

An example of goal seeking is shown in Figure 8.11. For example, in a financial planning model in Excel, the internal rate of return (IRR) is the interest rate that produces a net present value (NPV) of zero. Given a stream of annual returns in Column E, we can compute the NPV of planned investment. By applying goal seeking, we can determine the internal rate of return where the NPV is zero. The goal to be achieved is NPV equal to zero, which determines the internal rate of return of this cash flow, including the invest- ment. We set the NPV cell to the value 0 by changing the interest rate cell. The answer is 38.77059%.

COMPUTING A BREAK-EVEN POINT BY USING GOAL SEEKING Some modeling software packages can directly compute break-even points, which is an important application of goal seeking. This involves determining the value of the decision variables (e.g., quantity to produce) that generate zero profit.

In many general applications programs, it can be difficult to conduct sensitivity analysis because the prewritten routines usually present only a limited opportunity for asking what-if questions. In a DSS, the what-if and the goal-seeking options must be easy to perform.

FIGURE 8.10 Example of a What-If Analysis Done in an Excel Worksheet.

490 Part III • Prescriptive Analytics and Big Data

u SECTION 8.7 REVIEW QUESTIONS

1. List some difficulties that may arise when analyzing multiple goals. 2. List the reasons for performing sensitivity analysis. 3. Explain why a manager might perform what-if analysis. 4. Explain why a manager might use goal seeking.

8.8 DECISION ANALYSIS WITH DECISION TABLES AND DECISION TREES

Decision situations that involve a finite and usually not too large number of alterna- tives are modeled through an approach called decision analysis (see Arsham, 2006a,b; Decision Analysis Society, decision-analysis.society.informs.org). Using this approach, the alternatives are listed in a table or a graph, with their forecasted contributions to the goal(s) and the probability of obtaining the contribution. These can be evaluated to select the best alternative.

Single-goal situations can be modeled with decision tables or decision trees. Multiple goals (criteria) can be modeled with several other techniques, described later in this chapter.

Decision Tables

Decision tables conveniently organize information and knowledge in a systematic, tabu- lar manner to prepare it for analysis. For example, say that an investment company is considering investing in one of three alternatives: bonds, stocks, or certificates of deposit (CDs). The company is interested in one goal: maximizing the yield on the investment after 1 year. If it were interested in other goals, such as safety or liquidity, the problem would be classified as one of multicriteria decision analysis (see Koksalan & Zionts, 2001).

The yield depends on the state of the economy sometime in the future (often called the state of nature), which can be in solid growth, stagnation, or inflation. Experts esti- mated the following annual yields:

FIGURE 8.11 Goal-Seeking Analysis.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 491

• If there is solid growth in the economy, bonds will yield 12%, stocks 15%, and time deposits 6.5%.

• If stagnation prevails, bonds will yield 6%, stocks 3%, and time deposits 6.5%. • If inflation prevails, bonds will yield 3%, stocks will bring a loss of 2%, and time

deposits will yield 6.5%.

The problem is to select the one best investment alternative. These are assumed to be discrete alternatives. Combinations such as investing 50% in bonds and 50% in stocks must be treated as new alternatives.

The investment decision-making problem can be viewed as a two-person game (see Kelly, 2002). The investor makes a choice (i.e., a move), and then a state of nature occurs (i.e., makes a move). Table 8.3 shows the payoff of a mathematical model. The table includes decision variables (the alternatives), uncontrollable variables (the states of the economy; e.g., the environment), and result variables (the projected yield; e.g., out- comes). All the models in this section are structured in a spreadsheet framework.

If this were a decision-making problem under certainty, we would know what the economy would be and could easily choose the best investment. But that is not the case, so we must consider the two situations of uncertainty and risk. For uncertainty, we do not know the probabilities of each state of nature. For risk, we assume that we know the probabilities with which each state of nature will occur.

TREATING UNCERTAINTY Several methods are available for handling uncertainty. For example, the optimistic approach assumes that the best possible outcome of each alterna- tive will occur and then selects the best of the best (i.e., stocks). The pessimistic approach assumes that the worst possible outcome for each alternative will occur and selects the best of these (i.e., CDs). Another approach simply assumes that all states of nature are equally possible (see Clemen & Reilly, 2000; Goodwin & Wright, 2000; Kontoghiorghes, Rustem, & Siokos, 2002). Every approach for handling uncertainty has serious problems. Whenever possible, the analyst should attempt to gather enough information so that the problem can be treated under assumed certainty or risk.

TREATING RISK The most common method for solving this risk analysis problem is to select the alternative with the greatest expected value. Assume that experts estimate the chance of solid growth at 50%, the chance of stagnation at 30%, and the chance of inflation at 20%. The decision table is then rewritten with the known probabilities (see Table 8.3). An expected value is computed by multiplying the results (i.e., outcomes) by their respective probabilities and adding them. For example, investing in bonds yields an expected return of 12(0.5) + 6(0.3) + 3(0.2) = 8.4%.

This approach can sometimes be a dangerous strategy because the utility of each potential outcome may be different from the value. Even if there is an infinitesimal chance of a catastrophic loss, the expected value may seem reasonable, but the investor may not be willing to cover the loss. For example, suppose a financial advisor presents you with

TABLE 8.3 Investment Problem Decision Table Model

State of Nature (Uncontrollable Variables)

Alternative Solid Growth (%) Stagnation (%) Inflation (%)

Bonds 12.0 6.0 3.0

Stocks 15.0 3.0 –2.0

CDs 6.5 6.5 6.5

492 Part III • Prescriptive Analytics and Big Data

an “almost sure” investment of $1,000 that can double your money in one day, and then the advisor says, “Well, there is a 0.9999 probability that you will double your money, but unfortunately there is a 0.0001 probability that you will be liable for a $500,000 out-of- pocket loss.” The expected value of this investment is as follows:

0.9999 ($2,000 - $1,000) + .0001(-$500,000 - $1,000) = $999.90 - $50.10 = $949.80

The potential loss could be catastrophic for any investor who is not a billionaire. Depending on the investor’s ability to cover the loss, an investment has different ex- pected utilities. Remember that the investor makes the decision only once.

Decision Trees

An alternative representation of the decision table is a decision tree. A decision tree shows the relationships of the problem graphically and can handle complex situations in a compact form. However, a decision tree can be cumbersome if there are many al- ternatives or states of nature. TreeAge Pro (TreeAge Software Inc., treeage.com) and PrecisionTree (Palisade Corp., palisade.com) include powerful, intuitive, and sophisti- cated decision tree analysis systems. These vendors also provide excellent examples of decision trees used in practice. Note that the phrase decision tree has been used to de- scribe two different types of models and algorithms. In the current context, decision trees refer to scenario analysis. On the other hand, some classification algorithms in predictive analysis (see Chapters 4 and 5) are also called decision tree algorithms. The reader is ad- vised to note the difference between two different uses of the same name – decision tree.

A simplified investment case of multiple goals (a decision situation in which alter- natives are evaluated with several, sometimes conflicting, goals) is shown in Table 8.4. The three goals (criteria) are yield, safety, and liquidity. This situation is under assumed certainty; that is, only one possible consequence is projected for each alternative; the more complex cases of risk or uncertainty could be considered. Some of the results are qualitative (e.g., low, high) rather than numeric.

See Clemen and Reilly (2000), Goodwin and Wright (2000), and Decision Analysis Society (informs.org/Community/DAS) for more on decision analysis. Although doing so is quite complex, it is possible to apply mathematical programming directly to decision-making situations under risk. We discuss several other methods of treating risk later in the book. These include simulation, certainty factors, and fuzzy logic.

u SECTION 8.8 REVIEW QUESTIONS

1. What is a decision table? 2. What is a decision tree? 3. How can a decision tree be used in decision making? 4. Describe what it means to have multiple goals.

TABLE 8.4 Multiple Goals

Alternative Yield (%) Safety Liquidity

Bonds 8.4 High High

Stocks 8.0 Low High

CDs 6.5 Very high High

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 493

8.9 INTRODUCTION TO SIMULATION

In this section and the next we introduce a category of techniques that are used for support- ing decision making. Very broadly, these methods fall under the umbrella of simulation. Simulation is the appearance of reality. In decision systems, simulation is a technique for conducting experiments (e.g., what-if analyses) with a computer on a model of a manage- ment system. Strictly speaking, simulation is a descriptive rather than a prescriptive method. There is no automatic search for an optimal solution. Instead, a simulation model describes or predicts the characteristics of a given system under different conditions. When the values of the characteristics are computed, the best of several alternatives can be selected. The simulation process usually repeats an experiment many times to obtain an estimate (and a variance) of the overall effect of certain actions. For most situations, a computer simulation is appropriate, but there are some well-known manual simulations (e.g., a city police de- partment simulated its patrol car scheduling with a carnival game wheel).

Typically, real decision-making situations involve some randomness. Because many decision situations deal with semistructured or unstructured situations, reality is complex, which may not be easily represented by optimization or other models but can often be handled by simulation. Simulation is one of the most commonly used decision support methods. See Application Case 8.6 for an example. Application Case 8.7 illustrates the value of simulation in another setting where the problem complexity does not permit building a traditional optimization model.

Major Characteristics of Simulation

Simulation typically involves building a model of reality to the extent practical. Simulation models may suffer from fewer assumptions about the decision situation as compared to other prescriptive analytic models. In addition, simulation is a technique for conducting experiments. Therefore, it involves testing specific values of the decision or uncontrol- lable variables in the model and observing the impact on the output variables.

Finally, simulation is normally used only when a problem is too complex to be treated using numerical optimization techniques. Complexity in this situation means either that the problem cannot be formulated for optimization (e.g., because the assumptions do not hold), that the formulation is too large, that there are too many interactions among the variables, or that the problem is stochastic in nature (i.e., exhibits risk or uncertainty).

A steel manufacturing plant produces rolled-steel tubes for different industries across the country. They build tubes based on a customer’s require- ments and specifications. Maintaining high-quality norms and timely delivery of products are two of the foremost important criteria for this steel tub- ing plant. The plant views its manufacturing sys- tem as a sequence of operations where it unrolls steel from one reel and rolls it onto a different reel. This happens once the forming, welding, editing,

or inspecting operation is finished. The ultimate product would be a reel of rolled steel tubing that weighs about 20 tons. The reel is then shipped to the customer.

A key challenge for management is to be able to predict the appropriate delivery date for an order, and its impact on the currently planned production schedule. Given the complexity of the produc- tion process, it is not easy to develop an optimi- zation model in Excel or other software to build a

Application Case 8.7 Steel Tubing Manufacturer Uses a Simulation-Based Production Scheduling System

(Continued )

494 Part III • Prescriptive Analytics and Big Data

Advantages of Simulation

Simulation is used in decision support modeling for the following reasons:

• The theory is fairly straightforward. • A great amount of time compression can be attained, quickly giving a manager some

feel as to the long-term (1@ to 10@ year) effects of many policies. • Simulation is descriptive rather than normative. This allows the manager to pose

what-if questions. Managers can use a trial-and-error approach to problem solving and can do so faster, at less expense, more accurately, and with less risk.

• A manager can experiment to determine which decision variables and which parts of the environment are really important, and with different alternatives.

• An accurate simulation model requires an intimate knowledge of the problem, thus forcing the model builder to constantly interact with the manager. This is desirable

production schedule (see Application Case 8.1). The issue is that these tools fail to capture key planning issues such as employee schedules and qualifica- tions, material accessibility, material allocation com- plication, and random aspects of the operation.

Methodology/Solution

When traditional modeling methods do not capture the problem subtleties or complexities, a simula- tion model could perhaps be built. The predictive analysis approach uses a versatile Simio simula- tion model that takes into consideration all the operational complexity, manufacturing material matching algorithms, and deadline considerations. Also, Simio’s service offering, known as risk-based planning and scheduling (RPS), provides some user interfaces and reports simply designed for production management. This gives the client the ability to explore the impact of a new order on their production plan and schedule within about 10 minutes.

Results/Benefits

Such models provide significant visibility into the production schedule. The risk-based planning and scheduling system should be able to warn the master scheduler that a specific order has a chance of being delivered late. Changes could also be made sooner to rectify issues with an order. Success for this steel tubing manufacturer is directly tied to product qual- ity and on-time delivery. By exploitation of Simio’s

predictive RPS offering, the plant expects improved market share.

Questions for DisCussion

1. Explain the advantages of using Simio’s simula- tion model over traditional methods.

2. In what ways has the predictive analysis approach helped management achieve the goals of analyz- ing the production schedules?

3. Besides the steel manufacturing industry, in what other industries could such a modeling approach help improve quality and service?

What Can We Learn from This Application Case?

By using Simio’s simulation model, the manufacturing plant made better decisions in assessment of opera- tions, taking all of the problem issues into consider- ation. Thus, a simulation-based production scheduling system could derive higher returns and market share for the steel tubing manufacturer. Simulation is an important technique for prescriptive analytics.

Compiled from Arthur, Molly. “Simulation-Based Production Scheduling System.” www.simio.com, Simio LLC, 2014, www. simio.com/case-studies/A-Steel-Tubing-Manufacturer- Expects-More-Market-Share/A-Steel-Tubing-Manufacturer- Expects-More-Market-Share.pdf (accessed September 2018); “Risk-Based Planning and Scheduling (RPS) with Simio.” www.simio. com, Simio LLC, www.simio.com/about-simio/why-simio/ simio-RPS-risk-based-planning-and-scheduling.php (accessed September 2018).

Application Case 8.7 (Continued)

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 495

for DSS development because the developer and manager both gain a better under- standing of the problem and the potential decisions available.

• The model is built from the manager’s perspective. • The simulation model is built for one particular problem and typically cannot solve

any other problem. Thus, no generalized understanding is required of the manager; every component in the model corresponds to part of the real system.

• Simulation can handle an extremely wide variety of problem types, such as inven- tory and staffing, as well as higher-level managerial functions, such as long-range planning.

• Simulation generally can include the real complexities of problems; simplifications are not necessary. For example, simulation can use real probability distributions rather than approximate theoretical distributions.

• Simulation automatically produces many important performance measures. • Simulation is often the only DSS modeling method that can readily handle relatively

unstructured problems. • Some relatively easy-to-use simulation packages (e.g., Monte Carlo simulation) are

available. These include add-in spreadsheet packages (e.g., @RISK), influence dia- gram software, Java-based (and other Web development) packages, and the visual interactive simulation systems to be discussed shortly.

Disadvantages of Simulation

The primary disadvantages of simulation are as follows:

• An optimal solution cannot be guaranteed, but relatively good ones are generally found.

• Simulation model construction can be a slow and costly process, although newer modeling systems are easier to use than ever.

• Solutions and inferences from a simulation study are usually not transferable to other problems because the model incorporates unique problem factors.

• Simulation is sometimes so easy to explain to managers that analytic methods are often overlooked.

• Simulation software sometimes requires special skills because of the complexity of the formal solution method.

The Methodology of Simulation

Simulation involves setting up a model of a real system and conducting repetitive experi- ments on it. The methodology consists of the following steps, as shown in Figure 8.12:

1. Define the problem. We examine and classify the real-world problem, specifying why a simulation approach is appropriate. The system’s boundaries, environment, and other such aspects of problem clarification are handled here.

2. Construct the simulation model. This step involves determination of the vari- ables and their relationships, as well as data gathering. Often the process is de- scribed by using a flowchart, and then a computer program is written.

3. Test and validate the model. The simulation model must properly represent the system being studied. Testing and validation ensure this.

4. Design the experiment. When the model has been proven valid, an experiment is designed. Determining how long to run the simulation is part of this step. There are two important and conflicting objectives: accuracy and cost. It is also prudent to identify typical (e.g., mean and median cases for random variables), best-case (e.g., low-cost, high-revenue), and worst-case (e.g., high-cost, low-revenue) scenarios.

496 Part III • Prescriptive Analytics and Big Data

These help establish the ranges of the decision variables and environment in which to work and also assist in debugging the simulation model.

5. Conduct the experiment. Conducting the experiment involves issues ranging from random-number generation to result presentation.

6. Evaluate the results. The results must be interpreted. In addition to standard statistical tools, sensitivity analyses can also be used.

7. Implement the results. The implementation of simulation results involves the same issues as any other implementation. However, the chances of success are bet- ter because the manager is usually more involved with the simulation process than with other models. Higher levels of managerial involvement generally lead to higher levels of implementation success.

Banks and Gibson (2009) presented some useful advice about simulation practices. For example, they list the following seven issues as the common mistakes committed by simulation modelers. The list, though not exhaustive, provides general directions for professionals working on simulation projects.

• Focusing more on the model than on the problem • Providing point estimates • Not knowing when to stop • Reporting what the client wants to hear rather than what the model results say • Lack of understanding of statistics • Confusing cause and effect • Failure to replicate reality

In a follow-up article they provide additional guidelines. You should consult this article: analytics-magazine.org/spring-2009/205-software-solutions-the-abcs-of-simulation- practice.html.

Simulation Types

As we have seen, simulation and modeling are used when pilot studies and experiment- ing with real systems are expensive or sometimes impossible. Simulation models allow us to investigate various interesting scenarios before making any investment. In fact, in simulations, the real-world operations are mapped into the simulation model. The model consists of relationships and, consequently, equations that all together present the

Do over/Feedback

Define the problem

Construct the simulation

model

Test and validate the

model

Design and conduct the experiments

Evaluate the experiments’

results

Implement the results

Change the real-world problem

Real-world problem

FIGURE 8.12 The Process of Simulation.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 497

real-world operations. The results of a simulation model, then, depend on the set of pa- rameters given to the model as inputs.

There are various simulation paradigms such as Monte Carlo simulation, discrete event, agent based, or system dynamics. One of the factors that determine the type of simulation technique is the level of abstraction in the problem. Discrete events and agent- based models are usually used for middle or low levels of abstraction. They usually con- sider individual elements such as people, parts, and products in the simulation models, whereas systems dynamics is more appropriate for aggregate analysis.

In the following section, we introduce the major types of simulation: probabilistic simulation, time-dependent and time-independent simulation, and visual simulation. There are many other simulation techniques such as system dynamics modeling, and agent- based modeling. As has been noted before, the goal here is to make you aware of the potential of some of these techniques as opposed to make you an expert in using them.

PROBABILISTIC SIMULATION In probabilistic simulation, one or more of the indepen- dent variables (e.g., the demand in an inventory problem) are probabilistic. They follow certain probability distributions, which can be either discrete distributions or continuous distributions:

• Discrete distributions involve a situation with a limited number of events (or vari- ables) that can take on only a finite number of values.

• Continuous distributions are situations with unlimited numbers of possible events that follow density functions, such as the normal distribution.

The two types of distributions are shown in Table 8.5.

TIME-DEPENDENT VERSUS TIME-INDEPENDENT SIMULATION Time-independent refers to a situation in which it is not important to know exactly when the event occurred. For ex- ample, we may know that the demand for a certain product is three units per day, but we do not care when during the day the item is demanded. In some situations, time may not be a factor in the simulation at all, such as in steady-state plant control design. However, in waiting-line problems applicable to e-commerce, it is important to know the precise time of arrival (to know whether the customer will have to wait). This is a time-dependent situation.

Monte Carlo Simulation

In most business decision problems, we usually employ one of the following two types of probabilistic simulations. The most common simulation method for business decision problems is the Monte Carlo simulation. This method usually begins with building a

TABLE 8.5 Discrete versus Continuous Probability Distributions

Daily Demand Discrete Probability Continuous Probability

5 0.10 Daily demand is normally distributed with a mean of 7 and a standard deviation of 1.2

6 0.15

7 0.30

8 0.25

9 0.20

498 Part III • Prescriptive Analytics and Big Data

model of the decision problem without having to consider the uncertainty of any vari- ables. Then we recognize that certain parameters or variables are uncertain or follow an assumed or estimated probability distribution. This estimation is based on analysis of past data. Then we begin running sampling experiments. Running sampling experiments consists of generating random values of uncertain parameters and then computing val- ues of the variables that are impacted by such parameters or variables. These sampling experiments essentially amount to solving the same model hundreds or thousands of times. We can then analyze the behavior of these dependent or performance variables by examining their statistical distributions. This method has been used in simulations of physical as well as business systems. A good public tutorial on the Monte Carlo simulation method is available on Palisade.com (http://www.palisade.com/risk/monte_carlo_ simulation.asp). Palisade markets a tool called @RISK, a popular spreadsheet-based Monte Carlo simulation software. Another popular software in this category is Crystal Ball, now marketed by Oracle as Oracle Crystal Ball. Of course, it is also possible to build and run Monte Carlo experiments within an Excel spreadsheet without using any add- on software such as the two just mentioned. But these tools make it more convenient to run such experiments in Excel-based models. Monte Carlo simulation models have been used in many commercial applications. Examples include Procter & Gamble using these models to determine hedging foreign-exchange risks; Lilly using the model for deciding optimal plant capacity; Abu Dhabi Water and Electricity Company using @Risk for fore- casting water demand in Abu Dhabi; and literally thousands of other actual case studies. Each of the simulation software companies’ Web sites include many such success stories.

Discrete Event Simulation

Discrete event simulation refers to building a model of a system where the interaction between different entities is studied. The simplest example of this is a shop consisting of a server and customers. By modeling the customers arriving at various rates and the server serving at various rates, we can estimate the average performance of the system, waiting time, the number of waiting customers, and so on. Such systems are viewed as collections of customers, queues, and servers. There are thousands of documented applications of discrete event simulation models in engineering, business, and so on. Tools for building discrete event simulation models have been around for a long time, but these have evolved to take advan- tage of developments in graphical capabilities for building and understanding the results of such simulation models. We will discuss this modeling method further in the next section. Application Case 8.8 gives an example of the use of such simulation in analyzing complexities of a supply chain that uses a visual simulation to be described in the next section.

Introduction

Cosan is a Brazil-based conglomerate that operates globally. One of its major activities is to grow and process sugar cane. Besides being a major source of sugar, sugar cane is now a major source of ethanol, a main ingredient in renewable energy. Because of the growing demand for renewable energy, etha- nol production has become such a major activity for

Cosan that it now operates two refineries in addi- tion to 18 production plants, and of course, mil- lions of hectares of sugar cane farms. According to recent data, it processed over 44 million tons of sugar cane, produced over 1.3 billion liters of etha- nol, and produced 3.3 million tons of sugar. As one might imagine, operations of this scale lead to com- plex supply chains. So the logistics team was asked

Application Case 8.8 Cosan Improves Its Renewable Energy Supply Chain Using Simulation

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 499

to make recommendations to the senior manage- ment to:

• Determine the optimum number of vehicles re- quired in a fleet used to transport sugar cane to processing mills to preserve capital.

• Propose how to increase the actual capacity of sugar cane received at the sugar mills.

• Identify the production bottleneck problems to solve to improve the flow of sugar cane.

Methodology/Solution

The logistics team worked with Simio software and built a complex simulation model of the Cosan supply chain as it pertains to these issues. According to a Simio brief, “Over the course of three months, newly hired engineers collected data in the field and received hands-on training and modeling assistance from Paragon Consulting of San Palo.”

To model agricultural operations to analyze the sugar cane’s postharvest journey to produc- tion mills, the model objectives included details of the fleet of road transport sugar cane crop to Unity Costa Pinto, the actual capacity of reception of cane sugar mills, bottlenecks and points for improvement in the flow of CCT (cut-load-haul) of cane sugar, and so on.

The model parameters are as follows:

Input Variables: 32 Output Variables: 39 Auxiliary Variables: 92 Variable Entities: 8 Input Tables: 19 Simulated Days: 240 (1st season) Number of Entities: 12 (10 harvester composi- tional types for transport of sugar cane)

Results/Benefits

Analyses produced by these Simio models provided a good view of the risk of operation over the 240- day period due to various uncertainties. By analyz- ing the various bottlenecks and ways to mitigate those scenarios, the company was able to make bet- ter decisions and save over $500,000 from this mod- eling effort alone.

Questions for DisCussion

1. What type of supply chain disruptions might occur in moving the sugar cane from the field to the production plants to develop sugar and ethanol?

2. What types of advanced planning and prediction might be useful in mitigating such disruptions?

What Can We Learn from This Application Case?

This short application story illustrates the value of applying simulation to a problem where it might be difficult to build an optimization model. By incorpo- rating a discrete event simulation model and visual interactive simulation (VIS), one can visualize the impact of interruptions in supply chain due to fleet failure, unexpected downtime at the plant, and so on, and come up with planned corrections.

Sources: Compiled from Wikipedia contributors, Cosan, Wikipedia, The Free Encyclopedia, https://en.wikipedia.org/w/index. php?title=Cosan&oldid=713298536 (accessed July 10, 2016); Agricultural Operations Simulation Case Study: Cosan, http:// www.simio.com/case-studies/Cosan-agricultural-logistics- simulation-software-case-study/agricultural-simulation- software-case-study-video-cosan.php (accessed July 2016); Cosan Case Study: Optimizing agricultural logistics operations, http://www.simio.com/case-studies/Cosan-agricultural- logistics-simulation-software-case-study/index.php (accessed July 2016).

u SECTION 8.9 REVIEW QUESTIONS

1. List the characteristics of simulation. 2. List the advantages and disadvantages of simulation. 3. List and describe the steps in the methodology of simulation. 4. List and describe the types of simulation.

500 Part III • Prescriptive Analytics and Big Data

8.10 VISUAL INTERACTIVE SIMULATION

We next examine methods that show a decision maker a representation of the decision- making situation in action as it runs through scenarios of the various alternatives. These powerful methods overcome some of the inadequacies of conventional methods and help build trust in the solution attained because they can be visualized directly.

Conventional Simulation Inadequacies

Simulation is a well-established, useful, descriptive, mathematics-based method for gain- ing insight into complex decision-making situations. However, simulation does not usu- ally allow decision makers to see how a solution to a complex problem evolves over (compressed) time, nor can decision makers interact with the simulation (which would be useful for training purposes and teaching). Simulation generally reports statistical re- sults at the end of a set of experiments. Decision makers are thus not an integral part of simulation development and experimentation, and their experience and judgment cannot be used directly. If the simulation results do not match the intuition or judgment of the decision maker, a confidence gap in the results can occur.

Visual Interactive Simulation

Visual interactive simulation (VIS), also known as visual interactive modeling (VIM) and visual interactive problem solving, is a simulation method that lets decision makers see what the model is doing and how it interacts with the decisions made, as they are made. This technique has been used with great success in operations analysis in many fields such as supply chain and healthcare. The user can employ his or her knowl- edge to determine and try different decision strategies while interacting with the model. Enhanced learning, about both the problem and the impact of the alternatives tested, can and does occur. Decision makers also contribute to model validation. Decision makers who use VIS generally support and trust their results.

VIS uses animated computer graphic displays to present the impact of different managerial decisions. It differs from regular graphics in that the user can adjust the decision-making process and see results of the intervention. A visual model is a graphic used as an integral part of decision making or problem solving, not just as a communica- tion device. Some people respond better than others to graphical displays, and this type of interaction can help managers learn about the decision-making situation.

VIS can represent static or dynamic systems. Static models display a visual image of the result of one decision alternative at a time. Dynamic models display systems that evolve over time, and the evolution is represented by animation. The latest visual simula- tion technology has been coupled with the concept of virtual reality, where an artificial world is created for a number of purposes, from training to entertainment to viewing data in an artificial landscape. For example, the U.S. military uses VIS systems so that ground troops can gain familiarity with terrain or a city to very quickly orient themselves. Pilots also use VIS to gain familiarity with targets by simulating attack runs. The VIS software can also include GIS coordinates.

Visual Interactive Models and DSS

VIM in DSS has been used in several operations management decisions. The method consists of priming (like priming a water pump) a visual interactive model of a plant (or company) with its current status. The model then runs rapidly on a computer, allowing managers to observe how a plant is likely to operate in the future.

Waiting-line management (queuing) is a good example of VIM. Such a DSS usu- ally computes several measures of performance for the various decision alternatives

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 501

(e.g.,  waiting time in the system). Complex waiting-line problems require simulation. VIM can display the size of the waiting line as it changes during the simulation runs and can also graphically present the answers to what-if questions regarding changes in input variables. Application Case 8.9 gives an example of a visual simulation that was used to explore the applications of radio-frequency identification (RFID) technology in develop- ing new scheduling rules in a manufacturing setting.

The VIM approach can also be used in conjunction with artificial intelligence. Integration of the two techniques adds several capabilities that range from the ability to build systems graphically to learning about the dynamics of the system. These systems, especially those developed for the military and the video-game industry, have “thinking” characters who can behave with a relatively high level of intelligence in their interactions with users.

Simulation Software

Hundreds of simulation packages are available for a variety of decision-making situations. Many run as Web-based systems. ORMS Today publishes a periodic review of simulation software. One recent review (current as of October 2018) is located at https://www.in- forms.org/ORMS-Today/Public-Articles/October-Volume-44-Number-5/Simulation- Software-Survey-Simulation-new-and-improved-reality-show (accessed November 2018). PC software packages include Analytica (Lumina Decision Systems, lumina.com) and the Excel add-ins Crystal Ball (now sold by Oracle as Oracle Crystal Ball, oracle.com) and @RISK (Palisade Corp., palisade.com). A major commercial software for discrete event sim- ulation has been Arena (sold by Rockwell Intl., arenasimulation.com). Original developers of Arena have now developed Simio (simio.com), a user-friendly VIS software. Another popular discrete event VIS software is ExtendSim (extendsim.com). SAS has a graphical analytics software package called JMP that also includes a simulation component in it.

A manufacturing services provider of complex opti- cal and electromechanical components seeks to gain efficiency in its job-shop scheduling decision because the current shop-floor operations suffer from a few issues:

• There is no system to record when the work- in-process (WIP) items actually arrive at or leave operating workstations and how long those WIPs actually stay at each workstation.

• The current system cannot monitor or keep track of the movement of each WIP in the pro- duction line in real time.

As a result, the company is facing two main issues at this production line: high backlogs and high costs of overtime to meet the demand. In addi- tion, the upstream cannot respond to unexpected incidents such as changes in demand or material shortages quickly enough and revise schedules in a cost-effective manner. The company is considering

implementing RFID on a production line. However, the company does not know if going to this major expense of adding RFID chips on production boxes, installing RFID readers throughout the production line, and of course, the systems to process this information will result in any real gains. So one question is to explore any new production sched- uling changes that may result by investing in RFID infrastructure.

Methodology

Because exploring the introduction of any new system in the physical production system can be extremely expensive or even disruptive, a discrete event simulation model was developed to exam- ine how tracking and traceability through RFID can facilitate job-shop production scheduling activities. A visibility-based scheduling (VBS) rule that utilizes the real-time traceability systems to track those WIPs, parts and components, and raw

Application Case 8.9 Improving Job-Shop Scheduling Decisions through RFID: A Simulation-Based Assessment

(Continued )

502 Part III • Prescriptive Analytics and Big Data

materials in shop-floor operations was proposed. A simulation approach was applied to examine the benefit of the VBS rule against the classical scheduling rules: the first-in-first-out and earliest due date dispatching rules. The simulation model was developed using Simio. Simio is a 3@D simu- lation modeling software package that employs an object-oriented approach to modeling and has recently been used in many areas such as facto- ries, supply chains, healthcare, airports, and ser- vice systems.

Figure 8.13 presents a screenshot of the Simio interface panel of this production line. The param- eter estimates used for the initial state in the simu- lation model include weekly demand and forecast, process flow, number of workstations, number of shop-floor operators, and operating time at each workstation. In addition, parameters of some of the input data such as RFID tagging time, informa- tion retrieving time, or system updating time are estimated from a pilot study and from the subject

matter experts. Figure 8.14 presents the process view of the simulation model where specific sim- ulation commands are implemented and coded. Figures 8.15 and 8.16 present the standard report view and pivot grid report of the simulation model. The standard report and pivot grid format provide a very quick method to find specific statistical results such as average, percent, total, maximum, or mini- mum values of variables assigned and captured as an output of the simulation model.

Results

The results of the simulation suggest that an RFID- based scheduling rule generates better performance compared to traditional scheduling rules with regard to processing time, production time, resource uti- lization, backlogs, and productivity. The company can take these productivity gains and perform cost/ benefit analyses in making the final investment decisions.

Application Case 8.9 (Continued)

FIGURE 8.13 Simio Interface View of the Simulation System. Source: Used with permission from Simio LLC.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 503

FIGURE 8.15 Standard Report View. Source: Used with permission from Simio LLC.

FIGURE 8.14 Process View of the Simulation Model. Source: Used with permission from Simio LLC.

(Continued )

504 Part III • Prescriptive Analytics and Big Data

u SECTION 8.10 REVIEW QUESTIONS

1. Define visual simulation and compare it to conventional simulation. 2. Describe the features of VIS (i.e., VIM) that make it attractive for decision makers. 3. How can VIS be used in operations management? 4. How is an animated film like a VIS application?

Questions for DisCussion

1. In situations such as what this case depicts, what other approaches can one take to analyze invest- ment decisions?

2. How would one save time if an RFID chip can tell the exact location of a product in process?

3. Research to learn about the applications of RFID sensors in other settings. Which one do you find most interesting?

Source: Based on Chongwatpol, J., & Sharda, R. (2013). RFID-enabled track and traceability in job-shop scheduling environment. European Journal of Operational Research, 227(3), 453–463, http://dx.doi.org/10.1016/j.ejor.2013.01.009.

For information about simulation software, see the Society for Modeling and Simulation International (scs.org) and the annual software sur- veys at ORMS Today (https://www.informs.org/ ORMS-Today/).

Application Case 8.9 (Continued)

FIGURE 8.16 Pivot Grid Report from a Simio Run. Source: Used with permission from Simio LLC.

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 505

Chapter Highlights

• Models play a major role in DSS because they are used to describe real decision-making situations. There are several types of models.

• Models can be static (i.e., a single snapshot of a situation) or dynamic (i.e., multiperiod).

• Analysis is conducted under assumed certainty (which is most desirable), risk, or uncertainty (which is least desirable).

• Influence diagrams graphically show the inter- relationships of a model. They can be used to enhance the use of spreadsheet technology.

• Spreadsheets have many capabilities, includ- ing what-if analysis, goal seeking, program- ming, database management, optimization, and simulation.

• Decision tables and decision trees can model and solve simple decision-making problems.

• Mathematical programming is an important opti- mization method.

• LP is the most common mathematical program- ming method. It attempts to find an optimal allocation of limited resources under organiza- tional constraints.

• The major parts of an LP model are the objec- tive function, the decision variables, and the constraints.

• Multicriteria decision-making problems are diffi- cult but not impossible to solve.

• What-if and goal seeking are the two most com- mon methods of sensitivity analysis.

• Many DSS development tools include built-in quantitative models (e.g., financial, statistical) or can easily interface with such models.

• Simulation is a widely used DSS approach that involves experimentation with a model that rep- resents the real decision-making situation.

• Simulation can deal with more complex situations than optimization, but it does not guarantee an optimal solution.

• There are many different simulation methods. Some that are important for decision making in- clude Monte Carlo simulation and discrete event simulation.

• VIS/VIM allows a decision maker to interact di- rectly with a model and shows results in an easily understood manner.

Key Terms

certainty decision analysis decision table decision tree decision variable discrete event simulation dynamic models environmental scanning and analysis forecasting goal seeking influence diagram

intermediate result variable linear programming (LP) mathematical programming Monte Carlo simulation multidimensional analysis

(modeling) multiple goals optimal solution parameter quantitative model result (outcome) variable

risk risk analysis sensitivity analysis simulation static models uncertainty uncontrollable variable visual interactive modeling (VIM) visual interactive simulation (VIS) what-if analysis

Questions for Discussion

1. How does prescriptive analytics relate to descriptive and predictive analytics?

2. Explain the differences between static and dynamic models. How can one evolve into the other?

3. What is the difference between an optimistic approach and a pessimistic approach to decision making under assumed uncertainty?

4. Explain why solving problems under uncertainty some- times involves assuming that the problem is to be solved under conditions of risk.

5. Excel is probably the most popular spreadsheet soft- ware for PCs. Why? What can we do with this package that makes it so attractive for modeling efforts?

506 Part III • Prescriptive Analytics and Big Data

6. Explain how decision trees work. How can a complex problem be solved by using a decision tree?

7. Explain how LP can solve allocation problems. 8. What are the advantages of using a spreadsheet package to

create and solve LP models? What are the disadvantages? 9. What are the advantages of using an LP package to cre-

ate and solve LP models? What are the disadvantages? 10. What is the difference between decision analysis with

a single goal and decision analysis with multiple goals (i.e., criteria)? Explain the difficulties that may arise when analyzing multiple goals.

11. Explain how multiple goals can arise in practice. 12. Compare and contrast what-if analysis and goal seeking. 13. Describe the general process of simulation. 14. List some of the major advantages of simulation over

optimization and vice versa. 15. Many computer games can be considered visual simula-

tion. Explain why. 16. Explain why VIS is particularly helpful in implementing

recommendations derived by computers.

Exercises

Teradata University Network (TUN) and Other Hands-on Exercises

1. Explore teradatauniversitynetwork.com, and deter- mine how models are used in the BI cases and papers.

2. Create the spreadsheet models shown in Figures 8.3 and 8.4. a. What is the effect of a change in the interest rate

from 8% to 10% in the spreadsheet model shown in Figure 8.3?

b. For the original model in Figure 8.3, what interest rate is required to decrease the monthly payments by 20% ? What change in the loan amount would have the same effect?

c. In the spreadsheet shown in Figure 8.4, what is the effect of a prepayment of $200 per month? What prepayment would be necessary to pay off the loan in 25 years instead of 30 years?

3. Solve the MBI product-mix problem described in this chapter, using either Excel’s Solver or a student version of an LP solver, such as Lindo. Lindo is available from Lindo Systems, Inc., at lindo.com; others are also available— search the Web. Examine the solution (output) reports for the answers and sensitivity report. Did you get the same results as reported in this chapter? Try the sensitivity

analysis outlined in the chapter; that is, lower the right- hand side of the CC-8 marketing constraint by one unit, from 200 to 199. What happens to the solution when you solve this modified problem? Eliminate the CC-8 lower- bound constraint entirely (this can be done easily by either deleting it in Solver or setting the lower limit to zero) and re-solve the problem. What happens? Using the original formulation, try modifying the objective function coefficients and see what happens.

4. Investigate via a Web search how models and their solutions are used by the U.S. Department of Homeland Security in the “war against terrorism.” Also investigate how other governments or government agencies are using models in their missions.

5. This problem was contributed by Dr. Rick Wilson of Oklahoma State University.

The recent drought has hit farmers hard. Cows are eating candy corn!

You are interested in creating a feed plan for the next week for your cattle using the following seven non- traditional feeding products: Chocolate Lucky Charms cereal, Butterfinger bars, Milk Duds, vanilla ice cream, Cap’n Crunch cereal, candy corn (because the real corn is all dead), and Chips Ahoy cookies.

Choc Lucky Charms Butterfinger Milk Duds

Vanilla Ice Cream

Cap’n Crunch

Candy Corn

Chips Ahoy

$$/lb 2.15 7 4.25 6.35 5.25 4 6.75

Choc YES YES YES NO NO NO YES

Protein 75 80 45 65 72 26 62

TDN 12 20 18 6 11 8 12

Calcium 3 4 4.5 12 2 1 5

Their per pound cost is shown, as is the protein units per pound they contribute, the total digestible nutrients (TDN) they contribute per pound, and the calcium units per pound.

You estimate that the total amount of nontradi- tional feeding products contribute the following amount of nutrients: at least 20,000 units of protein, at least

4,025  units of TDN, at least 1,000 but no more than 1,200 units of calcium.

There are some other miscellaneous requirements as well. • The chocolate in your overall feed plan (in pounds)

cannot exceed the amount of nonchocolate pound- age. Whether a product is considered chocolate

Chapter 8 • Prescriptive Analytics: Optimization and Simulation 507

or not is shown in the table (YES = chocolate, NO = not chocolate).

• No one feeding product can make up more than 25% of the total pounds needed to create an acceptable feed mix.

• There are two cereals (Chocolate Lucky Charms and Cap’n Crunch). Combined, they can be no more than 40% (in pounds) of the total mix required to meet the mix requirements.

Determine the optimal levels of the seven prod- ucts to create your weekly feed plan that minimizes cost. Note that all amounts of products must not have frac- tional values (whole numbered pounds only).

6. This exercise was also contributed by Dr. Rick Wilson of Oklahoma State University to illustrate the modeling capabilities of Excel Solver.

National signing day for rugby recruiting sea- son 2018 has been completed. Now, as the recruiting coordinator for the San Diego State University Aztec rugby team, it is time to analyze the results and plan for 2019.

You’ve developed complex analytics and data col- lection processes and applied them for the past few recruit- ing seasons to help you develop a plan for 2019. Basically, you have divided the area in which you actively recruit rugby players into eight different regions. Each region has a per-target cost, a “star rating” (average recruit “star” rank- ing, from 0 to 5, similar to what Rivals uses for football), a yield or acceptance rate percentage (the percentage of targeted recruits who come to SDSU), and a visibility mea- sure, which represents a measure of how much publicity SDSU gets for recruiting in that region, measured per target (increased visibility will enhance future recruiting efforts).

Cost/target avg star rating

acceptance rate %

visibility per target

Region1 125 3 40 0

Region2 89 2.5 42 0

Region3 234 3.25 25 2

Region4 148 3.1 30 3

Region5 321 3.5 22 7

Region6 274 3.45 20 4

Region7 412 3.76 17 5

Region8 326 3.2 18 5.5

Your goal is to create a LINEAR mathematical model that determines the number of target recruits you should pur- sue in each region in order to have an estimated yield (expected number) of at least 25 rugby recruits for next year while minimizing cost. (Region 1 with yield of 40%: if we target 10 people, the expected number that will come is .4 * 1 0 = 4 0 .)

In determining the optimal number of targets in each region (which, not surprisingly, should be integer values), you must also satisfy the following conditions:

• No more than 20% of the total targets (not the expected number of recruits) should be from any one region.

• Each region should have at least 4% of the total tar- gets (again, not the expected number of recruits, but the number of targets).

• The average star rating of the targets must be at least equal to 3.3.

• The average visibility value of the targets must be at least equal to 3.5.

• Off on the recruiting trail you go!

7. This exercise was also contributed by Dr. Rick Wilson of Oklahoma State University.

You are the Water Resources Manager for Thirstiville, OK, and are working out the details for next year’s con- tracts with three different entities to supply water to your town. Each water source (A, B, C) provides water of differ- ent quality. The quality assessment is aggregated together in two values P1 and P2, representing a composite of con- taminants, such as THMs, HAAs, and so on. The sources each have a maximum of water that they can provide (measured in thousands of gallons), a minimum that we must purchase from them, and a per-thousand-gallon cost.

MIN MAX P1 P2 COST

Source A 400 1000 4 1 0.25

Source B 1000 2500 3.5 3 0.175

Source C 0 775 5 2.5 0.20

On the product end, you must procure water such that you can provide three distinct water products for next year (this is all being done at the aggregate “city” level). You must provide drinking water to the city, and then water to two different wholesale clients (this is com- monly done by municipalities). The table below shows requirements for these three products, and the “sales” or revenue that you get from each customer (by thousand gallons, same scale as the earlier cost).

For each of the three water products/customers, MIN is the minimum that we have to provide to each, MAX is the maximum that we can provide (it is reasonable to be provided with a targeted range of product to pro- vide to our customers), the maximum P1 and P2 weighted average for the water blended together for each quality “category” (the contaminants) per customer, and the sales price.

MIN MAX P1 P2 SALES

Drinking 1500 1700 3.75 2.25 0.35

WSale 1 250 325 No Req. 2.75 0.4

WSale 2 No limit No limit 4 2 0.425

Yes, the second wholesale customer (WSale 2) will take as much water as you can blend together for them.

Obviously, water from all three sources will need to be blended together to meet the Thirstiville custom- er requirements. There is one more requirement: for each of the three products (drinking water and the two wholesale clients), Source A and Source B both individ- ually (yes, separately) must make up at least 20% of the

508 Part III • Prescriptive Analytics and Big Data

total amount of the production of that particular water type. We do not have such a requirement for Source C.

Create an appropriate LP model that determines how to meet customer water demand for next year while maximizing profit (sales less costs). Summarize your

results (something more than telepathy—say, some sort of table of data beyond the model solution?) It must use words ( ) and indicate how much water we should promise to buy from our three sources. Integers are not required.

References

“Canadian Football League Uses Frontline Solvers to Optimize Scheduling in 2016.” www.solver.com/news/canadian- football-league-uses-frontline-solvers-optimize- scheduling-2016 (accessed September 2018).

“Risk-Based Planning and Scheduling (RPS) with Simio.” Www.simio.com, Simio LLC, www.simio.com/about- simio/why-simio/simio-RPS-risk-based-planning- and-scheduling.php (accessed September 2018).

Arsham, H. (2006a). “Modeling and Simulation Resources.” home.ubalt.edu/ntsbarsh/Business-stat/RefSim.htm (accessed November 2018).

Arsham, H. (2006b). “Decision Science Resources.” home.ubalt .edu/ntsbarsh/Business-stat/Refop.htm (accessed No- vember 2018).

Arthur, Molly. “Simulation-Based Production Scheduling System.” www.simio.com, Simio LLC, 2014, www.simio.com/ case-studies/A-Steel-Tubing-Manufacturer- Expects-More- Market-Share/A-Steel-Tubing-Manufacturer-Expects- More-Market-Share.pdf (accessed September 2018).

Bailey, M. J., J. Snapp, S. Yetur, J. S. Stonebraker, S. A. Edwards, A. Davis, & R. Cox. (2011). “Practice Summaries: American Airlines Uses Should-Cost Modeling to Assess the Uncer- tainty of Bids for Its Full-Truckload Shipment Routes.” Interfaces, 41(2), 194–196.

Banks, J., & Gibson, R. R. (2009). Seven Sins of Simulation Practice.” INFORMS Analytics, 24–27. www.analytics- magazine.org/summer-2009/193-strategic-problems- modeling-the-market-space (accessed September 2018).

Bowers, M. R., C. E. Noon, W. Wu, & J. K. Bass. (2016). “Neo- natal Physician Scheduling at the University of Tennessee Medical Center.” Interfaces, 46(2), 168–182.

Chongwatpol, J., & R. Sharda. (2013). “RFID-Enabled Track and Traceability in Job-Shop Scheduling Environment.” European Journal of Operational Research, 227(3), 453–463, http://dx.doi.org/10.1016/j.ejor.2013.01.009.

Christiansen, M., K. Fagerholt, G. Hasle, A. Minsaas, & B. Nygreen. (2009, April). “Maritime Transport Optimization: An Ocean of Opportunities.” OR/MS Today, 36(2), 26–31.

Clemen, R. T., & Reilly, T. (2000). Making Hard Decisions with Decision Tools Suite. Belmont, MA: Duxbury Press.

Dilkina, B. N., & W. S. Havens. “The U.S. National Football League Scheduling Problem. Intelligent Systems Lab,” www.cs.cornell.edu/~bistra/papers/NFLsched1.pdf (accessed September 2018).

Farasyn, I., K. Perkoz, & W. Van de Velde. (2008, July/ August). “Spreadsheet Models for Inventory Target Setting at Procter & Gamble.” Interfaces, 38(4), 241–250.

Goodwin, P., & Wright, G. (2000). Decision Analysis for Management Judgment, 2nd ed. New York: Wiley.

Hurley, W. J., & M. Balez. (2008, July/August). “A Spreadsheet Implementation of an Ammunition Requirements Planning Model for the Canadian Army.” Interfaces, 38(4), 271–280. ingrammicrocommerce.com, “ CUSTOMERS,” https:// www.ing rammicrocommerce.com/customers/ (accessed July 2016).

Kearns, G. S. (2004, January–March). “A Multi-Objective, Mul- ticriteria Approach for Evaluating IT Investments: Results from Two Case Studies.” Information Resources Manage- ment Journal, 17(1), 37–62.

Kelly, A. (2002). Decision Making Using Game Theory: An Introduction for Managers. Cambridge, UK: Cambridge University Press.

Knight, F. H. (1933). Risk, Uncertainty and Profit: With an Additional Introductory Essay Hither to Unpublished. Lon- don school of economics and political science.

Koksalan, M., & S. Zionts. (Eds.). (2001). Multiple Criteria Deci- sion Making in the New Millennium. Berlin: Springer-Verlag.

Kontoghiorghes, E. J., B. Rustem, & S. Siokos. (2002). Com- putational Methods in Decision Making, Economics, and Finance. Boston: Kluwer.

Kostuk, Kent J., and K. A. Willoughby. (2012). “A Decision Support System for Scheduling the Canadian Football League.” Interfaces, 42(3), 286–295.

Manikas, A. S., J. R. Kroes, & T. F. Gattiker. (2016). Metro Meals on Wheels Treasure Valley Employs a Low-Cost Routing Tool to Improve Deliveries. Interfaces, 46(2), 154–167.

Mookherjee, R., J. Martineau, L. Xu, M. Gullo, K. Zhou, A. Hazlewood, X. Zhang, F. Griarte, & N. Li. (2016). “End-to- End Predictive Analytics and Optimization in Ingram Micro’s Two-Tier Distribution Business.” Interfaces, 46 (1),49–73.

Ovchinnikov, A., & J. Milner. (2008, July/August). “Spread- sheet Model Helps to Assign Medical Residents at the University of Vermont’s College of Medicine.” Interfaces, 38(4), 311–323.

Simio.com. “Cosan Case Study—Optimizing Agricultural Logistics Operations.” http://www.simio.com/case-studies/ Cosan-agricultural-logistics-simulation-software- case-study/index.php (accessed September 2018).

Slaugh, V. W., M. Akan, O. Kesten, & M. U. Unver. (2016). “The Pennsylvania Adoption Exchange Improves Its Matching Process.” Interfaces, 462, 133–154.

Solver.com. “Optimizing Vendor Contract Awards Gets an A+.” solver.com/news/optimizing-vendor-contract- awards-gets (accessed September 2018).

Turban, E., & J. Meredith. (1994). Fundamentals of Manage- ment Science, 6th ed. Richard D. Irwin, Inc.

Wikipedia.com. Cosan. https://en.wikipedia.org/wiki/ Cosan (accessed November 2018).