Discussion: Artificial Neural Network

profiledancingduck
chapter6.pdf

315

Deep Learning and Cognitive Computing

LEARNING OBJECTIVES

■■ Learn what deep learning is and how it is changing the world of computing

■■ Know the placement of deep learning within the broad family of artificial intelligence (AI) learning methods

■■ Understand how traditional “shallow” artificial neural networks (ANN) work

■■ Become familiar with the development and learning processes of ANN

■■ Develop an understanding of the methods to shed light into the ANN black box

■■ Know the underlying concept and methods for deep neural networks

■■ Become familiar with different types of deep learning methods

■■ Understand how convolutional neural networks (CNN) work

■■ Learn how recurrent neural networks (RNN) and long short-memory networks (LSTM) work

■■ Become familiar with the computer frameworks for implementing deep learning

■■ Know the foundational details about cognitive computing

■■ Learn how IBM Watson works and what types of application it can be used for

A rtificial intelligence (AI) is making a re-entrance into the world of commuting and in our lives, this time far stronger and much more promising than before. This unprecedented re-emergence and the new level of expectations can largely be attributed to deep learning and cognitive computing. These two latest buzzwords de- fine the leading edge of AI and machine learning today. Evolving out of the traditional artificial neural networks (ANN), deep learning is changing the very foundation of how machine learning works. Thanks to large collections of data and improved computational resources, deep learning is making a profound impact on how computers can discover complex patterns using the self-extracted features from the data (as opposed to a data scientist providing the feature vector to the learning algorithm). Cognitive computing— first popularized by IBM Watson and its success against the best human players in the game show Jeopardy!—makes it possible to deal with a new class of problems, the type

C H A P T E R

6

316 Part II • Predictive Analytics/Machine Learning

of problems that are thought to be solvable only by human ingenuity and creativity, ones that are characterized by ambiguity and uncertainty. This chapter covers the concepts, methods, and application of these two cutting-edge AI technology trends.

6.1 Opening Vignette: Fighting Fraud with Deep Learning and Artificial Intelligence 316

6.2 Introduction to Deep Learning 320 6.3 Basics of “Shallow” Neural Networks 325 6.4 Process of Developing Neural Network–Based Systems 334 6.5 Illuminating the Black Box of ANN 340 6.6 Deep Neural Networks 343 6.7 Convolutional Neural Networks 349 6.8 Recurrent Networks and Long Short-Term Memory Networks 360 6.9 Computer Frameworks for Implementation of Deep Learning 368

6.10 Cognitive Computing 370

6.1 OPENING VIGNETTE: Fighting Fraud with Deep Learning and Artificial Intelligence

THE BUSINESS PROBLEM

Danske Bank is a Nordic universal bank with strong local roots and bridges to the rest of the world. Founded in October 1871, Danske Bank has helped people and businesses in the Nordics realize their ambitions for over 145 years. Its headquarters is in Denmark, with core markets in Denmark, Finland, Norway, and Sweden.

Mitigating fraud is a top priority for banks. According to the Association of Certified Fraud Examiners, businesses lose more than $3.5 trillion each year to fraud. The problem is pervasive across the financial industry and is becoming more prevalent and sophis- ticated each month. As customers conduct more banking online across a wider variety of channels and devices, there are more opportunities for fraud to occur. Adding to the problem, fraudsters are becoming more creative and technologically savvy—they are also using advanced technologies such as machine learning—and new schemes to defraud banks are evolving rapidly.

Old methods for identifying fraud, such as using human-written rules engines, catch only a small percentage of fraud cases and produce a significantly high number of false positives. While false negatives end up costing money to the bank, chasing after a large number of false positives not only costs time and money but also blemishes customer trust and satisfaction. To improve probability predictions and identify a much higher per- centage of actual cases of fraud while reducing false alarms, banks need new forms of analytics. This includes using artificial intelligence.

Danske Bank, like other global banks, is seeing a seismic shift in customer interac- tions. In the past, most customers handled their transactions in a bank branch. Today, almost all interactions take place digitally through a mobile phone, tablet, ATM, or call center. This provides more “surface area” for fraud to occur. The bank needed to mod- ernize its fraud detection defenses. It struggled with a low 40 percent fraud detection rate and was managing up to 1,200 false positives per day—and 99.5 percent of all cases the bank was investigating were not fraud related. That large number of false alarms required a substantial investment of people, time, and money to investigate what turned out to be dead ends. Working with Think Big Analytics, a Teradata company, Danske Bank made a strategic decision to apply innovative analytic techniques, including AI, to better identify instances of fraud while reducing false positives.

Chapter 6 • Deep Learning and Cognitive Computing 317

THE SOLUTION: DEEP LEARNING ENHANCES FRAUD DETECTION

Danske Bank integrated deep learning with graphics processing unit (GPU) appliances that were also optimized for deep learning. The new software system helps the analyt- ics team to identify potential cases of fraud while intelligently avoiding false positives. Operational decisions are shifted from users to AI systems. However, human interven- tion is still necessary in some cases. For example, the model can identify anomalies, such as debit card purchases taking place around the world, but analysts are needed to determine whether that is fraud or a bank customer simply made an online purchase that sent a payment to China and then bought an item the next day from a retailer based in London.

Danske Bank’s analytic approach employs a “champion/challenger” methodology. With this approach, deep learning systems compare models in real time to determine which one is most effective. Each challenger processes data in real time, learning as it goes which traits are more likely to indicate fraud. If a process dips below a certain threshold, the model is fed more data, such as the geolocation of customers or recent ATM transactions. When a challenger outperforms other challengers, it transforms into a champion, giving the other models a roadmap to successful fraud detection.

THE RESULTS

Danske Bank implemented a modern enterprise analytic solution leveraging AI and deep learning, and it has paid big dividends. The bank was able to:

• Realize a 60 percent reduction in false positives with an expectation to reach as high as 80 percent.

• Increase true positives by 50 percent. • Focus resources on actual cases of fraud.

The following graph (see Figure 6.1) shows how true and false positive rates improved with advanced analytics (including deep learning). The red dot represents the old rules engine, which caught only about 40 percent of all fraud. Deep learning improved signifi- cantly upon machine learning, allowing Danske Bank to better detect fraud with much lower false positives.

Enterprise analytics is rapidly evolving and moving into new learning systems enabled by AI. At the same time, hardware and processors are becoming more powerful and spe- cialized, and algorithms more accessible, including those available through open source. This gives banks the powerful solutions needed to identify and mitigate fraud. As Danske Bank learned, building and deploying an enterprise-grade analytics solution that meets its specific needs and leverages its data sources deliver more value than traditional off- the-shelf tools could have provided. With AI and deep learning, Danske Bank now has the ability to better uncover fraud without being burdened by an unacceptable amount of false positives. The solution also allows the bank’s engineers, data scientists, lines of business, and investigative officers from Interpol, local police, and other agencies to col- laborate to uncover fraud, including sophisticated fraud rings. With its enhanced capabili- ties, the enterprise analytic solution is now being used across other business areas of the bank to deliver additional value.

Because these technologies are still evolving, implementing deep learning and AI solutions can be difficult for companies to achieve on their own. They can benefit by partnering with a company that has the proven capabilities to implement technology- enabled solutions that deliver high-value outcomes. As shown in this case, Think Big Analytics, a Teradata company, has the expertise to configure specialized hardware and software frameworks to enable new operational processes. The project entailed integrat- ing open-source solutions, deploying production models, and then applying deep learning

318 Part II • Predictive Analytics/Machine Learning

analytics to extend and improve the models. A framework was created to manage and track the models in the production system and to make sure the models could be trusted. These models enabled the underlying system to make autonomous decisions in real time that aligned with the bank’s procedural, security, and high-availability guidelines. The solution provided new levels of detail, such as time series and sequences of events, to better assist the bank with its fraud investigations. The entire solution was implemented very quickly—from kickoff to live in only five months. Figure 6.2 shows a generalized framework for AI and deep learning–based enterprise-level analytics solutions.

In summary, Danske Bank undertook a multi-step project to productionize machine- learning techniques while developing deep learning models to test those techniques. The integrated models helped identify the growing problem of fraud. For a visual summary, watch the video (https://www.teradata.com/Resources/Videos/Danske-Bank- Innovating-in-Artificial-Intelligence) and/or read the blog (http://blogs.teradata. com/customers/danske-bank-innovating-artificial-intelligence-deep-learning- detect-sophisticated-fraud/).

Deep Learning

21.0

20.8

20.6

20.4

20.2

0.0

0.0 0.02

Tr ue

P os

it iv

e R

at e

0.04 0.06

False Negative Rate

Random predi ction

0.08 0.10

Classic Machine Learning

Rules Engine

Ensemble (area = 0.89) CNN (area = 0.95) ResNet (area = 0.94) LSTM (area = 0.90) Rule Engine Random predictions

FIGURE 6.1 Deep Learning Improves Both True Positives and True Negatives.

Chapter 6 • Deep Learning and Cognitive Computing 319

u QUESTIONS FOR THE OPENING VIGNETTE

1. What is fraud in banking? 2. What are the types of fraud that banking firms are facing today? 3. What do you think are the implications of fraud on banks and on their customers? 4. Compare the old and new methods for identifying and mitigating fraud. 5. Why do you think deep learning methods provided better prediction accuracy? 6. Discuss the trade-off between false positive and false negative (type 1 and type 2

errors) within the context of predicting fraudulent activities.

WHAT WE CAN LEARN FROM THIS VIGNETTE

As you will see in this chapter, AI in general and the methods of machine learning in specific are evolving and advancing rapidly. The use of large digitized data sources, both from inside and outside the organization, both structured and unstructured, along with advanced computing systems (software and hardware combinations), has paved the way toward dealing with problems that were thought to be unsolvable just a few years ago. Deep learning and cognitive computing (as the ramifications of the cutting edge in AI systems) are helping enterprises to make accurate and timely decisions by harnessing the rapidly expanding Big Data resources. As shown in this opening vignette, this new generation of AI systems is capable of solving problems much bet- ter than their older counterparts. In the domain of fraud detection, traditional methods have always been marginally useful, having higher than desired false positive rates and causing unnecessary investigations and thereby dissatisfaction for their customers. As difficult problems such as fraud detection are, new AI technologies like deep learn- ing are making them solvable with a high level of accuracy and applicability.

Source: Teradata Case Study. “Danske Bank Fights Fraud with Deep Learning and AI.” https://www.teradata. com/Resources/Case-Studies/Danske-Bank-Fight-Fraud-With-Deep-Learning-and-AI (accessed August 2018). Used with permission.

Engineer

Simulate

M entoring

Handover Investigate

Cross-Functional Teams

4

3 2

1

Cross-Functional Teams

Leveragable

APIs

Validate

InsightsLive Test

Production

Test

Integrate

Analyze Data

Go Live

Tr ain

ing

Ini tia

l W in

s

Al as-a-Service Manage iterative, stage-gate process for analytic models from development to handover to operations

Al Strategy Analyze business priorities and identify Al use cases. Review key enterprise AI capabilities and provide recommendations and next steps for customers to successfully get value from AI.

Al Rapid Analytic Consulting EngagementTM (Race) Use AI exploration to test use cases and provide a proof of value for AI approaches.

Al Foundation Operationalize use cases through data science and engineering; build and deploy a deep learning platform, integrating data sources, models, and business processes.

FIGURE 6.2 A Generalized Framework for AI and Deep Learning–Based Analytics Solutions.

320 Part II • Predictive Analytics/Machine Learning

6.2 INTRODUCTION TO DEEP LEARNING

About a decade ago, conversing with an electronic device (in human language, intelligently) would have been unconceivable, something that could only be seen in SciFi movies. Today, however, thanks to the advances in AI methods and technologies, almost everyone has ex- perienced this unthinkable phenomenon. You probably have already asked Siri or Google Assistant several times to dial a number from your phone address book or to find an address and give you the specific directions while you were driving. Sometimes when you were bored in the afternoon, you may have asked the Google Home or Amazon’s Alexa to play some music in your favorite genre on the device or your TV. You might have been surprised at times when you uploaded a group photo of your friends on Facebook and observed its tagging suggestions where the name tags often exactly match your friends’ faces in the pic- ture. Translating a manuscript from a foreign language does not require hours of struggling with a dictionary; it is as easy as taking a picture of that manuscript in the Google Translate mobile app and giving it a fraction of a second. These are only a few of the many, ever- increasing applications of deep learning that have promised to make life easier for people.

Deep learning, as the newest and perhaps at this moment the most popular member of the AI and machine-learning family, has a goal similar to those of the other machine- learning methods that came before it: mimic the thought process of humans—using math- ematical algorithms to learn from data pretty much the same way that humans learn. So, what is really different (and advanced) in deep learning? Here is the most commonly pronounced differentiating characteristic of deep learning over traditional machine learn- ing. The performance of traditional machine-learning algorithms such as decision trees, support vector machines, logistic regression, and neural networks relies heavily on the representation of the data. That is, only if we (analytics professionals or data scientists) provide those traditional machine-learning algorithms with relevant and sufficient pieces of information (a.k.a. features) in proper format are they able to “learn” the patterns and thereby perform their prediction (classification or estimation), clustering, or association tasks with an acceptable level of accuracy. In other words, these algorithms need humans to manually identify and derive features that are theoretically and/or logically relevant to the objectives of the problem on hand and feed these features into the algorithm in a proper format. For example, in order to use a decision tree to predict whether a given customer will return (or churn), the marketing manager needs to provide the algorithm with information such as the customer’s socioeconomic characteristics—income, occupa- tion, educational level, and so on (along with demographic and historical interactions/ transactions with the company). But the algorithm itself is not able to define such socio- economic characteristics and extract such features, for instance, from survey forms com- pleted by the customer or obtained from social media.

While such a structured, human-mediated machine-learning approach has been working fine for rather abstract and formal tasks, it is extremely challenging to have the approach work for some informal, yet seemingly easy (to humans), tasks such as face identification or speech recognition since such tasks require a great deal of knowledge about the world (Goodfellow et al., 2016). It is not straightforward, for instance, to train a machine-learning algorithm to accurately recognize the real meaning of a sentence spo- ken by a person just by manually providing it with a number of grammatical or semantic features. Accomplishing such a task requires a “deep” knowledge about the world that is not easy to formalize and explicitly present. What deep learning has added to the classic machine-learning methods is in fact the ability to automatically acquire the knowledge required to accomplish such informal tasks and consequently extract some advanced fea- tures that contribute to the superior system performance.

To develop an intimate understanding of deep learning, one should learn where it fits in the big picture of all other AI family of methods. A simple hierarchical relationship diagram,

Chapter 6 • Deep Learning and Cognitive Computing 321

or a taxonomy-like representation, may in fact provide such a holistic understanding. In an attempt to do this, Goodfellow and his colleagues (2016) categorized deep learning as part of the representation learning family of methods. Representation learning techniques entail one type of machine learning (which is also a part of AI) in which the emphasis is on learn- ing and discovering features by the system in addition to discovering the mapping from those features to the output/target. Figure 6.3 uses a Venn diagram to illustrate the place- ment of deep learning within the overarching family of AI-based learning methods.

Figure 6.4 highlights the differences in the steps/tasks that need to be performed when building a typical deep learning model versus the steps/tasks performed when building models with classic machine-learning algorithms. As shown in the top two work- flows, knowledge-based systems and classic machine-learning methods require data sci- entists to manually create the features (i.e., the representation) to achieve the desired output. The bottommost workflows show that deep learning enables the computer to derive some complex features from simple concepts that would be very effort intensive (or perhaps impossible in some problem situations) to be discovered by humans manu- ally, and then it maps those advanced features to the desired output.

From a methodological viewpoint, although deep learning is generally believed to be a new area in machine learning, its initial idea goes back to the late 1980s, just a few decades after the emergence of artificial neural networks when LeCun and colleagues (1989) published an article about applying backpropagation networks for recognizing handwritten ZIP codes. In fact, as it is being practiced today, deep learning seems to be nothing but an extension of neural networks with the idea that deep learning is able to deal with more complicated tasks with a higher level of sophistication by employing many layers of connected neurons along with much larger data sets to automatically character- ized variables and solve the problems but only at the expense of a great deal of compu- tational effort. This very high computational requirement and the need for very large data sets were the two main reasons why the initial idea had to wait more than two decades until some advanced computational and technological infrastructure emerged for deep

Artificial Intelligence

Machine Learning

Representation Learning

Deep Learning

CNN RNN

LSTM

Autoencoders Decision trees

Logistic regression

... ... Robotics Fuzzy logic

Knowledge- based/expert systems

Clustering PCA/ICA ......

FIGURE 6.3 A Venn Diagram Showing the Placement of Deep Learning within the Overarching AI-Based Learning Methods.

322 Part II • Predictive Analytics/Machine Learning

learning’s practical realization. Although the scale of neural networks has dramatically in- creased in the past decade by the advancement of related technologies, it is still estimated that having artificial deep neural networks with the comparable number of neurons and level of complexity existing in the human brain will take several more decades.

In addition to the computer infrastructures, as mentioned, the availability of large and feature-rich digitized data sets was another key reason for the development of suc- cessful deep learning applications in recent years. Obtaining good performance from a deep learning algorithm used to be a very difficult task that required extensive skills and experience/understanding to design task-specific networks, and therefore, not many were able to develop deep learning for practical and/or research purposes. Large training data sets, however, have greatly compensated for the lack of intimate knowledge and reduced the level of skill needed for implementing deep neural networks. Nevertheless, although the size of available data sets has exponentially increased in recent years, a great chal- lenge, especially for supervised learning of deep networks, is now the labeling of the cases in these huge data sets. As a result, a great deal of research is ongoing, focusing on how we can take advantage of large quantities of unlabeled data for semisupervised or unsupervised learning or how we can develop methods to label examples in bulk in a reasonable time.

The following section of this chapter provides a general introduction to neural networks from where deep learning has originated. Following the overview of these “shallow” neural networks, the chapter introduces different types of deep learning archi- tectures and how they work, some common applications of these deep learning architec- tures, and some popular computer frameworks to use in implementing deep learning in practice. Since, as mentioned, the basics of deep learning are the same as those of arti- ficial neural networks, in the following section, we provide a brief coverage of the neu- ral network architecture (namely, multilayered perceptron [MLP]-type neural networks, which was omitted in the neural network section in Chapter 5 because it was to be covered here) to focus on their mathematical principles and then explain how the vari- ous types of deep learning architectures/approaches were derived from these founda- tions. Application Case 6.1 provides an interesting example of what deep learning and advanced analytics techniques can achieve in the field of football.

Input Knowledge-

Based Systems

Classic Machine Learning

Generic

Deep Learning

R ep

re se

nt at

io n

Le ar

ni ng

Input

Input

Input

Manually Created

Representation Output

Manually Created Features

Auto- Created Features

Simple Features

More Advanced Features

Mapping from

Features

Mapping from

Features

Mapping from Features

Output

Output

Output

FIGURE 6.4 Illustration of the Key Differences between Classic Machine-Learning Methods and Representation Learning/Deep Learning (shaded boxes indicate components that are able to learn

directly from data).

Chapter 6 • Deep Learning and Cognitive Computing 323

Football. Soccer. The beautiful game. Whatever you call it, the world’s most popular sport is being trans- formed by a Dutch start-up bringing AI to the pitch. SciSports, founded in 2012 by two self-proclaimed football addicts and data geeks, is innovating on the edge of what is possible. The sports analytics com- pany uses streaming data and applies machine learn- ing, deep learning, and AI to capture and analyze these data, making way for innovations in everything from player recruitment to virtual reality for fans.

Player Selection Goes High Tech

In the era of eight-figure contracts, player recruitment is a high-stakes game. The best teams are not those with the best players but the best combination of players. Scouts and coaches have used observation, rudimentary data, and intuition for decades, but savvy clubs now are using advanced analytics to identify rising stars and undervalued players. “The SciSkill Index evaluates every professional football player in the world in one universal index,” says SciSports founder and CEO Giels Brouwer. The company uses machine-learning algorithms to calculate the quality, talent, and value of more than 200,000 players. This

helps clubs find talent, look for players who fit a cer- tain profile, and analyze their opponents.

Every week, more than 1,500 matches in 210 leagues are analyzed by the SciSkill technology. Armed with this insight, SciSports partners with elite football clubs across Europe and other continents to help them sign the right players. This has led to several unexpected—and in some cases lucrative— player acquisitions. For example, a second-division Dutch player did not want to renew his contract, so he went out as a free agent. A new club reviewed the SciSkill index and found his data intriguing. That club was not too sure at first because it thought he looked clumsy in scouting—but the data told the true story. The club signed him as the third striker, and he quickly moved into a starting role and became its top goal scorer. His rights were sold at a large premium within two years, and now he is one of the top goal scorers in Dutch professional football.

Real-Time 3D Game Analysis

Traditional football data companies generate data only on players who have the ball, leaving everything else

Application Case 6.1 Finding the Next Football Star with Artificial Intelligence

(Continued )

324 Part II • Predictive Analytics/Machine Learning

undocumented. This provides an incomplete picture of player quality. Seeing an opportunity to capture the immense amount of data regarding what happens away from the ball, SciSports developed a camera sys- tem called BallJames.

BallJames is a real-time tracking technology that automatically generates 3D data from video. Fourteen cameras placed around a stadium record every movement on the field. BallJames then gen- erates data such as the precision, direction, and speed of the passing, sprinting strength, and jump- ing strength. “This forms a complete picture of the game,” says Brouwer. “The data can be used in lots of cool ways, from allowing fans to experience the game from any angle using virtual reality, to sports betting and fantasy sports.” He added that the data can even help coaches on the bench. “When they want to know if a player is getting tired, they can substitute players based on analytics.”

Machine Learning and Deep Learning

SciSports models on-field movements using machine- learning algorithms, which by nature improve on performing a task as the player gains more experi- ence. On the pitch, BallJames works by automati- cally assigning a value to each action, such as a cor- ner kick. Over time, these values change based on their success rate. A goal, for example, has a high value, but a contributing action—which may have previously had a low value—can become more valuable as the platform masters the game. Wouter Roosenburg, SciSports chief technology officer, says AI and machine learning will play an important role in the future of SciSports and football analyt- ics in general. “Existing mathematical models model

existing knowledge and insights in football, while artificial intelligence and machine learning will make it possible to discover new connections that people wouldn’t make themselves.”

To accurately compile 3D images, BallJames must distinguish between players, referees, and the ball. SAS Event Stream Processing enables real-time image recognition using deep learning models. “By combining our deep learning models into SAS'Viya', we can train our models in-memory in the cloud, on our cameras or wherever our resources are,” says Roosenburg. The ability to deploy deep learn- ing models in memory onto cameras and then do the inferencing in real time is cutting-edge sci- ence. “Having one uniform platform to manage the entire 3-D production chain is invaluable,” says Roosenburg. “Without SAS Viya, this project would not be possible.”

Adding Oomph to Open Source

Previously SciSports exclusively used open source to build models. It now benefits from an end-to- end platform that allows analytical teams to work in their language of choice and share a single, man- aged analytical asset inventory across the organiza- tion. According to Brouwer, this enables the firm to attract employees with different open-source skills yet still manage the production chain using one platform. “My CTO tells me he loves that our data scientists can do all the research in open source and he doesn’t have to worry about the produc- tion of the models,” says Brouwer. “What takes 100 lines of code in Python only takes five in SAS. This speeds our time to market, which is crucial in sports analytics.”

SciSports - Facts & Figures

1

Universal Index with Every Professional

Football Player

Players Analyzed in SciSkill Index

Cameras Around the Pitch Enable

Real-Time Analysis

200,000 14

Application Case 6.1 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 325

u SECTION 6.2 REVIEW QUESTIONS

1. What is deep learning? What can deep learning do? 2. Compared to traditional machine learning, what is the most prominent difference of

deep learning?

3. List and briefly explain different learning methods in AI. 4. What is representation learning, and how does it relate to deep learning?

6.3 BASICS OF “SHALLOW” NEURAL NETWORKS

Artificial neural networks are essentially simplified abstractions of the human brain and its complex biological networks of neurons. The human brain has a set of billions of interconnected neurons that facilitate our thinking, learning, and understanding of the world around us. Theoretically speaking, learning is nothing but the establishment and adaptation of new or existing interneuron connections. In the artificial neural networks, however, neurons are processing units (also called processing elements [PEs]) that perform a set of predefined mathematical operations on the numerical values coming from the input variables or from the other neuron outputs to create and push out its own outputs. Figure 6.5 shows a schematic representation of a single-input and single-output neuron (more accurately, the processing element in artificial neural networks).

In this figure, p represents a numerical input. Each input goes into the neuron with an adjustable weight w and a bias term b. A multiplication weight function applies the weight to the input, and a net input function shown by g adds the bias term to the weighted input z. The output of the net input function (n, known as the net input) then goes through another function called the transfer (a.k.a. activation) function (shown by f ) for conversion and the production of the actual output a. In other words:

a = f (wp + b)

Since its inception, SciSports has quickly become one of the world’s fastest-growing sports analytics companies. Brouwer says the versatility of the SAS Platform has also been a major factor. “With SAS, we’ve got the ability to scale processing power up or down as needed, put models into production in real time, develop everything in one platform and integrate with open source. Our ambition is to bring real-time data analytics to billions of soccer fans all over the world. By partnering with SAS, we can make that happen.”

Questions for Case 6.1

1. What does SciSports do? Look at its Web site for more information.

2. How can advanced analytics help football teams?

3. What is the role of deep learning in solutions provided by SciSports?

Sources: SAS Customer Stories. “Finding the Next Football Star with Artificial Intelligence.” www.sas.com/en_us/customers/ scisports. html (accessed August 2018).Copyright (c) 2018 SAS Institute Inc., Cary, NC, USA. All Rights Reserved. Used with permission.

X fSp

Input Single-Input Neuron a 5 f(wp 1 b)

w b

nz a

FIGURE 6.5 General Single-Input Artificial Neuron Representation.

326 Part II • Predictive Analytics/Machine Learning

A numerical example: if w = 2, p = 3, and b = -1, then a = f (2 * 3 - 1) = f (5). Various types of transfer functions are commonly used in the design of neural

networks. Table 6.1 shows some of the most common transfer functions and their cor- responding operations. Note that in practice, selection of proper transfer functions for a network requires a broad knowledge of neural networks—characteristics of the data as well as the specific purpose for which the network is created.

Just to provide an illustration, if in the previous example we had a hard limit trans- fer function, the actual output a would be a = hardlim(5) = 1. There are some guide- lines for choosing the appropriate transfer function for each set of neurons in a network. These guidelines are especially robust for the neurons located at the output layer of the network. For example, if the nature of the output for a model is binary, we are advised to use Sigmoid transfer functions at the output layer so that it produces an output between 0 and 1, which represents the conditional probability of y = 1 given x or P (y = 1 �x). Many neural network textbooks provide and elaborate on those guidelines at different layers in a neural network with some consistency and much disagreement, suggesting that the best practices should (and usually does) come from experience.

TABLE 6.1 Common Transfer (Activation) Functions in Neural Networks

Transfer Function Form Operation

Hard limit

a

n 0

21

11

a 5 hardlim (n)

a = +1 if n 7 0

a = 0 if n 6 0

Linear

21

11

a 5 purelin (n)

n

a

0 a = n

Log-Sigmoid

21

11

a 5 logsig (n)

n

a

0 a = 1

1 + e-n

Positive linear (a.k.a. rectified linear or ReLU)

21

11

a 5 poslin(n)

n

a

0

a = n if n 7 0

a = 0 if n 6 0

Chapter 6 • Deep Learning and Cognitive Computing 327

Typically, a neuron has more than a single input. In that case, each individual input pi can be shown as an element of the input vector p. Each of the individual input values would have its own adjustable weight wi of the weight vector W. Figure 6.6 shows a multiple-input neuron with R individual inputs.

For this neuron, the net input n can be expressed as:

n = w1,1 p1 + w1,2 p2 + w1,3 p3 + . . . + w1,R pR + b

Considering the input vector p as a R * 1 vector and the weight vector W as a 1 * R vector, then n can be written in matrix form as:

n = Wp + b

where Wp is a scalar (i.e., 1 * 1 vector). Moreover, each neural network is typically composed of multiple neurons connected

to each other and structured in consecutive layers so that the outputs of a layer work as the inputs to the next layer. Figure 6.7 shows a typical neural network with four neurons

X fSpR

p1

p1

Inputs Single-Input Neuron a 5 f(wp 1 b)

WRx1 bRx1

nz a

FIGURE 6.6 Typical Multiple-Input Neuron with R Individual Inputs.

Hidden Layer

Output Layer

Input Layer

In pu

ts

Output

FIGURE 6.7 Typical Neural Network with Three Layers and Eight Neurons.

328 Part II • Predictive Analytics/Machine Learning

at the input (i.e., first) layer, four neurons at the hidden (i.e., middle) layer, and a single neuron at the output (i.e., last) layer. Each of the neurons has its own weight, weighting function, bias, and transfer function and processes its own input(s) as described.

While the inputs, weighting functions, and transfer functions in a given network are fixed, the values of the weights and biases are adjustable. The process of adjusting weights and biases in a neural network is what is commonly called training. In fact, in practice, a neural network cannot be used effectively for a prediction problem unless it is well trained by a sufficient number of examples with known actual outputs (a.k.a. targets). The goal of the training process is to adjust network weights and biases such that the network output for each set of inputs (i.e., each sample) is adequately close to its corresponding target value.

Application Case 6.2 provides a case where computer gaming companies are using advanced analytics to better understand and engage with their customers.

Video gamers are a special breed. Sure, they spend a lot of time playing games, but they’re also build- ing social networks. Like sports athletes, video game players thrive on competition. They play against other gamers online. Those who earn first place, or even second or third place, have bragging rights. And like athletes who invest a lot of time training, video gamers take pride in the number of hours they spend playing. Furthermore, as games increase in complexity, gamers take pride in developing unique skills to best their compatriots.

A New Level of Gaming

Video gaming has evolved from the days of PAC-MAN and arcades. The widespread availability of the Internet has fueled the popularity of video games by bringing them into people’s homes via a wide range of electronics such as the personal computer and mobile devices. The world of computer games is now a powerful and profitable business.

According to NewZoo’s Global Games Market Report from April 2017, the global games market in 2017 saw:

• $109 billion in revenues.

• 7.8 percent increase from the previous year.

• 2.2 billion gamers globally.

• 42 percent of the market being mobile.

Video game companies can tap into this envi- ronment and learn valuable information about their customers, especially their behaviors and the under- lying motivations. These customer data enable com- panies to improve the gaming experience and better engage players.

Traditionally, the gaming industry appealed to its customers—the gamers—by offering striking graphics and captivating visualizations. As technol- ogy advanced, the graphics became more vivid with hi-def renditions. Companies have continued to use technology in highly creative ways to develop games that attract customers and capture their inter- ests, which results in more time spent playing and higher affinity levels. What video game companies have not done as well is to fully utilize technology to understand the contextual factors that drive sus- tained brand engagement.

Know the Players

In today’s gaming world, creating an exciting prod- uct is no longer enough. Games must strongly appeal to the visual and auditory senses in an era when people expect cool graphics and cutting-edge sound effects. Games must also be properly mar- keted to reach highly targeted player groups. There are also opportunities to monetize gaming characters in the form of commercially available merchandise (e.g., toy store characters) or movie rights. Making a game successful requires programmers, designers,

Application Case 6.2 Gaming Companies Use Data Analytics to Score Points with Players

Chapter 6 • Deep Learning and Cognitive Computing 329

scenarists, musicians, and marketers to work together and share information. That is where gamer and gaming data come into play.

For example, the size of a gamer’s network— the number and types of people a gamer plays with or against—usually correlates with more time spent playing and more money that is spent. The more relationships gamers have, the higher the likeli- hood they will play more games with more people because they enjoy the experience. Network effects amplify engagement volumes.

These data also help companies better under- stand the types of games each individual likes to play. These insights enable the company to recom- mend additional games across other genres that will likely exert a positive impact on player engagement and satisfaction. Companies can also use these data in marketing campaigns to target new gamers or entice existing gamers to upgrade their member- ships, for example, to premium levels.

Monetize Player Behaviors

Collaborative filtering (cFilter) is an advanced ana- lytic function that makes automatic predictions (fil- tering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The cFilter function supposes that if User A has the same opinion as User B on one issue, then User A is more likely to have User B’s opinion on a different issue when compared to a random user. This shows that predictions are specific to a gamer based on data from many other gamers.

Filtering systems are often used by online retailers to make product recommendations. The analytics can determine products that a customer will like based on what other shoppers who made similar purchases also bought, liked, or rated highly. There are many examples across other industries such as healthcare, finance, manufacturing, and telecommunication.

The cFilter analytic function offers several ben- efits to online video game companies:

• Marketers can run more effective cam- paigns. Connections between gamers nat- urally form to create clusters. Marketers can isolate common player characteristics and

leverage those insights for campaigns. Con- versely, they can isolate players who do not belong to a cluster and determine what unique characteristics contribute to their nonconform- ing behaviors.

• Companies can improve player retention. A strong membership in a community of gam- ers decreases the chances of churn. The greater the incentives for gamers to belong to a group of active participants, the more desire they have to engage in competitions. This increases the “stickiness” of customers and can lead to more game subscriptions.

• Data insights lead to improved customer satisfaction. Clusters indicate a desire for certain types of games that correspond to dis- tinct gamer interests and behaviors. Compa- nies can create gaming experiences that are unique to each player. Enticing more peo- ple to play and play longer enhances gamer satisfaction.

Once they understand why customers want to play games and uncover their relationships with other gamers, companies can create the right incentives for players to keep returning. This ensures a sus- tained customer base and stable revenue streams.

Boost Loyalty and Revenue

Regardless of the genre, each video game has pas- sionate players who seek each other for competi- tions. The thrill of a conquest attracts avid engage- ment. Over time, distinct networks of gamers are formed, with each participant constructing social relationships that often lead to more frequent and intense gaming interactions.

The gaming industry is now utilizing data analytics and visualizations to discern customer behaviors better and uncover player motivations. Looking at customer segments is no longer enough. Companies are now looking at microsegments that go beyond traditional demographics like age or geo- graphic location to understand customer preferences such as favorite games, preferred levels of difficulty, or game genres.

By gaining analytic insights into gamer strat- egies and behaviors, companies can create unique

(Continued )

330 Part II • Predictive Analytics/Machine Learning

gaming experiences that are attuned to these behaviors. By engaging players with the games and features they desire, video game compa- nies gain a devoted following, grow profits, and develop new revenue streams through merchan- dising ventures.

For a visual treat, watch a short video (https:// www.teradata.com/Resources/Videos/Art-of- Analytics-The-Sword) to see how the companies can use analytics to decipher gamer relationships that drive user behaviors and lead to better games.

Questions for Case 6.2

1. What are the main challenges for gaming companies?

2. How can analytics help gaming companies stay competitive?

3. What types of data can gaming companies obtain and use for analytics?

Source: Teradata Case Study. https://www.teradata.com/ Resources/Case-Studies/Gaming-Companies-Use-Data- Analytics (accessed August 2018).

Technology Insight 6.1 briefly describes the common components (or elements) of a typical artificial neural network along with their functional relationships.

TECHNOLOGY INSIGHT 6.1 Elements of an Artificial Neural Network

A neural network is composed of processing elements that are organized in different ways to form the network’s structure. The basic processing unit in a neural network is the neuron. A number of neurons are then organized to establish a network of neurons. Neurons can be orga- nized in a number of different ways; these various network patterns are referred to as topologies or network architectures (some of the most common architectures are summarized in Chapter 5). One of the most popular approaches, known as the feedforward-multilayered perceptron, allows all neurons to link the output in one layer to the input of the next layer, but it does not allow any feedback linkage (Haykin, 2009).

Processing Element (PE) The PE of an ANN is an artificial neuron. Each neuron receives inputs, processes them, and de- livers a single output as shown in Figure 6.5. The input can be raw input data or the output of other processing elements. The output can be the final result (e.g., 1 means yes, 0 means no), or it can be input to other neurons.

Network Structure

Each ANN is composed of a collection of neurons that are grouped into layers. A typical struc- ture is shown in Figure 6.8. Note the three layers: input, intermediate (called the hidden layer), and output. A hidden layer is a layer of neurons that takes input from the previous layer and converts those inputs into outputs for further processing. Several hidden layers can be placed between the input and output layers, although it is common to use only one hidden layer. In that case, the hidden layer simply converts inputs into a nonlinear combination and passes the transformed inputs to the output layer. The most common interpretation of the hidden layer is as a feature-extraction mechanism; that is, the hidden layer converts the original inputs in the problem into a higher-level combination of such inputs.

In ANN, when information is processed, many of the processing elements perform their computations at the same time. This parallel processing resembles the way the human brain works, and it differs from the serial processing of conventional computing.

Application Case 6.2 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 331

Input Layer

Y1

X1

X2 X3(PE)

(PE)

(PE) Hidden Layer

(PE)

(PE) Output Layer

(PE)

(PE)

Weighted Sum ( )

Transfer Function

( )fS

FIGURE 6.8 Neural Network with One Hidden Layer. PE: processing element (an artificial representation of a biological neuron); Xi: inputs to a PE; y: output generated by a PE; g : summation function; and f : activation/transfer function.

Input Each input corresponds to a single attribute. For example, if the problem is to decide on ap- proval or disapproval of a loan, attributes could include the applicant’s income level, age, and home ownership status. The numeric value, or the numeric representation of non-numeric value, of an attribute is the input to the network. Several types of data, such as text, picture, and voice, can be used as inputs. Preprocessing may be needed to convert the data into meaningful inputs from symbolic/non-numeric data or to numeric/scale data.

Outputs The output of a network contains the solution to a problem. For example, in the case of a loan application, the output can be “yes” or “no.” The ANN assigns numeric values to the output, which may then need to be converted into categorical output using a threshold value so that the results would be 1 for “yes” and 0 for “no.”

Connection Weights Connection weights are the key elements of an ANN. They express the relative strength (or mathematical value) of the input data or the many connections that transfer data from layer to layer. In other words, weights express the relative importance of each input to a processing element and, ultimately, to the output. Weights are crucial in that they store learned patterns of information. It is through repeated adjustments of weights that a network learns.

Summation Function The summation function computes the weighted sums of all input elements entering each pro- cessing element. A summation function multiplies each input value by its weight and totals the values for a weighted sum. The formula for n inputs (represented with X) in one processing element is shown in Figure 6.9a, and for several processing elements, the summation function formulas are shown in Figure 6.9b.

Transfer Function

The summation function computes the internal stimulation, or activation level, of the neuron. Based on this level, the neuron may or may not produce an output. The relationship between the

332 Part II • Predictive Analytics/Machine Learning

(a) Single Neuron

(PE)

PE: Processing Element (or neuron)

Y1

Y2 X2

W2

W1 W11

W21

W12

W22

W23

X1

X2

X1

Y3

Y

Y 5 X1W1 1 X2W2

Y1 5 X1W11 1 X2W21 Y2 5 X1W12 1 X2W22 Y3 5 X2W23

(PE)

(PE)

(PE)

(b) Multiple Neurons

FIGURE 6.9 Summation Function for (a) a Single Neuron/PE and (b) Several Neurons/PEs.

Processing Element (PE)

Y 5 1.2

Summation Function: Y 5 3(0.2) 1 1(0.4) 1 2(0.1) 5 1.2

Transfer Function: YT 5 1/(1 1 e 21.2) 5 0.77

YT 5 0.77

X1 5 3

X2 5 1

X3 5 2

W 1 5 0.2

W2 5 0.4

W3 5

0.1

FIGURE 6.10 Example of ANN Transfer Function.

internal activation level and the output can be linear or nonlinear. The relationship is expressed by one of several types of transformation (transfer) functions (see Table 6.1 for a list of commonly used activation functions). Selection of the specific activation function affects the network’s op- eration. Figure 6.10 shows the calculation for a simple sigmoid-type activation function example.

The transformation modifies the output levels to fit within a reasonable range of values (typically between 0 and 1). This transformation is performed before the output reaches the next level. Without such a transformation, the value of the output becomes very large, especially when there are several layers of neurons. Sometimes a threshold value is used instead of a trans- formation function. A threshold value is a hurdle value for the output of a neuron to trigger the next level of neurons. If an output value is smaller than the threshold value, it will not be passed to the next level of neurons. For example, any value of 0.5 or less becomes 0, and any value above 0.5 becomes 1. A transformation can occur at the output of each processing element, or it can be performed only at the final output nodes.

Chapter 6 • Deep Learning and Cognitive Computing 333

Application Case 6.3 provides an interesting use case where advanced analytics and deep learning are being used to prevent the extinction of rare animals.

“There are some people who want to kill animals like the lions and cheetahs. We would like to teach them, there are not many left,” says WildTrack offi- cials. The more we can study their behavior, the more we can help to protect them—and sustain the earth’s biodiversity that supports us all. Their tracks tell a collective story that holds incredible value in conservation. Where are they going? How many are left? There is much to be learned by monitoring footprints of endangered species like the cheetah.

WildTrack, a nonprofit organization, was founded in 2004 by Zoe Jewell and Sky Alibhai, a vet- erinarian and a wildlife biologist, respectively, who had been working for many years in Africa monitor- ing black and white rhinos. While in Zimbabwe, in the early 1990s, they collected and presented data to show that invasive monitoring techniques used for black rhinos were negatively impacting female fertility and began to develop a footprint identifica- tion technique. Interest from researchers around the world who needed a cost-effective and noninvasive approach to wildlife monitoring sparked WildTrack.

Artificial intelligence may help people recre- ate some of the skills used by indigenous trackers. WildTrack researchers are exploring the value AI can bring to conservation. They think that AI solu- tions are designed to enhance human efforts—not replace them. With deep learning, given enough data, a computer can be trained to perform human- like tasks such as identifying footprint images and recognizing patterns in a similar way to indigenous trackers—but with the added ability to apply these concepts at a much larger scale and more rapid pace. Analytics really underpins the whole thing, potentially giving insights into species populations that WildTrack never had before.

The WildTrack footprint identification tech- nique is a tool for noninvasive monitoring of endan- gered species through digital images of footprints. Measurements from these images are analyzed by customized mathematical models that help to identify the species, individual, sex, and age class. AI could add the ability to adapt through progressive learning algorithms and tell an even more complete story.

Obtaining crowdsourcing data is the next important step toward redefining what con- servation looks like in the future. Ordinary people would not necessarily be able to dart a rhino, but they can take an image of a foot- print. WildTrack has data coming in from everywhere—too much to manage traditionally. That’s really where AI comes in. It can automate repetitive learning through data, performing fre- quent, high- volume, computerized tasks reliably and without fatigue.

“Our challenge is how to harness artificial intelligence to create an environment where there’s room for us, and all species in this world,” says Alibhai.

Questions for Case 6.3

1. What is WildTrack and what does it do?

2. How can advanced analytics help WildTrack?

3. What are the roles that deep learning plays in this application case?

Source: SAS Customer Story. “Can Artificial Intelligence Help Protect These Animals from Extinction? The Answer May Lie in Their Footprints.” https://www.sas.com/en_us/explore/ analytics-in-action/impact/WildTrack.html (accessed August 2018); WildTrack.org.

Application Case 6.3 Artificial Intelligence Helps Protect Animals from Extinction

u SECTION 6.3 REVIEW QUESTIONS

1. How does a single artificial neuron (i.e., PE) work? 2. List and briefly describe the most commonly used ANN activation functions. 3. What is MLP, and how does it work? 4. Explain the function of weights in ANN. 5. Describe the summation and activation functions in MLP-type ANN architecture.

334 Part II • Predictive Analytics/Machine Learning

6.4 PROCESS OF DEVELOPING NEURAL NETWORK–BASED SYSTEMS

Although the development process of ANN is similar to the structured design methodolo- gies of traditional computer-based information systems, some phases are unique or have some unique aspects. In the process described here, we assume that the preliminary steps of system development, such as determining information requirements, conducting a fea- sibility analysis, and gaining a champion in top management for the project, have been completed successfully. Such steps are generic to any information system.

As shown in Figure 6.11, the development process for an ANN application in- cludes nine steps. In step 1, the data to be used for training and testing the network

Get more data; reformat data

Collect, organize and format the data 1

Step

Re-separate data into subsets

Separate data into training, validation, and testing sets

2

Change network architecture

Decide on a network architecture and structure 3

Change learning algorithm

Select a learning algorithm 4

Change network parameters

Reset and restart the training

Set network parameters and initialize their values

5

Initialize weights and start training (and validation)

6

Stop training, freeze the network weights

7

Test the trained network 8

Deploy the network for use on unknown new cases

9

FIGURE 6.11 Development Process of an ANN Model.

Chapter 6 • Deep Learning and Cognitive Computing 335

are collected. Important considerations are that the particular problem is amenable to a neural network solution and that adequate data exist and can be obtained. In step 2, training data must be identified, and a plan must be made for testing the performance of the network.

In steps 3 and 4, a network architecture and a learning method are selected. The availability of a particular development tool or the capabilities of the development personnel may determine the type of neural network to be constructed. Also, certain problem types have demonstrated high success rates with certain configurations (e.g., multilayer feedforward neural networks for bankruptcy prediction [Altman (1968), Wilson and Sharda (1994), and Olson, Delen, and Meng (2012)]). Important considerations are the exact number of neurons and the number of layers. Some packages use genetic algo- rithms to select the network design.

There are several parameters for tuning the network to the desired learning perfor- mance level. Part of the process in step 5 is the initialization of the network weights and pa- rameters followed by the modification of the parameters as training performance feedback is received. Often, the initial values are important in determining the efficiency and length of training. Some methods change the parameters during training to enhance performance.

Step 6 transforms the application data into the type and format required by the neu- ral network. This may require writing software to preprocess the data or performing these operations directly in an ANN package. Data storage and manipulation techniques and processes must be designed for conveniently and efficiently retraining the neural network when needed. The application data representation and ordering often influence the ef- ficiency and possibly the accuracy of the results.

In steps 7 and 8, training and testing are conducted iteratively by presenting input and desired or known output data to the network. The network computes the outputs and adjusts the weights until the computed outputs are within an acceptable tolerance of the known outputs for the input cases. The desired outputs and their relationships to input data are derived from historical data (i.e., a portion of the data collected in step 1).

In step 9, a stable set of weights is obtained. Then the network can reproduce the desired outputs given inputs such as those in the training set. The network is ready for use as a stand-alone system or as part of another software system where new input data will be presented to it and its output will be a recommended decision.

Learning Process in ANN

In supervised learning, the learning process is inductive; that is, connection weights are de- rived from existing cases. The usual process of learning involves three tasks (see Figure 6.12):

1. Compute temporary outputs. 2. Compare outputs with desired targets. 3. Adjust the weights and repeat the process.

Like any other supervised machine-learning technique, neural network training is usually done by defining a performance function (F) (a.k.a. cost function or loss func- tion) and optimizing (minimizing) that function by changing model parameters. Usually, the performance function is nothing but a measure of error (i.e., the difference between the actual input and the target) across all inputs of a network. There are several types of error measures (e.g., sum square errors, mean square errors, cross entropy, or even cus- tom measures) all of which are designed to capture the difference between the network outputs and the actual outputs.

The training process begins by calculating outputs for a given set of inputs using some random weights and biases. Once the network outputs are on hand, the performance

336 Part II • Predictive Analytics/Machine Learning

function can be computed. The difference between the actual output (Y or YT) and the desired output (Z) for a given set of inputs is an error called delta (in calculus, the Greek symbol delta, ∆, means “difference”).

The objective is to minimize delta (i.e., reduce it to 0 if possible), which is done by adjusting the network’s weights. The key is to change the weights in the proper direction, making changes that reduce delta (i.e., error). Different ANNs compute delta in different ways, depending on the learning algorithm being used. Hundreds of learning algorithms are available for various situations and configurations of ANN.

Backpropagation for ANN Training

The optimization of performance (i.e., minimization of the error or delta) in the neural network is usually done by an algorithm called stochastic gradient descent (SGD), which is an iterative gradient-based optimizer used for finding the minimum (i.e., the lowest point) in performance functions, as in the case of neural networks. The idea behind the SGD algorithm is that the derivative of the performance function with respect to each current weight or bias indicates the amount of change in the error measure by each unit of change in that weight or bias element. These derivatives are referred to as network gradients. Calculation of network gradients in the neural networks requires application of an algorithm called backpropagation, which is the most popular neural network learning algorithm, that applies the chain rule of calculus to compute the deriv- atives of functions formed by composing other functions whose derivatives are known [more on the mathematical details of this algorithm can be found in Rumelhart, Hinton, and Williams (1986)].

Compute the output

ANN Model

Stop the learning

and freeze the weights

Is the desired output

achieved?

Yes

NoAdjust the weights

FIGURE 6.12 Supervised Learning Process of an ANN.

Chapter 6 • Deep Learning and Cognitive Computing 337

Backpropagation (short for back-error propagation) is the most widely used supervised learning algorithm in neural computing (Principe, Euliano, and Lefebvre, 2000). By using the SGD mentioned previously, the implementation of backpropaga- tion algorithms is relatively straightforward. A neural network with backpropagation learning includes one or more hidden layers. This type of network is considered feed- forward because there are no interconnections between the output of a processing element and the input of a node in the same layer or in a preceding layer. Externally provided correct patterns are compared with the neural network’s output during (su- pervised) training, and feedback is used to adjust the weights until the network has categorized all training patterns as correctly as possible (the error tolerance is set in advance).

Starting with the output layer, errors between network-generated actual output and the desired outputs are used to correct/adjust the weights for the connections between the neurons (see Figure 6.13). For any output neuron j, the error (delta) = (Zj - Yj) (df/dx), where Z and Y are the desired and actual outputs, respectively. Using the sigmoid func- tion, f = 31 + exp(-x)4-1, where x is proportional to the sum of the weighted inputs to the neuron, is an effective way to compute the output of a neuron in practice. With this function, the derivative of the sigmoid function df/dx = f (1 - f ) and of the error is a simple function of the desired and actual outputs. The factor f (1 - f ) is the logistic func- tion, which serves to keep the error correction well bounded. The weight of each input to the jth neuron is then changed in proportion to this calculated error. A more complicated expression can be derived to work backward in a similar way from the output neurons through the hidden layers to calculate the corrections to the associated weights of the inner neurons. This complicated method is an iterative approach to solving a nonlinear optimization problem that is very similar in meaning to the one characterizing multiple linear regression.

In backpropagation, the learning algorithm includes the following procedures:

1. Initialize weights with random values and set other parameters. 2. Read in the input vector and the desired output. 3. Compute the actual output via the calculations, working forward through the layers. 4. Compute the error. 5. Change the weights by working backward from the output layer through the hidden

layers.

W1

W2 Yi

Summation Transfer Function

f(S) Y 5 f(S)

a(Zi 2 Yi ) error

Wn Xn

X2

X1

Neuron (or PE)

S 5 i 5 1

n XiWiπ

FIGURE 6.13 Backpropagation of Error for a Single Neuron.

338 Part II • Predictive Analytics/Machine Learning

This procedure is repeated for the entire set of input vectors until the desired output and the actual output agree within some predetermined tolerance. Given the calcula- tion requirements for one iteration, training a large network can take a very long time; therefore, in one variation, a set of cases is run forward and an aggregated error is fed backward to speed the learning. Sometimes, depending on the initial random weights and network parameters, the network does not converge to a satisfactory performance level. When this is the case, new random weights must be generated, and the network parameters, or even its structure, may have to be modified before another attempt is made. Current research is aimed at developing algorithms and using parallel computers to improve this process. For example, genetic algorithms (GA) can be used to guide the selection of the network parameters to maximize the performance of the desired output. In fact, most commercial ANN software tools are now using GA to help users “optimize” the network parameters in a semiautomated manner.

A central concern in the training of any type of machine-learning model is over- fitting. It happens when the trained model is highly fitted to the training data set but performs poorly with regard to external data sets. Overfitting causes serious issues with respect to the generalizability of the model. A large group of strategies known as regu- larization strategies is designed to prevent models from overfitting by making changes or defining constraints for the model parameters or the performance function.

In the classic ANN models of small size, a common regularization strategy to avoid overfitting is to assess the performance function for a separate validation data set as well as the training data set after each iteration. Whenever the performance stopped improv- ing for the validation data, the training process would be stopped. Figure 6.14 shows a

Error Reduction on the Validation Set

Error Reduction on the Training Set

The Best Model

Training Iterations0 0

E rr

or

FIGURE 6.14 Illustration of the Overfitting in ANN—Gradually Changing Error Rates in the Training and Validation Data Sets As the Number of Iterations Increases.

Chapter 6 • Deep Learning and Cognitive Computing 339

typical graph of the error measure by the number of iterations of training. As shown, in the beginning, the error decreases in both training and validation data by running more and more iterations; but from a specific point (shown by the dashed line), the error starts increasing in the validation set while still decreasing in the training set. It means that be- yond that number of iterations, the model becomes overfitted to the data set with which it is trained and cannot necessarily perform well when it is fed with some external data. That point actually represents the recommended number of iterations for training a given neural network.

Technology Insight 6.2 discusses some of the popular neural network software and offers some Web links to more comprehensive ANN-related software sites.

TECHNOLOGY INSIGHT 6.2 ANN Software

Many tools are available for developing neural networks (see this book’s Web site and the re- source lists at PC AI, pcai.com). Some of these tools function like software shells. They provide a set of standard architectures, learning algorithms, and parameters, along with the ability to manipulate the data. Some development tools can support several network paradigms and learn- ing algorithms.

Neural network implementations are also available in most of the comprehensive pre- dictive analytics and data mining tools, such as the SAS Enterprise Miner, IBM SPSS Modeler (formerly Clementine), and Statistica Data Miner. Weka, RapidMiner, Orange, and KNIME are open-source free data mining software tools that include neural network capabilities. These free tools can be downloaded from their respective Web sites; simple Internet searches on the names of these tools should lead you to the download pages. Also, most of the commercial software tools are available for download and use for evaluation purposes (usually they are limited on time of availability and/or functionality).

Many specialized neural network tools make the building and deployment of a neural network model an easier undertaking in practice. Any listing of such tools would be in- complete. Online resources such as Wikipedia (en.wikipedia.org/wiki/Artificial_neural_net- work), Google’s or Yahoo!’s software directory, and the vendor listings on pcai.com are good places to locate the latest information on neural network software vendors. Some of the vendors that have been around for a while and have reported industrial applications of their neural network software include California Scientific (BrainMaker), NeuralWare, NeuroDimension Inc., Ward Systems Group (Neuroshell), and Megaputer. Again, the list can never be complete.

Some ANN development tools are spreadsheet add-ins. Most can read spreadsheet, database, and text files. Some are freeware or shareware. Some ANN systems have been developed in Java to run directly on the Web and are accessible through a Web browser interface. Other ANN products are designed to interface with expert systems as hybrid de- velopment products.

Developers may instead prefer to use more general programming languages, such as C, C#, C++, Java, and so on, readily available R and Python libraries, or spreadsheets to program the model, perform the calculations, and deploy the results. A common practice in this area is to use a library of ANN routines. Many ANN software providers and open-source platforms provide such programmable libraries. For example, hav.Software (hav.com) provides a library of C++ classes for implementing stand-alone or embedded feedforward, simple recurrent, and random- order recurrent neural networks. Computational software such as MATLAB also includes neural network–specific libraries.

340 Part II • Predictive Analytics/Machine Learning

u SECTION 6.4 REVIEW QUESTIONS

1. List the nine steps in conducting a neural network project. 2. What are some of the design parameters for developing a neural network? 3. Draw and briefly explain the three-step process of learning in ANN. 4. How does backpropagation learning work? 5. What is overfitting in ANN learning? How does it happen, and how can it be mitigated? 6. Describe the different types of neural network software available today.

6.5 ILLUMINATING THE BLACK BOX OF ANN

Neural networks have been used as an effective tool for solving highly complex real- world problems in a wide range of application areas. Even though ANN have been proven to be superior predictors and/or cluster identifiers in many problem scenarios (compared to their traditional counterparts), in some applications, there exists an ad- ditional need to know “how the model does what it does.” ANN are typically known as black boxes, capable of solving complex problems but lacking the explanation of their capabilities. This lack of transparency situation is commonly referred to as the “black- box” syndrome.

It is important to be able to explain a model’s “inner being”; such an explanation offers assurance that the network has been properly trained and will behave as desired once deployed in a business analytics environment. Such a need to “look under the hood” might be attributable to a relatively small training set (as a result of the high cost of data acquisition) or a very high liability in case of a system error. One example of such an application is the deployment of airbags in vehicles. Here, both the cost of data acquisition (crashing vehicles) and the liability concerns (danger to human lives) are rather significant. Another representative example for the importance of explana- tion is loan-application processing. If an applicant is refused a loan, he or she has the right to know why. Having a prediction system that does a good job on differentiating good and bad applications may not be sufficient if it does not also provide the justifi- cation of its predictions.

A variety of techniques have been proposed for analysis and evaluation of trained neural networks. These techniques provide a clear interpretation of how a neural net- work does what it does; that is, specifically how (and to what extent) the individual inputs factor into the generation of specific network output. Sensitivity analysis has been the front-runner of the techniques proposed for shedding light into the black-box charac- terization of trained neural networks.

Sensitivity analysis is a method for extracting the cause-and-effect relationships among the inputs and the outputs of a trained neural network model. In the process of performing sensitivity analysis, the trained neural network’s learning capability is disabled so that the network weights are not affected. The basic procedure behind sensitivity analysis is that the inputs to the network are systematically perturbed within the allowable value ranges, and the corresponding change in the output is recorded for each and every input variable (Principe et al., 2000). Figure 6.15 shows a graphical illustration of this process. The first input is varied between its mean plus and minus of a user-defined number of standard deviations (or for categorical variables, all of its possible values are used) while all other input variables are fixed at their respective means (or modes). The network output is computed for a user-defined number of steps above and below the mean. This process is repeated for each input. As a result, a report is generated to summarize the variation of each output with respect to the variation in each input. The generated report often contains a column plot (along with

Chapter 6 • Deep Learning and Cognitive Computing 341

Systematically Perturbed

Inputs

Observed Change in Outputs

Trained ANN, the “Black Box”

D1

FIGURE 6.15 A Figurative Illustration of Sensitivity Analysis on an ANN Model.

According to the National Highway Traffic Safety Administration (NHTSA), over 6 million traffic acci- dents claim more than 41,000 lives each year in the United States. Causes of accidents and related injury severity are of special interest to traffic safety researchers. Such research is aimed at reducing not only the number of accidents but also the severity of injury. One way to accomplish the latter is to identify the most profound factors that affect injury severity. Understanding the circumstances under which driv- ers and passengers are more likely to be severely injured (or killed) in a vehicle accident can help improve the overall driving safety situation. Factors that potentially elevate the risk of injury severity of vehicle occupants in the event of an accident include demographic and/or behavioral characteris- tics of the person (e.g., age, gender, seatbelt usage, use of drugs or alcohol while driving), environmen- tal factors, and/or roadway conditions at the time of the accident (e.g., surface conditions, weather or light conditions, direction of the impact, vehicle ori- entation in the crash, occurrence of a rollover), as well as technical characteristics of the vehicle itself (e.g., age, body type).

In an exploratory data mining study, Delen et al. (2006) used a large sample of data—30,358 police- reported accident records obtained from the General Estimates System of NHTSA—to identify which fac- tors become increasingly more important in escalating the probability of injury severity during a traffic crash. Accidents examined in this study included a geograph- ically representative sample of multiple-vehicle colli- sion accidents, single-vehicle fixed-object collisions, and single-vehicle noncollision (rollover) crashes.

Contrary to many of the previous studies con- ducted in this domain, which have primarily used regression-type generalized linear models where the functional relationships between injury severity and crash-related factors are assumed to be linear (which is an oversimplification of the reality in most real-world situations), Delen and his colleagues (2006) decided to go in a different direction. Because ANN are known to be superior in capturing highly nonlinear complex relationships between the predictor variables (crash factors) and the target variable (severity level of the injuries), they decided to use a series of ANN models to estimate the significance of the crash factors on the level of injury severity sustained by the driver.

Application Case 6.4 Sensitivity Analysis Reveals Injury Severity Factors in Traffic Accidents

(Continued )

numeric values presented on the x-axis), reporting the relative sensitivity values for each input variable. A representative example of sensitivity analysis on ANN models is provided in Application Case 6.4.

342 Part II • Predictive Analytics/Machine Learning

From a methodological standpoint, Delen et al. (2006) followed a two-step process. In the first step, they developed a series of prediction models (one for each injury severity level) to capture the in-depth relation- ships between the crash-related factors and a specific level of injury severity. In the second step, they con- ducted sensitivity analysis on the trained neural network models to identify the prioritized importance of crash- related factors as they relate to different injury severity levels. In the formulation of the study, the five-class prediction problem was decomposed into a number of binary classification models to obtain the granularity of information needed to identify the “true” cause-and- effect relationships between the crash-related factors and different levels of injury severity. As shown in Figure 6.16, eight different neural network models have been developed and used in the sensitivity analy- sis to identify the key determinants of increased injury severity levels.

The results revealed considerable differences among the models built for different injury severity levels. This implies that the most influential factors in prediction models highly depend on the level of injury severity. For example, the study revealed that the variable seatbelt use was the most impor- tant determinant for predicting higher levels of injury severity (such as incapacitating injury or fatality), but it was one of the least significant pre- dictors for lower levels of injury severity (such as non-incapacitating injury and minor injury). Another interesting finding involved gender: The

driver’s gender was among the significant predic- tors for lower levels of injury severity, but it was not among the significant factors for higher lev- els of injury severity, indicating that more serious injuries do not depend on the driver being a male or a female. Another interesting and somewhat intuitive finding of the study indicated that age becomes an increasingly more significant factor as the level of injury severity increases, implying that older people are more likely to incur severe inju- ries (and fatalities) in serious vehicle crashes than younger people.

Questions for Case 6.4

1. How does sensitivity analysis shed light on the black box (i.e., neural networks)?

2. Why would someone choose to use a black-box tool such as neural networks over theoretically sound, mostly transparent statistical tools like logistic regression?

3. In this case, how did neural networks and sensi- tivity analysis help identify injury-severity factors in traffic accidents?

Sources: Delen, D., R. Sharda, & M. Bessonov. (2006). “Identifying Significant Predictors of Injury Severity in Traffic Accidents Using a Series of Artificial Neural Networks.” Accident Analysis and Prevention, 38(3), pp. 434–444; Delen, D., L. Tomak, K. Topuz, & E. Eryarsoy (2017). “Investigating Injury Severity Risk Factors in Automobile Crashes with Predictive Analytics and Sensitivity Analysis Methods.” Journal of Transport & Health, 4, pp. 118–131.

Model Label

1.1

1.2

1.3

1.4

2.1

2.2

2.3

2.4

No Injury (35.4%)

Probable Injury (23.6%)

Non-Incapacitating (19.6%)

Incapacitating (17.8%)

Fatal Injury (3.6%)

Binary category label 0 Binary category label 1

FIGURE 6.16 Graphical Representation of the Eight Binary ANN Model Configurations.

Application Case 6.4 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 343

u SECTION 6.5 REVIEW QUESTIONS

1. What is the so-called black-box syndrome? 2. Why is it important to be able to explain an ANN’s model structure? 3. How does sensitivity analysis work in ANN? 4. Search the Internet to find other methods to explain ANN methods. Report the results.

6.6 DEEP NEURAL NETWORKS

Until recently (before the advent of deep learning phenomenon), most neural network applications involved network architectures with only a few hidden layers and a limited number of neurons in each layer. Even in relatively complex business applications of neural networks, the number of neurons in networks hardly exceeded a few thousands. In fact, the processing capability of computers at the time was such a limiting factor that central processing units (CPU) were hardly able to run networks involving more than a couple of layers in a reasonable time. In recent years, development of graphics processing units (GPUs) along with the associated programming languages (e.g., CUDA by NVIDIA) that enable people to use them for data analysis purposes has led to more advanced appli- cations of neural networks. GPU technology has enabled us to successfully run neural net- works with over a million neurons. These larger networks are able to go deeper into the data features and extract more sophisticated patterns that could not be detected otherwise.

While deep networks can handle a considerably larger number of input variables, they also need relatively larger data sets to be trained satisfactorily; using small data sets for training deep networks typically leads to overfitting of the model to the training data and poor and unreliable results in case of applying to external data. Thanks to the Internet- and Internet of Things (IoT)-based data-capturing tools and technologies, larger data sets are now available in many application domains for deeper neural network training.

The input to a regular ANN model is typically an array of size R * 1, where R is the number of input variables. In the deep networks, however, we are able to use tensors (i.e., N-dimensional arrays) as input. For example, in image recognition networks, each input (i.e., image) can be represented by a matrix indicating the color codes used in the image pixels; or for video processing purposes, each video can be represented by several matrices (i.e., a 3D tensor), each representing an image involved in the video. In other words, tensors provide us with the ability to include additional dimensions (e.g., time, location) in analyzing the data sets.

Except for these general differences, the different types of deep networks involve various modifications to the architecture of standard neural networks that equip them with distinct capabilities of dealing with particular data types for advanced purposes. In the fol- lowing section, we discuss some of these special network types and their characteristics.

Feedforward Multilayer Perceptron (MLP)-Type Deep Networks

MLP deep networks, also known as deep feedforward networks, are the most general type of deep networks. These networks are simply large-scale neural networks that can contain many layers of neurons and handle tensors as their input. The types and characteristics of the network elements (i.e., weight functions, transfer functions) are pretty much the same as in the standard ANN models. These models are called feedforward because the flow of information that goes through them is always forwarding and no feedback connections (i.e., connections in which outputs of a model are fed back to itself) are allowed. The neural networks in which feedback connections are allowed are called recurrent neural networks (RNN). General RNN architectures, as well as a specific variation of RNNs called long short-term memory networks, are discussed in later sections of this chapter.

344 Part II • Predictive Analytics/Machine Learning

Generally, a sequential order of layers has to be held between the input and the output layers in the MLP-type network architecture. This means that the input vector has to pass through all layers sequentially and cannot skip any of them; moreover, it cannot be directly connected to any layer except for the very first one; the output of each layer is the input to the subsequent layer. Figure 6.17 demonstrates a vector representation of the first three layers of a typical MLP network. As shown, there is only one vector going into each layer, which is either the original input vector (p for the first layer) or the output vector from the previous hidden layer in the network architecture (ai - 1 for the ith layer). There are, however, some special variations of MLP network architectures designed for specialized purposes in which these principles can be violated.

Impact of Random Weights in Deep MLP

Optimization of the performance (loss) function in many real applications of deep MLPs is a challenging issue. The problem is that applying the common gradient-based train- ing algorithms with random initialization of weights and biases that is very efficient for finding the optimal set of parameters in shallow neural networks most of the time could lead to getting stuck in the locally optimal solutions rather than catching the global opti- mum values for the parameters. As the depth of network increases, chances of reaching a global optimum using random initializations with the gradient-based algorithms decrease. In such cases, usually pretraining the network parameters using some unsupervised deep learning methods such as deep belief networks (DBNs) can be helpful (Hinton, Osindero, and Teh, 2006). DBNs are a type of a large class of deep neural networks called generative models. Introduction of DBNs in 2006 is considered as the beginning of the current deep learning renaissance (Goodfellow et al., 2016), since prior to that, deep models were considered too difficult to optimize. In fact, the primary application of DBNs today is to improve classification models by pretraining of their parameters.

Using these unsupervised learning methods, we can train the MLP layers, one at a time, starting from the first layer, and use the output of each layer as the input to the subsequent layer and initialize that layer with an unsupervised learning algorithm. At the end, we will have a set of initialized values for the parameters across the whole network. Those pre- trained parameters, instead of random initialized parameters, then can be used as the initial values in the supervised learning of the MLP. This pretraining procedure has been shown to cause significant improvements to the deep classification applications. Figure 6.18 illustrates the classification errors that resulted from training a deep MLP network with (blue circles) and without (black triangles) pretraining of parameters (Bengio, 2009). In this example, the blue line represents the observed error rates of testing a classification model (on 1,000 heldout examples) trained using a purely supervised approach with 10 million examples,

x p z1

a1 5 f1(w1p 1 b1) a2 5 f2(w2a1 1 b2)

a3 5 f3(w3f2(w2f1(w1p1b1)1b2)1b3)

a3 5 f3(w3a2 1 b3)

w1 b1 w2 b2 w3 b3

n1 a1 z2 n2 a2 z3 n3 a3 S S Sf1 x f2 x f3

I

n

p

u

t

FIGURE 6.17 Vector Representation of the First Three Layers in a Typical MLP Network.

Chapter 6 • Deep Learning and Cognitive Computing 345

whereas the black line indicates the error rates on the same testing data set when 2.5 million examples were initially used for unsupervised training of network parameters (using DBN) and then the other 7.5 million examples along with the initialized parameters were used to train a supervised classification model. The diagrams clearly show a significant improvement in terms of the classification error rate in the model pretrained by a deep belief network.

More Hidden Layers versus More Neurons?

An important question regarding the deep MLP models is “Would it make sense (and produce better results) to restructure such networks with only a few layers, but many neurons in each?” In other words, the question is why do we need deep MLP networks with many layers when we can include the same number of neurons in just a few layers (i.e., wide networks instead of deep networks). According to the universal approximation theorem (Cybenko, 1989; Hornik, 1991), a sufficiently large single-layer MLP network will be able to approximate any function. Although theoretically founded, such a layer with many neurons may be prohibitively large and hence may fail to learn the underlying pat- terns correctly. A deeper network can reduce the number of neurons required at each layer and hence decrease the generalization error. Whereas theoretically it is still an open research question, practically using more layers in a network seems to be more effective and computationally more efficient than using many neurons in a few layers.

Like typical artificial neural networks, multilayer perceptron networks can also be used for various prediction, classification, and clustering purposes. Especially when a large number of input variables are involved or in cases that the nature of input has to be an N -dimensional array, a deep multilayer network design needs to be employed.

Application Case 6.5 provides an excellent case for the use of advanced analytics to better manage traffic flows in crowded cities.

0 1024

1023

1022

1021

100

1 2 3 4 5 6 7 8 9

Number of Examples Seen (3106)

C la

ss ifi

ca ti

on E

rr or

10

FIGURE 6.18 The Effect of Pretraining Network Parameters on Improving Results of a Classification- Type Deep Neural Network.

346 Part II • Predictive Analytics/Machine Learning

The Background

When the Georgia Department of Transportation (GDOT) wanted to optimize the use of Big Data and advanced analytics to gain insight into transporta- tion, it worked with Teradata to develop a proof of concept evaluation of GDOT’s variable speed limit (VSL) pilot project.

The VSL concept has been adopted in many parts of the world, but it is still relatively new in the United States. As GDOT explains,

VSL are speed limits that change based on road, traffic, and weather conditions. Electronic signs slow down traffic ahead of congestion or bad weather to smooth out flow, diminish stop-and- go conditions, and reduce crashes. This low- cost, cutting edge technology alerts drivers in real time to speed changes due to conditions down the road. More consistent speeds improve safety by helping to prevent rear-end and lane changing collisions due to sudden stops.

Quantifying the customer service, safety, and efficiency benefits of VSL is extremely important to GDOT. This fits within a wider need to understand the effects of investments in intelligent transporta- tion systems as well as other transportation systems and infrastructures.

VSL Pilot Project on I-285 in Atlanta

GDOT conducted a VSL pilot project on the north- ern half, or “top end,” of I-285 that encircles Atlanta. This 36-mile stretch of highway was equipped with 88 electronic speed limit signs that adjusted speed limits in 10 mph increments from 65 miles per hour (mph) to the minimum of 35 mph. The objectives were twofold:

1. Analyze speeds on the highway before versus after implementation of VSL.

2. Measure the impact of VSL on driving conditions.

To obtain an initial view of the traffic, the Teradata data science solution identified the loca- tions and durations of “persistent slowdowns.” If highway speeds are above “reference speed,” then

traffic is considered freely flowing. Falling below the reference speed at any point on the highway is considered a slowdown. When slowdowns per- sist across multiple consecutive minutes, a persistent slowdown can be defined.

By creating an analytic definition of slow- downs, it is possible to convert voluminous and highly variable speed data into patterns to support closer investigation. The early analyses of the data revealed that the clockwise and counterclockwise directions of the same highway may show signifi- cantly different frequency and duration of slow- downs. To better understand how slowdowns affect highway traffic, it is useful to take our new defini- tion and zoom in on a specific situation. Figure 6.19 shows a specific but typical Atlanta afternoon on I-285, at a section of highway where traffic is mov- ing clockwise, from west to east, between mile markers MM10 in the west to the east end at MM46.

The first significant slowdown occurred at 3:00 p.m. near MM32. The size of the circles repre- sents duration (measured in minutes). The slowdown at MM32 was nearly four hours long. As the slow- down “persisted,” traffic speed diminished behind it. The slowdown formed on MM32 became a bottle- neck that caused traffic behind it to slow down as well. The “comet trail” of backed-up traffic at the top left of Figure 6.20 illustrates the sequential formation of slowdowns at MM32 and then farther west, each starting later in the afternoon and not lasting as long.

Measuring Highway Speed Variability

The patterns of slowdowns on the highway as well as their different timings and locations led us to ques- tion their impact on drivers. If VSL could help driv- ers better anticipate the stop-and-go nature of the slowdowns, then being able to quantify the impact would be of interest to GDOT. GDOT was particu- larly concerned about what happens when a driver first encounters a slowdown. “While we do not know what causes the slowdown, we do know that driv- ers have made speed adjustments. If the slowdown was caused by an accident, then the speed reduction could be quite sudden; alternatively, if the slowdown was just caused by growing volumes of traffic, then the speed reduction might be much more gradual.”

Application Case 6.5 Georgia DOT Variable Speed Limit Analytics Help Solve Traffic Congestions

Chapter 6 • Deep Learning and Cognitive Computing 347

Identifying Bottlenecks and Traffic Turbulence

A bottleneck starts as a slowdown at a particular loca- tion. Something like a “pinch point” occurs on the highway. Then, over a period of time, traffic slows down behind the original pinch point. A  bottle- neck is a length of highway where traffic falls below

60 percent of reverence speed and can stay at that level for miles. Figure 6.20 shows a conceptual repre- sentation of a bottleneck.

While bottlenecks are initiated by a pitch point, or slowdown, that forms the head of the queue, it is the end of the queue that is the most interest- ing. The area at the back of a queue is where traf- fic encounters a transition from free flow to slowly

14

2 PM

3 PM

M in

ut e

of B

ot tl en

ec k

S us

pe ct

ed (D

ec em

be r

1 1

, 2

0 1

4 )

Ti m

e of

D ay

from West to East, by Mile MarkerDirection of Traffic

4 PM

5 PM

6 PM

7 PM

16 18 20 22 24 26 28

Pseudo Mile Marker

30 32 34 36 38 30 32

Slowdown Duration

100.0 200.0

219.0

FIGURE 6.19 Traffic Moving Clockwise during the Afternoon.

Turbulence Reduction Opportunity

Bottleneck (queuing traffic)

Tr af

fic S

pe ed

( m

ph )

Zone of Influence

Bottleneck End

Bottleneck End

Direction of Travel

60% of Reference Speed

Speed of Traffic

Normal Traffic

FIGURE 6.20 Graphical Depiction of a Bottleneck on a Highway.

(Continued )

348 Part II • Predictive Analytics/Machine Learning

moving congested conditions. In the worst condi- tions, the end of the queue can experience a rapid transition. Drivers moving at highway speed may unexpectedly encounter slower traffic. This condi- tion is ripe for accidents and is the place where VSL can deliver real value.

Powerful New Insight on Highway Congestion

The availability of new Big Data sources that describe the “ground truth” of traffic conditions on highways provides rich new opportunities for developing and analyzing highway performance metrics. Using just a single data source on detailed highway speeds, we produced two new and distinctive metrics using Teradata advanced data science capabilities.

First, by defining and measuring persistent slowdowns, we helped traffic engineers understand the frequency and duration of slow speed locations on a highway. The distinction of measuring a per- sistent slowdown versus a fleeting one is uniquely challenging and requires data science. It provides the ability to compare the number, duration, and location of slowdowns in a way that is more infor- mative and compelling than simple averages, vari- ances, and outliers in highway speeds.

The second metric was the ability to measure turbulence caused by bottlenecks. By identifying where bottlenecks occur and then narrowing in on their very critical zones of influence, we can make measurements of speeds and traffic deceleration tur- bulence within those zones. Data science and ana- lytics capabilities demonstrated reduced turbulence when VSL is active in the critical zone of a bottleneck.

There is much more that could be explored within this context. For example, it is natural to assume that because most traffic is on the road dur- ing rush hours, VSL provides the most benefits dur- ing these high-traffic periods. However, the opposite may be true, which could provide a very important benefit of the VSL program.

Although this project was small in size and was just a proof of concept, a combination of similar projects beyond just transportation under the name of “smart cities” is underway around the United States and abroad. The goal is to use a variety of data from sensors to multimedia, rare event reports to satellite images along with advanced analytics that include deep learning and cognitive computing to transform the dynamic nature of cities toward bet- ter to best for all stakeholders.

Questions for Case 6.5

1. What was the nature of the problems that GDOT was trying to solve with data science?

2. What type of data do you think was used for the analytics?

3. What were the data science metrics developed in this pilot project? Can you think of other metrics that can be used in this context?

Source: Teradata Case Study. “Georgia DOT Variable Speed Limit Analytics Help Solve Traffic Congestion.” https:// www.teradata. com/Resources/Case-Studies/Georgia-DOT-Variable-Speed- Limit-Analytics (accessed July 2018); “Georgia DOT Variable Speed Limits.” www.dot.ga.gov/ DriveSmart/SafetyOperation/ Pages/VSL.aspx (accessed August 2018).Used with permission from Teradata.

In the next section, we discuss a very popular variation of deep MLP architecture called convolutional neural network (CNN) specifically designed for computer vision applications (e.g., image recognition, handwritten text processing).

u SECTION 6.6 REVIEW QUESTIONS

1. What is meant by “deep” in deep neural networks? Compare deep neural networks to shallow neural networks.

2. What is GPU? How does it relate to deep neural networks? 3. How does a feedforward multilayer perceptron-type deep network work?

Application Case 6.5 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 349

4. Comment on the impact of random weights in developing deep MLP. 5. Which strategy is better: more hidden layers versus more neurons?

6.7 CONVOLUTIONAL NEURAL NETWORKS

CNNs (LeCun et al., 1989) are among the most popular types of deep learning methods. CNNs are in essence variations of the deep MLP architecture, initially designed for com- puter vision applications (e.g., image processing, video processing, text recognition) but are also applicable to nonimage data sets.

The main characteristic of the convolutional networks is having at least one layer in- volving a convolution weight function instead of general matrix multiplication. Figure 6.21 illustrates a typical convolutional unit.

Convolution, typically shown by the symbol, is a linear operation that essentially aims at extracting simple patterns from sophisticated data patterns. For instance, in pro- cessing an image containing several objects and colors, convolution functions can extract simple patterns like the existence of horizontal or vertical lines or edges in different parts of the picture. We discuss convolution functions in more detail in the next section.

A layer containing a convolution function in a CNN is called a convolution layer. This layer is often followed by a pooling (a.k.a. subsampling) layer. Pooling layers are in charge of consolidating the large tensors to one with a smaller size and reducing the number of model parameters while keeping their important features. Different types of pooling layers are also discussed in the following sections.

Convolution Function

In the description of MLP networks, it was said that the weight function is generally a matrix manipulation function that multiplies the weight vector into the input vector to produce the output vector in each layer. Having a very large input vector/tensor, which is the case in most deep learning applications, we need a large number of weight pa- rameters so that each single input to each neuron could be assigned a single weight pa- rameter. For instance, in an image-processing task using a neural network for images of size 150 * 150 pixels, each input matrix will contain 22,500 (i.e., 150 times 150) integers, each of which should be assigned its own weight parameter per each neuron it goes into throughout the network. Therefore, having even only a single layer requires thousands of weight parameters to be defined and trained. As one might guess, this fact would dramatically increase the required time and processing power to train a network, since in each training iteration, all of those weight parameters have to be updated by the SGD algorithm. The solution to this problem is the convolution function.

S fp

Input

w b

z n a

Convolutional Unit a 5 f(w p 1 b)

FIGURE 6.21 Typical Convolutional Network Unit.

350 Part II • Predictive Analytics/Machine Learning

The convolution function can be thought of as a trick to address the issue defined in the previous paragraph. The trick is called parameter sharing, which in addition to computational efficiency provides additional benefits. Specifically, in a convolution layer, instead of having a weight for each input, there is a set of weights referred to as the convolution kernel or filter, which is shared between inputs and moves around the input matrix to produce the outputs. The kernel is typically represented as a small matrix of size Wr * c; for a given input matrix V , then, the convolution function can be stated as:

zi, j = a r

k= 1 a c

l= 1 wk,l vi + k - 1, j + l - 1

For example, assume that the input matrix to a layer and the convolution kernel is

V = £ 1 0 1

1 1 0

1 1 0

0 1 1

1 1 1

0 0 1

§ W = c 0 1 1 1

d

Figure 6.22 illustrates how the convolution output can be computed. As shown, each element of the output matrix results from summing up the one-by-one point mul- tiplications of the kernel elements into a corresponding r * c (in this example, 2 * 2 because the kernel is 2 * 2) subset of the input matrix elements. So, in the example shown, the element at the second column of the first row of the output matrix is in fact 0(0) + 1(1) + 1(1) + 1(0) = 2.

It can be seen that the magnitude of each element in the output matrix directly depends on how the matched kernel (with the 2 * 2 matrix) and the input matrix are involved in calculation of that element. For example, the element at the fourth column of the first row of the output matrix is the result of convoluting the kernel by a part of the input matrix, which is exactly the same as the kernel (shown in Figure 6.23). This suggests that by applying the convolution operation, we actually are converting the input matrix into an output in which the parts that have a particular feature (reflected by the kernel) are placed in the square box.

This characteristic of convolution functions is especially useful in practical image- processing applications. For instance, if the input matrix represents the pixels of an image,

1

1

1 1

0

1

1

0

0

0

1

0

1

1

0

1

1

1 0 1

1 1

Kernel (W)

Input matrix (V) Output matrix (Z)

2

3

2

1

1

1

3

1

3

2

FIGURE 6.22 Convolution of a 2 : 2 Kernel by a 3 : 6 Input Matrix.

1

1

1 1

0

1

1

0

0

0

1

0

1

1

0

1

1

1

FIGURE 6.23 The Output of Convolution Operation Is Maximized When the Kernel Exactly Matches the Part of Input Matrix That Is Being Convoluted by.

Chapter 6 • Deep Learning and Cognitive Computing 351

a particular kernel representing a specific shape (e.g., a diagonal line) may be convoluted into that image to extract parts of the image involving that specific shape. Figure 6.24, for example, shows the result of applying a 3 * 3 horizontal line kernel to a 15 * 15 image of a square.

Clearly, the horizontal kernel produces an output in which the location of horizon- tal lines (as a feature) in the original input image is identified.

Convolution using a kernel of size r * c will reduce the number of rows and columns in the output by r - 1 and c - 1, respectively. In the recent case, for exam- ple, using a 2 * 2 kernel for convolution, the output matrix has 1 row and 1 column less than the input matrix. To prevent this change of size, we can pad the outside of the input matrix with zeros before convolving, that is, to add r - 1 rows and c - 1 columns of zeros to the input matrix. On the other hand, if we want the output matrix to be even smaller, we can have the kernel to take larger strides, or kernel movements. Normally, the kernel is moved one step at a time (i.e., stride = 1) when performing the convolution. By increasing this stride to 2, the size of the output matrix is reduced by a factor of 2.

Although the main benefit of employing convolution in the deep networks is pa- rameter sharing, which effectively reduces the required time and processing power to train the network by reducing the number of weight parameters, it involves some other benefits as well. A convolution layer in a network will have a property called equivari- ance for translation purposes (Goodfellow et al., 2016). It simply means that any changes in the input will lead to a change in the output in the same way. For instance, moving an object in the input image by 10 pixels in a particular direction will lead to moving its representation in the output image by 10 pixels in the same direction. Apart from image- processing applications, this feature is especially useful for analyzing time-series data using convolutional networks where convolution can produce a kind of timeline that shows when each feature appears in the input.

It should be noted that in almost all of the practical applications of convolutional networks, many convolution operations are used in parallel to extract various kinds of features from the data, because a single feature is hardly enough to fully describe the inputs for the classification or recognition purposes. Also, as noted before, in most real- world applications, we have to represent the inputs as multi-dimensional tensors. For instance, in the processing of color images as opposed to gray scale pictures, instead of having 2D tensors (i.e., matrices) that represent the color of pixels (i.e., black or white), one will have to use 3D tensors because each pixel should be defined using the intensity of red, blue, and green colors.

1

1

2

3

2

Horizontal Kernel

Input Image Output Image

3

1 2 3 4 5 6 7 8 9 101112131415 1 2 3 4 5 6 7 8 9 101112131415

1 2 3 4 5 6 7 8 5 9

10 11 12 13 14 15

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

FIGURE 6.24 Example of Using Convolution for Extracting Features (Horizontal Lines in This Example) from Images.

352 Part II • Predictive Analytics/Machine Learning

Pooling

Most of the times, a convolution layer is followed by another layer known as the pooling (a.k.a. subsampling) layer. The purpose of a pooling layer is to consolidate elements in the input matrix to produce a smaller output matrix while maintaining the important fea- tures. Normally, a pooling function involves an r * c consolidation window (similar to a kernel in the convolution function) that moves around the input matrix and in each move calculates some summary statistics of the elements involved in the consolidation window so that it can be put in the output image. For example, a particular type of pooling func- tion called average pooling takes the average of the input matrix elements involved in the consolidation window and puts that average value as an element of the output matrix in the corresponding location. Similarly, the max pooling function (Zhou et al.) takes the maximum of the values in the window as the output element. Unlike convolution, for the pooling function, given the size of the consolidation window (i.e., r and c), stride should be carefully selected so that there would be no overlaps in the consolidations. The pooling operation using an r * c consolidation window reduces the number of rows and columns of the input matrix by a factor of r and c, respectively. For example, using a 3 * 3 consolidation window, a 15 * 15 matrix will be consolidated to a 5 * 5 matrix.

Pooling, in addition to reducing the number of parameters, is especially useful in the image-processing applications of deep learning in which the critical task is to determine whether a feature (e.g., a particular animal) is present in an image while the exact spatial lo- cation of the same in the picture is not important. However, if the location of features is im- portant in a particular context, applying a pooling function could potentially be misleading.

You can think of pooling as an operation that summarizes large inputs whose fea- tures are already extracted by the convolution layer and shows us just the important parts (i.e., features) in each small neighborhood in the input space. For instance, in the case of the image-processing example shown in Figure 6.24, if we place a max pooling layer after the convolution layer using a 3 * 3 consolidation window, the output will be like what is shown in Figure 6.25. As shown, the 15 * 15 already convoluted image is consolidated in a 5 * 5 image while the main features (i.e., horizontal lines) are maintained therein.

Sometimes pooling is used just to modify the size of matrices coming from the pre- vious layer and convert them to a specified size required by the following layer in the network.

There are various types of pooling operations such as max pooling, average pool- ing, the L2 norm of a rectangular neighborhood, and weighted average pooling. The

Horizontal Convoluted Square

Output

1 2 3 4 5 6 7 8 9 101112131415

1 2 3 4 5 6 7 8 9

10 11 12 13 14 15

Max Pooling

FIGURE 6.25 An Example of Applying Max Pooling on an Output Image to Reduce Its Size.

Chapter 6 • Deep Learning and Cognitive Computing 353

choice of proper pooling operation as well as the decision to include a pooling layer in the network at all depends highly on the context and properties of the problem that the network is solving. There are some guidelines in the literature to help the network de- signers in making such decisions (Boureau et al., 2011; Boureau, Ponce, and LeCun, 2010; Scherer, Müller, and Behnke, 2010).

Image Processing Using Convolutional Networks

Real applications of deep learning in general and CNNs in particular highly depend on the availability of large, annotated data sets. Theoretically, CNNs can be applied to many practical problems, and today there are many large and feature-rich databases for such applications available. Nevertheless, the biggest challenge is that in supervised learning applications, one needs an already annotated (i.e., labeled) data set to train the model be- fore we can use it for prediction/identification of other unknown cases. Whereas extract- ing features of data sets using CNN layers is an unsupervised task, the extracted features will not be of much use without having labeled cases to develop a classification network in a supervised learning fashion. That is why image classification networks traditionally involve two pipelines: visual feature extraction and image classification.

ImageNet (http://www.image-net.org) is an ongoing research project that pro- vides researchers with a large database of images, each linked to a set of synonym words (known as synset) from WordNet (a word hierarchy database). Each synset represents a particular concept in the WordNet. Currently, WordNet includes more than 100,000 synsets, each of which is supposed to be illustrated by an average of 1,000 images in the ImageNet. ImageNet is a huge database for developing image processing–type deep networks. It contains more than 15 million labeled images in 22,000 categories. Because of its sheer size and proper categorization, ImageNet is by far the most widely used benchmarking data set to assess the efficiency and accuracy of deep networks designed by deep learning researchers.

One of the first convolutional networks designed for image classification using the ImageNet data set was AlexNet (Krizhevsky, Sutskever, and Hinton, 2012). It was com- posed of five convolution layers followed by three fully connected (a.k.a. dense) layers (see Figure 6.26 for a schematic representation of AlexNet). One of the contributions of this relatively simple architecture that made its training remarkably faster and com- putationally efficient was the use of rectified linear unit (ReLu) transfer functions in the convolution layers instead of the traditional sigmoid functions. By doing so, the designers

3

C1

C2

96

55

55

5

5

27

3 13 13

13 13

13

3

3

3

13 27 384 384 256

4,096 4,096

1,000

3

256

C3 C4 C5

FC6 FC7

FC8

FIGURE 6.26 Architecture of AlexNet, a Convolutional Network for Image Classification.

354 Part II • Predictive Analytics/Machine Learning

addressed the issue called the vanishing gradient problem caused by very small deriva- tives of sigmoid functions in some regions of the images. The other important contribu- tion of this network that has a dramatic role in improving the efficiency of deep networks was the introduction of the concept of dropout layers to the CNNs as a regularization technique to reduce overfitting. A dropout layer typically comes after the fully connected layers and applies a random probability to the neurons to switch off some of them and make the network sparser.

In the recent years, in addition to a large number of data scientists who showcase their deep learning capabilities, a number of well-known industry-leading companies such as Microsoft, Google, and Facebook have participated in the annual ImageNet Large Scale Visual Recognition Challenge (ILSVRC). The goal in the ILSVRC classification task is to design and train networks that are capable of classifying 1.2 million input images into one of the 1,000 image categories. For instance, GoogLeNet (a.k.a. Inception), a deep convolutional network architecture designed by Google researchers, was the win- ning architecture of ILSVRC 2014 with a 22-layer network and only a 6.66 percent clas- sification error rate, only slightly 15.1%2 worse than the human-level classification error (Russakovsky et al., 2015). The main contribution of the GoogLeNet architecture was to introduce a module called Inception. The idea of Inception is that because one would have no idea of the size of convolution kernel that would perform best on a particular data set, it is better to include multiple convolutions and let the network decide which one to use. Therefore, as shown in Figure 6.27, in each convolution layer, the data com- ing from the previous layer is passed through multiple types of convolution and the out- puts are concatenated before going to the next layer. Such architecture allows the model to take into account both local features via smaller convolutions and high abstracted features via larger ones.

Google recently launched a new service, Google Lens, that uses deep learning arti- ficial neural network algorithms (along with other AI techniques) to deliver information about the images captured by users from their nearby objects. This involves identifying the objects, products, plants, animals, and locations and providing information about them on the Internet. Some other features of this service are the capability of saving

Filter Concatenation

3 3 3 Convolutions

5 3 5 Convolutions

1 3 1 Convolutions

1 3 1 Convolutions

1 3 1 Convolutions

3 3 3 Max Pooling

131 Convolutions

Previous Layer

FIGURE 6.27 Conceptual Representation of the Inception Feature in GoogLeNet.

Chapter 6 • Deep Learning and Cognitive Computing 355

contact information from a business card image on the phone, identifying type of plants and breed of animals, identifying books and movies from their cover photos, and provid- ing information (e.g., stores, theaters, shopping, reservations) about them. Figure 6.28 shows two examples of using the Google Lens app on an Android mobile device.

Even though later more accurate networks have been developed (e.g., He, Zhang, Ren, & Sun, 2015) in terms of efficiency and processing requirements (i.e., smaller num- ber of layers and parameters), GoogLeNet is considered to be one of the best architec- tures to date. Apart from AlexNet and GoogLeNet, several other convolutional network architectures such as Residual Networks (ResNet), VGGNet, and Xception have been developed and contributed to the image-processing area, all relying on the ImageNet database.

In a May 2018 effort to address the labor-intensive task of labeling images on a large scale, Facebook published a weakly supervised training image recognition deep learning project (Mahajan et al., 2018). This project used hashtags made by the users on the im- ages posted on Instagram as labels and trained a deep learning image recognition model based on that. The model was trained using 3.5 billion Instagram images labeled with around 17,000 hashtags using 336 GPUs working in parallel; the training procedure took a few weeks to be accomplished. A preliminary version of the model (trained using only 1 billion images and 1,500 hashtags) was then tested on the ImageNet benchmark data set and is reported to have outperformed the state-of-the-art models in terms of accuracy by more than 2 percent. This big achievement by Facebook surely will open doors to a new world of image processing using deep learning since it can dramatically increase the size of available image data sets that are labeled for training purposes.

Use of deep learning and advanced analytics methods to classify images has evolved into the recognition of human faces and has become a very popular application for a variety of purposes. It is discussed in Application Case 6.6.

FIGURE 6.28 Two Examples of Using the Google Lens, a Service Based on Convolutional Deep Networks for Image Recognition. Source: ©2018 Google LLC, used with permission. Google and the

Google logo are registered trademarks of Google LLC.

356 Part II • Predictive Analytics/Machine Learning

Face recognition, although seemingly similar to image recognition, is a much more complicated undertaking. The goal of face recognition is to identify the individ- ual as opposed to the class it belongs to (human), and this identification task needs to be performed on a nonstatic (i.e., moving person) 3D environment. Face recognition has been an active research field in AI for many decades with limited success until recently. Thanks to the new generation of algorithms (i.e., deep learning) coupled with large data sets and computa- tional power, face recognition technology is starting to make a significant impact on real-world applications. From security to marketing, face recognition and the variety of applications/use cases of this technology are increasing at an astounding pace.

Some of the premier examples of face recogni- tion (both in advancements in technology and in the creative use of the technology perspectives) come from China. Today in China, face recognition is a very hot topic both from business development and from application development perspectives. Face recognition has become a fruitful ecosystem with hundreds of start-ups in China. In personal and/or business settings, people in China are widely using and relying on devices whose security is based on automatic recognition of their faces.

As perhaps the largest scale practical applica- tion case of deep learning and face recognition in the world today, the Chinese government recently started a project known as “Sharp Eyes” that aims at establishing a nationwide surveillance system based on face recognition. The project plans to integrate security cameras already installed in public places with private cameras on buildings and to utilize AI and deep learning to analyze the videos from those cameras. With millions of cameras and billions of lines of code, China is building a high-tech authori- tarian future. With this system, cameras in some cit- ies can scan train and bus stations as well as airports to identify and catch China’s most wanted suspected criminals. Billboard-size displays can show the faces of jaywalkers and list the names and pictures of peo- ple who do not pay their debts. Facial recognition scanners guard the entrances to housing complexes.

An interesting example of this surveillance system is the “shame game” (Mozur, 2018). An

intersection south of Changhong Bridge in the city of Xiangyang previously was a nightmare. Cars drove fast, and jaywalkers darted into the street. Then, in the summer of 2017, the police put up cameras linked to facial recognition technology and a big out- door screen. Photos of lawbreakers were displayed alongside their names and government identifica- tion numbers. People were initially excited to see their faces on the screen until propaganda outlets told them that this was a form of punishment. Using this, citizens not only became a subject of this shame game but also were assigned negative citizenship points. Conversely, on the positive side, if people are caught on camera showing good behavior, like pick- ing up a piece of trash from the road and putting it into a trash can or helping an elderly person cross an intersection, they get positive citizenship points that can be used for a variety of small awards.

China already has an estimated 200 million sur- veillance cameras—four times as many as the United States. The system is mainly intended to be used for tracking suspects, spotting suspicious behavior, and predicting crimes. For instance, to find a criminal, the image of a suspect can be uploaded to the system, matching it against millions of faces recognized from videos of millions of active security cameras across the country. This can find individuals with a high degree of similarity. The system also is merged with a huge database of information on medical records, travel bookings, online purchases, and even social media activities of every citizen and can monitor practically everyone in the country (with 1.4 billion people), tracking where they are and what they are doing each moment (Denyer, 2018). Going beyond narrowly defined security purposes, the govern- ment expects Sharp Eyes to ultimately assign every individual in the country a “social credit score” that specifies to what extent she or he is trustworthy.

While such an unrestricted application of deep learning (i.e., spying on citizens) is against the privacy and ethical norms and regulations of many western countries, including the United States, it is becoming a common practice in countries with less restrictive privacy laws and concerns as in China. Even western countries have begun to plan on employing similar technologies in limited scales only for security and

Application Case 6.6 From Image Recognition to Face Recognition

Chapter 6 • Deep Learning and Cognitive Computing 357

Text Processing Using Convolutional Networks

In addition to image processing, which was in fact the main reason for the popularity and development of convolutional networks, they have been shown to be useful in some large-scale text mining tasks as well. Especially since 2013, when Google published its word2vec project (Mikolov et al., 2013; Mikolov, Sutskever, Chen, Corrado, and Dean, 2013), the applications of deep learning for text mining have increased remarkably.

Word2vec is a two-layer neural network that gets a large text corpus as the input and converts each word in the corpus to a numeric vector of any given size (typically ranging from 100 to 1,000) with very interesting features. Although word2vec itself is not a deep learning algorithm, its outputs (word vectors also known as word embeddings) already have been widely used in many deep learning research and commercial projects as inputs.

One of the most interesting properties of word vectors created by the word2vec algorithm is maintaining the words’ relative associations. For example, vector operations

vector (‘King’) - vector (‘Man’) + vector (‘Woman’)

and

vector (‘London’) - vector (‘England’) + vector (‘France’)

will result in a vector very close to vector (‘Queen’) and vector (‘Paris’), respectively. Figure 6.29 shows a simple vector representation of the first example in a two-dimensional vector space.

Moreover, the vectors are specified in such a way that those of a similar context are placed very close to each other in the n-dimensional vector space. For instance, in the word2vec model pretrained by Google using a corpus including about 100 billion words (taken from Google News), the closest vectors to the vector (‘Sweden’) in terms of cosine distance, as shown in Table 6.2, identify European country names near the Scandinavian region, the same region in which Sweden is located.

Additionally, since word2vec takes into account the contexts in which a word has been used and the frequency of using it in each context in guessing the meaning of the word, it enables us to represent each term with its semantic context instead of just the syntactic/symbolic term itself. As a result, word2vec addresses several word variation issues that used to be problematic in traditional text mining activities. In other words,

crime prevention purposes. The FBI’s Next Generation Identification System, for instance, is a lawful appli- cation of facial recognition and deep learning that compares images from crime scenes with a national database of mug shots to identify potential suspects.

Questions for Case 6.6

1. What are the technical challenges in face recognition?

2. Beyond security and surveillance purposes, where else do you think face recognition can be used?

3. What are the foreseeable social and cultural problems with developing and using face recog- nition technology?

Sources: Mozur, P. (2018, June 8). “Inside China’s Dystopian Dreams: A.I., Shame and Lots of Cameras.” The New York Times. https://www.nytimes.com/2018/07/08/business/china- surveillance-technology.html; Denyer, S. (2018, January). “Beijing Bets on Facial Recognition in a Big Drive for Total Surveillance.” The Washington Post. https://www.washing- tonpost.com/news/world/wp/2018/01/07/feature/in- china-facial-recognition-is-sharp-end-of-a-drive-for-total- surveillance/?noredirect=on&utm_term=.e73091681b31.

358 Part II • Predictive Analytics/Machine Learning

word2vec is able to handle and correctly represent words including typos, abbreviations, and informal conversations. For instance, the words Frnce, Franse, and Frans would all get roughly the same word embeddings as their original counterpart France. Word embeddings are also able to determine other interesting types of associations such as distinction of entities (e.g., vector3‘human’4 - vector3‘animal’4~vector3‘ethics’4) or geopolitical associations (e.g., vector3‘Iraq’4 - vector3‘violence’4~vector3‘Jordan’4).

By providing such a meaningful representation of textual data, in recent years, word2vec has driven many deep learning–based text mining projects in a wide range of contexts (e.g., medical, computer science, social media, marketing), and various types of deep networks have been applied to the word embeddings created by this algorithm to accomplish different objectives. Particularly, a large group of studies had developed convolutional networks applied to the word embeddings with the aim of relation extrac- tion from textual data sets. Relation extraction is one of the subtasks of natural language processing (NLP) that focuses on determining whether two or more named entities rec- ognized in the text form specific relationships (e.g., “A causes B”; “B is caused by A”). For instance, Zeng et al. (2014) developed a deep convolutional network (see Figure 6.30) to classify relations between specified entities in sentences. To this end, these researchers

King

Queen

ManKing-Man

Woman

FIGURE 6.29 Typical Vector Representation of Word Embeddings in a Two-Dimensional Space

TABLE 6.2 Example of the word2vec Project Indicating the Closest Word Vectors to the Word “Sweden”

Word Cosine Distance

Norway 0.760124

Denmark 0.715460

Finland 0.620022

Switzerland 0.588132

Belgium 0.585635

Netherlands 0.574631

Iceland 0.562368

Estonia 0.547621

Slovenia 0.531408

Chapter 6 • Deep Learning and Cognitive Computing 359

used a matrix format to represent each sentence. Each column of the input matrices is in fact the word embedding (i.e., vector) associated with one of the words involved in the sentence. Zeng et al. then used a convolutional network, shown in the right box in Figure 6.30, to automatically learn the sentence-level features and concatenate those features (i.e., the output vector of the CNN) with some basic lexical features (e.g., the order of the two words of interest within the sentence and the left and right tokens for each of them). The concatenated feature vector then is fed into a classification layer with a softmax transfer function, which determines the type of relationship between the two words of interest among multiple predefined types. The softmax transfer function is the most com- mon type of function to be used for classification layers, especially when the number of classes is more than two. For classification problems with only two outcome categories, log-sigmoid transfer functions are also very popular. The proposed approach by Zeng et al. was shown to correctly classify the relation between the marked terms in sentences of a sample data set with an 82.7 percent accuracy.

In a similar study, Nguyen and Grishman (2015) used a four-layer convolutional net- work with multiple kernel sizes in each convolution layer fed by the real-valued vectors of words included in sentences to classify the type of relationship between the two marked words in each sentence. In the input matrix, each row was the word embedding associated with a word in the same sequence in the sentence as the row number. In addition, these researchers included two more columns to the input matrices to represent the relative posi- tion of each word (either positive or negative) with regard to each of the marked terms. The automatically extracted features then were passed through a classification layer with soft- max function for the type of relationship to be determined. Nguyen and Grishman trained their model using 8,000 annotated examples (with 19 predefined classes of relationships) and tested the trained model on a set of 2,717 validation data sets and achieved a classifica- tion accuracy of 61.32 percent (i.e., more than 11 times better performance than guessing).

Such text mining approaches using convolutional deep networks can be extended to various practical contexts. Again, the big challenge here, just as in image processing, is lack of sufficient large annotated data sets for supervised training of deep networks. A distant supervision method of training has been proposed (Mintz et al., 2009) to ad- dress this challenge. It suggests that large amounts of training data can be produced by aligning knowledge base (KB) facts with texts. In fact, this approach is based on the assumption that if a particular type of relation exists between an entity pair (e.g., “A” is a component of “B”) in the KB, then every text document containing the mention of the

tanh(W2x )

Word Representation

[People] have been moving back into [downtown]

Window Processing

Sentence level Features

Convolution

Lexical level features

Feature Extraction

Output

W3X

W1

WF

PF

Max over times

Sentence level features

FIGURE 6.30 CNN Architecture for Relation Extraction Task in Text Mining.

360 Part II • Predictive Analytics/Machine Learning

entity pair would express that relation. However, since this assumption was not very realistic, Riedel, Yao, and McCallum (2010) later relaxed it by modeling the problem as a multi-instance learning problem. They suggest assigning labels to a bag of instances rather than a single instance that can reduce the noise of the distant supervision method and create more realistic labeled training data sets (Kumar, 2017).

u SECTION 6.7 REVIEW QUESTIONS

1. What is CNN? 2. For what type of applications can CNN be used? 3. What is convolution function in CNN and how does it work? 4. What is pooling in CNN? How does it work? 5. What is ImageNet and how does it relate to deep learning? 6. What is the significance of AlexNet? Draw and describe its architecture. 7. What is GoogLeNet? How does it work? 8. How does CNN process text? What are word embeddings, and how do they work? 9. What is word2vec, and what does it add to traditional text mining?

6.8 RECURRENT NETWORKS AND LONG SHORT-TERM MEMORY NETWORKS

Human thinking and understanding to a great extent relies on context. It is crucial for us, for example, to know that a particular speaker uses very sarcastic language (based on his previous speeches) to fully catch all the jokes that he makes. Or to understand the real meaning of the word fall (i.e., either the season or to collapse) in the sentence “It is a nice day of fall” without knowledge about the other words in the surrounding sentences would only be guessing, not necessarily understanding. Knowledge of context is typically formed based on observing events that happened in the past. In fact, human thoughts are persistent, and we use every piece of information we previously acquired about an event in the process of analyzing it rather than throwing away our past knowledge and thinking from scratch every time we face similar events or situations. Hence, there seems to be a recurrence in the way humans process information.

While deep MLP and convolutional networks are specialized for processing a static grid of values like an image or a matrix of word embeddings, sometimes the sequence of input values is also important to the operation of the network to accomplish a given task and hence should be taken into account. Another popular type of neural networks is recurrent neural network (RNN) (Rumelhart et al., 1986), which is specifically de- signed to process sequential inputs. An RNN basically models a dynamic system where (at least in one of its hidden neurons) the state of the system (i.e., output of a hidden neuron) at each time point t depends on both the inputs to the system at that time and its state at the previous time point t - 1. In other words, RNNs are the type of neural networks that have memory and that apply that memory to determine their future out- puts. For instance, in designing a neural network to play chess, it is important to take into account several previous moves while training the network, because a wrong move by a player can lead to the eventual loss of the game in the subsequent 10–15 plays. Also, to understand the real meaning of a sentence in an essay, sometimes we need to rely on the information portrayed in the previous several sentences or paragraphs. That is, for a true understanding, we need the context built sequentially and collectively over time. Therefore, it is crucial to consider a memory element for the neural network that takes into account the effect of prior moves (in the chess example) and prior sentences and paragraphs (in the essay example) to determine the best output. This memory portrays and creates the context required for the learning and understanding.

Chapter 6 • Deep Learning and Cognitive Computing 361

In static networks like MLP-type CNNs, we are trying to find some functions (i.e., network weights and biases) that map the inputs to some outputs that are as close as possible to the actual target. In dynamic networks like RNNs, on the other hand, both inputs and outputs are sequences (patterns). Therefore, a dynamic network is a dynamic system rather than a function because its output depends not only on the input but also on the previous outputs. Most of the RNNs use the following general equation to define the values of their hidden units (Goodfellow et al., 2016).

a(t) = f (a(t - 1), p(t),u)

In this equation, a(t) represents the state of the system at time t, and p(t) and u rep- resent the input to the unit at time t and the parameters, respectively. Applying the same general equation for calculating the state of system at time t - 1, we will have:

a(t - 1) = f (a(t - 2), p(t - 1),u)

In other words:

a(t) = f ( f (a(t - 2), p(t - 1),u), p(t),u)

And this equation can be extended multiple times for any given sequence length. Graphically, a recurrent unit in a network can be depicted in a circuit diagram like the one shown in Figure 6.31. In this figure, D represents the tap delay lines, or simply the delay element of the network that, at each time point t, contains a1t2, the previous output value of the unit. Sometimes instead of just one value, we store several previous output values in D to account for the effect of all of them. Also iw and lw represent the weight vectors applied to the input and the delay, respectively.

Technically speaking, any network with feedback can actually be called a deep net- work, because even with a single layer, the loop created by the feedback can be thought of as a static MLP-type network with many layers (see Figure 6.32 for a graphical illustra- tion of this structure). However, in practice, each recurrent neural network would involve dozens of layers, each with feedback to itself, or even to the previous layers, which makes a recurrent neural network even deeper and more complicated.

Because of the feedbacks, computation of gradients in the recurrent neural net- works would be somewhat different from the general backpropagation algorithm used

X

XIw

Input Recurrent Neuron a(t) 5 f(iw.p(t) 1 lw.a(t) 1 b)

iw b

p(t) n (t) a(t)

D

fS

FIGURE 6.31 Typical Recurrent Unit.

362 Part II • Predictive Analytics/Machine Learning

for the static MLP networks. There are two alternative approaches for computing the gradients in the RNNs, namely, real-time recurrent learning (RTRL) and backpropagation through time (BTT), whose explanation is beyond the scope of this chapter. Nevertheless, the general purpose remains the same; once the gradients have been computed, the same procedures are applied to optimize the learning of the network parameters.

The LSTM networks (Hochreiter & Schmidhuber, 1997) are variations of recurrent neural networks that today are known as the most effective sequence modeling tech- nique and are the base of many practical applications. In a dynamic network, the weights are called the long-term memory while the feedbacks role is the short-term memory.

In essence, only the short-term memory (i.e., feedbacks; previous events) provides a network with the context. In a typical RNN, the information in the short-term memory is continuously replaced as new information is fed back into the network over time. That is why RNNs perform well when the gap between the relevant information and the place that is needed is small. For instance, for predicting the last word in the sentence “The referee blew his whistle,” we just need to know a few words back (i.e., the referee) to correctly predict. Since in this case the gap between the relevant information (i.e., the ref- eree) and where it is needed (i.e., to predict whistle) is small, an RNN network can easily perform this learning and prediction task.

However, sometimes the relevant information required to perform a task is far away from where it is needed (i.e., the gap is large). Therefore, it is quite likely that it would have already been replaced by other information in the short-term memory by the time it is needed for the creation of the proper context. For instance, to predict the last word in “I went to a carwash yesterday. It cost $5 to wash my car,” there is a relatively larger gap between the relevant information (i.e., carwash) and where it is needed. Sometimes we may even need to refer to the previous paragraphs to reach the relevant information for predicting the true meaning of a word. In such cases, RNNs usually do not perform well since they cannot keep the information in their short-term memory for a long enough time. Fortunately, LSTM networks do not have such a shortcoming. The term long short- term memory network then refers to a network in which we are trying to remember what happened in the past (i.e., feedbacks; previous outputs of the layers) for a long enough time so that it can be used/leveraged in accomplishing the task when needed.

From an architectural viewpoint, the memory concept (i.e., remembering “what happened in the past”) is incorporated in LSTM networks by incorporating four addi- tional layers into the typical recurrent network architecture: three gate layers, namely input gate, forget (a.k.a. feedback) gate, and output gate, and an additional layer called Constant Error Carousel (CEC), also known as the state unit that integrates those gates and interacts them with the other layers. Each gate is nothing but a layer with two inputs, one from the network input and the other a feedback from the final output of the whole network. The gates involve log-sigmoid transfer functions. Therefore, their outputs will be between 0 and 1 and describe how much of each component (either input, feedback, or output) should be let through the network. Also, CEC is a layer that falls between the

a(...)

x(...)

a(...)a(t21)

x(t11) x(t12) x(t13)x(t)

a(t11) a(t12)a(t) f f f f f

FIGURE 6.32 Unfolded View of a Typical Recurrent Network.

Chapter 6 • Deep Learning and Cognitive Computing 363

input and the output layers in a recurrent network architecture and applies the gates out- puts to make the short-term memory long.

To have a long short-term memory means that we want to keep the effect of previ- ous outputs for a longer time. However, we typically do not want to indiscriminately re- member everything that has happened in the past. Therefore, gating provides us with the capability of remembering prior outputs selectively. The input gate will allow selective inputs to the CEC; the forget gate will clear the CEC from the unwanted previous feed- backs; and the output gate will allow selective outputs from the CEC. Figure 6.33 shows a simple depiction of a typical LSTM architecture.

In summary, the gates in the LSTM are in charge of controlling the flow of informa- tion through the network and dynamically change the time scale of integration based on the input sequence. As a result, LSTM networks are able to learn long-term dependencies among the sequence of inputs more easily than the regular RNNs.

Application Case 6.7 illustrates the use of text processing in the context of under- standing customer opinions and sentiments toward innovatively designing and develop- ing new and improved products and services.

CECx

1 x Input Layer

Input Gate

Forget (feedback) Gate

Output Gate

Output Layer

a7(t)

a7(t)

a7(t)

a5(t) a6(t) a7(t)

a1(t)

a2(t)

a3(t)

x

p(t)

p(t)

p(t)

a7(t) a4(t)

p(t)

FIGURE 6.33 Typical Long Short-Term Memory (LSTM) Network Architecture.

Analyzing product and customer behavior provides valuable insights into what consumers want, how they interact with products, and where they encoun- ter usability issues. These insights can lead to new feature designs and development or even new products.

Understanding customer sentiment and know- ing what consumers truly think about products or a brand are traditional pain points. Customer jour- ney analytics provides insights into these areas, yet these solutions are not all designed to integrate vital sources of unstructured data such as call center

Application Case 6.7 Deliver Innovation by Understanding Customer Sentiments

(Continued )

364 Part II • Predictive Analytics/Machine Learning

notes or social media feedback. In today’s world, unstructured notes are part of core communications in virtually every industry, for example:

• Medical professionals record patient obser- vations.

• Auto technicians write down safety information. • Retailers track social media for consumer

comments. • Call centers monitor customer feedback and

take notes.

Bringing together notes, which are usually avail- able as free-form text, with other data for analysis has been difficult. That is because each industry has its own specific terms, slang, shorthand, and acronyms embed- ded in the data. Finding meaning and business insights first requires the text to be changed into a structured form. This manual process is expensive, time consum- ing, and prone to errors, especially as data scales to ever-increasing volumes. One way that companies can leverage notes without codifying the text is to use text clustering. This analytic technique quickly identifies common words or phrases for rapid insights.

Text and Notes Can Lead to New and Improved Products

Leveraging the insights and customer sentiment uncovered during a text and sentiment analysis can spark innovation. Companies such as vehicle manufacturers can use the intelligence to improve customer service and deliver an elevated customer experience. By learning what customers like and dislike about current products, companies can improve their design, such as adding new features to a vehicle to enhance the driving experience.

Forming word clusters also allows companies to identify safety issues. If an auto manufacturer sees that numerous customers are expressing negative sen- timents about black smoke coming from their vehicle, the company can respond. Likewise, manufacturers can address safety issues that are a concern to custom- ers. With comments grouped into buckets, companies have the ability to focus on specific customers who experienced a similar problem. This allows a com- pany to, for instance, offer a rebate or special promo- tion to those who experienced black smoke.

Understanding sentiments can better inform a vehicle manufacturer’s policies. For example,

customers have different lifetime values. A cus- tomer who complains just once but has a very large lifetime value can be a more urgent candidate for complaint resolution than a customer with a lower lifetime value with multiple issues. One may have spent $5,000 buying the vehicle from a used vehicle lot. Another may have a history of buying new cars from the manufacturer and spent $30,000 to buy the vehicle on the showroom floor.

Analyzing Notes Enables High-Value Business Outcomes

Managing the life cycle of products and services continues to be a struggle for most companies. The massive volumes of data now available have com- plicated life cycle management, creating new chal- lenges for innovation. At the same time, the rapid rise of consumer feedback through social media has left businesses without a strategy for digesting, mea- suring, or incorporating the information into their product innovation cycle—meaning they miss a cru- cial amount of intelligence that reflects a customer’s actual thoughts, feelings, and emotions.

Text and sentiment analysis is one solution to this problem. Deconstructing topics from masses of text allows companies to see what common issues, com- plaints, or positive or negative sentiments customers have about products. These insights can lead to high- value outcomes, such as improving products or cre- ating new ones that deliver a better user experience, responding timely to safety issues, and identifying which product lines are most popular with consumers.

Example: Visualizing Auto Issues with “The Safety Cloud”

The Teradata Art of Analytics uses data science, Teradata® Aster® Analytics, and visualization tech- niques to turn data into one-of-a-kind artwork. To demonstrate the unique insights offered by text clus- tering, data scientists used the Art of Analytics to create “The Safety Cloud.”

The scientists used advanced analytics algo- rithms on safety inspector and call center notes from an automobile manufacturer. The analytics identi- fied and systematically extracted common words and phrases embedded in the data volumes. The blue cluster represents power steering failure. The pink is engine stalls. Yellow is black smoke in the exhaust.

Application Case 6.7 (Continued)

Chapter 6 • Deep Learning and Cognitive Computing 365

Orange is brake failure. The manufacturer can use this information to gauge how big the problem is and whether it is safety related, and if so, then take actions to fix it.

For a visual summary, you can watch the video (http://www.teradata.com/Resources/Videos/ Art-of-Analytics-Safety-Cloud).

Questions for Case 6.7

1. Why do you think sentiment analysis is gaining overwhelming popularity?

2. How does sentiment analysis work? What does it produce?

3. In addition to the specific examples in this case, can you think of other businesses and industries that can benefit from sentiment analysis? What is common among the companies that can benefit greatly from sentiment analysis?

Source: Teradata Case Study. “Deliver Innovation by Understanding Customer Sentiments.” http://assets.teradata. com/resourceCenter/downloads/CaseStudies/EB9859.pdf (accessed August 2018). Used with permission.

LSTM Networks Applications

Since their emergence in the late 1990s (Hochreiter & Schmidhuber, 1997), LSTM networks have been widely used in many sequence modeling applications, includ- ing image captioning (i.e., automatically describing the content of images) (Vinyals, Toshev, Bengio, and Erhan, 2017, 2015; Xu et al., 2015), handwriting recognition and generation (Graves, 2013; Graves and Schmidhuber, 2009; Keysers et al. 2017), parsing (Liang et al. 2016; Vinyals, Kaiser, et al., 2015), speech recognition (Graves and Jaitly, 2014; Graves, Jaitly, and Mohamed, 2013; Graves, Mohamed, and Hinton, 2013), and machine translation (Bahdanau, Cho, and Bengio, 2014; Sutskever, Vinyals, and Le, 2014).

366 Part II • Predictive Analytics/Machine Learning

Currently, we are surrounded by multiple deep learning solutions working on the basis of speech recognition, such as Apple’s Siri, Google Now, Microsoft’s Cortana, and Amazon’s Alexa, several of which we deal with on a daily basis (e.g., checking on the weather, asking for a Web search, calling a friend, and asking for directions on the map). Note taking is not a difficult, frustrating task anymore since we can easily record a speech or lecture, upload the digital recording on one of the several cloud-based speech-to-text service providers’ platforms, and download the transcript in a few seconds. The Google cloud-based speech-to-text service, for example, supports 120 languages and their vari- ants and has the ability to convert speech to text either in real time or using recorded audios. The Google service automatically handles the noise in the audio; accurately punc- tuates the transcripts with commas, question marks, and periods; and can be customized by the user to a specific context by getting a set of terms and phrases that are very likely to be used in a speech and recognizing them appropriately.

Machine translation refers to a subfield of AI that employs computer programs to translate speech or text from one language to another. One of the most comprehensive machine translation systems is the Google’s Neural Machine Translation (GNMT) platform. GNMT is basically an LSTM network with eight encoder and eight decoder layers designed by a group of Google researchers in 2016 (Wu et al., 2016). GNMT is specialized for trans- lating whole sentences at a time as opposed to the former version of Google Translate platform, which was a phrase-based translator. This network is capable of naturally han- dling the translation of rare words (that previously was a challenge in machine translation) by dividing the words into a set of common subword units. GNMT currently supports au- tomatic sentence translations between more than 100 languages. Figure 6.34 shows how a sample sentence was translated from French to English by GNMT and a human translator. It also indicates how closely the GNMT translations between different language pairs were ranked by the human speakers compared with translations made by humans.

For the former secretary of state, this is to forget a month of bungling and convince the audience that Mr. Trump has not the makings of a president

Phrase Based†

Input Sentence

Neural Network† Human

English French

Chinese

Spanish

Spanish

French

Chinese

Translation Method Phrase Based† Neural Network† Human

543

Pour l’ancienne secrétaire d’Etat, il s’agit de faire oublier un mois de cafouillages et de convaincre l’auditoire que M. Trump n’a pas l’étoffe d’un président

For the former secretary of state, it is a question of forgetting a month of muddles and convincing the audience that Mr. Trump does not have the stuff of a president

The former secretary of state has to put behind her a month of setbacks and convince the audience that Mr. Trump does not have what it takes to be a president

Perfect Translation 5 6

English

English

English

FIGURE 6.34 Example Indicating the Close-to-Human Performance of the Google Neural Machine Translator (GNMT)

Chapter 6 • Deep Learning and Cognitive Computing 367

Although machine translation has been revolutionized by the virtue of LSTMs, it en- counters challenges that make it far from a fully automated high-quality translation. Like image-processing applications, there is a lack of sufficient training data (manually trans- lated by humans) for many language pairs on which the network can be trained. As a result, translations between rare languages are usually done through a bridging language (mostly English) that may result in higher chances of error.

In 2014, Microsoft launched its Skype Translator service, a free voice translation service involving both speech recognition and machine translation with the ability of translating real-time conversations in 10 languages. Using this service, people speaking different languages can talk to each other in their own languages via a Skype voice or video call, and the system recognizes their voices and translates their every sentence through a translator bot in near real time for the other party. To provide more accurate translations, the deep networks used in the backend of this system were trained using conversational language (i.e., using materials such as translated Web pages, movie sub- titles, and casual phrases taken from people’s conversations in social networking Web sites) rather than the formal language commonly used in documents. The output of the speech recognition module of the system then goes through TrueText, a Microsoft tech- nology for normalizing text that is capable of identifying mistakes and disfluencies (e.g., pauses during the speech or repeating some parts of speech, or adding fillers like “um” and “ah” when speaking) that people commonly conduct in their conversations and ac- count for them for making better translations. Figure 6.35 shows the four-step process involved in the Skype Translator by Microsoft, each of which relies on the LSTM type of deep neural networks.

u SECTION 6.8 REVIEW QUESTIONS

1. What is RNN? How does it differ from CNN? 2. What is the significance of “context,” “sequence,” and “memory” in RNN? 3. Draw and explain the functioning of a typical recurrent neural network unit. 4. What is the LSTM network, and how does it differ from RNNs? 5. List and briefly describe three different types of LSTM applications. 6. How do Google’s Neural Machine Translation and Microsoft Skype Translator work?

Can you hear me?

can can you hear me

Speech

Automatic Speech Recognition Machine

Translation

Text to Speech

?

me

hear

You

Can

Speech

TrueText

can can you here me

hear

A B C

FIGURE 6.35 Four-Step Process of Translating Speech Using Deep Networks in the Microsoft Skype Translator.

368 Part II • Predictive Analytics/Machine Learning

6.9 COMPUTER FRAMEWORKS FOR IMPLEMENTATION OF DEEP LEARNING

Advances in deep learning owe its recent popularity, to a great extent, to advances in the software and hardware infrastructure required for its implementation. In the past few de- cades, GPUs have been revolutionized to support the playing of high-resolution videos as well as advanced video games and virtual reality applications. However, GPUs’ huge pro- cessing potential had not been effectively utilized for purposes other than graphics pro- cessing up until a few years ago. Thanks to software libraries such as Theano (Bergstra et al., 2010), Torch (Collobert, Kavukcuoglu, and Farabet, 2011), Caffe (Jia et al., 2014), PyLearn2 (Goodfellow et al., 2013), Tensorflow (Abadi et al., 2016), and MXNet (Chen et al., 2015) developed with the purpose of programming GPUs for general-purpose processing (just as CPUs), and particularly for deep learning and analysis of Big Data, GPUs have become a critical enabler for the modern-day analytics. The operation of these libraries mostly relies on a parallel computing platform and application programming in- terface (API) developed by NVIDIA called Compute Unified Device Architecture (CUDA), which enables software developers to use GPUs made by NVIDIA for general-purpose processing. In fact, each deep learning framework consists of a high-level scripting lan- guage (e.g., Python, R, Lua) and a library of deep learning routines usually written in C (for using CPUs) or CUDA (for using GPUs).

We next introduce some of the most popular software libraries used for deep learn- ing by researchers and practitioners, including Torch, Caffe, Tensorflow, Theano, and Keras, and discuss some of their specific properties.

Torch

Torch (Collobert et al., 2011) is an open-source scientific computing framework (avail- able at www.torch.ch) for implementing machine-learning algorithms using GPUs. The Torch framework is a library based on LuaJIT, a compiled version of the popular Lua pro- gramming language (www.lua.org). In fact, Torch adds a number of valuable features to Lua that make deep learning analyses possible; it enables supporting n-dimensional arrays (i.e., tensors), whereas tables (i.e., two-dimensional arrays) normally are the only data-structuring method used by Lua. Additionally, Torch includes routine libraries for manipulating (i.e., indexing, slicing, transposing) tensors, linear algebra, neural network functions, and optimization. More importantly, while Lua by default uses CPU to run the programs, Torch enables use of GPUs for running programs written in the Lua language.

The easy and extremely fast scripting properties of LuaJIT along with its flexibility have made Torch a very popular framework for practical deep learning applications such that today its latest version, Torch7, is widely used by a number of big companies in the deep learning area, including Facebook, Google, and IBM, in their research labs, as well as for their commercial applications.

Caffe

Caffe is another open-source deep learning framework (available at http://caffe. berkeleyvision.org) created by Yangqing Jia (2013), a PhD student at the University of California–Berkeley, which the Berkeley AI Research (BAIR) then further developed. Caffe has multiple options to be used as a high-level scripting language, including the command line, Python, and MATLAB interfaces. The deep learning libraries in Caffe are written in the C++ programming language.

In Caffe, everything is done using text files instead of code. That is, to implement a network, generally we need to prepare two text files with the .prototxt extension that are communicated by the Caffe engine via JavaScript Object Notation (JSON) format.

Chapter 6 • Deep Learning and Cognitive Computing 369

The first text file, known as the architecture file, defines the architecture of the network layer by layer, where each layer is defined by a name, a type (e.g., data, convolution, output), the names of its previous (bottom) and next (top) layers in the architecture, and some required parameters (e.g., kernel size and stride for a convolutional layer). The sec- ond text file, known as the solver file, specifies the properties of the training algorithm, including the learning rate, maximum number of iterations, and processing unit (CPU or GPU) to be used for training the network.

While Caffe supports multiple types of deep network architectures like CNN and LSTM, it is particularly known to be an efficient framework for image processing due to its incredible speed in processing image files. According to its developers, it is able to pro- cess over 60 million images per day (i.e., 1 ms/image) using a single NVIDIA K40 GPU. In 2017, Facebook released an improved version of Caffe called Caffe2 (www.caffe2.ai) with the aim of improving the original framework to be effectively used for deep learning architectures other than CNN and with a special emphasis on portability for performing cloud and mobile computations while maintaining scalability and performance.

TensorFlow

Another popular open-source deep learning framework is TensorFlow. It was origi- nally developed and written in Python and C++ by the Google Brain Group in 2011 as DistBelief, but it was further developed into TensorFlow in 2015. TensorFlow at this time is the only deep learning framework that, in addition to CPUs and GPUs, supports Tensor Processing Units (TPUs), a type of processor developed by Google in 2016 for the specific purpose of neural network machine learning. In fact, TPUs were specifically designed by Google for the TensorFlow framework.

Although Google has not yet made TPUs available to the market, it is reported that it has used them in a number of its commercial services such as Google search, Street View, Google Photos, and Google Translate with significant improvements reported. A detailed study performed by Google shows that TPUs deliver 30 to 80 times higher perfor- mance per watt than contemporary CPUs and GPUs (Sato, Young, and Patterson, 2017). For example, it has been reported (Ung, 2016) that in Google Photos, an individual TPU can process over 100 million images per day (i.e., 0.86 ms/image). Such a unique feature will probably put TensorFlow way ahead of the other alternative frameworks in the near future as soon as Google makes TPUs commercially available.

Another interesting feature of TensorFlow is its visualization module, TensorBoard. Implementing a deep neural network is a complex and confusing task. TensorBoard re- fers to a Web application involving a handful of visualization tools to visualize network graphs and plot quantitative network metrics with the aim of helping users to better un- derstand what is going on during training procedures and to debug possible issues.

Theano

In 2007, the Deep Learning Group at the University of Montreal developed the initial version of a Python library, Theano (http://deeplearning.net/software/theano), to define, optimize, and evaluate mathematical expressions involving multi-dimensional ar- rays (i.e., tensors) on CPU or GPU platforms. Theano was one of the first deep learning frameworks but later became a source of inspiration for the developers of TensorFlow. Theano and TensorFlow both pursue a similar procedure in the sense that in both a typi- cal network implementation involves two sections: in the first section, a computational graph is built by defining the network variables and operations to be done on them; and the second section runs that graph (in Theano by compiling the graph into a function and in TensorFlow by creating a session). In fact, what happens in these libraries is that the user defines the structure of the network by providing some simple and symbolic

370 Part II • Predictive Analytics/Machine Learning

syntax understandable even for beginners in programming, and the library automatically generates appropriate codes in either C (for processing on CPU) or CUDA (for process- ing on GPU) to implement the defined network. Hence, users without any knowledge of programming in C or CUDA and with just a minimum knowledge of Python are able to efficiently design and implement deep learning networks on the GPU platforms.

Theano also includes some built-in functions to visualize computational graphs as well as to plot the network performance metrics even though its visualization features are not comparable to TensorBoard.

Keras: An Application Programming Interface

While all described deep learning frameworks require users to be familiar with their own syntax (through reading their documentations) to be able to successfully train a network, fortunately there are some easier, more user-friendly ways to do so. Keras (https:// keras.io/) is an open-source neural network library written in Python that functions as a high-level application programming interface (API) and is able to run on top of various deep learning frameworks including Theano and TensorFlow. In essence, Keras just by getting the key properties of network building blocks (i.e., type of layers, transfer func- tions, and optimizers) via an extremely simple syntax automatically generates syntax in one of the deep learning frameworks and runs that framework in the backend. While Keras is efficient enough to build and run general deep learning models in just a few minutes, it does not provide several advanced operations provided by TensorFlow or Theano. Therefore, in dealing with special deep network models that require advanced settings, one still needs to directly use those frameworks instead of Keras (or other APIs such as Lasagne) as a proxy.

u SECTION 6.9 REVIEW QUESTIONS

1. Despite the short tenure of deep learning implementation, why do you think there are several different computing frameworks for it?

2. Define CPU, NVIDIA, CUDA, and deep learning, and comment on the relationship between them.

3. List and briefly define the characteristics of different deep learning frameworks. 4. What is Keras, and how is it different from the other frameworks?

6.10 COGNITIVE COMPUTING

We are witnessing a significant increase in the way technology is evolving. Things that once took decades are now taking months, and the things that we see only in SciFi movies are becoming reality, one after another. Therefore, it is safe to say that in the next decade or two, technological advancements will transform how people live, learn, and work in a rather dramatic fashion. The interactions between humans and technology will become in- tuitive, seamless, and perhaps transparent. Cognitive computing will have a significant role to play in this transformation. Generally speaking, cognitive computing refers to the com- puting systems that use mathematical models to emulate (or partially simulate) the human cognition process to find solutions to complex problems and situations where the potential answers can be imprecise. While the term cognitive computing is often used interchange- ably with AI and smart search engines, the phrase itself is closely associated with IBM’s cognitive computer system Watson and its success on the television show Jeopardy! Details on Watson’s success on Jeopardy! can be found in Application Case 6.8.

According to Cognitive Computing Consortium (2018), cognitive computing makes a new class of problems computable. It addresses highly complex situations that are

Chapter 6 • Deep Learning and Cognitive Computing 371

characterized by ambiguity and uncertainty; in other words, it handles the kinds of prob- lems that are thought to be solvable by human ingenuity and creativity. In today’s dy- namic, information-rich, and unstable situations, data tend to change frequently, and they often conflict. The goals of users evolve as they learn more and redefine their objectives. To respond to the fluid nature of users’ understanding of their problems, the cognitive computing system offers a synthesis not just of information sources but also of influences, contexts, and insights. To achieve such a high-level of performance, cognitive systems often need to weigh conflicting evidence and suggest an answer that is “best” rather than “right.” Figure 6.36 illustrates a general framework for cognitive computing where data and AI technologies are used to solve complex real-world problems.

How Does Cognitive Computing Work?

As one would guess from the name, cognitive computing works much like a human thought process, reasoning mechanism, and cognitive system. These cutting-edge compu- tation systems can find and synthesize data from various information sources and weigh context and conflicting evidence inherent in the data to provide the best possible answers to a given question or problem. To achieve this, cognitive systems include self-learning technologies that use data mining, pattern recognition, deep learning, and NLP to mimic the way the human brain works.

Outcomes

Cognitive Computing

Saved lives

Improved economy

Better security

Engaged customers

Higher revenues Reduced risks

Improved living

Test

Built

Validate

Structured Data

(POS, transactions, OLAP, CRM, SCM,

external, etc.)

Unstructured Data

(social media, multimedia, loT, literature, etc.)

Complex Problems

(health, economic, humanitarian, social, etc.)

AI Algorithms Soft/Hardware

(machine learning, NLP, search, cloud,

GPU, etc.)

FIGURE 6.36 Conceptual Framework for Cognitive Computing and Its Promises.

372 Part II • Predictive Analytics/Machine Learning

Using computer systems to solve the types of problems that humans are typically tasked with requires vast amounts of structured and unstructured data fed to machine- learning algorithms. Over time, cognitive systems are able to refine the way in which they learn and recognize patterns and the way they process data to become capable of antici- pating new problems and modeling and proposing possible solutions.

To achieve those capabilities, cognitive computing systems must have the following key attributes as defined by the Cognitive Computing Consortium (2018):

• Adaptive: Cognitive systems must be flexible enough to learn as information changes and goals evolve. The systems must be able to digest dynamic data in real time and make adjustments as the data and environment change.

• Interactive: Human-computer interaction (HCI) is a critical component in cogni- tive systems. Users must be able to interact with cognitive machines and define their needs as those needs change. The technologies must also be able to interact with other processors, devices, and cloud platforms.

• Iterative and stateful: Cognitive computing technologies can also identify prob- lems by asking questions or pulling in additional data if a stated problem is vague or incomplete. The systems do this by maintaining information about similar situa- tions that have previously occurred.

• Contextual: Understanding context is critical in thought processes, so cogni- tive systems must understand, identify, and mine contextual data, such as syntax, time, location, domain, requirements, and a specific user’s profile, tasks, or goals. Cognitive systems may draw on multiple sources of information, including struc- tured and unstructured data and visual, auditory, or sensor data.

How Does Cognitive Computing Differ from AI?

Cognitive computing is often used interchangeably with AI, the umbrella term used for technologies that rely on data and scientific methods/computations to make (or help/sup- port in making) decisions. But there are differences between the two terms, which can largely be found within their purposes and applications. AI technologies include—but are not limited to—machine learning, neural computing, NLP, and, most recently, deep learn- ing. With AI systems, especially in machine-learning systems, data are fed into the algo- rithm for processing (an iterative and time-demanding process that is often called training) so that the systems “learn” variables and interrelationships among those variables so that it can produce predictions (or characterizations) about a given complex problem or situa- tion. Applications based on AI and cognitive computing include intelligent assistants, such as Amazon’s Alexa, Google Home, and Apple’s Siri. A simple comparison between cogni- tive computing and AI is given in Table 6.3 (Reynolds and Feldman, 2014; CCC, 2018).

As can be seen in Table 6.3, the differences between AI and cognitive computing are rather marginal. This is expected because cognitive computing is often character- ized as a subcomponent of AI or an application of AI technologies tailored for a specific purpose. AI and cognitive computing both utilize similar technologies and are applied to similar industry segments and verticals. The main difference between the two is the pur- pose: while cognitive computing is aimed at helping humans to solve complex problems, AI is aimed at automating processes that are performed by humans; at the extreme, AI is striving to replace humans with machines for tasks requiring “intelligence,” one at a time.

In recent years, cognitive computing typically has been used to describe AI systems that aim to simulate human thought process. Human cognition involves real-time analysis of environment, context, and intent among many other variables that inform a person’s ability to solve problems. A number of AI technologies are required for a computer sys- tem to build cognitive models that mimic human thought processes, including machine learning, deep learning, neural networks, NLP, text mining, and sentiment analysis.

Chapter 6 • Deep Learning and Cognitive Computing 373

In general, cognitive computing is used to assist humans in their decision-making process. Some examples of cognitive computing applications include supporting medical doctors in their treatment of disease. IBM Watson for Oncology, for example, has been used at Memorial Sloan Kettering Cancer Center to provide oncologists evidence-based treatment options for cancer patients. When medical staff input questions, Watson generates a list of hypotheses and offers treatment options for doctors to consider. Whereas AI relies on algo- rithms to solve a problem or to identify patterns hidden in data, cognitive computing systems have the loftier goal of creating algorithms that mimic the human brain’s reasoning process to help humans solve an array of problems as the data and the problems constantly change.

In dealing with complex situations, context is important, and cognitive computing systems make context computable. They identify and extract context features such as time, location, task, history, or profile to present a specific set of information that is ap- propriate for an individual or for a dependent application engaged in a specific process at a specific time and place. According to the Cognitive Computing Consortium, they provide machine-aided serendipity by wading through massive collections of diverse information to find patterns and then apply those patterns to respond to the needs of the user at a particular moment. In a sense, cognitive computing systems aim at redefining the nature of the relationship between people and their increasingly pervasive digital environment. They may play the role of assistant or coach for the user, and they may act virtually autonomously in many problem-solving situations. The boundaries of the pro- cesses and domains these systems can affect are still elastic and emergent. Their output may be prescriptive, suggestive, instructive, or simply entertaining.

In the short time of its existence, cognitive computing has proved to be useful in many domain and complex situations and is evolving into many more. The typical use cases for cognitive computing include the following:

• Development of smart and adaptive search engines • Effective use of natural language processing • Speech recognition • Language translation • Context-based sentiment analysis

TABLE 6.3 Cognitive Computing versus Artificial Intelligence (AI)

Characteristic Cognitive Computing Artificial Intelligence (AI)

Technologies used • Machine learning • Natural language processing • Neural networks • Deep learning • Text mining • Sentiment analysis

• Machine learning • Natural language processing • Neural networks • Deep learning

Capabilities offered Simulate human thought processes to assist humans in finding solutions to complex problems

Find hidden patterns in a variety of data sources to identify problems and provide potential solutions

Purpose Augment human capability Automate complex processes by acting like a human in certain situations

Industries Customer service, marketing, healthcare, entertainment, service sector

Manufacturing, finance, healthcare, banking, securities, retail, government

374 Part II • Predictive Analytics/Machine Learning

• Face recognition and facial emotion detection • Risk assessment and mitigation • Fraud detection and mitigation • Behavioral assessment and recommendations

Cognitive analytics is a term that refers to cognitive computing–branded technol- ogy platforms, such as IBM Watson, that specialize in processing and analyzing large, unstructured data sets. Typically, word processing documents, e-mails, videos, images, audio files, presentations, Web pages, social media, and many other data formats need to be manually tagged with metadata before they can be fed into a traditional analytics engine and Big Data tools for computational analyses and insight generation. The princi- pal benefit of utilizing cognitive analytics over those traditional Big Data analytics tools is that for cognitive analytics such data sets do not need to be pretagged. Cognitive analyt- ics systems can use machine learning to adapt to different contexts with minimal human supervision. These systems can be equipped with a chatbot or search assistant that un- derstands queries, explains data insights, and interacts with humans in human languages.

Cognitive Search

Cognitive search is the new generation search method that uses AI (advanced indexing, NLP, and machine learning) to return results that are much more relevant to users. Forrester de- fines cognitive search and knowledge discovery solutions as “a new generation of enterprise search solutions that employ AI technologies such as natural language processing and ma- chine learning to ingest, understand, organize, and query digital content from multiple data sources” (Gualtieri, 2017). Cognitive search creates searchable information out of nonsearch- able content by leveraging cognitive computing algorithms to create an indexing platform.

Searching for information is a tedious task. Although current search engines do a very good job in finding relevant information in a timely manner, their sources are limited to publically available data over the Internet. Cognitive search proposes the next genera- tion of search tailored for use in enterprises. It is different from traditional search because, according to Gualtieri (2017), it:

• Can handle a variety of data types. Search is no longer just about unstructured text contained in documents and in Web pages. Cognitive search solutions can also accommodate structured data contained in databases and even nontraditional enter- prise data such as images, video, audio, and machine-/sensor-generated logs from IoT devices.

• Can contextualize the search space. In information retrieval, the context is important. Context takes the traditional syntax-/symbol-driven search to a new level where it is defined by semantics and meaning.

• Employ advanced AI technologies. The distinguishing characteristic of cogni- tive search solutions is that they use NLP and machine learning to understand and organize data, predict the intent of the search query, improve the relevancy of results, and automatically tune the relevancy of results over time.

• Enable developers to build enterprise-specific search applications. Search is not just about a text box on an enterprise portal. Enterprises build search applica- tions that embed search in customer 360 applications, pharma research tools, and many other business process applications. Virtual digital assistants such as Amazon Alexa, Google Now, and Siri would be useless without powerful searches behind the scenes. Enterprises wishing to build similar applications for their customers will also benefit from cognitive search solutions. Cognitive search solutions provide soft- ware development kits (SDKs), APIs, and/or visual design tools that allow develop- ers to embed the power of the search engine in other applications.

Chapter 6 • Deep Learning and Cognitive Computing 375

Figure 6.37 shows the progressive evolution of search methods from good old key- word search to modern-day cognitive search on two dimensions—ease of use and value proposition.

IBM Watson: Analytics at Its Best

IBM Watson is perhaps the smartest computer system built to date. Since the emergence of computers and subsequently AI in the late 1940s, scientists have compared the per- formance of these “smart” machines with human minds. Accordingly, in the mid- to late-1990s, IBM researchers built a smart machine and used the game of chess (generally credited as the game of smart humans) to test its ability against the best of human players. On May 11, 1997, an IBM computer called Deep Blue beat the world chess grandmaster after a six-game match series: two wins for Deep Blue, one for the champion, and three draws. The match lasted several days and received massive media coverage around the world. It was the classic plot line of human versus machine. Beyond the chess contest, the intention of developing this kind of computer intelligence was to make computers able to handle the kinds of complex calculations needed to help discover new drugs and to do the broad financial modeling needed to identify trends and do risk analysis, handle large database searches, and perform massive calculations needed in advanced fields of science.

After a couple of decades, IBM researchers came up with another idea that was perhaps more challenging: a machine that could not only play the American TV quiz show Jeopardy! but also beat the best of the best. Compared to chess, Jeopardy! is much more challenging. While chess is well structured and has very simple rules and therefore is a very good match for computer processing, Jeopardy! is neither simple nor structured. Jeopardy! is a game designed to test human intelligence and creativity. Therefore, a com- puter designed to play the game needed to be a cognitive computing system that can work and think like a human. Making sense of imprecision inherent in human language was the key to success.

Value Proposition

Natural Human Interaction (NHI)

Cognitive Search

Contextual Search

Indexing NLP

Indexing NLP Machine Learning

Semantic Search

Keyword Search

Machine Learning

Natural Language Processing (NLP)

E as

e of

U se

Indexing

Indexing

FIGURE 6.37 Progressive Evolution of Search Methods.

376 Part II • Predictive Analytics/Machine Learning

In 2010, an IBM research team developed Watson, an extraordinary computer system—a novel combination of advanced hardware and software—designed to answer questions posed in natural human language. The team built Watson as part of the DeepQA project and named it after IBM’s first president, Thomas J. Watson. The team that built Watson was looking for a major research challenge: one that could rival the scientific and popular interest of Deep Blue and would have clear relevance to IBM’s business interests. The goal was to advance computational science by exploring new ways for computer technology to affect science, business, and society at large. Accordingly, IBM research undertook a challenge to build Watson as a computer system that could compete at the human champion level in real time on Jeopardy! The team wanted to create a real-time automatic contestant on the show capable of listening, understanding, and responding, not merely a laboratory exercise. Application Case 6.8 provides some of the details on IBM Watson’s participation in the game show.

In 2011, to test its cognitive abilities, Watson com- peted on the quiz show Jeopardy! in the first-ever human-versus-machine matchup for the show. In a two-game, combined-point match (broadcast in three Jeopardy! episodes during February 14–16), Watson beat Brad Rutter, the highest all-time money winner on Jeopardy! and Ken Jennings, the record holder for the longest championship streak (75 days). In these episodes, Watson consistently outperformed its human opponents on the game’s signaling device, but it had trouble responding to a few categories, notably those having short clues containing only a few words. Watson had access to 200 million pages of structured and unstructured content, consum- ing four terabytes of disk storage. During the game, Watson was not connected to the Internet.

Meeting the Jeopardy! challenge required advancing and incorporating a variety of text mining and NLP technologies, including parsing, question classification, question decomposition, automatic source acquisition and evaluation, entity and rela- tionship detection, logical form generation, and knowledge representation and reasoning. Winning at Jeopardy! required accurately computing confi- dence in answers. The questions and content are ambiguous and noisy, and none of the individual algorithms is perfect. Therefore, each component must produce a confidence in its output, and indi- vidual component confidences must be combined to compute the overall confidence of the final

answer. The final confidence is used to determine whether the computer system should risk choosing to answer at all. In Jeopardy! this confidence is used to determine whether the computer will “ring in” or “buzz in” for a question. The confidence must be computed during the time the question is read and before the opportunity to buzz in. This is roughly between one and six seconds with an average around three seconds.

Watson was an excellent example for the rapid advancement of the computing technology and what it is capable of doing. Although still not as creatively/natively smart as human beings, com- puter systems like Watson are evolving to change the world we are living in, hopefully for the better.

Questions for Case 6.8

1. In your opinion, what are the most unique fea- tures about Watson?

2. In what other challenging games would you like to see Watson compete against humans? Why?

3. What are the similarities and differences between Watson’s and humans’ intelligence?

Sources: Ferrucci, D., E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, D. Kalyanpur, A. Lally, J. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. (2010). “Building Watson: An Overview of the DeepQA Project.” AI Magazine, 31(3), pp. 59–79; IBM Corporation. (2011). “The DeepQA Project.” https://researcher.watson.ibm. com/researcher/view_group.php?id=2099 (accessed May 2018).

Application Case 6.8 IBM Watson Competes against the Best at Jeopardy!

Chapter 6 • Deep Learning and Cognitive Computing 377

How Does Watson Do It?

What is under the hood of Watson? How does it do what it does? The system behind Watson, which is called DeepQA, is a massively parallel, text mining–focused, probabilis- tic evidence–based computational architecture. For the Jeopardy! challenge, Watson used more than 100 different techniques for analyzing natural language, identifying sources, finding and generating hypotheses, finding and scoring evidence, and merging and ranking hypotheses. What is far more important than any particular technique the IBM team used was how it combined them in DeepQA such that overlapping approaches could bring their strengths to bear and contribute to improvements in accuracy, confidence, and speed.

DeepQA is architecture with an accompanying methodology that is not specific to the Jeopardy! challenge. These are the overarching principles in DeepQA:

• Massive parallelism. Watson needed to exploit massive parallelism in the con- sideration of multiple interpretations and hypotheses.

• Many experts. Watson needed to be able to integrate, apply, and contextually evaluate a wide range of loosely coupled probabilistic questions and content analytics.

• Pervasive confidence estimation. No component of Watson committed to an answer; all components produced features and associated confidences, scoring dif- ferent question and content interpretations. An underlying confidence-processing substrate learned how to stack and combine the scores.

• Integration of shallow and deep knowledge. Watson needed to balance the use of strict semantics and shallow semantics, leveraging many loosely formed ontologies.

Figure 6.38 illustrates the DeepQA architecture at a very high level. More technical details about the various architectural components and their specific roles and capabilities can be found in Ferrucci et al. (2010).

What Is the Future for Watson?

The Jeopardy! challenge helped IBM address requirements that led to the design of the DeepQA architecture and the implementation of Watson. After three years of intense re- search and development by a core team of about 20 researchers, as well as a significant

Hypothesis n Soft Filtering Evidence Scoring

Hypothesis 3 Soft Filtering Evidence Scoring

Hypothesis 2 Soft Filtering Evidence Scoring

Hypothesis 1

Question (in natural language)

Soft Filtering Evidence Scoring

Question (translation to digital)

Analysis (decomposition)

Primary Search

Answer Sources

Evidence Sources

Synthesis (combining)

Answer (and level of confidence)

Merging and Ranking

... ... ...

Candidate Generation

Support Evidence Retrieval

Deep Evidence Scoring

1 2 3

4 5

FIGURE 6.38 A High-Level Depiction of DeepQA Architecture

378 Part II • Predictive Analytics/Machine Learning

R&D budget, Watson managed to perform at human expert levels in terms of precision, confidence, and speed on the Jeopardy! quiz show.

After the show, the big question was “So what now?” Was developing Watson all for a quiz show? Absolutely not! Showing the rest of the world what Watson (and the cognitive system behind it) could do became an inspiration for the next generation of intelligent information systems. For IBM, it was a demonstration of what is possible with cutting-edge analytics and computational sciences. The message is clear: If a smart ma- chine can beat the best of the best in humans at what they are the best at, think about what it can do for your organizational problems.

The innovative and futuristic technologies that made Watson one of the most ac- claimed technological advances of this decade are being leveraged as computational foundation for several tools to analyze and characterize unstructured data for prediction- type problems. These experimental tools include Tone Analyzer and Personality Insights. Using textual content, these tools have shown the ability to predict outcomes of complex social events and globally popular competitions.

WATSON PREDICTS THE WINNER OF 2017 EUROVISION SONG CONTEST. A tool developed on the foundations of IBM Watson, Watson Tone Analyzer, uses computational linguistics to identify tone in written text. Its broader goal is to have business managers use the Tone Analyzer to understand posts, conversations, and communications of target customer pop- ulations and to respond to their needs and wants in a timely manner. One could, for ex- ample, use this tool to monitor social media and other Web-based content, including wall posts, tweets, product reviews, and discussion boards as well as longer documents such as articles and blog posts. Or one could use it to monitor customer service interactions and support related conversations. Although it sounds as if any other text-based detection system can build on sentiment analysis, Tone Analyzer differs from these systems in that it analyzes and characterizes textual content. Watson Tone Analyzer measures social tenden- cies and opinions, using a version of the Big-5, the five categories of personality traits (i.e., openness, agreeableness, conscientiousness, extroversion, and neuroticism), along with other emotional categories to detect the tone in a given textual content. As an example, Slowey (2017b) used IBM’s Watson Tone Analyzer to predict the winner of the 2017 Eurovision Songs Contest. Using nothing but the lyrics of the previous years’ competitions, Slowey discovered a pattern that suggested most winners had high levels of agreeableness and conscientiousness. The results (produced before the contest) indicated that Portugal would win the contest, and that is exactly what happened. Try it out yourself:

• Go to Watson Tone Analyzer (https://tone-analyzer-demo.ng.bluemix.net). • Copy and paste your own text in the provided text entry field. • Click “Analyze.” • Observe the summary results as well as the specific sentences where specific tones

are the strongest

Another tool built on the linguistic foundations of IBM Watson is Watson Personality Insight, which seems to work quite similar to Watson Tone Analyzer. In another fun applica- tion case, Slowey (2017a) used Watson Personality Insight to predict the winner of the best picture category at the 2017 Oscar Academy Awards. Using the scripts of the movies from the past years, Slowey developed a generalized profile for winners and then compared that profile to those of the newly nominated movies to identify the upcoming winner. Although in this case, Slowey incorrectly predicted Hidden Figures as the winner, the methodology she followed was unique and innovative and hence deserves credit. To try Watson Personality Insight tool yourself, just go to https://personality-insights-demo.ng.bluemix.net/, copy and paste your own textual content into the “Body of Text” section, and observe the outcome.

Chapter 6 • Deep Learning and Cognitive Computing 379

One of the worthiest endeavors for Watson (or Watson-like large-scale cognitive computing systems) is to help doctors and other medical professionals to diagnose dis- eases and identify the best treatment options that would work for an individual patient. Although Watson is new, this very novel and worthy task is not new to the world of com- puting. In the early 1970s, several researchers at Stanford University developed a com- puter system, MYCIN, to identify bacteria causing severe infections, such as bacteremia and meningitis, and to recommend antibiotics with the dosage adjusted for the specifics of an individual patient (Buchanan and Shortliffe, 1984). This six-year effort relied on a rule-based expert system, a type of AI system, where the diagnoses and treatment knowl- edge nuggets/rules were elicited from a large number of experts (i.e., doctors with ample experience in the specific medical domain). The resulting system was then tested on new patients, and its performance was compared to those of the experienced doctors used as the knowledge sources/experts. The results favored MYCIN, providing a clear indication that properly designed and implemented AI-based computer systems can meet and often exceed the effectiveness and efficiency of even the best medical experts. After more than four decades, Watson is now trying to pick up where MYCIN left the mission of using smart computer systems to improve the health and well-being of humans by helping doc- tors with the contextual information that they need to better and more quickly diagnose and treat their patients.

The first industry targeted to utilize Watson was healthcare, followed by security, finance, retail, education, public services, and research. The following sections pro- vide short descriptions of what Watson can do (and, in many cases, is doing) for these industries.

HEALTHCARE AND MEDICINE The challenges that healthcare is facing today are rather big and multifaceted. With the aging U.S. population, which may be partially attributed to better living conditions and advanced medical discoveries fueled by a variety of tech- nological innovations, demand for healthcare services is increasing faster than the supply of resources. As we all know, when there is an imbalance between demand and supply, prices go up and quality suffers. Therefore, we need cognitive systems like Watson to help decision makers optimize the use of their resources in both clinical and managerial settings.

According to healthcare experts, only 20 percent of the knowledge that physicians use to diagnose and treat patients is evidence based. Considering that the amount of medical information available is doubling every five years and that much of these data are unstructured, physicians simply do not have time to read every journal that can help them keep up-to-date with the latest advances. Given the growing demand for services and the complexity of medical decision making, how can healthcare providers address these problems? The answer could be to use Watson or similar cognitive systems that have the ability to help physicians in diagnosing and treating patients by analyzing large amounts of data—both structured data coming from electronic medical record databases and unstructured text coming from physician notes and published literature—to provide evidence for faster and better decision making. First, the physician and the patient can describe symptoms and other related factors to the system in natural language. Watson can then identify the key pieces of information and mine the patient’s data to find rel- evant facts about family history, current medications, and other existing conditions. It can then combine that information with current findings from tests and then can form and test hypotheses for potential diagnoses by examining a variety of data sources—treatment guidelines, electronic medical record data, doctors’ and nurses’ notes, and peer-reviewed research and clinical studies. Next, Watson can suggest potential diagnostics and treat- ment options with a confidence rating for each suggestion.

380 Part II • Predictive Analytics/Machine Learning

Watson also has the potential to transform healthcare by intelligently synthesizing fragmented research findings published in a variety of outlets. It can dramatically change the way medical students learn. It can help healthcare managers to be proactive about upcoming demand patterns, optimally allocate resources, and improve processing of pay- ments. Early examples of leading healthcare providers that use Watson-like cognitive systems include MD Anderson, The Cleveland Clinic, and Memorial Sloan Kettering.

SECURITY As the Internet expands into every facet of our lives—e-commerce, e-business, smart grids for energy, smart homes for remote control of residential gad- gets and appliances—to make things easier to manage, it also opens up the potential for ill-intended people to intrude in our lives. We need smart systems like Watson that are capable of constantly monitoring for abnormal behavior and, when it is identified, preventing people from accessing our lives and harming us. This could be at the corpo- rate or even national security system level; it could also be at the personal level. Such a smart system could learn who we are and become a digital guardian that could make inferences about activities related to our life and alert us whenever abnormal things happen.

FINANCE The financial services industry faces complex challenges. Regulatory measures as well as social and governmental pressures for financial institutions to be more inclusive have increased. And the customers the industry serves are more empowered, demand- ing, and sophisticated than ever before. With so much financial information generated each day, it is difficult to properly harness the appropriate information on which to act. Perhaps the solution is to create smarter client engagement by better understanding risk profiles and the operating environment. Major financial institutions are already working with Watson to infuse intelligence into their business processes. Watson is tackling data- intensive challenges across the financial services sector, including banking, financial plan- ning, and investing.

RETAIL The retail industry is rapidly changing according to customers’ needs and wants. Empowered by mobile devices and social networks that give them easier access to more information faster than ever before, customers have high expectations for products and services. While retailers are using analytics to keep up with those expectations, their big- ger challenge is efficiently and effectively analyzing the growing mountain of real-time insights that could give them a competitive advantage. Watson’s cognitive computing capabilities related to analyzing massive amounts of unstructured data can help retail- ers reinvent their decision-making processes around pricing, purchasing, distribution, and staffing. Because of Watson’s ability to understand and answer questions in natural language, Watson is an effective and scalable solution for analyzing and responding to social sentiment based on data obtained from social interactions, blogs, and customer reviews.

EDUCATION With the rapidly changing characteristics of students—who are more visu- ally oriented/stimulated, constantly connected to social media and social networks, and with increasingly shorter attention spans—what should the future of education and the classroom look like? The next generation of educational systems should be tailored to fit the needs of the new generation with customized learning plans, personalized textbooks (digital ones with integrated multimedia—audio, video, animated graphs/charts, etc.), dynamically adjusted curriculum, and perhaps smart digital tutors and 24/7 personal advi- sors. Watson seems to have what it takes to make all this happen. With its NLP capability, students can converse with it just as they do with their teachers, advisors, and friends.

Chapter 6 • Deep Learning and Cognitive Computing 381

This smart assistant can answer students’ questions, satisfy their curiosity, and help them keep up with the endeavors of the educational journey.

GOVERNMENT For local, regional, and national governments, the exponential rise of Big Data presents an enormous dilemma. Today’s citizens are more informed and em- powered than ever before, and that means they have high expectations for the value of the public sector serving them. And government organizations can now gather enormous volumes of unstructured, unverified data that could serve their citizens, but only if those data can be analyzed efficiently and effectively. IBM Watson’s cognitive computing may help make sense of this data deluge, speeding governments’ decision-making processes and helping public employees to focus on innovation and discovery.

RESEARCH Every year, hundreds of billions of dollars are spent on research and develop- ment, most of it documented in patents and publications, creating an enormous amount of unstructured data. To contribute to the extant body of knowledge, one needs to sift through these data sources to find the outer boundaries of research in a particular field. This is very difficult, if not impossible, work if it is done with traditional means, but Watson can act as a research assistant to help collect and synthesize information to keep people updated on recent findings and insights. For instance, the New York Genome Center is using the IBM Watson cognitive computing system to analyze the genomic data of patients diagnosed with a highly aggressive and malignant brain cancer and to more rapidly deliver personalized, life-saving treatment to patients with this disease (Royyuru, 2014).

u SECTION 6.10 REVIEW QUESTIONS

1. What is cognitive computing, and how does it differ from other computing paradigms? 2. Draw a diagram and explain the conceptual framework of cognitive computing.

Make sure to include inputs, enablers, and expected outcomes in your framework.

3. List and briefly define the key attributes of cognitive computing. 4. How does cognitive computing differ from ordinary AI techniques? 5. What are the typical use cases for cognitive analytics? 6. Explain what the terms cognitive analytics and cognitive search mean. 7. What is IBM Watson and what is its significance to the world of computing? 8. How does Watson work? 9. List and briefly explain five use cases for IBM Watson.

Chapter Highlights

• Deep learning is among the latest trends in AI that come with great expectations.

• The goal of deep learning is similar to those of the other machine-leaning methods, which is to use sophisticated mathematical algorithms to learn from data similar to the way that humans learn.

• What deep learning has added to the classic machine-learning methods is the ability to auto- matically acquire the features required to accom- plish highly complex and unstructured tasks.

• Deep learning belongs to the representation learning within the AI learning family of methods.

• The recent emergence and popularity of deep learning can largely be attributed to very large data sets and rapidly advancing commuting infrastructures.

• Artificial neural networks emulate the way the human brain works. The basic processing unit is a neuron. Multiple neurons are grouped into lay- ers and linked together.

382 Part II • Predictive Analytics/Machine Learning

• In a neural network, knowledge is stored in the weight associated with the connections between neurons.

• Backpropagation is the most popular learning paradigm of feedforward neural networks.

• An MLP-type neural network consists of an input layer, an output layer, and a number of hidden layers. The nodes in one layer are connected to the nodes in the next layer.

• Each node at the input layer typically represents a single attribute that may affect the prediction.

• The usual process of learning in a neural network involves three steps: (1) compute temporary out- puts based on inputs and random weights, (2) compute outputs with desired targets, and (3) ad- just the weights and repeat the process.

• Developing neural network–based systems re- quires a step-by-step process. It includes data preparation and preprocessing, training and test- ing, and conversion of the trained model into a production system.

• Neural network software allows for easy experi- mentation with many models. Although neural network modules are included in all major data mining software tools, specific neural network packages are also available.

• Neural network applications abound in almost all business disciplines as well as in virtually all other functional areas.

• Overfitting occurs when neural networks are trained for a large number of iterations with rela- tively small data sets. To prevent overfitting, the training process is controlled by an assessment process using a separate validation data set.

• Neural networks are known as black-box models. Sensitivity analysis is often used to shed light into the black box to assess the relative importance of input features.

• Deep neural networks broke the generally ac- cepted notion of “no more than two hidden lay- ers are needed to formulate complex prediction problems.” They promote increasing the hidden layer to arbitrarily large numbers to better repre- sent the complexity in the data set.

• MLP deep networks, also known as deep feedfor- ward networks, are the most general type of deep networks.

• The impact of random weights in the learning process of deep MLP is shown to be a signifi- cant issue. Nonrandom assignment of the initial weights seems to significantly improve the learn- ing process in deep MLP.

• Although there is no generally accepted theoreti- cal basis for this, it is believed and empirically shown that in deep MLP networks, multiple lay- ers perform better and converge faster than few layers with many neurons.

• CNNs are arguably the most popular and most successful deep learning methods.

• CNNs were initially designed for computer vision applications (e.g., image processing, video process- ing, text recognition) but also have been shown to be applicable to nonimage or non-text data sets.

• The main characteristic of the convolutional net- works is having at least one layer involving a convolution weight function instead of general matrix multiplication.

• The convolution function is a method to address the issue of having too many network weight pa- rameters by introducing the notion of parameter sharing.

• In CNN, a convolution layer is often followed by another layer known as the pooling (a.k.a. sub- sampling) layer. The purpose of a pooling layer is to consolidate elements in the input matrix in order to produce a smaller output matrix while maintaining the important features.

• ImageNet is an ongoing research project that provides researchers with a large database of images, each linked to a set of synonym words (known as synset) from WordNet (a word hierar- chy database).

• AlexNet is one of the first convolutional net- works designed for image classification using the ImageNet data set. Its success rapidly popularized the use and reputation of CNNs.

• GoogLeNet (a.k.a. Inception), a deep convolu- tional network architecture designed by Google researchers, was the winning architecture at ILSVRC 2014.

• Google Lens is an app that uses deep learning ar- tificial neural network algorithms to deliver infor- mation about the images captured by users from their nearby objects.

• Google’s word2vec project remarkably increased the use of CNN-type deep learning for text min- ing applications.

• RNN is another deep learning architecture de- signed to process sequential inputs.

• RNNs have memory to remember previous in- formation in determining context-specific, time- dependent outcomes.

• A variation of RNN, the LSTM network is today known as the most effective sequence modeling

Chapter 6 • Deep Learning and Cognitive Computing 383

technique and is the base of many practical applications.

• Two emerging LSTM applications are Google Neural Machine Translator and Microsoft Skype Translator.

• Deep learning implementation frameworks include Torch, Caffe, TensorFlow, Theano, and Keras.

• Cognitive computing makes a new class of prob- lems computable by addressing highly complex situations that are characterized by ambiguity and uncertainty; in other words, it handles the kinds of problems that are thought to be solvable by human ingenuity and creativity.

• Cognitive computing finds and synthesizes data from various information sources and weighs the context and conflicting evidence inherent in the data in order to provide the best possible answers to a given question or problem.

• The key attributes of cognitive computing include adaptability, interactivity, being iterative, stateful, and contextual.

• Cognitive analytics is a term that refers to cognitive computing–branded technology platforms, such as IBM Watson, that specialize in the processing and analysis of large unstructured data sets.

• Cognitive search is the new generation of search method that uses AI (advanced indexing, NLP, and machine learning) to return results that are much more relevant to the user than traditional search methods.

• IBM Watson is perhaps the smartest computer system built to date. It has coined and popular- ized the term cognitive computing.

• IBM Watson beat the best of men (the two most winning competitors) at the quiz game Jeopardy!, showcasing the ability of commut- ers to do tasks that are designed for human intelligence.

• Watson and systems like it are now in use in many application areas including healthcare, fi- nance, security, and retail.

Key Terms

activation function artificial intelligence (AI) artificial neural networks (ANN) backpropagation black-box syndrome Caffe cognitive analytics cognitive computing cognitive search connection weight constant error carousel (CEC) convolution function convolutional neural network

(CNN) deep belief network (DBN) deep learning deep neural network DeepQA

Google Lens GoogLeNet Google Neural Machine Translator

(GNMT) graphics processing unit (GPU) hidden layer IBM Watson ImageNet Keras long short-term memory (LSTM) machine learning Microsoft Skype Translator multilayer perceptron (MLP) MYCIN network structure neural network neuron overfitting

perceptron performance function pooling processing element (PE) recurrent neural network (RNN) representation learning sensitivity analysis stochastic gradient

descent (SGD) summation function supervised learning TensorFlow Theano threshold value Torch transfer function word embeddings word2vec

Questions for Discussion

1. What is deep learning? What can deep learning do that traditional machine-learning methods cannot?

2. List and briefly explain different learning paradigms/ methods in AI.

3. What is representation learning, and how does it relate to machine learning and deep learning?

4. List and briefly describe the most commonly used ANN activation functions.

5. What is MLP, and how does it work? Explain the function of summation and activation weights in MLP-type ANN.

6. List and briefly describe the nine-step process in con- ducting a neural network project.

384 Part II • Predictive Analytics/Machine Learning

7. Draw and briefly explain the three-step process of learning in ANN.

8. How does the backpropagation learning algorithm work? 9. What is overfitting in ANN learning? How does it hap-

pen, and how can it be prevented? 10. What is the so-called black-box syndrome? Why is

it important to be able to explain an ANN’s model structure?

11. How does sensitivity analysis work in ANN? Search the Internet to find other methods to explain ANN methods.

12. What is meant by “deep” in deep neural networks? Compare deep neural network to shallow neural network.

13. What is GPU? How does it relate to deep neural networks?

14. How does a feedforward multilayer perceptron–type deep network work?

15. Comment on the impact of random weights in develop- ing deep MLP.

16. Which strategy is better: more hidden layers versus more neurons?

17. What is CNN? 18. For what type of applications can CNN be used? 19. What is the convolution function in CNN, and how does

it work? 20. What is pooling in CNN? How does it work? 21. What is ImageNet, and how does it relate to deep

learning? 22. What is the significance of AlexNet? Draw and describe

its architecture. 23. What is GoogLeNet? How does it work? 24. How does CNN process text? What is word embeddings,

and how does it work? 25. What is word2vec, and what does it add to the tradi-

tional text mining?

26. What is RNN? How does it differ from CNN? 27. What is the significance of context, sequence, and mem-

ory in RNN? 28. Draw and explain the functioning of a typical recurrent

neural network unit. 29. What is LSTM network, and how does it differ from

RNNs? 30. List and briefly describe three different types of LSTM

applications. 31. How do Google’s Neural Machine Translation and

Microsoft Skype Translator work? 32. Despite its short tenure, why do you think deep learn-

ing implementation has several different computing frameworks?

33. Define and comment on the relationship between CPU, NVIDIA, CUDA, and deep learning.

34. List and briefly define the characteristics of different deep learning frameworks.

35. What is Keras, and how does it differ from other frameworks?

36. What is cognitive computing and how does it differ from other computing paradigms?

37. Draw a diagram and explain the conceptual frame- work of cognitive computing. Make sure to include inputs, enablers, and expected outcomes in your framework.

38. List and briefly define the key attributes of cognitive computing.

39. How does cognitive computing differ from ordinary AI techniques?

40. What are the typical use cases for cognitive analytics? 41. What is cognitive analytics? What is cognitive search? 42. What is IBM Watson, and what is its significance to the

world of computing? 43. How does IBM Watson work? 44. List and briefly explain five use cases for IBM Watson.

Exercises

Teradata University Network (TUN) and Other Hands-On and Internet Exercises

1. Go to the Teradata University Network Web site (teradatauniversitynetwork.com). Search for teach- ing and learning materials (e.g., articles, application cases, white papers, videos, exercises) on deep learn- ing, cognitive computing, and IBM Watson. Read the material you have found. If needed, also conduct a search on the Web to enhance your findings. Write a report on your findings.

2. Deep learning is relatively new to the world of analytics. Its application cases and success stories are just start- ing to emerge in the Web. Conduct a comprehensive search on your school’s digital library resources to iden- tify at least five journal articles where interesting deep

learning applications are described. Write a report on your findings.

3. Most of the applications of deep learning today are developed using R- and/or Python-based open-source computing resources. Identify those resources (frame- works such as Torch, Caffe, TensorFlow, Theano, Keras) available for building deep learning models and applications. Compare and contrast their capabilities and limitations. Based on your findings and understand- ing of these resources, if you were to develop a deep learning application, which one would you choose to employ? Explain and justify/defend your choice.

4. Cognitive computing has become a popular term to define and characterize the extent of the ability of machines/ computers to show “intelligent” behavior. Thanks to IBM

Chapter 6 • Deep Learning and Cognitive Computing 385

Watson and its success on Jeopardy!, cognitive comput- ing and cognitive analytics are now part of many real- world intelligent systems. In this exercise, identify at least three application cases where cognitive computing was used to solve complex real-world problems. Summarize your findings in a professionally organized report.

5. Download KNIME analytics platform, one of the most popular free/open-source software tools from knime. org. Identify the deep learning examples (where Keras is used to build some exemplary prediction/classifica- tion models) in its example folder. Study the models in detail. Understand what it does and how exactly it does it. Then, using a different but similar data set, build and test your own deep learning prediction model. Report your findings and experiences in a written document.

6. Search for articles related to “cognitive search.” Identify at least five pieces of written material (a combination of journal articles, white papers, blog posts, application cases, etc.). Read and summarize your findings. Explain your understanding of cognitive search and how it dif- fers from regular search methods.

7. Go to Teradata.com. Search and find application case studies and white papers on deep learning and/or cogni- tive computing. Write a report to summarize your find- ings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

8. Go to SAS.com. Search and find application case stud- ies and white papers on deep learning and/or cognitive computing. Write a report to summarize your findings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

9. Go to IBM.com. Search and find application case stud- ies and white papers on deep learning and/or cognitive computing. Write a report to summarize your findings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

10. Go to TIBCO.com or some other advanced analytics company Web site. Search and find application case studies and white papers on deep learning and/or cog- nitive computing. Write a report to summarize your find- ings, and comment on the capabilities and limitations (based on your understanding) of these technologies.

References

Abad, M., P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, . . . M. Isard. (2016). “TensorFlow: A System for Large-Scale Machine Learning.” OSDI, 16, pp. 265–283.

Altman, E. I. (1968). “Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy.” The Journal of Finance, 23(4), pp. 589–609.

Bahdanau, D., K. Cho, & Y. Bengio. (2014). “Neural Machine Translation by Jointly Learning to Align and Translate.” ArXiv Preprint ArXiv:1409.0473.

Bengio, Y. (2009). “Learning Deep Architectures for AI.” Foundations and Trends® in Machine Learning, 2(1), pp. 1–127.

Bergstra, J., O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, . . . Y. Bengio. (2010). “Theano: A CPU and GPU Math Compiler in Python.” Proceedings of the Ninth Python in Science Conference, Vol. 1.

Bi, R. (2014). “When Watson Meets Machine Learning.” www. kdnuggets.com/2014/07/watson-meets-machine- learning.html (accessed June 2018).

Boureau, Y.-L., N. Le Roux, F. Bach, J. Ponce, & Y. LeCun (2011). “Ask the Locals: Multi-Way Local Pooling for Image Recognition.” Proceedings of the International Com- puter Vision (ICCV’11) IEEE International Conference, pp. 2651–2658.

Boureau, Y.-L., J. Ponce, & Y. LeCun. (2010). “A Theoretical Analysis of Feature Pooling in Visual Recognition.” Pro- ceedings of International Conference on Machine Learn- ing (ICML’10), pp. 111–118.

Buchanan, B. G., & E. H. Shortliffe. (1984). Rule Based Ex- pert Systems: The MYCIN Experiments of the Stanford

Heuristic Programming Project. Reading, MA: Addison- Wesley.

Cognitive Computing Consortium. (2018). https://cogni- tivecomputingconsortium.com/resources/cognitive- computing-defined/#1467829079735-c0934399- 599a (accessed July 2018).

Chen, T., M. Li, Y. Li, M. Lin, N. Wang, M. Wang, . . . Z. Zhang. (2015). “Mxnet: A Flexible and Efficient Machine Learn- ing Library for Heterogeneous Distributed Systems.” ArXiv Preprint ArXiv:1512.01274.

Collobert, R., K. Kavukcuoglu, & C. Farabet. (2011). “Torch7: A Matlab-like Environment for Machine Learning.” Big- Learn, NIPS workshop.

Cybenko, G. (1989). “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals and Systems, 2(4), 303–314.

DeepQA. (2011). “DeepQA Project: FAQ, IBM Corporation.” https://researcher.watson.ibm.com/researcher/ view_group.php?id=2099 (accessed May 2018).

Delen, D., R. Sharda, & M. Bessonov, M. (2006). “Identifying Sig- nificant Predictors of Injury Severity in Traffic Accidents Us- ing a Series of Artificial Neural Networks.” Accident Analysis & Prevention, 38(3), 434–444.

Denyer, S. (2018, January). “Beijing Bets on Facial Recognition in a Big Drive for Total Surveillance.” The Washington Post. https://www.washingtonpost.com/news/world/ w p / 2 0 1 8 / 0 1 / 0 7 / f e a t u r e / i n - c h i n a - f a c i a l - recognition-is-sharp-end-of-a-drive-for-total- su r ve i l l ance / ?nored i r e c t=on&utm_ te r m= . e73091681b31.