financial mathematics dissertation

fffinal

AIPersonalityDissertation1.pdf

Home >Business & Finance homework help >Financial markets homework help >financial mathematics dissertation

Balancing relevancy across expert systems for a Conversational AI personality

Gautam Prasad

September 2019

School of Mathematics, Cardiff University

A dissertation submitted in partial fulfilment of the requirements for MSc (in Data Science and Analytics) by taught programme.

CANDIDATE’S ID NUMBER 1821536

CANDIDATE’S SURNAME Please circle as appropriate Mr / Miss / Ms/ Mrs / Rev / Dr / Other ……………..........

PRASAD CANDIDATE’S FULL FORENAMES GAUTAM

DECLARATION

This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree.

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 1

This dissertation is being submitted in partial fulfilment of the requirements for the degree of Msc

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 2

This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended.

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 3 –

I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations.

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 4 - BAR ON ACCESS APPROVED

I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee.

Signed ……………………………………………. (candidate) Date 06-09-2019 Gautam Prasad

Gautam Prasad

Executive Summary Companies want to interact with their customers in a way that is not limited by time, human

resource availability or language. They need to pre-empt the needs of their clientele in order to

keep them satisfied and thereby reduce customer churn. Businesses such as Vodafone, Royal

Bank of Scotland (RBS) and NatWest and other businesses in the telecom, and finance domain

are investing in chatbots to build and maintain new relationships looking to lower overheads,

costs and training time.

There are three significant types of chatbot expert systems in use; chitchat bots, short tail/ task-

focused bots, and longtail search or FAQ bots. These have primarily been used individually

based on business requirements. Chitchat focuses on having the most natural conversation with

a user based on their inputs; Short tail looks at helping the user complete a small number of

regularly performed tasks and requires high training effort to scale efficiently but tends to pro-

vides consistent results. Longtail systems are focussed on information retrieval and require a

more significant training effort at the start, provides a wider variety of answers at lower confi-

dences; however, scales more efficiently. Developing a mixture of experts’ system, that is ca-

pable of combining these three technologies into a single personality with balanced relevancies

of which response is to be used, is of high interest to organisations. This will enable them to

better lever their existing investments in brand personality (chitchat systems) and human-read-

able material (long tail systems) while allowing them to rapidly develop new task-focused

(short tail) systems in a wide variety of their business domains using modern API based con-

versational short tail systems. This project looks at the best method to bring these disparate

types of systems together into a coherent conversational personality.

The analysis looked at different scenarios using unmodified confidence scores from the expert

systems, building a high-level classifier to determine the best system to answer, and simulating

fallback rules used in systems like IBM Watson. The outputs from all the scenarios were opti-

mised using evolutionary algorithms and optimised using a prepared data set on the same topic

that was applicable to all three system types, before being compared using an accuracy metric

to determine the most successful strategy. The additional effect of concurrency between each

user utterance was then evaluated against these strategies.

The study concluded that merging the chitchat and short-tail training data to reduce the number

of expert systems from 3 to 2 and then use of the fallback rules works best if the confidence

level for fallback is optimised for the expected data set. If the training sets of the chitchat and

short tail systems cannot be merged, or there is a requirement for keeping three separate

sys-tems, a weighted high-level classifier performed best. Optimisation of confidence lev-els

used improved the performance of the fallback rules by a considerable margin. The effect of

concurrency was thought to be a crucial aspect to investigate from the recommendations of

the literature review, but the overall effect of concurrency on this data was shown to be small.

Recommended next steps could be beta testing with real user data to avoid any cognitive

bias in the test and train set and to gauge the change in performance by increasing the

number of expert systems to beyond three, where it is expected that a high-level classifier

increases in performance compared to fallback rules.

iii

Acknowledgements I would like to thank my supervisor, Professor Alexander Balinsky, for his timely help and for

pointing me towards the right direction throughout this project. I am grateful to Mr Stephen

Broadhurst of ThinJetty Ltd for providing me with an opportunity to pursue this project at his

organisation and for the continuous mentoring and support. Also, I would like to acknowledge

the advice and assistance from Ms Joanna Emery and the moral support provided by my col-

leagues on the MSc course.

Contents Executive Summary .................................................................................................................... i

Acknowledgements .................................................................................................................. iii

1. Introduction ........................................................................................................................ 1

2. Literature Review ............................................................................................................... 1

2.1. Expert Systems ........................................................................................................... 1

2.1.1. Overview .............................................................................................................. 1

2.1.2. Typical Architecture for an expert system ........................................................... 2

2.1.3. Applications ......................................................................................................... 3

2.2. Mixture of Experts ..................................................................................................... 3

2.2.1. Background/ Overview ........................................................................................ 3

2.2.2. Applications ......................................................................................................... 4

2.3. Conversational Agents and the application of Expert Systems ................................. 5

2.3.1. What are Conversational Agents? ........................................................................ 5

2.3.2. How are expert systems used in chatbots? ........................................................... 7

2.3.3. Comparison of toolkits for building conversational agents ................................. 8

3. Methodology & Implementation ...................................................................................... 10

3.1. Overview .................................................................................................................. 10

3.2. Tools setup and initialisation ................................................................................... 10

3.3. Knowledgebase ........................................................................................................ 11

3.4. Testing framework & Dataset optimisation ............................................................. 12

3.5. Scenarios .................................................................................................................. 13

3.5.1. Overview ............................................................................................................ 13

3.5.2. Experiment 1: High-level classifier ................................................................... 14

3.5.3. Experiment 2: Weighted High-Level Classifier ................................................ 16

3.5.4. Experiment 3: Based on unmodified confidence scores .................................... 17

3.5.5. Experiment 4: Weighted confidence scores ....................................................... 17

3.5.6. Experiment 5: Emulating fallback logic ............................................................ 18

3.5.7. Experiment 6: Effect of concurrency ................................................................. 20

4. Results .............................................................................................................................. 21

5. Discussion & Conclusion ................................................................................................. 25

5.1. Qualitative evaluation .............................................................................................. 25

5.2. Possibility for future work ....................................................................................... 26

5.3. Recommendations & Findings ................................................................................. 27

References ................................................................................................................................ 28

Appendices ............................................................................................................................... 32

1. Introduction The project investigates the use of a mixture of experts’ system in the conversational artificial

intelligence (AI) domain and to devise a rule-based or machine learnt algorithm-based tech-

nique which can balance the relevancy across them. The expert systems are specialised on a

domain level and are to be put into use to respond to a conversational turn in a robust and

precise manner all the while accounting for concurrency and minimising errors. The aim is to

have a mechanism that enables the rules to adapt during the conversation based on parameters

to do with the conversational state or features from the user utterance to best judge which sys-

tem must respond.

2. Literature Review

2.1. Expert Systems

2.1.1. Overview Expert systems is a branch of AI which deals with developing machines which, in a

specific domain, have problem-solving abilities similar to those displayed by a human expert

in the same field or to simulate human expert behaviour (Tzafestas et al., 1993). An expert

system is different from other forms of AI because it performs problem-solving using domain-

specific approaches at an expert knowledge level and also provides pieces of evidence for the

conclusion drawn (Tzafestas et al., 1993). Several advantages, such as the following, have also

been highlighted over the course of time in comparison to human experts (Ignazio, 1991):

• No human-like bias involved while obtaining solutions or prescribing strategies.

• Minimal chances for occurrences of errors in calculations.

• Serves the purpose without fail on a near-constant basis.

2.1.2. Typical Architecture for an expert system

Figure 1: Typical architecture of an expert system (Forsyth, 1984; Tzafestas et al., 1993; Yazdani, 1989)

An expert system consists of the following high-level modules as shown in Figure 1

which form the crux of the operations of the system (Forsyth, 1984, n.d.; Ignizio, 1990; Tripa-

thi, 2011; Tzafestas et al., 1993):

2.1.2.1. Knowledge Base The knowledge system encompasses the domain-specific expert-level knowledge

that is required by the expert system for comprehending user requirements, formu-

lating strategies, obtaining necessary rule-based solutions which can then be passed

onto the inference engine for further processing. It consists of both factual, which

is the most commonly shared/found forms of knowledge, and heuristic knowledge,

which is the less widely shared and considerably more individualistic form of

knowledge which acts as the reasoning for the solutions obtained.

2.1.2.2. Inference Engine The inference engine acts as the intelligence behind the expert system and takes

care of the inferences from user requests/utterances. It then analyses and processes

the rules obtained from the knowledge base in order to arrive at a solution with

logical reasoning for the same. In short, it controls the interpretation and reasoning

methodology of the expert system. The two most widely used reasoning strategies

are forward chaining and backward chaining. Forward chaining starts with the data

at hand and uses the inference rules to arrive at a solution whereas backward chain-

ing starts with the list of goals to be attained and works its way backwards to see if

there is any data available to solve the problem and attain the goals.

2.1.2.3. User Interface A user interface is built to interact with the user by receiving user inputs in the form

of utterances and for the system to revert with a user-identifiable output.

2.1.3. Applications Expert systems have varied applications majorly in classification tasks in the fields of

medical diagnosis, information retrieval and aligned services, engineering, human-

computer interaction, military, robotics amongst others. (Forsyth, 1984, n.d.; Ignizio,

1990; Tzafestas et al., 1993).

2.2. Mixture of Experts

2.2.1. Background/ Overview Mixture of experts is a method that was introduced almost 30 years ago by Jacobs and

co-workers (R. Jacobs et al., 1991). They investigated the use of a different error function in a

mixture of experts’ system, and their approach has been supremely popular in a suite of wide-

ranging applications (Yuksel et al., 2012). Over the years, around 20 different studies have

been conducted on the principles, working and applications of expert systems and to an extent

was even considered to be completely solved; However, recently there has been a resurgence

in interest in the context of using a mixture of experts for several new-age problems (Yuksel et

al., 2012).

Mixture of experts has been widely regarded as a combining method which, when put

to use in machine learning tasks, can lead to better performance and improved results (Masoud-

nia and Ebrahimpour, 2014). The critical aspect of a mixture of experts model in any applica-

tion was to employ specialised expert systems to return correct answers for topics which fall

under its knowledge base and use a gating network across all the expert systems which helps

in reducing the errors (Jacobs et al., 1991). The basic principle behind this is that the gating

network assigned a new input to an expert system, and weights of only this system are changed

if the output is found to be incorrect which removes any chance of interference for the other

expert systems (R. Jacobs et al., 1991). This also has the implication & possible added ad-

vantage of each expert system being assigned only a small set of extremely feasible input cases

(R. Jacobs et al., 1991).

Figure 2: A system of expert and gating systems (R. Jacobs et al., 1991)

The gating network is assumed to be a stochastic one-out-of-n selector, unlike

in (Hampshire and Waibel, 1992; R. A. Jacobs et al., 1991), which is how minimal interference

is achieved in a much more straightforward manner, albeit reconceptualising the error function

in order to make the expert systems challenge one another making the whole network compet-

itive in nature rather than being collaborative (R. Jacobs et al., 1991). An evaluative compari-

son was performed between standard backpropagation networks with a single hidden layer and

a mixture of experts by using it to recognise multi-speaker vowel recognition (R. Jacobs et al.,

1991). The parameters of the models were kept approximately equal by adjusting the number

of hidden layers in the backpropagation network (R. Jacobs et al., 1991). Upon investigation

of the results, the mixture of experts model achieved the error criterion (average squared error

of 0.08) at a much higher speed even while keeping the number of epochs needed for the same

at a lower number and also maintaining scalability with increase in number of experts used in

the system (R. Jacobs et al., 1991; Masoudnia and Ebrahimpour, 2014).

2.2.2. Applications Several applications have been devised over the years for a mixture of experts’ systems

such as (Yuksel et al., 2012):

• Used in the prediction of climate (Lu, 2006), electricity demand (Weigend et

al., 1995), stock prices (Versace et al., 2004), currency exchange rates (Coelho

et al., 2003), amongst others.

• Machine learnt classification tasks involved in

o Classification of

� Text (Estabrooks and Japkowicz, 2001),

� Audio signals (Harb et al., 2004) and

o Recognition of

� Handwriting (Ebrahimpour, 2009),

� Speech (Mossavat et al., 2010; Peng et al., 1996), and

� 3D objects (Walter et al., 1999).

2.3. Conversational Agents and the application of Expert Systems

2.3.1. What are Conversational Agents? A piece of software or program that enables a machine to converse using a natural lan-

guage such as English with a human user is called a Conversational agent (Io and Lee,

2017; Weizenbaum, 1966). Since the initial research and work done in the field since

the 1960s, the most significant challenges faced was in enabling the machine with in-

telligence that would facilitate such interactions (Shum et al., 2018; Turing, 1950).

In a typical human conversation with a chatbot, input from the user is in the form of a

single or set of natural language utterances which the system analyses to gauge the

requirements of the user and produces a response which it deems an appropriate one to

the analysed input (Weizenbaum, 1966). The afore-mentioned response is derived using

several techniques, of which rules have been predominantly used in the earlier stages

of development of chatbots, wherein the user utterance is searched for keywords and

based on their presence, associated rules are invoked to convert the utterance (Shum et

al., 2018; Weizenbaum, 1966).

2.3.1.1. Types of Conversational Agents

2.3.1.1.1. Chitchat

Several of the earliest systems developed such as “Eliza”, “ALICE” and “Parry”

(Colby KM, 1975; Shieber, 1994; Wallace, 2009; Weizenbaum, 1966) were fo-

cussed on performing as chitchat bots for the purpose of conversation with users

in the medium of text, audio amongst others. (Shum et al., 2018). These systems

used pattern matching based on rules to respond to the user’s input. (Shum et

al., 2018; Weizenbaum, 1966).

The chatbots were given different personalities such as a “Rogerian Psychother-

apist” (Shum et al., 2018; Weizenbaum, 1966), a paranoid person (Colby KM,

1975; Shum et al., 2018) and so on, but were severely limited in terms of capa-

bility to continue the conversation for a prolonged duration and had highly spec-

ified domains as well which further reduced their performance (Colby KM,

1975; Shieber, 1994; Wallace, 2009). These limitations were partly due to the

technology with which the systems were built, such as AIML for “A.L.I.C.E”

which in turn led to their failure in several evaluations such as the “Ultimate

Turing Test” (Shum et al., 2018; Wallace, 2009).

2.3.1.1.2. Task-Completion

Task-Completion conversational agents were built with a focus on realising spe-

cific tasks which fall under constrained domains (Shum et al., 2018; Walker et

al., 2001; Wang et al., n.d.). It typically gives a short single high confidence

answer. A few most commonly seen domains were that of hotel or flight book-

ing, weather forecast, information gathering amongst others. In general, the sys-

tem tries to gauge the user’s ‘intents’ and then responds with actions that will

complete said intent or goal (Shum et al., 2018; Walker et al., 2001). Further

improvements also included the ability to comprehend complex dialogues with

inherent variability and state tracking (Shum et al., 2018; Williams and Young,

2007). These systems were evaluated on several parameters, not limited to

(Walker et al., 2001):

• User satisfaction

• Task completion

• Task duration

• Accuracy

Telecom giant Vodafone has over a couple of years back introduced a chatbot,

‘TOBi’, which could help users in basic tasks of checking account details, trou-

bleshooting and also in purchasing new connections (Koehler, 2017). TOBi de-

livered the following metrics (Davis, 2018):

• An increased conversion rate of more than 100% when compared to their

website.

• A decreased transaction time of around 50% compared to their website

• Among the highest ever received usability scores of 90.

Another such an example would be that of ‘Cora’, a chatbot employed by RBS

and NatWest, both in the banking domain, to answer basic baking related que-

ries from the user (“NatWest begins testing AI driven ‘digital human’ in bank-

ing first,” n.d.; Rumney, 2018). This has helped in identifying the most fre-

quently asked questions and significantly cutting down queuing times.

2.3.1.1.3. Long tail or Question Answering

QA conversational agents process natural language queries raised by the user

and provide concise and relevant answers to it, thereby improving the overall

interaction between the user and intelligent system (Simmons, 1970; Waltinger

et al., 2012). It gives a number of lower confidence, more extended sections of

text mined from the corpus rather than configured. In general, the question was

first analysed, and a search performed, which resulted in an answer with sup-

porting evidence and a score (Ferrucci et al., 2010; Setiaji and Wibowo, 2016).

The answers thus obtained were then scored in order of relevance before being

presented back to the user (Ferrucci et al., 2010; Setiaji and Wibowo, 2016). In

order to aid better response to the user, other facets such as topic identifying,

context recognition, keyword detection amongst others are also used in tandem

(Niranjan et al., 2012; Setiaji and Wibowo, 2016; Waltinger et al., 2012). Ques-

tion classifying, in which the inherent ‘type’ of the question posed by the user

is obtained, has also been identified as a component which can improve the ac-

curacy of long-tail agents (Suzuki et al., 2003; Waltinger et al., 2012; Zhang

and Lee, 2003).

2.3.2. How are expert systems used in chatbots? 2.3.2.1. Expert systems in chatbots Traditionally, spoken dialogue systems/conversational agents have employed

mechanisms to control the dialogue flow of the user to limit the responses from the

user to a set of pre-defined or limited choices (M. O’Neill et al., 2004). In advanced

systems which could allow multi-domain interaction between user and agent, a

component was used to identify the domain or topic based on the user input and

perform the necessary action to fulfil the requirement (M. O’Neill et al., 2004).

Over time, several ‘plan-based dialogue modelling schemes’ were put forward to

build systems upon; the premise behind those being that behind every user-system

interaction lies a particular requirement or goal of a user and the system has to rec-

ognise those and perform accordingly (Lin et al., 1999). The entire system is con-

figured as multiple ‘domain-specific experts’ to facilitate multi-domain conversa-

tions, with the capability to complete transactions in a particular domain working

in association with each other all the while switching between themselves based on

user input (M. O’Neill et al., 2004; Nakano et al., 2008). A middle layer is present

in the system which is responsible for evaluating user utterances across all ‘experts’

present and determine which one has to respond to that particular utterance

(Hartikainen et al., 2004; Komatani et al., 2006; Lin et al., 1999; M. O’Neill et al.,

2004).

2.3.2.2. Problems faced in addressing multi-domain conversations There have been a few problems which have arisen while attempting to handle

multi-domain conversations in a concurrent and flexible manner, such as:

• Identifying how to handle errors in comprehending user input (Lin et al.,

1999).

• To determine if the user or the system should take the initiative to carry on

with the conversation (Lin et al., 1999).

• To tackle user initiatives in a proper and ‘consistent’ manner (Lin et al.,

1999).

• Diminishing efficiency due to multiple systems working simultaneously

(Hartikainen et al., 2004)

• Inability in handling concurrent topics (Lin et al., 1999).

2.3.3. Comparison of toolkits for building conversational agents Different toolkits which are specialised in building conversational agents were looked

at, selected primarily on their capability of having out-of-the-box, both a search (long

tail) and a typical conversational agent functionality. IBM Watson, Google Dialogflow

and Microsoft Luis had both these capabilities with their long tail functionalities,

namely being Watson Discovery, Knowledge connectors and Microsoft QnA Maker.

In order to select the best possible toolkit, the study conducted by (Liu et al., 2019) was

used. (Liu et al., 2019) compared state-of-the-art conversational toolkits by comparing

the metrics of precision, recall and F1 score as given in Table 1.

Intent

Toolkit/ Metric Precision Recall F1

Rasa 0.863 0.863 0.863

Dialogflow 0.87 0.859 0.864

LUIS 0.855 0.855 0.855

Watson 0.884 0.881 0.882 Table 1: Comparison of specialised toolkits (Liu et al., 2019)

As seen in Table 1, IBM Watson returns the highest F1 score for intent classification

and though there isn’t a significant difference in the scores for the other three toolkits

(Liu et al., 2019).

3. Methodology & Implementation

3.1. Overview The aim of the project is to create a rule-based or machine learnt algorithm for weighting

confidence and evidence returned by the expert systems to determine for each conversa-

tional turn which system is the best placed to respond and to enable these rules to adapt

during the conversation based on parameters to do with the conversational state or features

from the user utterance to best judge which system must respond. By doing this, we also

aim to recommend how businesses can best combine ‘off the shelf’/ out-of-the-box

(OOTB) chitchat with existing human-readable corpora (long tail) and then rapidly develop

domain-specific functionality.

A car configurator bot (multi-domain expert system/conversational agent) was decided to

be built to perform the tests and analysis. The bot would have the ability to engage in chit-

chat with the user, perform tasks involved in configuring a car such as gathering general

requirements, booking test drives, and such other queries. It would also act as a question

and answer bot wherein users can pose natural language queries to be answered by the bot

which could further help them narrow their search or enhance their knowledge of a vehicle

in mind. The short tail queries would be put forth to the system by the user at a higher

frequency, and the long tail ones would be at a significantly lower frequency. Each user

utterance is passed to all three expert systems, and the corresponding confidence scores are

retrieved. Metrics such as accuracy, precision and recall are derived and used as a baseline

score to compare and to simulate different scenarios. The whole experiment and analysis

were devised to be completed using IBM Watson tool of Assistant and Discovery as dis-

cussed in Section 2.3.4 along with the data extraction and manipulation using coding in

Python and optimisation tasks in Excel using Solver.

3.2. Tools setup and initialisation An account was set up in IBM Watson for using Assistant and Discovery for building the

conversational agent. The Watson Developer Cloud Python SDK will be used to communi-

cate to the Assistant and Discovery services using the application programming interface

(API) provided; the dependencies and packages for which are also installed. Access to the

online services is gained using a combination of usernames, API keys, environment IDs,

collection IDs etc. Python is also used to extract, manipulate and further analyse the outputs

from the services’ APIs. Microsoft Excel is used to create and store the data used for train-

ing and testing purposes in filetypes of Excel worksheets (.xlsx), Comma-separated value

files (.csv), and Tab-separated value files (.tsv) based on requirements and features.

Toward the latter end of the analysis, in order to optimise metric values such as precision,

recall or accuracy (objective) based on weights or other parameters (constraints) as neces-

sary, the Solver add-in of Excel is used. Solver has three solving methods which are used

throughout our experiments based on the requirements and enhancements brought about on

the metrics by using a method. In cases where more than one method is used, a comparison

is also made possible. The solver methods are as follows:

• LP Simplex: Used in cases where the problems are linear, which in turn means its

applications are restricted (“Excel Solver,” 2016). However, one of its benefits is

that the solutions obtained are always globally optimised (“Excel Solver,” 2016).

• Generalised Reduced Gradient (GRG) Non-Linear: This method is the fastest of the

non-linear methods but has a disadvantage that the solution obtained might not be

a global optimum and is also highly dependent on the initial conditions (“Excel

Solver,” 2016). It is used for smooth non-linear problems (“Excel Solver,” 2016).

• Evolutionary method: Based on the theory of natural selection, it may converge to

a solution if either the solution is the global optimum or if the population has lost

its diversity (“Excel Solver,” 2016). It is used in cases of non-smooth problems.

3.3. Knowledgebase A corpus of data had to be created to train and test the conversational agent. Since the

conversational agent has three experts, namely chitchat (social), short-tail (task-oriented)

and long-tail (Q&A), a corpus was created for all of them. The data were acquired in the

following manner:

• Chitchat: A collection of close to 890 example user utterances and 59 intents was

obtained for the chitchat expert system by combining the data from:

o Watson Assistant: The inbuilt ‘General’ content catalogue, which contains

ten unique intents and close to 200 example utterances.

o Google Dialogflow: The inbuilt ‘smalltalk’ agent was exported, which con-

tains 86 intents and around 1500 example utterances.

• Short-tail: A corpus of 16 intents with close to 150 example user utterances were

created manually for the short-tail expert system of the car configurator use case

which included intents such as ‘#GeneralRequirements’, ‘#BookTestDrive’ etc. and

their corresponding example user utterances.

• Long tail: 118 car brochures spread across different vehicle types, makes and mod-

els were obtained and collated to be used as the corpus for the long tail expert sys-

tem.

The data for both the chitchat and short tail expert systems were ingested into two separate

Watson Assistants, and the data for longtail was ingested into Watson Discovery to be used

in testing and analysis purposes. After ingestion into discovery, the search was optimised

for relevancy by using the out-of-the-box (OOTB) relevancy training available in Watson

Discovery. The process entailed posing natural language queries to discovery and marking

the results from the service as ‘relevant’ and ‘not relevant’ based on the contents of the

results.

For testing the conversational agent, a dataset consisting of 50 example user utterances

across five different conversations was manually created mimicking users who would be

using the chatbot to configure a car or book a test drive and such similar requests. The

utterances were created in such a way that they would be as close to a real conversation as

possible with sample responses for each and from different expert systems which would

further enable testing the multi-domain conversational agent.

3.4. Testing framework & Dataset optimisation In general, the data for conversational agents, which comprises of user example utterances

and intents are created by ‘subject matter experts’ based on ground truth (Freed, 2018).

Utterances are created and then marked with ‘entities’ and labelled with expected ‘intents’;

the corpus procured for the chit-chat workspace from the Watson and Google services has

been created in the same fashion by employing API’s to crawl the web. Such procurement

calls for the need for testing the same to identify any hidden patterns and weakness present

in it, which can further be remedied (Freed, 2018).

Testing is achieved by submitting the utterances to the classifier and investigating the out-

put of the classifier to see if it matches the set ‘ground truth’ (Freed, 2018). The data is split

into training and testing/blind datasets using the k-fold cross-validation technique. After

training the classifier on the training data, the validation dataset is used to evaluate the

classifier and obtain the required parameters. Precision and recall metrics are intended to

be put into use while evaluating and comparing the performance of the classifier.

The testing is done on the chitchat corpus, and the metrics are obtained on an intent level.

In order to improve the performance, the intents are sorted based on recall value and the

ones with the lowest value are selected for removal or to be fixed. The misclassified utter-

ances were either moved to different intents or edited to enable better classification. This

process was carried out for all the utterances falling under intents with the lower recall

values. After a complete overhaul, the data was re-ingested, and the testing was carried out

again. The entire process was reiterated multiple times, thereby improving the overall met-

ric values and boosting the intents with initial low recall value. The final dataset, which had

a precision value of 90.32% and was reduced to 50 unique intents and 859 utterances for

the chit-chat expert system, was then ingested back into the Watson Assistant service.

3.5. Scenarios

3.5.1. Overview In order to fulfil the primary aim which is to create an adaptable rule-based or machine

learnt algorithm for weighting confidence and evidence returned by the expert systems

to determine for each conversational turn which system is best placed to respond, dif-

ferent scenarios have to be devised to analyse and compare. The outcome of the com-

parison would give the best algorithm to implement in order to obtain the best concur-

rency and switching between the expert systems in an efficient but logical manner based

on user utterances. The scenarios thus formulated are as follows:

1. High-level classifier

2. Weighted high-level classifier

3. Based on pure-confidence values

4. Weighted system confidences

5. Emulating fallback logic

6. Testing for concurrency

The testing data created earlier, which consists of 50 unique user utterances are inputted

to the mixture of experts’ system and the confidences returned is used as a baseline to

perform the experiments and simulate the scenarios. A metric similar to accuracy was

devised, which was obtained by dividing the number of correct classifications to the

total number of classifications performed, to compare the scenarios on a qualitative

basis. The thorough investigation of the metric obtained post-simulation gives a clear

insight to which expert system should be utilised to respond to said utterance.

3.5.2. Experiment 1: High-level classifier The purpose of this experiment is to set up a high-level classifier which acts as the

‘selector’ in the mixture of experts’ systems and is trained on a sample of the example

utterances spread across the short-tail, chitchat and long-tail expert systems. The data

corpus must be sampled to obtain an equal number of utterances, fixed at 50 for the use

case, from each class to avoid any possible bias in that regard. The ‘Pandas’ module in

Python and it’s inbuilt ‘group by’ method is used for sampling the utterances of all three

separate expert systems. The separate sampled datasets are concatenated in order to

obtain a single corpus of 150 utterances labelled into the three classes of short-tail,

chitchat and long-tail.

The dataset is further split on an 80:20 ratio to be used for training and testing purposes

of the classifiers built. Different algorithms are used as the base for the classifier to

allow for the selection of the best possible classifier and are compared based on the

testing accuracy values, the algorithms being:

• Naïve Bayes: An algorithm based on Bayes theorem and assumes independence

between predictors used in the classification.

• Linear Classifier – Logistic Regression: It uses the logistic function at its core

to determine the relationship between the dependent variable and several inde-

pendent variables.

• Support Vector Machine (SVM): SVM is a supervised algorithm which at-

tempts to extract the best possible hyperplanes for classifying the data.

• Bagging Method - Random Forest (RF): Random forest method constructs nu-

merous decision trees during training, and the outputted class is the mean or

mode of the individual trees. It fixes the overfitting found in decision tree clas-

sification

• Boosting Method – eXtreme Gradient Boosting Model (XGBoost): A super-

vised machine learning algorithm which uses an ensemble of other weaker mod-

els/ algorithms to reduce bias and variance.

Along with these algorithms, Google BERT (Bidirectional Encoder Representations

from Transformers), which is an unsupervised learning algorithm, was also used to

build a classifier (Devlin et al., 2018). BERT uses bidirectional encoding and works on

several pre-trained models released by Google, which can be further fine-tuned to suit

the application or requirement (Devlin et al., 2018). It takes into account the context of

a word from both its left and right sides since it is bidirectional (Devlin et al., 2018).

BERT is built for binary classification out of the box and must be modified to work

with our use case of three classes (Devlin et al., 2018). Also, the training, validation

and testing data must be formatted to suit the input requirements of BERT, which is

done using a combination of Python coding and Excel.

The classifier was built using the following machine learning algorithms/ tools on the

3 expert system corpora and tested. The resultant accuracy metric, which is the fraction

of correctly classified samples is as follows:

Algorithm Accuracy Naïve Bayes 0.77

Linear Classification 0.74 Support Vector Machine 0.77

Bagging Model 0.67 Boosting Model 0.69 Google BERT 0.87

Table 2: High-level classifiers comparison

Google BERT gave the best accuracy values for the test data and was selected as the

best algorithm to use as the high-level classifier and to build the simulation for the

scenario. The simulation is carried out in the following manner:

• The test corpus created, which consists of 50 unique utterances across five con-

versations, is inputted to the BERT based classifier, and the output is obtained.

The output is a confidence score for every utterance for each expert system.

Utterance Short tail Confidence Chitchat Confidence Long tail Confidence Utterance 1 0.27420458 0.530363 0.19543229 Utterance 2 0.13647898 0.6893639 0.17415714 Utterance 3 0.25295562 0.4700206 0.27702382 Utterance 4 0.3946402 0.36452472 0.24083503 Utterance 5 0.3252571 0.38989067 0.2848522 Utterance 6 0.31253842 0.24711472 0.44034687

. . . .

. . . . Table 3: Sample high-level classifier output

• The expert system which returns the highest confidence is selected as the system

which is best placed to respond to the utterance at that conversational turn.

• If the classification was performed as expected, the output of the scenario is

obtained per utterance by verifying if the expert system, which has been ob-

tained after classification, is the same as the ‘golden system’. This is part of the

testing data and has been labelled by the subject matter expert based on the log-

ical response expected.

• The recall metric for the scenario is obtained for comparison purposes during

the analysis stage.

3.5.3. Experiment 2: Weighted High-Level Classifier In this experiment, the confidences obtained from the high-level classifier built on

Google BERT (as in Experiment 1) are weighted (as in Experiment 4) to see if this

brings about a beneficial change in the output of the mixture of experts’ system.

The experiment is carried out in the following manner:

• The confidence scores from the classifier were obtained as in Experiment 1 and

were further weighted (giving a bias to the confidence scores obtained from

each system, as in Experiment 4), and the results were obtained for an equal

weight of 1 across all systems.

• The expert system which returns the highest weighted confidence is selected as

the system which is best placed to respond to the utterance at that conversational

turn.

• If the classification was performed as expected, the output of the scenario is

obtained per utterance by verifying if the expert system which has been obtained

after classification is the same as the ‘golden system’ which is part of the testing

data and has been labelled by the subject matter expert based on the logical

response expected.

• The metric for the scenario (objective) is obtained and is then subject to optimi-

sation using Solver to obtain the maximum value possible for the same by var-

ying the values of weights assigned to all expert systems (constraint).

• The optimisation is performed using both GRG Non-Linear and Evolutionary

methods to allow for comparison.

• The weights obtained after optimisation is carried out are further normalised so

that comprehension is improved. Normalisation is done by fixing a system con-

fidence value to be 1 (in this case, short tail is selected as the business problem

to focus on the short tail and bring in other expert systems without changing the

confidence of this). In that case, the normalised values for the other expert sys-

tems are obtained by dividing their current values with the value of the pre-

normalised short-tail confidence.

3.5.4. Experiment 3: Based on unmodified confidence scores This experiment aims to simulate a scenario wherein the selection of which system is

used to respond with is decided upon by using only the pure confidence scores returned

by the three expert systems, namely chitchat, short tail and long tail hosted in the Wat-

son Assistant and Watson Discovery cloud services respectively. Also, to decide if this

approach is suitable to be used to enable the algorithm to adapt to changes during the

conversation based on parameters related to the conversational state or features from

the user utterance to best judge which system must respond.

The experiment is carried out in the following manner:

• The mixture of experts’ system is tested using the testing data comprising of 50

example user utterances by posing these utterances to all three expert systems

individually.

• The response of the systems in the form of confidence scores is retrieved and is

stored across the utterances.

• The expert system which returns the highest confidence is selected as the system

which is best placed to respond to the utterance at that conversational turn.

• The next stage of the simulation is carried out in Excel. A check is performed

to verify if the ‘golden system’ matches the system obtained based on the con-

fidence calculation. If the classification was performed as expected, the output

of the scenario is obtained per utterance.

• The metric for the scenario is obtained for comparison purposes during the anal-

ysis stage.

3.5.5. Experiment 4: Weighted confidence scores In this experiment, the confidences obtained from the expert systems are weighted in

an attempt to see if this brings about a beneficial change in the output of the mixture of

experts’ system.

The experiment is carried out in the following manner:

• The confidence scores are obtained from the three expert systems as in Experi-

ment 3.

• The scores are then weighted in order to better classify the input utterances. The

weights are assigned an equal value of 1 to start with.

• A check is performed to verify if the ‘golden system’ matches the system ob-

tained based on the confidence calculation. If the classification was performed

as expected, the output of the scenario is obtained per utterance.

• The metric for the scenario (objective) is obtained and is then subject to optimi-

sation using Solver to obtain the maximum value possible for the same by var-

ying the values of weights assigned to all expert systems (constraint).

• The optimisation is performed using both GRG Non-Linear and Evolutionary

methods to allow for comparison.

• The weights obtained after optimisation is carried out are further normalised so

that comprehension is improved. Normalisation is done by fixing a system con-

fidence value to be 1 (in this case, short tail is selected as the business problem

to focus on the short tail and bring in other expert systems without changing its

confidence). In that case, the normalised values for the other expert systems are

obtained by dividing their current values with the value of the pre-normalised

short tail confidence.

3.5.6. Experiment 5: Emulating fallback logic This experiment attempts to mimic the fallback logic employed by Watson in the out

of the box cloud service. The logic can be defined as follows with three significant

variations:

• Version 1:

o if ‘short tail confidence’ > ‘threshold confidence’:

� the short tail system must respond

o else if ‘chitchat confidence’ > ‘threshold confidence’:

� the chitchat system must respond

o else:

� the long tail system must respond

• Version 2:

o if ‘short tail confidence’ > ‘threshold confidence’:

� the short tail system must respond

o else if ‘longtail confidence’ > ‘threshold confidence’:

� The long tail system must respond

o else:

� the chitchat system must respond.

• Version 3 (2 expert systems):

o if ‘combined confidence’ > ‘threshold confidence’:

� the combined system must respond

o else if ‘longtail confidence’ > ‘threshold confidence’:

� the long tail system must respond

The experiment for versions 1 & 2 is conducted in the following way:

• After ingestion and training, the mixture of experts’ system is tested using the

test corpus consisting of 50 example user utterances created earlier.

• The response of the systems in the form of confidence scores is retrieved and

stored across the utterances.

• The logic is then simulated using Excel, and the outputs are obtained for both

the variations with the threshold set at 0.2, which is the default used by Watson.

• After obtaining the outputs, the metric value is calculated. The metric (objec-

tive) is then subject to optimisation using solver to obtain the maximum value

possible for the same by varying the threshold value (constraint).

The experiment for version 3 is conducted in the following two ways:

1. On the cloud service by manipulating the training data for the created assistants:

• The training data consisting of utterances is manipulated so that the utter-

ances and intents falling under short tail and chitchat expert systems are

combined into a single system and long tail is kept as a separate system. The

data is re-ingested into the Watson Assistant service for testing purposes.

• After ingestion, the mixture of experts’ system is tested using the same test-

ing data comprising of 50 example user utterances by posing these utter-

ances to all three expert systems individually.

• The response of the systems in the form of confidence scores is retrieved

and is stored across the utterances.

• The logic is then simulated with the threshold value set at 0.2 (Watson de-

fault value) using Excel and the output for the variation is obtained.

• After obtaining the outputs, the metric value is calculated for the same. The

metric value (objective) is then subject to optimisation using solver to obtain

the maximum value possible for the same by varying the threshold value

(constraint).

2. Building a high-level classifier with the training data mimicking the OOTB

Watson (Conversation AI toolkit) logic:

• The training data consisting of utterances is manipulated so that the utter-

ances and intents falling under short tail and chitchat expert systems are

combined into a single system and long tail is kept as a separate system. The

data is used to build a binary classifier using Google BERT as done in Ex-

periment 1.

• The test corpus created, which consists of 50 unique utterances across five

conversations, is inputted to the BERT based classifier, and the output is

obtained. The output is a confidence score for every utterance concerning

either class of data.

• The logic is then simulated with the threshold value set at 0.2 (Watson de-

fault value) using Excel and the output’s obtained for the variation.

• After obtaining the outputs, the metric value is calculated for the same. The

metric value (objective) is then subject to optimisation using Solver to ob-

tain the maximum value possible for the same by varying the threshold value

(constraint).

3.5.7. Experiment 6: Effect of concurrency This experiment attempts to investigate how systems deal with concurrency, which was

a vital area of the problem, as observed in the literature review in section 2.3.3.2. The

aim is to gauge the effect of concurrency in the user input utterances within a conver-

sation on the mixture of expert system confidence values and output. In order to do so,

an updated test corpus will have to be created which mimics the effect of concurrency.

The experiment is carried out in the following manner:

• An updated test corpus of 70 unique utterances across four conversations is cre-

ated. It is done in a manner which incorporates the occurrence of utterances

which fall under the domain of the same expert system (effect of concurrency)

within conversations.

• This test data is then inputted to the mixture of experts’ system. The response

which comprises of confidence scores and intents is retrieved and is stored

across the utterances.

• A new parameter is introduced to vary the effect of concurrency on the output.

This parameter boosts the confidence returned from a particular expert system

if it is concurrent to the preceding utterance.

• The expert system which returns the highest confidence at the end of the boost-

ing is selected as the system which is best placed to respond to the utterance at

that conversational turn.

• A check is performed to verify if the ‘golden system’ matches the system ob-

tained based on the confidence calculation. If the classification was performed

as expected, the output of the scenario is obtained per utterance.

• The metric for the scenario (objective) is obtained and is then subject to optimi-

sation using solver to obtain the maximum value possible for the same by var-

ying the values of weights assigned to all expert systems (constraint).

• The optimisation is performed using both GRG Non-Linear and Evolutionary

methods to allow for comparison.

4. Results Throughout the experiments, several rule-based systems and parameters were looked at which

could be employed to enable these rules to be adaptable based on the requirements and conver-

sational state in order to gauge the system best placed to respond to the user utterance at the

said conversational turn. These were simulated as variations within and across scenarios, as

mentioned in section 4.5.

The results obtained from the experiments conducted can be analysed as follows:

1. Use of unmodified confidence scores: In the experiments where the confidence scores

were used to create a rule or logic for selection of an expert system within the mixture

of experts’ conversational agent, the test corpus was used to retrieve confidence scores

from the system and then used for further simulations. The confidences thus obtained

were used directly and also in a weighted manner to observe the changes being brought

about as can be seen below in Table 4.

Scenario Number of experts Metric Comments

Based on unmodified confi- dence scores

3 0.82

Weighted confidence scores

0.82 Initial weights of 1 each

3 0.82 GRG optimised weights

0.84 Evolutionary optimised weights

Table 4: Results from simulations based on the use of unmodified confidence scores

2. Use of a high-level classifier: In these scenarios, a high-level classifier was modelled

to mimic the working of the expert system in the Watson cloud service and use the

output confidences derived by testing using the test corpus as a base to formulate further

rules. The confidences were both used as an unmodified value and also as a weighted

component, as seen in Table 5.

Scenario Number of experts Metric Notes

High level classifier

(BERT) 3 0.72

Weighted high-level clas-

sifier (BERT)

0.72 Initial weights of 1 each

0.72 GRG optimised weights

0.90 Evolutionary optimised weights

Table 5: Results from simulations based on the use of a high-level classifier

3. Emulating Watson fallback logic: The inbuilt Watson logic and rules employed by the

Watson Assistant service while acting as a multiple domain system was simulated. The

simulations were carried out in two ways:

a. Method 1: By using the confidence scores from Watson a combined chitchat

(CC) and short tail (ST) corpus with long tail (LT) apart and also varying the

logic while keeping them separate.

b. Method 2: Using a high-level classifier also built on the combined data to in-

vestigate its performance using the test data. The confidences scores were both

used as an unmodified value and also as a weighted component.

The outputs as follows in Table 6 can be analysed to gauge if any of the simulations

would be a good fit for creating the rules for the mixture of experts’ system.

Scenario Number of experts Metric Notes

Emulating fallback

logic

0.86

Threshold set at 0.2 (de-

fault)

2 0.98 Optimised threshold of 0.55

0.50

Version 1 with the threshold

set at 0.2 (Default)

0.86

Version 1 with an optimised

threshold of 0.69

3 0.50

Version 2 with the threshold

set at 0.2 (Default)

0.70

Version 2 with an optimised

threshold of 0.67

High-level classifier 2 0.96

Weighted high-level

classifier

0.96 Initial weights of 1 each

2 0.96 GRG optimised weights

0.96

Evolutionarily optimised

weights

Table 6: Results from simulations mimicking the Watson default logic

4. Testing for effect of concurrency:

A simulation was conducted to test the effect of concurrency being used as a weight in the

selection of an expert system to respond to the utterance at a particular conversational turn.

The presence of concurrency was modelled using a ‘boosting value’ which was used to

boost the value of the corresponding system’s confidence score. The testing for this sce-

nario was conducted using the extended and modified test corpus, which had situations for

concurrency incorporated into it; the output of which can be seen in Table 7.

Scenario

Number of

Experts Metric Comments

Effect of concurrency 3

0.74 Initial boosting value of 0.1

0.74 GRG optimised boost values

0.77 Evolutionary optimised boost values

Table 7: Results from simulation incorporating the effect of concurrency

The values of the accuracy metric were found to be improved upon by using Excel Solver for

optimisation as can be seen in Table 8 for a 2-system architecture and Table 9 for a 3-system

architecture.

Type

Original

metric

Improved

metric

Change

2 system with weights - Weighted classifier 0.94 0.94 0.00

2 system with confidence - OOTB Watson Rules 0.86 0.98 13.95

Table 8: Effect of optimisation on the metric for 2 expert systems

Type

Original

metric

Improved

metric

Change

3 system with weights - Weighted high-level classifier 0.72 0.90 25.00

3 system with weights - Weighted confidences 0.82 0.84 2.44

3 system with confidence - OOTB Rules – Version 2 0.50 0.70 40.00

3 system with confidence - OOTB Rules – Version 1 0.50 0.86 72.00

Testing for concurrency 0.74 0.77 4.05

Table 9: Effect of optimisation on the metric for 3 expert systems

5. Discussion & Conclusion

5.1. Qualitative evaluation A better understanding of the work carried out can be gained by looking at the strengths

and weaknesses of the project on a qualitative basis. This should further help in identifying

opportunities for future work and any associated possibilities in the domain.

Strengths:

1. Uniqueness: There has been minimal work which has been carried out in the domain

of employing expert systems and maintaining a balance between them in the area

of conversational agents. The research carried out will also help the business make

an informed decision on the technologies and rules to use while attempting to build

a multi-domain conversational agent. This makes the work carried out unique in its

aspect.

2. Achieving research goals: Concurrency & robustly handling user inputs came

across as a drawback in present systems during the literature review, and it was

considered as a facet for investigating its effects on how the expert system responds

based on its presence and absence.

3. Test corpus creation: Two datasets were made from scratch for testing purposes in

the use case of a ‘Car Configurator’ to be put to use in the experiments. The dataset

used for extensive testing consisted of 50 user utterances across five conversations

and the dataset used for testing concurrency comprised of 70 utterances across four

conversations. This can be used for future work with minimal modifications based

on requirements.

Weaknesses:

1. Cognitive Bias in testing data: Since both the training (short tail expert system) and

testing datasets were created by me during the process of carrying out the experi-

ments, there is a high likelihood of cognitive bias being introduced into the same.

Cognitive bias is something which is faced by any entity trying to create a corpus

of data for use in a data science problem. This may have led to the introduction of

noise in data and the presence of unwanted filters which can sometimes cause cru-

cial aspects in the domain to be missed out. The presence of such cognitive bias can

lead to misclassification by the expert systems.

2. Small data for training short tail: The corpus of data used for training the short tail

expert system consisted of 16 intents, and approximately 150 example utterances

are significantly small in comparison to the corpora used for training the chitchat

and long-tail expert systems. However, this is precisely the problem businesses are

facing when trying to incorporate a short tail domain for their conversational agent

and is the problem I have aimed to address with this research. They have massive

datasets available for long-tail and chitchat but want to have a short tail facet ready

with minimal effort and a much smaller corpus.

3. Lack of ‘real’ user data: Another drawback which can be taken into account is the

lack of real user testing data which can be put to use for testing the performance of

the created mixture of experts’ system.

5.2. Possibility for future work 1. Two different datasets were used for testing the scenarios. The initial set had 50

utterances across five conversations, and the one updated to account for concur-

rency had 70 utterances across four conversations. If there were more time avail-

able, the possible first step to be taken would be to conduct all experiments and

test all scenarios on the updated dataset as it accounts for concurrency and may

lead to results closer to the ‘ground reality’ in user conversations.

2. Run a beta test on real user data by opening up the platform to a small section of users and thereby obtaining real user conversations upon which further testing,

and performance metrics can be obtained. This could lead to a more thorough in-

vestigation of the problem at hand and its possible solutions.

3. Currently, the mixture of expert systems has been built using only two and three experts based on the requirements. Another aspect to investigate in the future

would be to identify which scenario would work best and how the accuracy will

change with the introduction of n-systems, for example when a fourth one is in-

troduced and to perhaps model a relationship between the same.

5.3. Recommendations & Findings • A comparative analysis was performed between traditional machine learning

algorithms used predominantly for classification tasks and Google BERT; a

state-of-the-art model devised for several natural language processing tasks.

The results were highly favourable for BERT in this particular case having a

small training data corpus.

• Google BERT was also put to the test against a traditional ‘Conversational Ar-

tificial Intelligence (AI)’ configuring toolkit such as Watson Assistant using the

default logic utilised while creating a multi-domain conversational system. This

was done as BERT was found to be optimised for small datasets. The classifier

was found to have a slightly lower performance in comparison to the toolkit.

• The scenarios for rule building while creating a multi-domain mixture of ex-

perts’ system in the conversational agent’s space can be divided into two based

on the data available and the manipulations possible on the same as follows:

o If the training data for chitchat and short tail can be merged into a single

corpus while keeping long-tail separate, the best scenario or rule to adopt

would be the Watson default logic as done in Version 3 of the emulating

fallback logic. This would entail building a mixture of experts’ system

with two domain-specific experts.

o Building a mixture of experts’ system using three separate domain-spe-

cific experts of short tail, chitchat and long tail were found to be the best

possible scenario for cases wherein data merging is not possible. Among

those, a high-level classifier with normalised and weighted confidences

gave the best results.

• Another important finding was the improvement in result metrics when the pa-

rameters such as weights for the system confidences or the cut-off confidences

was optimised using Solver. The average improvement was around 7% for the

2 expert system scenarios and 29% for the 3 expert system scenarios. Thus, it

can be postulated that the need for optimisation possibly increases with an in-

crease in the number and variety of expert systems and their weighting.

References Colby KM, 1975. Artificial Paranoia - 1st Edition. Pergamon Press INC. Maxwell House,

New York, NY, England. Davis, B., 2018. Vodafone’s chatbot is delivering double the conversion rate of its website –

Econsultancy [WWW Document]. URL https://econsultancy.com/vodafones-chatbot- is-delivering-twice-the-conversion-rate-of-its-website/ (accessed 8.31.19).

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding. ArXiv181004805 Cs.

Ebrahimpour, 2009. Recognition of Persian handwritten digits using Characterization Loci and Mixture of Experts. Int. J. Digit. Content Technol. Its Appl. 3. https://doi.org/10.4156/jdcta.vol3.issue3.5

Estabrooks, A., Japkowicz, N., 2001. A Mixture-of-experts Framework for Text Classifica- tion, in: Proceedings of the 2001 Workshop on Computational Natural Language Learning - Volume 7, ConLL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9:1–9:8. https://doi.org/10.3115/1117822.1117828

Excel Solver: Which Solving Method Should I Choose?, 2016. . EngineerExcel. URL https://www.engineerexcel.com/excel-solver-solving-method-choose/ (accessed 8.16.19).

Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A.A., Lally, A., Murdock, J.W., Nyberg, E., Prager, J., Schlaefer, N., Welty, C., 2010. Building Wat- son: An Overview of the DeepQA Project. AI Mag. 31, 59–79. https://doi.org/10.1609/aimag.v31i3.2303

Forsyth, R., 1984. Expert systems : principles and case studies. London ; New York : Chap- man and Hall ; New York, NY : Methuen.

Forsyth, R., n.d. The architecture of expert systems 7. Freed, A.R., 2018. Testing Strategies for Chatbots (Part 1)— Testing Their Classifiers

[WWW Document]. Medium. URL https://medium.com/ibm-watson/testing-strate- gies-for-chatbots-part-1-testing-their-classifiers-20becaf5f211 (accessed 8.14.19).

Hampshire, J.B., Waibel, A., 1992. The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14, 751–769. https://doi.org/10.1109/34.142911

Harb, H., Chen, L., Auloge, J.-, 2004. Mixture of experts for audio classification: an applica- tion to male female classification and musical genre recognition, in: 2004 IEEE Inter- national Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763). Presented at the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), pp. 1351-1354 Vol.2. https://doi.org/10.1109/ICME.2004.1394479

Hartikainen, M., Turunen, M., Hakulinen, J., Salonen, E.-P., Adam Funk, J., 2004. Flexible dialogue management using distributed and dynamic dialogue control.

Ignizio, J.P., 1991. Introduction to expert systems: the development and implementation of rule-based expert systems. McGraw-Hill, New York.

Ignizio, J.P., 1990. A brief introduction to expert systems. Comput. Oper. Res. 17, 523–533. https://doi.org/10.1016/0305-0548(90)90058-F

Io, H.N., Lee, C.B., 2017. Chatbots and conversational agents: A bibliometric analysis, in: 2017 IEEE International Conference on Industrial Engineering and Engineering Man- agement (IEEM). Presented at the 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 215–219. https://doi.org/10.1109/IEEM.2017.8289883

Jacobs, R., Jordan, M., J. Nowlan, S., E. Hinton, G., 1991. Adaptive Mixture of Local Expert. Neural Comput. 3, 78–88. https://doi.org/10.1162/neco.1991.3.1.79

Jacobs, R.A., Jordan, M.I., Barto, A.G., 1991. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cogn. Sci. 15, 219–250. https://doi.org/10.1016/0364-0213(91)80006-Q

Koehler, A., 2017. Meet TOBi the chatbot: The latest addition to our customer service team [WWW Document]. Vodafone Soc. Off. Vodafone UK Blog. URL https://blog.voda- fone.co.uk/2017/04/12/meet-tobi-chatbot-latest-addition-vodafone-uks-customer-ser- vice-team/ (accessed 8.31.19).

Komatani, K., Kanda, N., Nakano, M., Nakadai, K., Tsujino, H., Ogata, T., Okuno, H.G., 2006. Multi-domain Spoken Dialogue System with Extensibility and Robustness Against Speech Recognition Errors, in: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, SigDIAL ’06. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9–17.

Lin, B.-S., Wang, H., Fen, Q., 1999. Consistent Dialogue Across Concurrent Topics Based On An Expert System Model.

Liu, X., Eshghi, A., Swietojanski, P., Rieser, V., 2019. Benchmarking Natural Language Un- derstanding Services for building Conversational Agents. ArXiv190305566 Cs.

Lu, Z., 2006. A regularized minimum cross-entropy algorithm on mixtures of experts for time series prediction and curve detection. Pattern Recognit. Lett. 27, 947–955. https://doi.org/10.1016/j.patrec.2005.12.002

M. O’Neill, I., Hanna, P., Liu, X., Mctear, M., 2004. Cross domain dialogue modelling: an object-based approach.

Masoudnia, S., Ebrahimpour, R., 2014. Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293. https://doi.org/10.1007/s10462-012-9338-y

Mossavat, S.I., Amft, O., Vries, B. de, Petkov, P.N., Kleijn, W.B., 2010. A bayesian hierar- chical mixture of experts approach to estimate speech quality, in: 2010 Second Inter- national Workshop on Quality of Multimedia Experience (QoMEX). Presented at the 2010 Second International Workshop on Quality of Multimedia Experience (QoMEX), pp. 200–205. https://doi.org/10.1109/QOMEX.2010.5516203

Nakano, M., Funakoshi, K., Hasegawa, Y., Tsujino, H., 2008. A Framework for Building Conversational Agents Based on a Multi-expert Model, in: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08. Association for Compu- tational Linguistics, Stroudsburg, PA, USA, pp. 88–91.

NatWest begins testing AI driven ‘digital human’ in banking first [WWW Document], n.d. URL https://www.rbs.com/rbs/news/2018/02/natwest-begins-testing-ai-driven-digital- human-in-banking-first.html (accessed 8.31.19).

Niranjan, M., Saipreethy, M.S., Kumar, T.G., 2012. An intelligent question answering con- versational agent using Naïve Bayesian classifier, in: 2012 IEEE International Confer- ence on Technology Enhanced Education (ICTEE). Presented at the 2012 IEEE Inter- national Conference on Technology Enhanced Education (ICTEE), pp. 1–5. https://doi.org/10.1109/ICTEE.2012.6208614

Peng, F., Jacobs, R.A., Tanner, M.A., 1996. Bayesian Inference in Mixtures-of-Experts and Hierarchical Mixtures-of-Experts Models with an Application to Speech Recognition. J. Am. Stat. Assoc. 91, 953–960. https://doi.org/10.1080/01621459.1996.10476965

Rumney, E., 2018. British bank RBS hires “digital human” Cora on probation. Reuters. Setiaji, B., Wibowo, F.W., 2016. Chatbot Using a Knowledge in Database: Human-to-Ma-

chine Conversation Modeling, in: 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS). Presented at the 2016 7th International

Conference on Intelligent Systems, Modelling and Simulation (ISMS), pp. 72–77. https://doi.org/10.1109/ISMS.2016.53

Shieber, S.M., 1994. Lessons from a Restricted Turing Test. Commun ACM 37, 70–78. https://doi.org/10.1145/175208.175217

Shum, H., He, X., Li, D., 2018. From Eliza to XiaoIce: challenges and opportunities with so- cial chatbots. Front. Inf. Technol. Electron. Eng. 19, 10–26. https://doi.org/10.1631/FITEE.1700826

Simmons, R.F., 1970. Natural Language Question-answering Systems: 1969. Commun ACM 13, 15–30. https://doi.org/10.1145/361953.361963

Suzuki, J., Taira, H., Sasaki, Y., Maeda, E., 2003. Question Classification using HDAG Ker- nel, in: Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering. Association for Computational Linguistics, Sapporo, Japan, pp. 61–68. https://doi.org/10.3115/1119312.1119320

Tripathi, K.P., 2011. A Review on Knowledge-based Expert System: Concept and Architec- ture. Artif. Intell. Tech. 5.

Turing, A.M., 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX, 433–460. https://doi.org/10.1093/mind/LIX.236.433

Tzafestas, S.G., Kokkinaki, A.I., Valavanis, K.P., 1993. An Overview of Expert Systems, in: Tzafestas, S. (Ed.), Expert Systems in Engineering Applications. Springer Berlin Hei- delberg, Berlin, Heidelberg, pp. 3–24. https://doi.org/10.1007/978-3-642-84048-7_1

Versace, M., Bhatt, R., Hinds, O., Shiffer, M., 2004. Predicting the exchange traded fund DIA with a combination of genetic algorithms and neural networks. Expert Syst. Appl. 27, 417–425. https://doi.org/10.1016/j.eswa.2004.05.018

Walker, M., S. Aberdeen, J., Boland, J., Bratt, E., S. Garofolo, J., Hirschman, L., N. Le, A., Lee, S., Narayanan, S., Papineni, K., L. Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S., 2001. DARPA communicator dialog travel planning systems: the june 2000 data collection. pp. 1371–1374.

Wallace, R.S., 2009. The Anatomy of A.L.I.C.E., in: Epstein, R., Roberts, G., Beber, G. (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer Netherlands, Dordrecht, pp. 181–210. https://doi.org/10.1007/978-1-4020-6710-5_13

Walter, P., Elsen, I., Muller, H., Kraiss, K.-, 1999. 3D object recognition with a specialized mixtures of experts architecture, in, IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339). Presented at the IJCNN’99. In- ternational Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), pp. 3563–3568 vol.5. https://doi.org/10.1109/IJCNN.1999.836243

Waltinger, U., Breuing, A., Wachsmuth, I., 2012. Connecting Question Answering and Con- versational Agents. KI - Künstl. Intell. 26, 381–390. https://doi.org/10.1007/s13218- 012-0208-1

Wang, Z., Ahmadvand, A., Choi, J.I., Karisani, P., Agichtein, E., n.d. Emersonbot: Infor- mation-Focused Conversational AI Emory University at the Alexa Prize 2017 Chal- lenge 11.

Weigend, A.S., Mangeas, M., Srivastava, A.N., 1995. Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. Int. J. Neural Syst. 06, 373–399. https://doi.org/10.1142/S0129065795000251

Weizenbaum, J., 1966. ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine. Commun ACM 9, 36–45. https://doi.org/10.1145/365153.365168

Williams, J.D., Young, S., 2007. Partially observable Markov decision processes for spoken dialog systems. Comput. Speech Lang. 21, 393–422. https://doi.org/10.1016/j.csl.2006.06.008

Yazdani, M., 1989. Expert Systems Principles and Case Studies, in: Forsyth, R. (Ed.), . Chap- man & Hall, Ltd., London, UK, UK, pp. 173–183.

Yuksel, S.E., Wilson, J.N., Gader, P.D., 2012. Twenty Years of Mixture of Experts. IEEE Trans. Neural Netw. Learn. Syst. 23, 1177–1193. https://doi.org/10.1109/TNNLS.2012.2200299

Zhang, D., Lee, W.S., 2003. Question Classification Using Support Vector Machines, in: Pro- ceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03. ACM, New York, NY, USA, pp. 26–32. https://doi.org/10.1145/860435.860443

Appendices

A. Code for obtaining output from IBM Watson Services ''' This code reads multiple user utter- ance from a file and then parsing the same to all 3 expert sys- tems.The confidence obtained from the sys- tems is then stored as a matrix across single user utter- ances in the outputted excel file '''

import json import ibm_watson import pandas as pd from ibm_watson import DiscoveryV1 #setting configurations for Watson Assistant api_version_assistant = '' apikey_assistant = '' assistant_url = '' shorttail_workspace = '' chitchat_workspace = '' assistant = ibm_watson.AssistantV1( version = api_version_assistant, iam_apikey = apikey_assistant, url= assistant_url ) #setting configurations for long tail/ Watson discovery service api_version_discovery = '' apikey_discovery= '' discovery_url= '' environment_id = '' collection_id = '' discovery = DiscoveryV1( version=api_version_discovery, iam_apikey=apikey_discovery, url=discovery_url ) #read csv file with input test utterances df = pd.read_csv('inputfilepath') #give path to file containing test utterances

#defining columns in dataframe to store confidence values df['shorttail_confidence'] = '0' df['shorttail_intent'] = '0' df['chitchat_confidence'] = '0' df['chitchat_intent'] = '0' df['longtail_confidence'] = '0' df['lt_conf1'] ='0' df['lt_conf2'] ='0' df['lt_conf3'] ='0' #passing utterances to chitchat system for i in range(0,len(df)): input = df['message'].loc[i] response_cc = assistant.message( workspace_id = chitchat_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.chitchat_confidence.iloc[i] = '0' df.chitchat_intent.iloc[i] = 'Invalid' else: df.chitchat_confidence.iloc[i] = response_cc['in- tents'][0]['confidence'] df.chitchat_intent.iloc[i] = response_cc['in- tents'][0]['intent'] #passing utterances to shorttail system for i in range(0,len(df)): input = df['message'].loc[i] response_st = assistant.message( workspace_id = shorttail_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.shorttail_confidence.iloc[i] = '0' df.shorttail_intent.iloc[i] = 'Invalid' else:

df.shorttail_confidence.iloc[i] = response_st['in- tents'][0]['confidence'] df.shorttail_intent.iloc[i] = response_st['in- tents'][0]['intent'] #passing utterances to longtail system for i in range(0,len(df)): user_input = df['message'].loc[i] input_text = "text:"+user_input query_ex = discovery.query(environment_id, collec- tion_id, filter=None, query=input_text, natural_lan- guage_query=None, passages=True, aggregation=None, count=3, re- turn_fields=None, offset=None, sort=None, highlight=True, pas- sages_fields=None, passages_count=3, passages_charac- ters=None, deduplicate=None, deduplicate_field=None, collec- tion_ids=None, similar=None, similar_document_ids=None, simi- lar_fields=None, bias=None, logging_opt_out=None) for j in range(0,len(query_ex.result['results'])): if j<3: #get top 3 results from discov- ery/ longtail system df.longtail_confidence.iloc[i] = query_ex.result['re- sults'][0]['result_metadata']['confidence'] df['lt_conf'+str(j+1)].iloc[i] = query_ex.result['re- sults'][j]['result_metadata']['confidence'] else: break df.to_excel('outputfilepath', index = False) #give path to store the output as an excel file for further manipulation

B. Test Data Types

a. Type 1 Conversation 1 Hi Good Morning How are you? I am looking to buy a car Can you show me red sedans please? Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives Let's buy that

Conversation 2

Hello what's up can we chat can you help me configure a car I am looking for something in the mid 20k pound range I like green SUV's with a sunroof Yes please Does it have driver assist? How much is the insurance going to cost? Ok, let's buy that

Conversation 3 Good evening What's up? How are you? I wanna buy a car I am looking for a sedan with manual transmission What is the power of that car? Does it have 6 airbags? I like it What is the tax liability Ok, let's go ahead and send the quote

Conversation 4 Greetings Good afternoon Describe yourself How can you help I want to configure a car I am looking for a blue automatic sedan What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive for me? I'll be back later

Conversation 5 are you here? Hey Good Morning I want to customise a car Looking for something in the 40k pound range with automatic transmission Does it have dsg transmission? Perfect! Exactly what I am looking for

Can I get a test drive for that nearby I am ready to buy. Please send configuration to dealer

b. Type 2 Conversation 1 Hi What are you I am looking to buy a car Can you show me an automatic red sedan please? I like the Audi a4 Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives This is exactly what I am looking for Can I get a quote of what it's costing now Would it be possible to get a test drive tomorrow? Excellent Goodbye for now

Conversation 2 Hello can we chat help me configure a car I am looking for a something in the mid 20k pound range I want a green hatchback with a sunroof I like the kia rio I want to see that with 17-inch alloys It's beginning to move towards exactly what I am looking for Does it have driver assist? What is the power of that car? Does it have 6 airbags? How much is the insurance going to cost? How much is the cost now? Ok, let's go ahead and send the quote I'll be back later Bye

Conversation 3 Greetings Describe yourself How can you help I want to configure a car I am looking for a diesel SUV

I want it in blue with black alloys Start again I want to experiment I am looking to buy a petrol convertible with automatic transmission I love the mercedes amg-gt I want to see the car in a light green colour instead of black That's perfect What's the tax liability? Does it have dsg transmission? What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive on Friday at 4pm for me? Great Work You're funny Are you real? I'll be back later bye

Conversation 4 are you here? Good afternoon I want to customise a car Looking for something in the 40k pound range with automatic transmis- sion Which of those have Apple car in them? I'll go for the hyundai i30 I want it in dark blue How many airbags does it have? Does it have ebd? What's the mileage in mpg? That's really nice Can I get a test drive for that nearby now Yes please I am ready to buy. Please send configuration to dealer I am bored! Nah Talk to you later