financial mathematics dissertation
Balancing relevancy across expert systems for a Conversational AI personality
Gautam Prasad
September 2019
School of Mathematics, Cardiff University
A dissertation submitted in partial fulfilment of the requirements for MSc (in Data Science and Analytics) by taught programme.
CANDIDATE’S ID NUMBER 1821536
CANDIDATE’S SURNAME Please circle as appropriate Mr / Miss / Ms/ Mrs / Rev / Dr / Other ……………..........
PRASAD CANDIDATE’S FULL FORENAMES GAUTAM
DECLARATION
This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree.
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 1
This dissertation is being submitted in partial fulfilment of the requirements for the degree of Msc
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 2
This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended.
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 3 –
I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations.
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 4 - BAR ON ACCESS APPROVED
I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee.
Signed ……………………………………………. (candidate) Date 06-09-2019 Gautam Prasad
Gautam Prasad
Gautam Prasad
Gautam Prasad
Gautam Prasad
i
Executive Summary Companies want to interact with their customers in a way that is not limited by time, human
resource availability or language. They need to pre-empt the needs of their clientele in order to
keep them satisfied and thereby reduce customer churn. Businesses such as Vodafone, Royal
Bank of Scotland (RBS) and NatWest and other businesses in the telecom, and finance domain
are investing in chatbots to build and maintain new relationships looking to lower overheads,
costs and training time.
There are three significant types of chatbot expert systems in use; chitchat bots, short tail/ task-
focused bots, and longtail search or FAQ bots. These have primarily been used individually
based on business requirements. Chitchat focuses on having the most natural conversation with
a user based on their inputs; Short tail looks at helping the user complete a small number of
regularly performed tasks and requires high training effort to scale efficiently but tends to pro-
vides consistent results. Longtail systems are focussed on information retrieval and require a
more significant training effort at the start, provides a wider variety of answers at lower confi-
dences; however, scales more efficiently. Developing a mixture of experts’ system, that is ca-
pable of combining these three technologies into a single personality with balanced relevancies
of which response is to be used, is of high interest to organisations. This will enable them to
better lever their existing investments in brand personality (chitchat systems) and human-read-
able material (long tail systems) while allowing them to rapidly develop new task-focused
(short tail) systems in a wide variety of their business domains using modern API based con-
versational short tail systems. This project looks at the best method to bring these disparate
types of systems together into a coherent conversational personality.
The analysis looked at different scenarios using unmodified confidence scores from the expert
systems, building a high-level classifier to determine the best system to answer, and simulating
fallback rules used in systems like IBM Watson. The outputs from all the scenarios were opti-
mised using evolutionary algorithms and optimised using a prepared data set on the same topic
that was applicable to all three system types, before being compared using an accuracy metric
to determine the most successful strategy. The additional effect of concurrency between each
user utterance was then evaluated against these strategies.
The study concluded that merging the chitchat and short-tail training data to reduce the number
of expert systems from 3 to 2 and then use of the fallback rules works best if the confidence
ii
level for fallback is optimised for the expected data set. If the training sets of the chitchat and
short tail systems cannot be merged, or there is a requirement for keeping three separate
sys-tems, a weighted high-level classifier performed best. Optimisation of confidence lev-els
used improved the performance of the fallback rules by a considerable margin. The effect of
concurrency was thought to be a crucial aspect to investigate from the recommendations of
the literature review, but the overall effect of concurrency on this data was shown to be small.
Recommended next steps could be beta testing with real user data to avoid any cognitive
bias in the test and train set and to gauge the change in performance by increasing the
number of expert systems to beyond three, where it is expected that a high-level classifier
increases in performance compared to fallback rules.
iii
Acknowledgements I would like to thank my supervisor, Professor Alexander Balinsky, for his timely help and for
pointing me towards the right direction throughout this project. I am grateful to Mr Stephen
Broadhurst of ThinJetty Ltd for providing me with an opportunity to pursue this project at his
organisation and for the continuous mentoring and support. Also, I would like to acknowledge
the advice and assistance from Ms Joanna Emery and the moral support provided by my col-
leagues on the MSc course.
Contents Executive Summary .................................................................................................................... i
Acknowledgements .................................................................................................................. iii
1. Introduction ........................................................................................................................ 1
2. Literature Review ............................................................................................................... 1
2.1. Expert Systems ........................................................................................................... 1
2.1.1. Overview .............................................................................................................. 1
2.1.2. Typical Architecture for an expert system ........................................................... 2
2.1.3. Applications ......................................................................................................... 3
2.2. Mixture of Experts ..................................................................................................... 3
2.2.1. Background/ Overview ........................................................................................ 3
2.2.2. Applications ......................................................................................................... 4
2.3. Conversational Agents and the application of Expert Systems ................................. 5
2.3.1. What are Conversational Agents? ........................................................................ 5
2.3.2. How are expert systems used in chatbots? ........................................................... 7
2.3.3. Comparison of toolkits for building conversational agents ................................. 8
3. Methodology & Implementation ...................................................................................... 10
3.1. Overview .................................................................................................................. 10
3.2. Tools setup and initialisation ................................................................................... 10
3.3. Knowledgebase ........................................................................................................ 11
3.4. Testing framework & Dataset optimisation ............................................................. 12
3.5. Scenarios .................................................................................................................. 13
3.5.1. Overview ............................................................................................................ 13
3.5.2. Experiment 1: High-level classifier ................................................................... 14
3.5.3. Experiment 2: Weighted High-Level Classifier ................................................ 16
3.5.4. Experiment 3: Based on unmodified confidence scores .................................... 17
3.5.5. Experiment 4: Weighted confidence scores ....................................................... 17
3.5.6. Experiment 5: Emulating fallback logic ............................................................ 18
3.5.7. Experiment 6: Effect of concurrency ................................................................. 20
4. Results .............................................................................................................................. 21
5. Discussion & Conclusion ................................................................................................. 25
5.1. Qualitative evaluation .............................................................................................. 25
5.2. Possibility for future work ....................................................................................... 26
5.3. Recommendations & Findings ................................................................................. 27
References ................................................................................................................................ 28
Appendices ............................................................................................................................... 32
1
1. Introduction The project investigates the use of a mixture of experts’ system in the conversational artificial
intelligence (AI) domain and to devise a rule-based or machine learnt algorithm-based tech-
nique which can balance the relevancy across them. The expert systems are specialised on a
domain level and are to be put into use to respond to a conversational turn in a robust and
precise manner all the while accounting for concurrency and minimising errors. The aim is to
have a mechanism that enables the rules to adapt during the conversation based on parameters
to do with the conversational state or features from the user utterance to best judge which sys-
tem must respond.
2. Literature Review
2.1. Expert Systems
2.1.1. Overview Expert systems is a branch of AI which deals with developing machines which, in a
specific domain, have problem-solving abilities similar to those displayed by a human expert
in the same field or to simulate human expert behaviour (Tzafestas et al., 1993). An expert
system is different from other forms of AI because it performs problem-solving using domain-
specific approaches at an expert knowledge level and also provides pieces of evidence for the
conclusion drawn (Tzafestas et al., 1993). Several advantages, such as the following, have also
been highlighted over the course of time in comparison to human experts (Ignazio, 1991):
• No human-like bias involved while obtaining solutions or prescribing strategies.
• Minimal chances for occurrences of errors in calculations.
• Serves the purpose without fail on a near-constant basis.
2
2.1.2. Typical Architecture for an expert system
Figure 1: Typical architecture of an expert system (Forsyth, 1984; Tzafestas et al., 1993; Yazdani, 1989)
An expert system consists of the following high-level modules as shown in Figure 1
which form the crux of the operations of the system (Forsyth, 1984, n.d.; Ignizio, 1990; Tripa-
thi, 2011; Tzafestas et al., 1993):
2.1.2.1. Knowledge Base The knowledge system encompasses the domain-specific expert-level knowledge
that is required by the expert system for comprehending user requirements, formu-
lating strategies, obtaining necessary rule-based solutions which can then be passed
onto the inference engine for further processing. It consists of both factual, which
is the most commonly shared/found forms of knowledge, and heuristic knowledge,
which is the less widely shared and considerably more individualistic form of
knowledge which acts as the reasoning for the solutions obtained.
2.1.2.2. Inference Engine The inference engine acts as the intelligence behind the expert system and takes
care of the inferences from user requests/utterances. It then analyses and processes
the rules obtained from the knowledge base in order to arrive at a solution with
logical reasoning for the same. In short, it controls the interpretation and reasoning
methodology of the expert system. The two most widely used reasoning strategies
are forward chaining and backward chaining. Forward chaining starts with the data
at hand and uses the inference rules to arrive at a solution whereas backward chain-
ing starts with the list of goals to be attained and works its way backwards to see if
there is any data available to solve the problem and attain the goals.
3
2.1.2.3. User Interface A user interface is built to interact with the user by receiving user inputs in the form
of utterances and for the system to revert with a user-identifiable output.
2.1.3. Applications Expert systems have varied applications majorly in classification tasks in the fields of
medical diagnosis, information retrieval and aligned services, engineering, human-
computer interaction, military, robotics amongst others. (Forsyth, 1984, n.d.; Ignizio,
1990; Tzafestas et al., 1993).
2.2. Mixture of Experts
2.2.1. Background/ Overview Mixture of experts is a method that was introduced almost 30 years ago by Jacobs and
co-workers (R. Jacobs et al., 1991). They investigated the use of a different error function in a
mixture of experts’ system, and their approach has been supremely popular in a suite of wide-
ranging applications (Yuksel et al., 2012). Over the years, around 20 different studies have
been conducted on the principles, working and applications of expert systems and to an extent
was even considered to be completely solved; However, recently there has been a resurgence
in interest in the context of using a mixture of experts for several new-age problems (Yuksel et
al., 2012).
Mixture of experts has been widely regarded as a combining method which, when put
to use in machine learning tasks, can lead to better performance and improved results (Masoud-
nia and Ebrahimpour, 2014). The critical aspect of a mixture of experts model in any applica-
tion was to employ specialised expert systems to return correct answers for topics which fall
under its knowledge base and use a gating network across all the expert systems which helps
in reducing the errors (Jacobs et al., 1991). The basic principle behind this is that the gating
network assigned a new input to an expert system, and weights of only this system are changed
if the output is found to be incorrect which removes any chance of interference for the other
expert systems (R. Jacobs et al., 1991). This also has the implication & possible added ad-
vantage of each expert system being assigned only a small set of extremely feasible input cases
(R. Jacobs et al., 1991).
4
Figure 2: A system of expert and gating systems (R. Jacobs et al., 1991)
The gating network is assumed to be a stochastic one-out-of-n selector, unlike
in (Hampshire and Waibel, 1992; R. A. Jacobs et al., 1991), which is how minimal interference
is achieved in a much more straightforward manner, albeit reconceptualising the error function
in order to make the expert systems challenge one another making the whole network compet-
itive in nature rather than being collaborative (R. Jacobs et al., 1991). An evaluative compari-
son was performed between standard backpropagation networks with a single hidden layer and
a mixture of experts by using it to recognise multi-speaker vowel recognition (R. Jacobs et al.,
1991). The parameters of the models were kept approximately equal by adjusting the number
of hidden layers in the backpropagation network (R. Jacobs et al., 1991). Upon investigation
of the results, the mixture of experts model achieved the error criterion (average squared error
of 0.08) at a much higher speed even while keeping the number of epochs needed for the same
at a lower number and also maintaining scalability with increase in number of experts used in
the system (R. Jacobs et al., 1991; Masoudnia and Ebrahimpour, 2014).
2.2.2. Applications Several applications have been devised over the years for a mixture of experts’ systems
such as (Yuksel et al., 2012):
• Used in the prediction of climate (Lu, 2006), electricity demand (Weigend et
al., 1995), stock prices (Versace et al., 2004), currency exchange rates (Coelho
et al., 2003), amongst others.
5
• Machine learnt classification tasks involved in
o Classification of
� Text (Estabrooks and Japkowicz, 2001),
� Audio signals (Harb et al., 2004) and
o Recognition of
� Handwriting (Ebrahimpour, 2009),
� Speech (Mossavat et al., 2010; Peng et al., 1996), and
� 3D objects (Walter et al., 1999).
2.3. Conversational Agents and the application of Expert Systems
2.3.1. What are Conversational Agents? A piece of software or program that enables a machine to converse using a natural lan-
guage such as English with a human user is called a Conversational agent (Io and Lee,
2017; Weizenbaum, 1966). Since the initial research and work done in the field since
the 1960s, the most significant challenges faced was in enabling the machine with in-
telligence that would facilitate such interactions (Shum et al., 2018; Turing, 1950).
In a typical human conversation with a chatbot, input from the user is in the form of a
single or set of natural language utterances which the system analyses to gauge the
requirements of the user and produces a response which it deems an appropriate one to
the analysed input (Weizenbaum, 1966). The afore-mentioned response is derived using
several techniques, of which rules have been predominantly used in the earlier stages
of development of chatbots, wherein the user utterance is searched for keywords and
based on their presence, associated rules are invoked to convert the utterance (Shum et
al., 2018; Weizenbaum, 1966).
2.3.1.1. Types of Conversational Agents
2.3.1.1.1. Chitchat
Several of the earliest systems developed such as “Eliza”, “ALICE” and “Parry”
(Colby KM, 1975; Shieber, 1994; Wallace, 2009; Weizenbaum, 1966) were fo-
cussed on performing as chitchat bots for the purpose of conversation with users
in the medium of text, audio amongst others. (Shum et al., 2018). These systems
6
used pattern matching based on rules to respond to the user’s input. (Shum et
al., 2018; Weizenbaum, 1966).
The chatbots were given different personalities such as a “Rogerian Psychother-
apist” (Shum et al., 2018; Weizenbaum, 1966), a paranoid person (Colby KM,
1975; Shum et al., 2018) and so on, but were severely limited in terms of capa-
bility to continue the conversation for a prolonged duration and had highly spec-
ified domains as well which further reduced their performance (Colby KM,
1975; Shieber, 1994; Wallace, 2009). These limitations were partly due to the
technology with which the systems were built, such as AIML for “A.L.I.C.E”
which in turn led to their failure in several evaluations such as the “Ultimate
Turing Test” (Shum et al., 2018; Wallace, 2009).
2.3.1.1.2. Task-Completion
Task-Completion conversational agents were built with a focus on realising spe-
cific tasks which fall under constrained domains (Shum et al., 2018; Walker et
al., 2001; Wang et al., n.d.). It typically gives a short single high confidence
answer. A few most commonly seen domains were that of hotel or flight book-
ing, weather forecast, information gathering amongst others. In general, the sys-
tem tries to gauge the user’s ‘intents’ and then responds with actions that will
complete said intent or goal (Shum et al., 2018; Walker et al., 2001). Further
improvements also included the ability to comprehend complex dialogues with
inherent variability and state tracking (Shum et al., 2018; Williams and Young,
2007). These systems were evaluated on several parameters, not limited to
(Walker et al., 2001):
• User satisfaction
• Task completion
• Task duration
• Accuracy
Telecom giant Vodafone has over a couple of years back introduced a chatbot,
‘TOBi’, which could help users in basic tasks of checking account details, trou-
bleshooting and also in purchasing new connections (Koehler, 2017). TOBi de-
livered the following metrics (Davis, 2018):
7
• An increased conversion rate of more than 100% when compared to their
website.
• A decreased transaction time of around 50% compared to their website
• Among the highest ever received usability scores of 90.
Another such an example would be that of ‘Cora’, a chatbot employed by RBS
and NatWest, both in the banking domain, to answer basic baking related que-
ries from the user (“NatWest begins testing AI driven ‘digital human’ in bank-
ing first,” n.d.; Rumney, 2018). This has helped in identifying the most fre-
quently asked questions and significantly cutting down queuing times.
2.3.1.1.3. Long tail or Question Answering
QA conversational agents process natural language queries raised by the user
and provide concise and relevant answers to it, thereby improving the overall
interaction between the user and intelligent system (Simmons, 1970; Waltinger
et al., 2012). It gives a number of lower confidence, more extended sections of
text mined from the corpus rather than configured. In general, the question was
first analysed, and a search performed, which resulted in an answer with sup-
porting evidence and a score (Ferrucci et al., 2010; Setiaji and Wibowo, 2016).
The answers thus obtained were then scored in order of relevance before being
presented back to the user (Ferrucci et al., 2010; Setiaji and Wibowo, 2016). In
order to aid better response to the user, other facets such as topic identifying,
context recognition, keyword detection amongst others are also used in tandem
(Niranjan et al., 2012; Setiaji and Wibowo, 2016; Waltinger et al., 2012). Ques-
tion classifying, in which the inherent ‘type’ of the question posed by the user
is obtained, has also been identified as a component which can improve the ac-
curacy of long-tail agents (Suzuki et al., 2003; Waltinger et al., 2012; Zhang
and Lee, 2003).
2.3.2. How are expert systems used in chatbots? 2.3.2.1. Expert systems in chatbots Traditionally, spoken dialogue systems/conversational agents have employed
mechanisms to control the dialogue flow of the user to limit the responses from the
user to a set of pre-defined or limited choices (M. O’Neill et al., 2004). In advanced
systems which could allow multi-domain interaction between user and agent, a
8
component was used to identify the domain or topic based on the user input and
perform the necessary action to fulfil the requirement (M. O’Neill et al., 2004).
Over time, several ‘plan-based dialogue modelling schemes’ were put forward to
build systems upon; the premise behind those being that behind every user-system
interaction lies a particular requirement or goal of a user and the system has to rec-
ognise those and perform accordingly (Lin et al., 1999). The entire system is con-
figured as multiple ‘domain-specific experts’ to facilitate multi-domain conversa-
tions, with the capability to complete transactions in a particular domain working
in association with each other all the while switching between themselves based on
user input (M. O’Neill et al., 2004; Nakano et al., 2008). A middle layer is present
in the system which is responsible for evaluating user utterances across all ‘experts’
present and determine which one has to respond to that particular utterance
(Hartikainen et al., 2004; Komatani et al., 2006; Lin et al., 1999; M. O’Neill et al.,
2004).
2.3.2.2. Problems faced in addressing multi-domain conversations There have been a few problems which have arisen while attempting to handle
multi-domain conversations in a concurrent and flexible manner, such as:
• Identifying how to handle errors in comprehending user input (Lin et al.,
1999).
• To determine if the user or the system should take the initiative to carry on
with the conversation (Lin et al., 1999).
• To tackle user initiatives in a proper and ‘consistent’ manner (Lin et al.,
1999).
• Diminishing efficiency due to multiple systems working simultaneously
(Hartikainen et al., 2004)
• Inability in handling concurrent topics (Lin et al., 1999).
2.3.3. Comparison of toolkits for building conversational agents Different toolkits which are specialised in building conversational agents were looked
at, selected primarily on their capability of having out-of-the-box, both a search (long
tail) and a typical conversational agent functionality. IBM Watson, Google Dialogflow
and Microsoft Luis had both these capabilities with their long tail functionalities,
namely being Watson Discovery, Knowledge connectors and Microsoft QnA Maker.
9
In order to select the best possible toolkit, the study conducted by (Liu et al., 2019) was
used. (Liu et al., 2019) compared state-of-the-art conversational toolkits by comparing
the metrics of precision, recall and F1 score as given in Table 1.
Intent
Toolkit/ Metric Precision Recall F1
Rasa 0.863 0.863 0.863
Dialogflow 0.87 0.859 0.864
LUIS 0.855 0.855 0.855
Watson 0.884 0.881 0.882 Table 1: Comparison of specialised toolkits (Liu et al., 2019)
As seen in Table 1, IBM Watson returns the highest F1 score for intent classification
and though there isn’t a significant difference in the scores for the other three toolkits
(Liu et al., 2019).
10
3. Methodology & Implementation
3.1. Overview The aim of the project is to create a rule-based or machine learnt algorithm for weighting
confidence and evidence returned by the expert systems to determine for each conversa-
tional turn which system is the best placed to respond and to enable these rules to adapt
during the conversation based on parameters to do with the conversational state or features
from the user utterance to best judge which system must respond. By doing this, we also
aim to recommend how businesses can best combine ‘off the shelf’/ out-of-the-box
(OOTB) chitchat with existing human-readable corpora (long tail) and then rapidly develop
domain-specific functionality.
A car configurator bot (multi-domain expert system/conversational agent) was decided to
be built to perform the tests and analysis. The bot would have the ability to engage in chit-
chat with the user, perform tasks involved in configuring a car such as gathering general
requirements, booking test drives, and such other queries. It would also act as a question
and answer bot wherein users can pose natural language queries to be answered by the bot
which could further help them narrow their search or enhance their knowledge of a vehicle
in mind. The short tail queries would be put forth to the system by the user at a higher
frequency, and the long tail ones would be at a significantly lower frequency. Each user
utterance is passed to all three expert systems, and the corresponding confidence scores are
retrieved. Metrics such as accuracy, precision and recall are derived and used as a baseline
score to compare and to simulate different scenarios. The whole experiment and analysis
were devised to be completed using IBM Watson tool of Assistant and Discovery as dis-
cussed in Section 2.3.4 along with the data extraction and manipulation using coding in
Python and optimisation tasks in Excel using Solver.
3.2. Tools setup and initialisation An account was set up in IBM Watson for using Assistant and Discovery for building the
conversational agent. The Watson Developer Cloud Python SDK will be used to communi-
cate to the Assistant and Discovery services using the application programming interface
(API) provided; the dependencies and packages for which are also installed. Access to the
online services is gained using a combination of usernames, API keys, environment IDs,
collection IDs etc. Python is also used to extract, manipulate and further analyse the outputs
11
from the services’ APIs. Microsoft Excel is used to create and store the data used for train-
ing and testing purposes in filetypes of Excel worksheets (.xlsx), Comma-separated value
files (.csv), and Tab-separated value files (.tsv) based on requirements and features.
Toward the latter end of the analysis, in order to optimise metric values such as precision,
recall or accuracy (objective) based on weights or other parameters (constraints) as neces-
sary, the Solver add-in of Excel is used. Solver has three solving methods which are used
throughout our experiments based on the requirements and enhancements brought about on
the metrics by using a method. In cases where more than one method is used, a comparison
is also made possible. The solver methods are as follows:
• LP Simplex: Used in cases where the problems are linear, which in turn means its
applications are restricted (“Excel Solver,” 2016). However, one of its benefits is
that the solutions obtained are always globally optimised (“Excel Solver,” 2016).
• Generalised Reduced Gradient (GRG) Non-Linear: This method is the fastest of the
non-linear methods but has a disadvantage that the solution obtained might not be
a global optimum and is also highly dependent on the initial conditions (“Excel
Solver,” 2016). It is used for smooth non-linear problems (“Excel Solver,” 2016).
• Evolutionary method: Based on the theory of natural selection, it may converge to
a solution if either the solution is the global optimum or if the population has lost
its diversity (“Excel Solver,” 2016). It is used in cases of non-smooth problems.
3.3. Knowledgebase A corpus of data had to be created to train and test the conversational agent. Since the
conversational agent has three experts, namely chitchat (social), short-tail (task-oriented)
and long-tail (Q&A), a corpus was created for all of them. The data were acquired in the
following manner:
• Chitchat: A collection of close to 890 example user utterances and 59 intents was
obtained for the chitchat expert system by combining the data from:
o Watson Assistant: The inbuilt ‘General’ content catalogue, which contains
ten unique intents and close to 200 example utterances.
o Google Dialogflow: The inbuilt ‘smalltalk’ agent was exported, which con-
tains 86 intents and around 1500 example utterances.
• Short-tail: A corpus of 16 intents with close to 150 example user utterances were
created manually for the short-tail expert system of the car configurator use case
12
which included intents such as ‘#GeneralRequirements’, ‘#BookTestDrive’ etc. and
their corresponding example user utterances.
• Long tail: 118 car brochures spread across different vehicle types, makes and mod-
els were obtained and collated to be used as the corpus for the long tail expert sys-
tem.
The data for both the chitchat and short tail expert systems were ingested into two separate
Watson Assistants, and the data for longtail was ingested into Watson Discovery to be used
in testing and analysis purposes. After ingestion into discovery, the search was optimised
for relevancy by using the out-of-the-box (OOTB) relevancy training available in Watson
Discovery. The process entailed posing natural language queries to discovery and marking
the results from the service as ‘relevant’ and ‘not relevant’ based on the contents of the
results.
For testing the conversational agent, a dataset consisting of 50 example user utterances
across five different conversations was manually created mimicking users who would be
using the chatbot to configure a car or book a test drive and such similar requests. The
utterances were created in such a way that they would be as close to a real conversation as
possible with sample responses for each and from different expert systems which would
further enable testing the multi-domain conversational agent.
3.4. Testing framework & Dataset optimisation In general, the data for conversational agents, which comprises of user example utterances
and intents are created by ‘subject matter experts’ based on ground truth (Freed, 2018).
Utterances are created and then marked with ‘entities’ and labelled with expected ‘intents’;
the corpus procured for the chit-chat workspace from the Watson and Google services has
been created in the same fashion by employing API’s to crawl the web. Such procurement
calls for the need for testing the same to identify any hidden patterns and weakness present
in it, which can further be remedied (Freed, 2018).
Testing is achieved by submitting the utterances to the classifier and investigating the out-
put of the classifier to see if it matches the set ‘ground truth’ (Freed, 2018). The data is split
into training and testing/blind datasets using the k-fold cross-validation technique. After
training the classifier on the training data, the validation dataset is used to evaluate the
classifier and obtain the required parameters. Precision and recall metrics are intended to
be put into use while evaluating and comparing the performance of the classifier.
13
The testing is done on the chitchat corpus, and the metrics are obtained on an intent level.
In order to improve the performance, the intents are sorted based on recall value and the
ones with the lowest value are selected for removal or to be fixed. The misclassified utter-
ances were either moved to different intents or edited to enable better classification. This
process was carried out for all the utterances falling under intents with the lower recall
values. After a complete overhaul, the data was re-ingested, and the testing was carried out
again. The entire process was reiterated multiple times, thereby improving the overall met-
ric values and boosting the intents with initial low recall value. The final dataset, which had
a precision value of 90.32% and was reduced to 50 unique intents and 859 utterances for
the chit-chat expert system, was then ingested back into the Watson Assistant service.
3.5. Scenarios
3.5.1. Overview In order to fulfil the primary aim which is to create an adaptable rule-based or machine
learnt algorithm for weighting confidence and evidence returned by the expert systems
to determine for each conversational turn which system is best placed to respond, dif-
ferent scenarios have to be devised to analyse and compare. The outcome of the com-
parison would give the best algorithm to implement in order to obtain the best concur-
rency and switching between the expert systems in an efficient but logical manner based
on user utterances. The scenarios thus formulated are as follows:
1. High-level classifier
2. Weighted high-level classifier
3. Based on pure-confidence values
4. Weighted system confidences
5. Emulating fallback logic
6. Testing for concurrency
The testing data created earlier, which consists of 50 unique user utterances are inputted
to the mixture of experts’ system and the confidences returned is used as a baseline to
perform the experiments and simulate the scenarios. A metric similar to accuracy was
devised, which was obtained by dividing the number of correct classifications to the
total number of classifications performed, to compare the scenarios on a qualitative
basis. The thorough investigation of the metric obtained post-simulation gives a clear
insight to which expert system should be utilised to respond to said utterance.
14
3.5.2. Experiment 1: High-level classifier The purpose of this experiment is to set up a high-level classifier which acts as the
‘selector’ in the mixture of experts’ systems and is trained on a sample of the example
utterances spread across the short-tail, chitchat and long-tail expert systems. The data
corpus must be sampled to obtain an equal number of utterances, fixed at 50 for the use
case, from each class to avoid any possible bias in that regard. The ‘Pandas’ module in
Python and it’s inbuilt ‘group by’ method is used for sampling the utterances of all three
separate expert systems. The separate sampled datasets are concatenated in order to
obtain a single corpus of 150 utterances labelled into the three classes of short-tail,
chitchat and long-tail.
The dataset is further split on an 80:20 ratio to be used for training and testing purposes
of the classifiers built. Different algorithms are used as the base for the classifier to
allow for the selection of the best possible classifier and are compared based on the
testing accuracy values, the algorithms being:
• Naïve Bayes: An algorithm based on Bayes theorem and assumes independence
between predictors used in the classification.
• Linear Classifier – Logistic Regression: It uses the logistic function at its core
to determine the relationship between the dependent variable and several inde-
pendent variables.
• Support Vector Machine (SVM): SVM is a supervised algorithm which at-
tempts to extract the best possible hyperplanes for classifying the data.
• Bagging Method - Random Forest (RF): Random forest method constructs nu-
merous decision trees during training, and the outputted class is the mean or
mode of the individual trees. It fixes the overfitting found in decision tree clas-
sification
• Boosting Method – eXtreme Gradient Boosting Model (XGBoost): A super-
vised machine learning algorithm which uses an ensemble of other weaker mod-
els/ algorithms to reduce bias and variance.
Along with these algorithms, Google BERT (Bidirectional Encoder Representations
from Transformers), which is an unsupervised learning algorithm, was also used to
build a classifier (Devlin et al., 2018). BERT uses bidirectional encoding and works on
several pre-trained models released by Google, which can be further fine-tuned to suit
15
the application or requirement (Devlin et al., 2018). It takes into account the context of
a word from both its left and right sides since it is bidirectional (Devlin et al., 2018).
BERT is built for binary classification out of the box and must be modified to work
with our use case of three classes (Devlin et al., 2018). Also, the training, validation
and testing data must be formatted to suit the input requirements of BERT, which is
done using a combination of Python coding and Excel.
The classifier was built using the following machine learning algorithms/ tools on the
3 expert system corpora and tested. The resultant accuracy metric, which is the fraction
of correctly classified samples is as follows:
Algorithm Accuracy Naïve Bayes 0.77
Linear Classification 0.74 Support Vector Machine 0.77
Bagging Model 0.67 Boosting Model 0.69 Google BERT 0.87
Table 2: High-level classifiers comparison
Google BERT gave the best accuracy values for the test data and was selected as the
best algorithm to use as the high-level classifier and to build the simulation for the
scenario. The simulation is carried out in the following manner:
• The test corpus created, which consists of 50 unique utterances across five con-
versations, is inputted to the BERT based classifier, and the output is obtained.
The output is a confidence score for every utterance for each expert system.
Utterance Short tail Confidence Chitchat Confidence Long tail Confidence Utterance 1 0.27420458 0.530363 0.19543229 Utterance 2 0.13647898 0.6893639 0.17415714 Utterance 3 0.25295562 0.4700206 0.27702382 Utterance 4 0.3946402 0.36452472 0.24083503 Utterance 5 0.3252571 0.38989067 0.2848522 Utterance 6 0.31253842 0.24711472 0.44034687
. . . .
. . . .
. . . . Table 3: Sample high-level classifier output
• The expert system which returns the highest confidence is selected as the system
which is best placed to respond to the utterance at that conversational turn.
16
• If the classification was performed as expected, the output of the scenario is
obtained per utterance by verifying if the expert system, which has been ob-
tained after classification, is the same as the ‘golden system’. This is part of the
testing data and has been labelled by the subject matter expert based on the log-
ical response expected.
• The recall metric for the scenario is obtained for comparison purposes during
the analysis stage.
3.5.3. Experiment 2: Weighted High-Level Classifier In this experiment, the confidences obtained from the high-level classifier built on
Google BERT (as in Experiment 1) are weighted (as in Experiment 4) to see if this
brings about a beneficial change in the output of the mixture of experts’ system.
The experiment is carried out in the following manner:
• The confidence scores from the classifier were obtained as in Experiment 1 and
were further weighted (giving a bias to the confidence scores obtained from
each system, as in Experiment 4), and the results were obtained for an equal
weight of 1 across all systems.
• The expert system which returns the highest weighted confidence is selected as
the system which is best placed to respond to the utterance at that conversational
turn.
• If the classification was performed as expected, the output of the scenario is
obtained per utterance by verifying if the expert system which has been obtained
after classification is the same as the ‘golden system’ which is part of the testing
data and has been labelled by the subject matter expert based on the logical
response expected.
• The metric for the scenario (objective) is obtained and is then subject to optimi-
sation using Solver to obtain the maximum value possible for the same by var-
ying the values of weights assigned to all expert systems (constraint).
• The optimisation is performed using both GRG Non-Linear and Evolutionary
methods to allow for comparison.
• The weights obtained after optimisation is carried out are further normalised so
that comprehension is improved. Normalisation is done by fixing a system con-
fidence value to be 1 (in this case, short tail is selected as the business problem
to focus on the short tail and bring in other expert systems without changing the
17
confidence of this). In that case, the normalised values for the other expert sys-
tems are obtained by dividing their current values with the value of the pre-
normalised short-tail confidence.
3.5.4. Experiment 3: Based on unmodified confidence scores This experiment aims to simulate a scenario wherein the selection of which system is
used to respond with is decided upon by using only the pure confidence scores returned
by the three expert systems, namely chitchat, short tail and long tail hosted in the Wat-
son Assistant and Watson Discovery cloud services respectively. Also, to decide if this
approach is suitable to be used to enable the algorithm to adapt to changes during the
conversation based on parameters related to the conversational state or features from
the user utterance to best judge which system must respond.
The experiment is carried out in the following manner:
• The mixture of experts’ system is tested using the testing data comprising of 50
example user utterances by posing these utterances to all three expert systems
individually.
• The response of the systems in the form of confidence scores is retrieved and is
stored across the utterances.
• The expert system which returns the highest confidence is selected as the system
which is best placed to respond to the utterance at that conversational turn.
• The next stage of the simulation is carried out in Excel. A check is performed
to verify if the ‘golden system’ matches the system obtained based on the con-
fidence calculation. If the classification was performed as expected, the output
of the scenario is obtained per utterance.
• The metric for the scenario is obtained for comparison purposes during the anal-
ysis stage.
3.5.5. Experiment 4: Weighted confidence scores In this experiment, the confidences obtained from the expert systems are weighted in
an attempt to see if this brings about a beneficial change in the output of the mixture of
experts’ system.
The experiment is carried out in the following manner:
• The confidence scores are obtained from the three expert systems as in Experi-
ment 3.
18
• The scores are then weighted in order to better classify the input utterances. The
weights are assigned an equal value of 1 to start with.
• A check is performed to verify if the ‘golden system’ matches the system ob-
tained based on the confidence calculation. If the classification was performed
as expected, the output of the scenario is obtained per utterance.
• The metric for the scenario (objective) is obtained and is then subject to optimi-
sation using Solver to obtain the maximum value possible for the same by var-
ying the values of weights assigned to all expert systems (constraint).
• The optimisation is performed using both GRG Non-Linear and Evolutionary
methods to allow for comparison.
• The weights obtained after optimisation is carried out are further normalised so
that comprehension is improved. Normalisation is done by fixing a system con-
fidence value to be 1 (in this case, short tail is selected as the business problem
to focus on the short tail and bring in other expert systems without changing its
confidence). In that case, the normalised values for the other expert systems are
obtained by dividing their current values with the value of the pre-normalised
short tail confidence.
3.5.6. Experiment 5: Emulating fallback logic This experiment attempts to mimic the fallback logic employed by Watson in the out
of the box cloud service. The logic can be defined as follows with three significant
variations:
• Version 1:
o if ‘short tail confidence’ > ‘threshold confidence’:
� the short tail system must respond
o else if ‘chitchat confidence’ > ‘threshold confidence’:
� the chitchat system must respond
o else:
� the long tail system must respond
• Version 2:
o if ‘short tail confidence’ > ‘threshold confidence’:
� the short tail system must respond
o else if ‘longtail confidence’ > ‘threshold confidence’:
� The long tail system must respond
19
o else:
� the chitchat system must respond.
• Version 3 (2 expert systems):
o if ‘combined confidence’ > ‘threshold confidence’:
� the combined system must respond
o else if ‘longtail confidence’ > ‘threshold confidence’:
� the long tail system must respond
The experiment for versions 1 & 2 is conducted in the following way:
• After ingestion and training, the mixture of experts’ system is tested using the
test corpus consisting of 50 example user utterances created earlier.
• The response of the systems in the form of confidence scores is retrieved and
stored across the utterances.
• The logic is then simulated using Excel, and the outputs are obtained for both
the variations with the threshold set at 0.2, which is the default used by Watson.
• After obtaining the outputs, the metric value is calculated. The metric (objec-
tive) is then subject to optimisation using solver to obtain the maximum value
possible for the same by varying the threshold value (constraint).
The experiment for version 3 is conducted in the following two ways:
1. On the cloud service by manipulating the training data for the created assistants:
• The training data consisting of utterances is manipulated so that the utter-
ances and intents falling under short tail and chitchat expert systems are
combined into a single system and long tail is kept as a separate system. The
data is re-ingested into the Watson Assistant service for testing purposes.
• After ingestion, the mixture of experts’ system is tested using the same test-
ing data comprising of 50 example user utterances by posing these utter-
ances to all three expert systems individually.
• The response of the systems in the form of confidence scores is retrieved
and is stored across the utterances.
• The logic is then simulated with the threshold value set at 0.2 (Watson de-
fault value) using Excel and the output for the variation is obtained.
• After obtaining the outputs, the metric value is calculated for the same. The
metric value (objective) is then subject to optimisation using solver to obtain
20
the maximum value possible for the same by varying the threshold value
(constraint).
2. Building a high-level classifier with the training data mimicking the OOTB
Watson (Conversation AI toolkit) logic:
• The training data consisting of utterances is manipulated so that the utter-
ances and intents falling under short tail and chitchat expert systems are
combined into a single system and long tail is kept as a separate system. The
data is used to build a binary classifier using Google BERT as done in Ex-
periment 1.
• The test corpus created, which consists of 50 unique utterances across five
conversations, is inputted to the BERT based classifier, and the output is
obtained. The output is a confidence score for every utterance concerning
either class of data.
• The logic is then simulated with the threshold value set at 0.2 (Watson de-
fault value) using Excel and the output’s obtained for the variation.
• After obtaining the outputs, the metric value is calculated for the same. The
metric value (objective) is then subject to optimisation using Solver to ob-
tain the maximum value possible for the same by varying the threshold value
(constraint).
3.5.7. Experiment 6: Effect of concurrency This experiment attempts to investigate how systems deal with concurrency, which was
a vital area of the problem, as observed in the literature review in section 2.3.3.2. The
aim is to gauge the effect of concurrency in the user input utterances within a conver-
sation on the mixture of expert system confidence values and output. In order to do so,
an updated test corpus will have to be created which mimics the effect of concurrency.
The experiment is carried out in the following manner:
• An updated test corpus of 70 unique utterances across four conversations is cre-
ated. It is done in a manner which incorporates the occurrence of utterances
which fall under the domain of the same expert system (effect of concurrency)
within conversations.
• This test data is then inputted to the mixture of experts’ system. The response
which comprises of confidence scores and intents is retrieved and is stored
across the utterances.
21
• A new parameter is introduced to vary the effect of concurrency on the output.
This parameter boosts the confidence returned from a particular expert system
if it is concurrent to the preceding utterance.
• The expert system which returns the highest confidence at the end of the boost-
ing is selected as the system which is best placed to respond to the utterance at
that conversational turn.
• A check is performed to verify if the ‘golden system’ matches the system ob-
tained based on the confidence calculation. If the classification was performed
as expected, the output of the scenario is obtained per utterance.
• The metric for the scenario (objective) is obtained and is then subject to optimi-
sation using solver to obtain the maximum value possible for the same by var-
ying the values of weights assigned to all expert systems (constraint).
• The optimisation is performed using both GRG Non-Linear and Evolutionary
methods to allow for comparison.
4. Results Throughout the experiments, several rule-based systems and parameters were looked at which
could be employed to enable these rules to be adaptable based on the requirements and conver-
sational state in order to gauge the system best placed to respond to the user utterance at the
said conversational turn. These were simulated as variations within and across scenarios, as
mentioned in section 4.5.
The results obtained from the experiments conducted can be analysed as follows:
1. Use of unmodified confidence scores: In the experiments where the confidence scores
were used to create a rule or logic for selection of an expert system within the mixture
of experts’ conversational agent, the test corpus was used to retrieve confidence scores
from the system and then used for further simulations. The confidences thus obtained
were used directly and also in a weighted manner to observe the changes being brought
about as can be seen below in Table 4.
22
Scenario Number of experts Metric Comments
Based on unmodified confi- dence scores
3 0.82
Weighted confidence scores
0.82 Initial weights of 1 each
3 0.82 GRG optimised weights
0.84 Evolutionary optimised weights
Table 4: Results from simulations based on the use of unmodified confidence scores
2. Use of a high-level classifier: In these scenarios, a high-level classifier was modelled
to mimic the working of the expert system in the Watson cloud service and use the
output confidences derived by testing using the test corpus as a base to formulate further
rules. The confidences were both used as an unmodified value and also as a weighted
component, as seen in Table 5.
Scenario Number of experts Metric Notes
High level classifier
(BERT) 3 0.72
Weighted high-level clas-
sifier (BERT)
3
0.72 Initial weights of 1 each
0.72 GRG optimised weights
0.90 Evolutionary optimised weights
Table 5: Results from simulations based on the use of a high-level classifier
3. Emulating Watson fallback logic: The inbuilt Watson logic and rules employed by the
Watson Assistant service while acting as a multiple domain system was simulated. The
simulations were carried out in two ways:
a. Method 1: By using the confidence scores from Watson a combined chitchat
(CC) and short tail (ST) corpus with long tail (LT) apart and also varying the
logic while keeping them separate.
23
b. Method 2: Using a high-level classifier also built on the combined data to in-
vestigate its performance using the test data. The confidences scores were both
used as an unmodified value and also as a weighted component.
The outputs as follows in Table 6 can be analysed to gauge if any of the simulations
would be a good fit for creating the rules for the mixture of experts’ system.
Scenario Number of experts Metric Notes
Emulating fallback
logic
0.86
Threshold set at 0.2 (de-
fault)
2 0.98 Optimised threshold of 0.55
0.50
Version 1 with the threshold
set at 0.2 (Default)
0.86
Version 1 with an optimised
threshold of 0.69
3 0.50
Version 2 with the threshold
set at 0.2 (Default)
0.70
Version 2 with an optimised
threshold of 0.67
High-level classifier 2 0.96
Weighted high-level
classifier
0.96 Initial weights of 1 each
2 0.96 GRG optimised weights
0.96
Evolutionarily optimised
weights
Table 6: Results from simulations mimicking the Watson default logic
4. Testing for effect of concurrency:
A simulation was conducted to test the effect of concurrency being used as a weight in the
selection of an expert system to respond to the utterance at a particular conversational turn.
The presence of concurrency was modelled using a ‘boosting value’ which was used to
boost the value of the corresponding system’s confidence score. The testing for this sce-
nario was conducted using the extended and modified test corpus, which had situations for
concurrency incorporated into it; the output of which can be seen in Table 7.
24
Scenario
Number of
Experts Metric Comments
Effect of concurrency 3
0.74 Initial boosting value of 0.1
0.74 GRG optimised boost values
0.77 Evolutionary optimised boost values
Table 7: Results from simulation incorporating the effect of concurrency
The values of the accuracy metric were found to be improved upon by using Excel Solver for
optimisation as can be seen in Table 8 for a 2-system architecture and Table 9 for a 3-system
architecture.
Type
Original
metric
Improved
metric
%
Change
2 system with weights - Weighted classifier 0.94 0.94 0.00
2 system with confidence - OOTB Watson Rules 0.86 0.98 13.95
Table 8: Effect of optimisation on the metric for 2 expert systems
Type
Original
metric
Improved
metric
%
Change
3 system with weights - Weighted high-level classifier 0.72 0.90 25.00
3 system with weights - Weighted confidences 0.82 0.84 2.44
3 system with confidence - OOTB Rules – Version 2 0.50 0.70 40.00
3 system with confidence - OOTB Rules – Version 1 0.50 0.86 72.00
Testing for concurrency 0.74 0.77 4.05
Table 9: Effect of optimisation on the metric for 3 expert systems
25
5. Discussion & Conclusion
5.1. Qualitative evaluation A better understanding of the work carried out can be gained by looking at the strengths
and weaknesses of the project on a qualitative basis. This should further help in identifying
opportunities for future work and any associated possibilities in the domain.
Strengths:
1. Uniqueness: There has been minimal work which has been carried out in the domain
of employing expert systems and maintaining a balance between them in the area
of conversational agents. The research carried out will also help the business make
an informed decision on the technologies and rules to use while attempting to build
a multi-domain conversational agent. This makes the work carried out unique in its
aspect.
2. Achieving research goals: Concurrency & robustly handling user inputs came
across as a drawback in present systems during the literature review, and it was
considered as a facet for investigating its effects on how the expert system responds
based on its presence and absence.
3. Test corpus creation: Two datasets were made from scratch for testing purposes in
the use case of a ‘Car Configurator’ to be put to use in the experiments. The dataset
used for extensive testing consisted of 50 user utterances across five conversations
and the dataset used for testing concurrency comprised of 70 utterances across four
conversations. This can be used for future work with minimal modifications based
on requirements.
Weaknesses:
1. Cognitive Bias in testing data: Since both the training (short tail expert system) and
testing datasets were created by me during the process of carrying out the experi-
ments, there is a high likelihood of cognitive bias being introduced into the same.
Cognitive bias is something which is faced by any entity trying to create a corpus
of data for use in a data science problem. This may have led to the introduction of
noise in data and the presence of unwanted filters which can sometimes cause cru-
cial aspects in the domain to be missed out. The presence of such cognitive bias can
lead to misclassification by the expert systems.
26
2. Small data for training short tail: The corpus of data used for training the short tail
expert system consisted of 16 intents, and approximately 150 example utterances
are significantly small in comparison to the corpora used for training the chitchat
and long-tail expert systems. However, this is precisely the problem businesses are
facing when trying to incorporate a short tail domain for their conversational agent
and is the problem I have aimed to address with this research. They have massive
datasets available for long-tail and chitchat but want to have a short tail facet ready
with minimal effort and a much smaller corpus.
3. Lack of ‘real’ user data: Another drawback which can be taken into account is the
lack of real user testing data which can be put to use for testing the performance of
the created mixture of experts’ system.
5.2. Possibility for future work 1. Two different datasets were used for testing the scenarios. The initial set had 50
utterances across five conversations, and the one updated to account for concur-
rency had 70 utterances across four conversations. If there were more time avail-
able, the possible first step to be taken would be to conduct all experiments and
test all scenarios on the updated dataset as it accounts for concurrency and may
lead to results closer to the ‘ground reality’ in user conversations.
2. Run a beta test on real user data by opening up the platform to a small section of users and thereby obtaining real user conversations upon which further testing,
and performance metrics can be obtained. This could lead to a more thorough in-
vestigation of the problem at hand and its possible solutions.
3. Currently, the mixture of expert systems has been built using only two and three experts based on the requirements. Another aspect to investigate in the future
would be to identify which scenario would work best and how the accuracy will
change with the introduction of n-systems, for example when a fourth one is in-
troduced and to perhaps model a relationship between the same.
27
5.3. Recommendations & Findings • A comparative analysis was performed between traditional machine learning
algorithms used predominantly for classification tasks and Google BERT; a
state-of-the-art model devised for several natural language processing tasks.
The results were highly favourable for BERT in this particular case having a
small training data corpus.
• Google BERT was also put to the test against a traditional ‘Conversational Ar-
tificial Intelligence (AI)’ configuring toolkit such as Watson Assistant using the
default logic utilised while creating a multi-domain conversational system. This
was done as BERT was found to be optimised for small datasets. The classifier
was found to have a slightly lower performance in comparison to the toolkit.
• The scenarios for rule building while creating a multi-domain mixture of ex-
perts’ system in the conversational agent’s space can be divided into two based
on the data available and the manipulations possible on the same as follows:
o If the training data for chitchat and short tail can be merged into a single
corpus while keeping long-tail separate, the best scenario or rule to adopt
would be the Watson default logic as done in Version 3 of the emulating
fallback logic. This would entail building a mixture of experts’ system
with two domain-specific experts.
o Building a mixture of experts’ system using three separate domain-spe-
cific experts of short tail, chitchat and long tail were found to be the best
possible scenario for cases wherein data merging is not possible. Among
those, a high-level classifier with normalised and weighted confidences
gave the best results.
• Another important finding was the improvement in result metrics when the pa-
rameters such as weights for the system confidences or the cut-off confidences
was optimised using Solver. The average improvement was around 7% for the
2 expert system scenarios and 29% for the 3 expert system scenarios. Thus, it
can be postulated that the need for optimisation possibly increases with an in-
crease in the number and variety of expert systems and their weighting.
28
References Colby KM, 1975. Artificial Paranoia - 1st Edition. Pergamon Press INC. Maxwell House,
New York, NY, England. Davis, B., 2018. Vodafone’s chatbot is delivering double the conversion rate of its website –
Econsultancy [WWW Document]. URL https://econsultancy.com/vodafones-chatbot- is-delivering-twice-the-conversion-rate-of-its-website/ (accessed 8.31.19).
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding. ArXiv181004805 Cs.
Ebrahimpour, 2009. Recognition of Persian handwritten digits using Characterization Loci and Mixture of Experts. Int. J. Digit. Content Technol. Its Appl. 3. https://doi.org/10.4156/jdcta.vol3.issue3.5
Estabrooks, A., Japkowicz, N., 2001. A Mixture-of-experts Framework for Text Classifica- tion, in: Proceedings of the 2001 Workshop on Computational Natural Language Learning - Volume 7, ConLL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9:1–9:8. https://doi.org/10.3115/1117822.1117828
Excel Solver: Which Solving Method Should I Choose?, 2016. . EngineerExcel. URL https://www.engineerexcel.com/excel-solver-solving-method-choose/ (accessed 8.16.19).
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A.A., Lally, A., Murdock, J.W., Nyberg, E., Prager, J., Schlaefer, N., Welty, C., 2010. Building Wat- son: An Overview of the DeepQA Project. AI Mag. 31, 59–79. https://doi.org/10.1609/aimag.v31i3.2303
Forsyth, R., 1984. Expert systems : principles and case studies. London ; New York : Chap- man and Hall ; New York, NY : Methuen.
Forsyth, R., n.d. The architecture of expert systems 7. Freed, A.R., 2018. Testing Strategies for Chatbots (Part 1)— Testing Their Classifiers
[WWW Document]. Medium. URL https://medium.com/ibm-watson/testing-strate- gies-for-chatbots-part-1-testing-their-classifiers-20becaf5f211 (accessed 8.14.19).
Hampshire, J.B., Waibel, A., 1992. The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14, 751–769. https://doi.org/10.1109/34.142911
Harb, H., Chen, L., Auloge, J.-, 2004. Mixture of experts for audio classification: an applica- tion to male female classification and musical genre recognition, in: 2004 IEEE Inter- national Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763). Presented at the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), pp. 1351-1354 Vol.2. https://doi.org/10.1109/ICME.2004.1394479
Hartikainen, M., Turunen, M., Hakulinen, J., Salonen, E.-P., Adam Funk, J., 2004. Flexible dialogue management using distributed and dynamic dialogue control.
Ignizio, J.P., 1991. Introduction to expert systems: the development and implementation of rule-based expert systems. McGraw-Hill, New York.
Ignizio, J.P., 1990. A brief introduction to expert systems. Comput. Oper. Res. 17, 523–533. https://doi.org/10.1016/0305-0548(90)90058-F
Io, H.N., Lee, C.B., 2017. Chatbots and conversational agents: A bibliometric analysis, in: 2017 IEEE International Conference on Industrial Engineering and Engineering Man- agement (IEEM). Presented at the 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 215–219. https://doi.org/10.1109/IEEM.2017.8289883
29
Jacobs, R., Jordan, M., J. Nowlan, S., E. Hinton, G., 1991. Adaptive Mixture of Local Expert. Neural Comput. 3, 78–88. https://doi.org/10.1162/neco.1991.3.1.79
Jacobs, R.A., Jordan, M.I., Barto, A.G., 1991. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cogn. Sci. 15, 219–250. https://doi.org/10.1016/0364-0213(91)80006-Q
Koehler, A., 2017. Meet TOBi the chatbot: The latest addition to our customer service team [WWW Document]. Vodafone Soc. Off. Vodafone UK Blog. URL https://blog.voda- fone.co.uk/2017/04/12/meet-tobi-chatbot-latest-addition-vodafone-uks-customer-ser- vice-team/ (accessed 8.31.19).
Komatani, K., Kanda, N., Nakano, M., Nakadai, K., Tsujino, H., Ogata, T., Okuno, H.G., 2006. Multi-domain Spoken Dialogue System with Extensibility and Robustness Against Speech Recognition Errors, in: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, SigDIAL ’06. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9–17.
Lin, B.-S., Wang, H., Fen, Q., 1999. Consistent Dialogue Across Concurrent Topics Based On An Expert System Model.
Liu, X., Eshghi, A., Swietojanski, P., Rieser, V., 2019. Benchmarking Natural Language Un- derstanding Services for building Conversational Agents. ArXiv190305566 Cs.
Lu, Z., 2006. A regularized minimum cross-entropy algorithm on mixtures of experts for time series prediction and curve detection. Pattern Recognit. Lett. 27, 947–955. https://doi.org/10.1016/j.patrec.2005.12.002
M. O’Neill, I., Hanna, P., Liu, X., Mctear, M., 2004. Cross domain dialogue modelling: an object-based approach.
Masoudnia, S., Ebrahimpour, R., 2014. Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293. https://doi.org/10.1007/s10462-012-9338-y
Mossavat, S.I., Amft, O., Vries, B. de, Petkov, P.N., Kleijn, W.B., 2010. A bayesian hierar- chical mixture of experts approach to estimate speech quality, in: 2010 Second Inter- national Workshop on Quality of Multimedia Experience (QoMEX). Presented at the 2010 Second International Workshop on Quality of Multimedia Experience (QoMEX), pp. 200–205. https://doi.org/10.1109/QOMEX.2010.5516203
Nakano, M., Funakoshi, K., Hasegawa, Y., Tsujino, H., 2008. A Framework for Building Conversational Agents Based on a Multi-expert Model, in: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08. Association for Compu- tational Linguistics, Stroudsburg, PA, USA, pp. 88–91.
NatWest begins testing AI driven ‘digital human’ in banking first [WWW Document], n.d. URL https://www.rbs.com/rbs/news/2018/02/natwest-begins-testing-ai-driven-digital- human-in-banking-first.html (accessed 8.31.19).
Niranjan, M., Saipreethy, M.S., Kumar, T.G., 2012. An intelligent question answering con- versational agent using Naïve Bayesian classifier, in: 2012 IEEE International Confer- ence on Technology Enhanced Education (ICTEE). Presented at the 2012 IEEE Inter- national Conference on Technology Enhanced Education (ICTEE), pp. 1–5. https://doi.org/10.1109/ICTEE.2012.6208614
Peng, F., Jacobs, R.A., Tanner, M.A., 1996. Bayesian Inference in Mixtures-of-Experts and Hierarchical Mixtures-of-Experts Models with an Application to Speech Recognition. J. Am. Stat. Assoc. 91, 953–960. https://doi.org/10.1080/01621459.1996.10476965
Rumney, E., 2018. British bank RBS hires “digital human” Cora on probation. Reuters. Setiaji, B., Wibowo, F.W., 2016. Chatbot Using a Knowledge in Database: Human-to-Ma-
chine Conversation Modeling, in: 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS). Presented at the 2016 7th International
30
Conference on Intelligent Systems, Modelling and Simulation (ISMS), pp. 72–77. https://doi.org/10.1109/ISMS.2016.53
Shieber, S.M., 1994. Lessons from a Restricted Turing Test. Commun ACM 37, 70–78. https://doi.org/10.1145/175208.175217
Shum, H., He, X., Li, D., 2018. From Eliza to XiaoIce: challenges and opportunities with so- cial chatbots. Front. Inf. Technol. Electron. Eng. 19, 10–26. https://doi.org/10.1631/FITEE.1700826
Simmons, R.F., 1970. Natural Language Question-answering Systems: 1969. Commun ACM 13, 15–30. https://doi.org/10.1145/361953.361963
Suzuki, J., Taira, H., Sasaki, Y., Maeda, E., 2003. Question Classification using HDAG Ker- nel, in: Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering. Association for Computational Linguistics, Sapporo, Japan, pp. 61–68. https://doi.org/10.3115/1119312.1119320
Tripathi, K.P., 2011. A Review on Knowledge-based Expert System: Concept and Architec- ture. Artif. Intell. Tech. 5.
Turing, A.M., 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX, 433–460. https://doi.org/10.1093/mind/LIX.236.433
Tzafestas, S.G., Kokkinaki, A.I., Valavanis, K.P., 1993. An Overview of Expert Systems, in: Tzafestas, S. (Ed.), Expert Systems in Engineering Applications. Springer Berlin Hei- delberg, Berlin, Heidelberg, pp. 3–24. https://doi.org/10.1007/978-3-642-84048-7_1
Versace, M., Bhatt, R., Hinds, O., Shiffer, M., 2004. Predicting the exchange traded fund DIA with a combination of genetic algorithms and neural networks. Expert Syst. Appl. 27, 417–425. https://doi.org/10.1016/j.eswa.2004.05.018
Walker, M., S. Aberdeen, J., Boland, J., Bratt, E., S. Garofolo, J., Hirschman, L., N. Le, A., Lee, S., Narayanan, S., Papineni, K., L. Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S., 2001. DARPA communicator dialog travel planning systems: the june 2000 data collection. pp. 1371–1374.
Wallace, R.S., 2009. The Anatomy of A.L.I.C.E., in: Epstein, R., Roberts, G., Beber, G. (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer Netherlands, Dordrecht, pp. 181–210. https://doi.org/10.1007/978-1-4020-6710-5_13
Walter, P., Elsen, I., Muller, H., Kraiss, K.-, 1999. 3D object recognition with a specialized mixtures of experts architecture, in, IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339). Presented at the IJCNN’99. In- ternational Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), pp. 3563–3568 vol.5. https://doi.org/10.1109/IJCNN.1999.836243
Waltinger, U., Breuing, A., Wachsmuth, I., 2012. Connecting Question Answering and Con- versational Agents. KI - Künstl. Intell. 26, 381–390. https://doi.org/10.1007/s13218- 012-0208-1
Wang, Z., Ahmadvand, A., Choi, J.I., Karisani, P., Agichtein, E., n.d. Emersonbot: Infor- mation-Focused Conversational AI Emory University at the Alexa Prize 2017 Chal- lenge 11.
Weigend, A.S., Mangeas, M., Srivastava, A.N., 1995. Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. Int. J. Neural Syst. 06, 373–399. https://doi.org/10.1142/S0129065795000251
Weizenbaum, J., 1966. ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine. Commun ACM 9, 36–45. https://doi.org/10.1145/365153.365168
31
Williams, J.D., Young, S., 2007. Partially observable Markov decision processes for spoken dialog systems. Comput. Speech Lang. 21, 393–422. https://doi.org/10.1016/j.csl.2006.06.008
Yazdani, M., 1989. Expert Systems Principles and Case Studies, in: Forsyth, R. (Ed.), . Chap- man & Hall, Ltd., London, UK, UK, pp. 173–183.
Yuksel, S.E., Wilson, J.N., Gader, P.D., 2012. Twenty Years of Mixture of Experts. IEEE Trans. Neural Netw. Learn. Syst. 23, 1177–1193. https://doi.org/10.1109/TNNLS.2012.2200299
Zhang, D., Lee, W.S., 2003. Question Classification Using Support Vector Machines, in: Pro- ceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03. ACM, New York, NY, USA, pp. 26–32. https://doi.org/10.1145/860435.860443
32
Appendices
A. Code for obtaining output from IBM Watson Services ''' This code reads multiple user utter- ance from a file and then parsing the same to all 3 expert sys- tems.The confidence obtained from the sys- tems is then stored as a matrix across single user utter- ances in the outputted excel file '''
import json import ibm_watson import pandas as pd from ibm_watson import DiscoveryV1 #setting configurations for Watson Assistant api_version_assistant = '' apikey_assistant = '' assistant_url = '' shorttail_workspace = '' chitchat_workspace = '' assistant = ibm_watson.AssistantV1( version = api_version_assistant, iam_apikey = apikey_assistant, url= assistant_url ) #setting configurations for long tail/ Watson discovery service api_version_discovery = '' apikey_discovery= '' discovery_url= '' environment_id = '' collection_id = '' discovery = DiscoveryV1( version=api_version_discovery, iam_apikey=apikey_discovery, url=discovery_url ) #read csv file with input test utterances df = pd.read_csv('inputfilepath') #give path to file containing test utterances
33
#defining columns in dataframe to store confidence values df['shorttail_confidence'] = '0' df['shorttail_intent'] = '0' df['chitchat_confidence'] = '0' df['chitchat_intent'] = '0' df['longtail_confidence'] = '0' df['lt_conf1'] ='0' df['lt_conf2'] ='0' df['lt_conf3'] ='0' #passing utterances to chitchat system for i in range(0,len(df)): input = df['message'].loc[i] response_cc = assistant.message( workspace_id = chitchat_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.chitchat_confidence.iloc[i] = '0' df.chitchat_intent.iloc[i] = 'Invalid' else: df.chitchat_confidence.iloc[i] = response_cc['in- tents'][0]['confidence'] df.chitchat_intent.iloc[i] = response_cc['in- tents'][0]['intent'] #passing utterances to shorttail system for i in range(0,len(df)): input = df['message'].loc[i] response_st = assistant.message( workspace_id = shorttail_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.shorttail_confidence.iloc[i] = '0' df.shorttail_intent.iloc[i] = 'Invalid' else:
34
df.shorttail_confidence.iloc[i] = response_st['in- tents'][0]['confidence'] df.shorttail_intent.iloc[i] = response_st['in- tents'][0]['intent'] #passing utterances to longtail system for i in range(0,len(df)): user_input = df['message'].loc[i] input_text = "text:"+user_input query_ex = discovery.query(environment_id, collec- tion_id, filter=None, query=input_text, natural_lan- guage_query=None, passages=True, aggregation=None, count=3, re- turn_fields=None, offset=None, sort=None, highlight=True, pas- sages_fields=None, passages_count=3, passages_charac- ters=None, deduplicate=None, deduplicate_field=None, collec- tion_ids=None, similar=None, similar_document_ids=None, simi- lar_fields=None, bias=None, logging_opt_out=None) for j in range(0,len(query_ex.result['results'])): if j<3: #get top 3 results from discov- ery/ longtail system df.longtail_confidence.iloc[i] = query_ex.result['re- sults'][0]['result_metadata']['confidence'] df['lt_conf'+str(j+1)].iloc[i] = query_ex.result['re- sults'][j]['result_metadata']['confidence'] else: break df.to_excel('outputfilepath', index = False) #give path to store the output as an excel file for further manipulation
B. Test Data Types
a. Type 1 Conversation 1 Hi Good Morning How are you? I am looking to buy a car Can you show me red sedans please? Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives Let's buy that
Conversation 2
35
Hello what's up can we chat can you help me configure a car I am looking for something in the mid 20k pound range I like green SUV's with a sunroof Yes please Does it have driver assist? How much is the insurance going to cost? Ok, let's buy that
Conversation 3 Good evening What's up? How are you? I wanna buy a car I am looking for a sedan with manual transmission What is the power of that car? Does it have 6 airbags? I like it What is the tax liability Ok, let's go ahead and send the quote
Conversation 4 Greetings Good afternoon Describe yourself How can you help I want to configure a car I am looking for a blue automatic sedan What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive for me? I'll be back later
Conversation 5 are you here? Hey Good Morning I want to customise a car Looking for something in the 40k pound range with automatic transmission Does it have dsg transmission? Perfect! Exactly what I am looking for
36
Can I get a test drive for that nearby I am ready to buy. Please send configuration to dealer
b. Type 2 Conversation 1 Hi What are you I am looking to buy a car Can you show me an automatic red sedan please? I like the Audi a4 Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives This is exactly what I am looking for Can I get a quote of what it's costing now Would it be possible to get a test drive tomorrow? Excellent Goodbye for now
Conversation 2 Hello can we chat help me configure a car I am looking for a something in the mid 20k pound range I want a green hatchback with a sunroof I like the kia rio I want to see that with 17-inch alloys It's beginning to move towards exactly what I am looking for Does it have driver assist? What is the power of that car? Does it have 6 airbags? How much is the insurance going to cost? How much is the cost now? Ok, let's go ahead and send the quote I'll be back later Bye
Conversation 3 Greetings Describe yourself How can you help I want to configure a car I am looking for a diesel SUV
37
I want it in blue with black alloys Start again I want to experiment I am looking to buy a petrol convertible with automatic transmission I love the mercedes amg-gt I want to see the car in a light green colour instead of black That's perfect What's the tax liability? Does it have dsg transmission? What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive on Friday at 4pm for me? Great Work You're funny Are you real? I'll be back later bye
Conversation 4 are you here? Good afternoon I want to customise a car Looking for something in the 40k pound range with automatic transmis- sion Which of those have Apple car in them? I'll go for the hyundai i30 I want it in dark blue How many airbags does it have? Does it have ebd? What's the mileage in mpg? That's really nice Can I get a test drive for that nearby now Yes please I am ready to buy. Please send configuration to dealer I am bored! Nah Talk to you later