financial mathematics dissertation
AI Personality Dissertation(1).pdf
Balancing relevancy across expert systems for a Conversational AI personality
Gautam Prasad
September 2019
School of Mathematics, Cardiff University
A dissertation submitted in partial fulfilment of the requirements for MSc (in Data Science and Analytics) by taught programme.
CANDIDATE’S ID NUMBER 1821536
CANDIDATE’S SURNAME Please circle as appropriate Mr / Miss / Ms/ Mrs / Rev / Dr / Other ……………..........
PRASAD CANDIDATE’S FULL FORENAMES GAUTAM
DECLARATION
This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree.
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 1
This dissertation is being submitted in partial fulfilment of the requirements for the degree of Msc
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 2
This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended.
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 3 –
I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations.
Signed ……………………………………………. (candidate) Date 06-09-2019
STATEMENT 4 - BAR ON ACCESS APPROVED
I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee.
Signed ……………………………………………. (candidate) Date 06-09-2019 Gautam Prasad
Gautam Prasad
Gautam Prasad
Gautam Prasad
Gautam Prasad
i
Executive Summary Companies want to interact with their customers in a way that is not limited by time, human
resource availability or language. They need to pre-empt the needs of their clientele in order to
keep them satisfied and thereby reduce customer churn. Businesses such as Vodafone, Royal
Bank of Scotland (RBS) and NatWest and other businesses in the telecom, and finance domain
are investing in chatbots to build and maintain new relationships looking to lower overheads,
costs and training time.
There are three significant types of chatbot expert systems in use; chitchat bots, short tail/ task-
focused bots, and longtail search or FAQ bots. These have primarily been used individually
based on business requirements. Chitchat focuses on having the most natural conversation with
a user based on their inputs; Short tail looks at helping the user complete a small number of
regularly performed tasks and requires high training effort to scale efficiently but tends to pro-
vides consistent results. Longtail systems are focussed on information retrieval and require a
more significant training effort at the start, provides a wider variety of answers at lower confi-
dences; however, scales more efficiently. Developing a mixture of experts’ system, that is ca-
pable of combining these three technologies into a single personality with balanced relevancies
of which response is to be used, is of high interest to organisations. This will enable them to
better lever their existing investments in brand personality (chitchat systems) and human-read-
able material (long tail systems) while allowing them to rapidly develop new task-focused
(short tail) systems in a wide variety of their business domains using modern API based con-
versational short tail systems. This project looks at the best method to bring these disparate
types of systems together into a coherent conversational personality.
The analysis looked at different scenarios using unmodified confidence scores from the expert
systems, building a high-level classifier to determine the best system to answer, and simulating
fallback rules used in systems like IBM Watson. The outputs from all the scenarios were opti-
mised using evolutionary algorithms and optimised using a prepared data set on the same topic
that was applicable to all three system types, before being compared using an accuracy metric
to determine the most successful strategy. The additional effect of concurrency between each
user utterance was then evaluated against these strategies.
The study concluded that merging the chitchat and short-tail training data to reduce the number
of expert systems from 3 to 2 and then use of the fallback rules works best if the confidence
ii
level for fallback is optimised for the expected data set. If the training sets of the chitchat and
short tail systems cannot be merged, or there is a requirement for keeping three separate
sys-tems, a weighted high-level classifier performed best. Optimisation of confidence lev-els
used improved the performance of the fallback rules by a considerable margin. The effect of
concurrency was thought to be a crucial aspect to investigate from the recommendations of
the literature review, but the overall effect of concurrency on this data was shown to be small.
Recommended next steps could be beta testing with real user data to avoid any cognitive
bias in the test and train set and to gauge the change in performance by increasing the
number of expert systems to beyond three, where it is expected that a high-level classifier
increases in performance compared to fallback rules.
iii
Acknowledgements I would like to thank my supervisor, Professor Alexander Balinsky, for his timely help and for
pointing me towards the right direction throughout this project. I am grateful to Mr Stephen
Broadhurst of ThinJetty Ltd for providing me with an opportunity to pursue this project at his
organisation and for the continuous mentoring and support. Also, I would like to acknowledge
the advice and assistance from Ms Joanna Emery and the moral support provided by my col-
leagues on the MSc course.
Contents Executive Summary .................................................................................................................... i
Acknowledgements .................................................................................................................. iii
1. Introduction ........................................................................................................................ 1
2. Literature Review ............................................................................................................... 1
2.1. Expert Systems ........................................................................................................... 1
2.1.1. Overview .............................................................................................................. 1
2.1.2. Typical Architecture for an expert system ........................................................... 2
2.1.3. Applications ......................................................................................................... 3
2.2. Mixture of Experts ..................................................................................................... 3
2.2.1. Background/ Overview ........................................................................................ 3
2.2.2. Applications ......................................................................................................... 4
2.3. Conversational Agents and the application of Expert Systems ................................. 5
2.3.1. What are Conversational Agents? ........................................................................ 5
2.3.2. How are expert systems used in chatbots? ........................................................... 7
2.3.3. Comparison of toolkits for building conversational agents ................................. 8
3. Methodology & Implementation ...................................................................................... 10
3.1. Overview .................................................................................................................. 10
3.2. Tools setup and initialisation ................................................................................... 10
3.3. Knowledgebase ........................................................................................................ 11
3.4. Testing framework & Dataset optimisation ............................................................. 12
3.5. Scenarios .................................................................................................................. 13
3.5.1. Overview ............................................................................................................ 13
3.5.2. Experiment 1: High-level classifier ................................................................... 14
3.5.3. Experiment 2: Weighted High-Level Classifier ................................................ 16
3.5.4. Experiment 3: Based on unmodified confidence scores .................................... 17
3.5.5. Experiment 4: Weighted confidence scores ....................................................... 17
3.5.6. Experiment 5: Emulating fallback logic ............................................................ 18
3.5.7. Experiment 6: Effect of concurrency ................................................................. 20
4. Results .............................................................................................................................. 21
5. Discussion & Conclusion ................................................................................................. 25
5.1. Qualitative evaluation .............................................................................................. 25
5.2. Possibility for future work ....................................................................................... 26
5.3. Recommendations & Findings ................................................................................. 27
References ................................................................................................................................ 28
Appendices ............................................................................................................................... 32
1
1. Introduction The project investigates the use of a mixture of experts’ system in the conversational artificial
intelligence (AI) domain and to devise a rule-based or machine learnt algorithm-based tech-
nique which can balance the relevancy across them. The expert systems are specialised on a
domain level and are to be put into use to respond to a conversational turn in a robust and
precise manner all the while accounting for concurrency and minimising errors. The aim is to
have a mechanism that enables the rules to adapt during the conversation based on parameters
to do with the conversational state or features from the user utterance to best judge which sys-
tem must respond.
2. Literature Review
2.1. Expert Systems
2.1.1. Overview Expert systems is a branch of AI which deals with developing machines which, in a
specific domain, have problem-solving abilities similar to those displayed by a human expert
in the same field or to simulate human expert behaviour (Tzafestas et al., 1993). An expert
system is different from other forms of AI because it performs problem-solving using domain-
specific approaches at an expert knowledge level and also provides pieces of evidence for the
conclusion drawn (Tzafestas et al., 1993). Several advantages, such as the following, have also
been highlighted over the course of time in comparison to human experts (Ignazio, 1991):
• No human-like bias involved while obtaining solutions or prescribing strategies.
• Minimal chances for occurrences of errors in calculations.
• Serves the purpose without fail on a near-constant basis.
2
2.1.2. Typical Architecture for an expert system
Figure 1: Typical architecture of an expert system (Forsyth, 1984; Tzafestas et al., 1993; Yazdani, 1989)
An expert system consists of the following high-level modules as shown in Figure 1
which form the crux of the operations of the system (Forsyth, 1984, n.d.; Ignizio, 1990; Tripa-
thi, 2011; Tzafestas et al., 1993):
2.1.2.1. Knowledge Base The knowledge system encompasses the domain-specific expert-level knowledge
that is required by the expert system for comprehending user requirements, formu-
lating strategies, obtaining necessary rule-based solutions which can then be passed
onto the inference engine for further processing. It consists of both factual, which
is the most commonly shared/found forms of knowledge, and heuristic knowledge,
which is the less widely shared and considerably more individualistic form of
knowledge which acts as the reasoning for the solutions obtained.
2.1.2.2. Inference Engine The inference engine acts as the intelligence behind the expert system and takes
care of the inferences from user requests/utterances. It then analyses and processes
the rules obtained from the knowledge base in order to arrive at a solution with
logical reasoning for the same. In short, it controls the interpretation and reasoning
methodology of the expert system. The two most widely used reasoning strategies
are forward chaining and backward chaining. Forward chaining starts with the data
at hand and uses the inference rules to arrive at a solution whereas backward chain-
ing starts with the list of goals to be attained and works its way backwards to see if
there is any data available to solve the problem and attain the goals.
3
2.1.2.3. User Interface A user interface is built to interact with the user by receiving user inputs in the form
of utterances and for the system to revert with a user-identifiable output.
2.1.3. Applications Expert systems have varied applications majorly in classification tasks in the fields of
medical diagnosis, information retrieval and aligned services, engineering, human-
computer interaction, military, robotics amongst others. (Forsyth, 1984, n.d.; Ignizio,
1990; Tzafestas et al., 1993).
2.2. Mixture of Experts
2.2.1. Background/ Overview Mixture of experts is a method that was introduced almost 30 years ago by Jacobs and
co-workers (R. Jacobs et al., 1991). They investigated the use of a different error function in a
mixture of experts’ system, and their approach has been supremely popular in a suite of wide-
ranging applications (Yuksel et al., 2012). Over the years, around 20 different studies have
been conducted on the principles, working and applications of expert systems and to an extent
was even considered to be completely solved; However, recently there has been a resurgence
in interest in the context of using a mixture of experts for several new-age problems (Yuksel et
al., 2012).
Mixture of experts has been widely regarded as a combining method which, when put
to use in machine learning tasks, can lead to better performance and improved results (Masoud-
nia and Ebrahimpour, 2014). The critical aspect of a mixture of experts model in any applica-
tion was to employ specialised expert systems to return correct answers for topics which fall
under its knowledge base and use a gating network across all the expert systems which helps
in reducing the errors (Jacobs et al., 1991). The basic principle behind this is that the gating
network assigned a new input to an expert system, and weights of only this system are changed
if the output is found to be incorrect which removes any chance of interference for the other
expert systems (R. Jacobs et al., 1991). This also has the implication & possible added ad-
vantage of each expert system being assigned only a small set of extremely feasible input cases
(R. Jacobs et al., 1991).
4
Figure 2: A system of expert and gating systems (R. Jacobs et al., 1991)
The gating network is assumed to be a stochastic one-out-of-n selector, unlike
in (Hampshire and Waibel, 1992; R. A. Jacobs et al., 1991), which is how minimal interference
is achieved in a much more straightforward manner, albeit reconceptualising the error function
in order to make the expert systems challenge one another making the whole network compet-
itive in nature rather than being collaborative (R. Jacobs et al., 1991). An evaluative compari-
son was performed between standard backpropagation networks with a single hidden layer and
a mixture of experts by using it to recognise multi-speaker vowel recognition (R. Jacobs et al.,
1991). The parameters of the models were kept approximately equal by adjusting the number
of hidden layers in the backpropagation network (R. Jacobs et al., 1991). Upon investigation
of the results, the mixture of experts model achieved the error criterion (average squared error
of 0.08) at a much higher speed even while keeping the number of epochs needed for the same
at a lower number and also maintaining scalability with increase in number of experts used in
the system (R. Jacobs et al., 1991; Masoudnia and Ebrahimpour, 2014).
2.2.2. Applications Several applications have been devised over the years for a mixture of experts’ systems
such as (Yuksel et al., 2012):
• Used in the prediction of climate (Lu, 2006), electricity demand (Weigend et
al., 1995), stock prices (Versace et al., 2004), currency exchange rates (Coelho
et al., 2003), amongst others.
5
• Machine learnt classification tasks involved in
o Classification of
� Text (Estabrooks and Japkowicz, 2001),
� Audio signals (Harb et al., 2004) and
o Recognition of
� Handwriting (Ebrahimpour, 2009),
� Speech (Mossavat et al., 2010; Peng et al., 1996), and
� 3D objects (Walter et al., 1999).
2.3. Conversational Agents and the application of Expert Systems
2.3.1. What are Conversational Agents? A piece of software or program that enables a machine to converse using a natural lan-
guage such as English with a human user is called a Conversational agent (Io and Lee,
2017; Weizenbaum, 1966). Since the initial research and work done in the field since
the 1960s, the most significant challenges faced was in enabling the machine with in-
telligence that would facilitate such interactions (Shum et al., 2018; Turing, 1950).
In a typical human conversation with a chatbot, input from the user is in the form of a
single or set of natural language utterances which the system analyses to gauge the
requirements of the user and produces a response which it deems an appropriate one to
the analysed input (Weizenbaum, 1966). The afore-mentioned response is derived using
several techniques, of which rules have been predominantly used in the earlier stages
of development of chatbots, wherein the user utterance is searched for keywords and
based on their presence, associated rules are invoked to convert the utterance (Shum et
al., 2018; Weizenbaum, 1966).
2.3.1.1. Types of Conversational Agents
2.3.1.1.1. Chitchat
Several of the earliest systems developed such as “Eliza”, “ALICE” and “Parry”
(Colby KM, 1975; Shieber, 1994; Wallace, 2009; Weizenbaum, 1966) were fo-
cussed on performing as chitchat bots for the purpose of conversation with users
in the medium of text, audio amongst others. (Shum et al., 2018). These systems
6
used pattern matching based on rules to respond to the user’s input. (Shum et
al., 2018; Weizenbaum, 1966).
The chatbots were given different personalities such as a “Rogerian Psychother-
apist” (Shum et al., 2018; Weizenbaum, 1966), a paranoid person (Colby KM,
1975; Shum et al., 2018) and so on, but were severely limited in terms of capa-
bility to continue the conversation for a prolonged duration and had highly spec-
ified domains as well which further reduced their performance (Colby KM,
1975; Shieber, 1994; Wallace, 2009). These limitations were partly due to the
technology with which the systems were built, such as AIML for “A.L.I.C.E”
which in turn led to their failure in several evaluations such as the “Ultimate
Turing Test” (Shum et al., 2018; Wallace, 2009).
2.3.1.1.2. Task-Completion
Task-Completion conversational agents were built with a focus on realising spe-
cific tasks which fall under constrained domains (Shum et al., 2018; Walker et
al., 2001; Wang et al., n.d.). It typically gives a short single high confidence
answer. A few most commonly seen domains were that of hotel or flight book-
ing, weather forecast, information gathering amongst others. In general, the sys-
tem tries to gauge the user’s ‘intents’ and then responds with actions that will
complete said intent or goal (Shum et al., 2018; Walker et al., 2001). Further
improvements also included the ability to comprehend complex dialogues with
inherent variability and state tracking (Shum et al., 2018; Williams and Young,
2007). These systems were evaluated on several parameters, not limited to
(Walker et al., 2001):
• User satisfaction
• Task completion
• Task duration
• Accuracy
Telecom giant Vodafone has over a couple of years back introduced a chatbot,
‘TOBi’, which could help users in basic tasks of checking account details, trou-
bleshooting and also in purchasing new connections (Koehler, 2017). TOBi de-
livered the following metrics (Davis, 2018):
7
• An increased conversion rate of more than 100% when compared to their
website.
• A decreased transaction time of around 50% compared to their website
• Among the highest ever received usability scores of 90.
Another such an example would be that of ‘Cora’, a chatbot employed by RBS
and NatWest, both in the banking domain, to answer basic baking related que-
ries from the user (“NatWest begins testing AI driven ‘digital human’ in bank-
ing first,” n.d.; Rumney, 2018). This has helped in identifying the most fre-
quently asked questions and significantly cutting down queuing times.
2.3.1.1.3. Long tail or Question Answering
QA conversational agents process natural language queries raised by the user
and provide concise and relevant answers to it, thereby improving the overall
interaction between the user and intelligent system (Simmons, 1970; Waltinger
et al., 2012). It gives a number of lower confidence, more extended sections of
text mined from the corpus rather than configured. In general, the question was
first analysed, and a search performed, which resulted in an answer with sup-
porting evidence and a score (Ferrucci et al., 2010; Setiaji and Wibowo, 2016).
The answers thus obtained were then scored in order of relevance before being
presented back to the user (Ferrucci et al., 2010; Setiaji and Wibowo, 2016). In
order to aid better response to the user, other facets such as topic identifying,
context recognition, keyword detection amongst others are also used in tandem
(Niranjan et al., 2012; Setiaji and Wibowo, 2016; Waltinger et al., 2012). Ques-
tion classifying, in which the inherent ‘type’ of the question posed by the user
is obtained, has also been identified as a component which can improve the ac-
curacy of long-tail agents (Suzuki et al., 2003; Waltinger et al., 2012; Zhang
and Lee, 2003).
2.3.2. How are expert systems used in chatbots? 2.3.2.1. Expert systems in chatbots Traditionally, spoken dialogue systems/conversational agents have employed
mechanisms to control the dialogue flow of the user to limit the responses from the
user to a set of pre-defined or limited choices (M. O’Neill et al., 2004). In advanced
systems which could allow multi-domain interaction between user and agent, a
8
component was used to identify the domain or topic based on the user input and
perform the necessary action to fulfil the requirement (M. O’Neill et al., 2004).
Over time, several ‘plan-based dialogue modelling schemes’ were put forward to
build systems upon; the premise behind those being that behind every user-system
interaction lies a particular requirement or goal of a user and the system has to rec-
ognise those and perform accordingly (Lin et al., 1999). The entire system is con-
figured as multiple ‘domain-specific experts’ to facilitate multi-domain conversa-
tions, with the capability to complete transactions in a particular domain working
in association with each other all the while switching between themselves based on
user input (M. O’Neill et al., 2004; Nakano et al., 2008). A middle layer is present
in the system which is responsible for evaluating user utterances across all ‘experts’
present and determine which one has to respond to that particular utterance
(Hartikainen et al., 2004; Komatani et al., 2006; Lin et al., 1999; M. O’Neill et al.,
2004).
2.3.2.2. Problems faced in addressing multi-domain conversations There have been a few problems which have arisen while attempting to handle
multi-domain conversations in a concurrent and flexible manner, such as:
• Identifying how to handle errors in comprehending user input (Lin et al.,
1999).
• To determine if the user or the system should take the initiative to carry on
with the conversation (Lin et al., 1999).
• To tackle user initiatives in a proper and ‘consistent’ manner (Lin et al.,
1999).
• Diminishing efficiency due to multiple systems working simultaneously
(Hartikainen et al., 2004)
• Inability in handling concurrent topics (Lin et al., 1999).
2.3.3. Comparison of toolkits for building conversational agents Different toolkits which are specialised in building conversational agents were looked
at, selected primarily on their capability of having out-of-the-box, both a search (long
tail) and a typical conversational agent functionality. IBM Watson, Google Dialogflow
and Microsoft Luis had both these capabilities with their long tail functionalities,
namely being Watson Discovery, Knowledge connectors and Microsoft QnA Maker.
9
In order to select the best possible toolkit, the study conducted by (Liu et al., 2019) was
used. (Liu et al., 2019) compared state-of-the-art conversational toolkits by comparing
the metrics of precision, recall and F1 score as given in Table 1.
Intent
Toolkit/ Metric Precision Recall F1
Rasa 0.863 0.863 0.863
Dialogflow 0.87 0.859 0.864
LUIS 0.855 0.855 0.855
Watson 0.884 0.881 0.882 Table 1: Comparison of specialised toolkits (Liu et al., 2019)
As seen in Table 1, IBM Watson returns the highest F1 score for intent classification
and though there isn’t a significant difference in the scores for the other three toolkits
(Liu et al., 2019).
10
3. Methodology & Implementation
3.1. Overview The aim of the project is to create a rule-based or machine learnt algorithm for weighting
confidence and evidence returned by the expert systems to determine for each conversa-
tional turn which system is the best placed to respond and to enable these rules to adapt
during the conversation based on parameters to do with the conversational state or features
from the user utterance to best judge which system must respond. By doing this, we also
aim to recommend how businesses can best combine ‘off the shelf’/ out-of-the-box
(OOTB) chitchat with existing human-readable corpora (long tail) and then rapidly develop
domain-specific functionality.
A car configurator bot (multi-domain expert system/conversational agent) was decided to
be built to perform the tests and analysis. The bot would have the ability to engage in chit-
chat with the user, perform tasks involved in configuring a car such as gathering general
requirements, booking test drives, and such other queries. It would also act as a question
and answer bot wherein users can pose natural language queries to be answered by the bot
which could further help them narrow their search or enhance their knowledge of a vehicle
in mind. The short tail queries would be put forth to the system by the user at a higher
frequency, and the long tail ones would be at a significantly lower frequency. Each user
utterance is passed to all three expert systems, and the corresponding confidence scores are
retrieved. Metrics such as accuracy, precision and recall are derived and used as a baseline
score to compare and to simulate different scenarios. The whole experiment and analysis
were devised to be completed using IBM Watson tool of Assistant and Discovery as dis-
cussed in Section 2.3.4 along with the data extraction and manipulation using coding in
Python and optimisation tasks in Excel using Solver.
3.2. Tools setup and initialisation An account was set up in IBM Watson for using Assistant and Discovery for building the
conversational agent. The Watson Developer Cloud Python SDK will be used to communi-
cate to the Assistant and Discovery services using the application programming interface
(API) provided; the dependencies and packages for which are also installed. Access to the
online services is gained using a combination of usernames, API keys, environment IDs,
collection IDs etc. Python is also used to extract, manipulate and further analyse the outputs
11
from the services’ APIs. Microsoft Excel is used to create and store the data used for train-
ing and testing purposes in filetypes of Excel worksheets (.xlsx), Comma-separated value
files (.csv), and Tab-separated value files (.tsv) based on requirements and features.
Toward the latter end of the analysis, in order to optimise metric values such as precision,
recall or accuracy (objective) based on weights or other parameters (constraints) as neces-
sary, the Solver add-in of Excel is used. Solver has three solving methods which are used
throughout our experiments based on the requirements and enhancements brought about on
the metrics by using a method. In cases where more than one method is used, a comparison
is also made possible. The solver methods are as follows:
• LP Simplex: Used in cases where the problems are linear, which in turn means its
applications are restricted (“Excel Solver,” 2016). However, one of its benefits is
that the solutions obtained are always globally optimised (“Excel Solver,” 2016).
• Generalised Reduced Gradient (GRG) Non-Linear: This method is the fastest of the
non-linear methods but has a disadvantage that the solution obtained might not be
a global optimum and is also highly dependent on the initial conditions (“Excel
Solver,” 2016). It is used for smooth non-linear problems (“Excel Solver,” 2016).
• Evolutionary method: Based on the theory of natural selection, it may converge to
a solution if either the solution is the global optimum or if the population has lost
its diversity (“Excel Solver,” 2016). It is used in cases of non-smooth problems.
3.3. Knowledgebase A corpus of data had to be created to train and test the conversational agent. Since the
conversational agent has three experts, namely chitchat (social), short-tail (task-oriented)
and long-tail (Q&A), a corpus was created for all of them. The data were acquired in the
following manner:
• Chitchat: A collection of close to 890 example user utterances and 59 intents was
obtained for the chitchat expert system by combining the data from:
o Watson Assistant: The inbuilt ‘General’ content catalogue, which contains
ten unique intents and close to 200 example utterances.
o Google Dialogflow: The inbuilt ‘smalltalk’ agent was exported, which con-
tains 86 intents and around 1500 example utterances.
• Short-tail: A corpus of 16 intents with close to 150 example user utterances were
created manually for the short-tail expert system of the car configurator use case
12
which included intents such as ‘#GeneralRequirements’, ‘#BookTestDrive’ etc. and
their corresponding example user utterances.
• Long tail: 118 car brochures spread across different vehicle types, makes and mod-
els were obtained and collated to be used as the corpus for the long tail expert sys-
tem.
The data for both the chitchat and short tail expert systems were ingested into two separate
Watson Assistants, and the data for longtail was ingested into Watson Discovery to be used
in testing and analysis purposes. After ingestion into discovery, the search was optimised
for relevancy by using the out-of-the-box (OOTB) relevancy training available in Watson
Discovery. The process entailed posing natural language queries to discovery and marking
the results from the service as ‘relevant’ and ‘not relevant’ based on the contents of the
results.
For testing the conversational agent, a dataset consisting of 50 example user utterances
across five different conversations was manually created mimicking users who would be
using the chatbot to configure a car or book a test drive and such similar requests. The
utterances were created in such a way that they would be as close to a real conversation as
possible with sample responses for each and from different expert systems which would
further enable testing the multi-domain conversational agent.
3.4. Testing framework & Dataset optimisation In general, the data for conversational agents, which comprises of user example utterances
and intents are created by ‘subject matter experts’ based on ground truth (Freed, 2018).
Utterances are created and then marked with ‘entities’ and labelled with expected ‘intents’;
the corpus procured for the chit-chat workspace from the Watson and Google services has
been created in the same fashion by employing API’s to crawl the web. Such procurement
calls for the need for testing the same to identify any hidden patterns and weakness present
in it, which can further be remedied (Freed, 2018).
Testing is achieved by submitting the utterances to the classifier and investigating the out-
put of the classifier to see if it matches the set ‘ground truth’ (Freed, 2018). The data is split
into training and testing/blind datasets using the k-fold cross-validation technique. After
training the classifier on the training data, the validation dataset is used to evaluate the
classifier and obtain the required parameters. Precision and recall metrics are intended to
be put into use while evaluating and comparing the performance of the classifier.
13
The testing is done on the chitchat corpus, and the metrics are obtained on an intent level.
In order to improve the performance, the intents are sorted based on recall value and the
ones with the lowest value are selected for removal or to be fixed. The misclassified utter-
ances were either moved to different intents or edited to enable better classification. This
process was carried out for all the utterances falling under intents with the lower recall
values. After a complete overhaul, the data was re-ingested, and the testing was carried out
again. The entire process was reiterated multiple times, thereby improving the overall met-
ric values and boosting the intents with initial low recall value. The final dataset, which had
a precision value of 90.32% and was reduced to 50 unique intents and 859 utterances for
the chit-chat expert system, was then ingested back into the Watson Assistant service.
3.5. Scenarios
3.5.1. Overview In order to fulfil the primary aim which is to create an adaptable rule-based or machine
learnt algorithm for weighting confidence and evidence returned by the expert systems
to determine for each conversational turn which system is best placed to respond, dif-
ferent scenarios have to be devised to analyse and compare. The outcome of the com-
parison would give the best algorithm to implement in order to obtain the best concur-
rency and switching between the expert systems in an efficient but logical manner based
on user utterances. The scenarios thus formulated are as follows:
1. High-level classifier
2. Weighted high-level classifier
3. Based on pure-confidence values
4. Weighted system confidences
5. Emulating fallback logic
6. Testing for concurrency
The testing data created earlier, which consists of 50 unique user utterances are inputted
to the mixture of experts’ system and the confidences returned is used as a baseline to
perform the experiments and simulate the scenarios. A metric similar to accuracy was
devised, which was obtained by dividing the number of correct classifications to the
total number of classifications performed, to compare the scenarios on a qualitative
basis. The thorough investigation of the metric obtained post-simulation gives a clear
insight to which expert system should be utilised to respond to said utterance.
14
3.5.2. Experiment 1: High-level classifier The purpose of this experiment is to set up a high-level classifier which acts as the
‘selector’ in the mixture of experts’ systems and is trained on a sample of the example
utterances spread across the short-tail, chitchat and long-tail expert systems. The data
corpus must be sampled to obtain an equal number of utterances, fixed at 50 for the use
case, from each class to avoid any possible bias in that regard. The ‘Pandas’ module in
Python and it’s inbuilt ‘group by’ method is used for sampling the utterances of all three
separate expert systems. The separate sampled datasets are concatenated in order to
obtain a single corpus of 150 utterances labelled into the three classes of short-tail,
chitchat and long-tail.
The dataset is further split on an 80:20 ratio to be used for training and testing purposes
of the classifiers built. Different algorithms are used as the base for the classifier to
allow for the selection of the best possible classifier and are compared based on the
testing accuracy values, the algorithms being:
• Naïve Bayes: An algorithm based on Bayes theorem and assumes independence
between predictors used in the classification.
• Linear Classifier – Logistic Regression: It uses the logistic function at its core
to determine the relationship between the dependent variable and several inde-
pendent variables.
• Support Vector Machine (SVM): SVM is a supervised algorithm which at-
tempts to extract the best possible hyperplanes for classifying the data.
• Bagging Method - Random Forest (RF): Random forest method constructs nu-
merous decision trees during training, and the outputted class is the mean or
mode of the individual trees. It fixes the overfitting found in decision tree clas-
sification
• Boosting Method – eXtreme Gradient Boosting Model (XGBoost): A super-
vised machine learning algorithm which uses an ensemble of other weaker mod-
els/ algorithms to reduce bias and variance.
Along with these algorithms, Google BERT (Bidirectional Encoder Representations
from Transformers), which is an unsupervised learning algorithm, was also used to
build a classifier (Devlin et al., 2018). BERT uses bidirectional encoding and works on
several pre-trained models released by Google, which can be further fine-tuned to suit
15
the application or requirement (Devlin et al., 2018). It takes into account the context of
a word from both its left and right sides since it is bidirectional (Devlin et al., 2018).
BERT is built for binary classification out of the box and must be modified to work
with our use case of three classes (Devlin et al., 2018). Also, the training, validation
and testing data must be formatted to suit the input requirements of BERT, which is
done using a combination of Python coding and Excel.
The classifier was built using the following machine learning algorithms/ tools on the
3 expert system corpora and tested. The resultant accuracy metric, which is the fraction
of correctly classified samples is as follows:
Algorithm Accuracy Naïve Bayes 0.77
Linear Classification 0.74 Support Vector Machine 0.77
Bagging Model 0.67 Boosting Model 0.69 Google BERT 0.87
Table 2: High-level classifiers comparison
Google BERT gave the best accuracy values for the test data and was selected as the
best algorithm to use as the high-level classifier and to build the simulation for the
scenario. The simulation is carried out in the following manner:
• The test corpus created, which consists of 50 unique utterances across five con-
versations, is inputted to the BERT based classifier, and the output is obtained.
The output is a confidence score for every utterance for each expert system.
Utterance Short tail Confidence Chitchat Confidence Long tail Confidence Utterance 1 0.27420458 0.530363 0.19543229 Utterance 2 0.13647898 0.6893639 0.17415714 Utterance 3 0.25295562 0.4700206 0.27702382 Utterance 4 0.3946402 0.36452472 0.24083503 Utterance 5 0.3252571 0.38989067 0.2848522 Utterance 6 0.31253842 0.24711472 0.44034687
. . . .
. . . .
. . . . Table 3: Sample high-level classifier output
• The expert system which returns the highest confidence is selected as the system
which is best placed to respond to the utterance at that conversational turn.
16
• If the classification was performed as expected, the output of the scenario is
obtained per utterance by verifying if the expert system, which has been ob-
tained after classification, is the same as the ‘golden system’. This is part of the
testing data and has been labelled by the subject matter expert based on the log-
ical response expected.
• The recall metric for the scenario is obtained for comparison purposes during
the analysis stage.
3.5.3. Experiment 2: Weighted High-Level Classifier In this experiment, the confidences obtained from the high-level classifier built on
Google BERT (as in Experiment 1) are weighted (as in Experiment 4) to see if this
brings about a beneficial change in the output of the mixture of experts’ system.
The experiment is carried out in the following manner:
• The confidence scores from the classifier were obtained as in Experiment 1 and
were further weighted (giving a bias to the confidence scores obtained from
each system, as in Experiment 4), and the results were obtained for an equal
weight of 1 across all systems.
• The expert system which returns the highest weighted confidence is selected as
the system which is best placed to respond to the utterance at that conversational
turn.
• If the classification was performed as expected, the output of the scenario is
obtained per utterance by verifying if the expert system which has been obtained
after classification is the same as the ‘golden system’ which is part of the testing
data and has been labelled by the subject matter expert based on the logical
response expected.
• The metric for the scenario (objective) is obtained and is then subject to optimi-
sation using Solver to obtain the maximum value possible for the same by var-
ying the values of weights assigned to all expert systems (constraint).
• The optimisation is performed using both GRG Non-Linear and Evolutionary
methods to allow for comparison.
• The weights obtained after optimisation is carried out are further normalised so
that comprehension is improved. Normalisation is done by fixing a system con-
fidence value to be 1 (in this case, short tail is selected as the business problem
to focus on the short tail and bring in other expert systems without changing the
17
confidence of this). In that case, the normalised values for the other expert sys-
tems are obtained by dividing their current values with the value of the pre-
normalised short-tail confidence.
3.5.4. Experiment 3: Based on unmodified confidence scores This experiment aims to simulate a scenario wherein the selection of which system is
used to respond with is decided upon by using only the pure confidence scores returned
by the three expert systems, namely chitchat, short tail and long tail hosted in the Wat-
son Assistant and Watson Discovery cloud services respectively. Also, to decide if this
approach is suitable to be used to enable the algorithm to adapt to changes during the
conversation based on parameters related to the conversational state or features from
the user utterance to best judge which system must respond.
The experiment is carried out in the following manner:
• The mixture of experts’ system is tested using the testing data comprising of 50
example user utterances by posing these utterances to all three expert systems
individually.
• The response of the systems in the form of confidence scores is retrieved and is
stored across the utterances.
• The expert system which returns the highest confidence is selected as the system
which is best placed to respond to the utterance at that conversational turn.
• The next stage of the simulation is carried out in Excel. A check is performed
to verify if the ‘golden system’ matches the system obtained based on the con-
fidence calculation. If the classification was performed as expected, the output
of the scenario is obtained per utterance.
• The metric for the scenario is obtained for comparison purposes during the anal-
ysis stage.
3.5.5. Experiment 4: Weighted confidence scores In this experiment, the confidences obtained from the expert systems are weighted in
an attempt to see if this brings about a beneficial change in the output of the mixture of
experts’ system.
The experiment is carried out in the following manner:
• The confidence scores are obtained from the three expert systems as in Experi-
ment 3.
18
• The scores are then weighted in order to better classify the input utterances. The
weights are assigned an equal value of 1 to start with.
• A check is performed to verify if the ‘golden system’ matches the system ob-
tained based on the confidence calculation. If the classification was performed
as expected, the output of the scenario is obtained per utterance.
• The metric for the scenario (objective) is obtained and is then subject to optimi-
sation using Solver to obtain the maximum value possible for the same by var-
ying the values of weights assigned to all expert systems (constraint).
• The optimisation is performed using both GRG Non-Linear and Evolutionary
methods to allow for comparison.
• The weights obtained after optimisation is carried out are further normalised so
that comprehension is improved. Normalisation is done by fixing a system con-
fidence value to be 1 (in this case, short tail is selected as the business problem
to focus on the short tail and bring in other expert systems without changing its
confidence). In that case, the normalised values for the other expert systems are
obtained by dividing their current values with the value of the pre-normalised
short tail confidence.
3.5.6. Experiment 5: Emulating fallback logic This experiment attempts to mimic the fallback logic employed by Watson in the out
of the box cloud service. The logic can be defined as follows with three significant
variations:
• Version 1:
o if ‘short tail confidence’ > ‘threshold confidence’:
� the short tail system must respond
o else if ‘chitchat confidence’ > ‘threshold confidence’:
� the chitchat system must respond
o else:
� the long tail system must respond
• Version 2:
o if ‘short tail confidence’ > ‘threshold confidence’:
� the short tail system must respond
o else if ‘longtail confidence’ > ‘threshold confidence’:
� The long tail system must respond
19
o else:
� the chitchat system must respond.
• Version 3 (2 expert systems):
o if ‘combined confidence’ > ‘threshold confidence’:
� the combined system must respond
o else if ‘longtail confidence’ > ‘threshold confidence’:
� the long tail system must respond
The experiment for versions 1 & 2 is conducted in the following way:
• After ingestion and training, the mixture of experts’ system is tested using the
test corpus consisting of 50 example user utterances created earlier.
• The response of the systems in the form of confidence scores is retrieved and
stored across the utterances.
• The logic is then simulated using Excel, and the outputs are obtained for both
the variations with the threshold set at 0.2, which is the default used by Watson.
• After obtaining the outputs, the metric value is calculated. The metric (objec-
tive) is then subject to optimisation using solver to obtain the maximum value
possible for the same by varying the threshold value (constraint).
The experiment for version 3 is conducted in the following two ways:
1. On the cloud service by manipulating the training data for the created assistants:
• The training data consisting of utterances is manipulated so that the utter-
ances and intents falling under short tail and chitchat expert systems are
combined into a single system and long tail is kept as a separate system. The
data is re-ingested into the Watson Assistant service for testing purposes.
• After ingestion, the mixture of experts’ system is tested using the same test-
ing data comprising of 50 example user utterances by posing these utter-
ances to all three expert systems individually.
• The response of the systems in the form of confidence scores is retrieved
and is stored across the utterances.
• The logic is then simulated with the threshold value set at 0.2 (Watson de-
fault value) using Excel and the output for the variation is obtained.
• After obtaining the outputs, the metric value is calculated for the same. The
metric value (objective) is then subject to optimisation using solver to obtain
20
the maximum value possible for the same by varying the threshold value
(constraint).
2. Building a high-level classifier with the training data mimicking the OOTB
Watson (Conversation AI toolkit) logic:
• The training data consisting of utterances is manipulated so that the utter-
ances and intents falling under short tail and chitchat expert systems are
combined into a single system and long tail is kept as a separate system. The
data is used to build a binary classifier using Google BERT as done in Ex-
periment 1.
• The test corpus created, which consists of 50 unique utterances across five
conversations, is inputted to the BERT based classifier, and the output is
obtained. The output is a confidence score for every utterance concerning
either class of data.
• The logic is then simulated with the threshold value set at 0.2 (Watson de-
fault value) using Excel and the output’s obtained for the variation.
• After obtaining the outputs, the metric value is calculated for the same. The
metric value (objective) is then subject to optimisation using Solver to ob-
tain the maximum value possible for the same by varying the threshold value
(constraint).
3.5.7. Experiment 6: Effect of concurrency This experiment attempts to investigate how systems deal with concurrency, which was
a vital area of the problem, as observed in the literature review in section 2.3.3.2. The
aim is to gauge the effect of concurrency in the user input utterances within a conver-
sation on the mixture of expert system confidence values and output. In order to do so,
an updated test corpus will have to be created which mimics the effect of concurrency.
The experiment is carried out in the following manner:
• An updated test corpus of 70 unique utterances across four conversations is cre-
ated. It is done in a manner which incorporates the occurrence of utterances
which fall under the domain of the same expert system (effect of concurrency)
within conversations.
• This test data is then inputted to the mixture of experts’ system. The response
which comprises of confidence scores and intents is retrieved and is stored
across the utterances.
21
• A new parameter is introduced to vary the effect of concurrency on the output.
This parameter boosts the confidence returned from a particular expert system
if it is concurrent to the preceding utterance.
• The expert system which returns the highest confidence at the end of the boost-
ing is selected as the system which is best placed to respond to the utterance at
that conversational turn.
• A check is performed to verify if the ‘golden system’ matches the system ob-
tained based on the confidence calculation. If the classification was performed
as expected, the output of the scenario is obtained per utterance.
• The metric for the scenario (objective) is obtained and is then subject to optimi-
sation using solver to obtain the maximum value possible for the same by var-
ying the values of weights assigned to all expert systems (constraint).
• The optimisation is performed using both GRG Non-Linear and Evolutionary
methods to allow for comparison.
4. Results Throughout the experiments, several rule-based systems and parameters were looked at which
could be employed to enable these rules to be adaptable based on the requirements and conver-
sational state in order to gauge the system best placed to respond to the user utterance at the
said conversational turn. These were simulated as variations within and across scenarios, as
mentioned in section 4.5.
The results obtained from the experiments conducted can be analysed as follows:
1. Use of unmodified confidence scores: In the experiments where the confidence scores
were used to create a rule or logic for selection of an expert system within the mixture
of experts’ conversational agent, the test corpus was used to retrieve confidence scores
from the system and then used for further simulations. The confidences thus obtained
were used directly and also in a weighted manner to observe the changes being brought
about as can be seen below in Table 4.
22
Scenario Number of experts Metric Comments
Based on unmodified confi- dence scores
3 0.82
Weighted confidence scores
0.82 Initial weights of 1 each
3 0.82 GRG optimised weights
0.84 Evolutionary optimised weights
Table 4: Results from simulations based on the use of unmodified confidence scores
2. Use of a high-level classifier: In these scenarios, a high-level classifier was modelled
to mimic the working of the expert system in the Watson cloud service and use the
output confidences derived by testing using the test corpus as a base to formulate further
rules. The confidences were both used as an unmodified value and also as a weighted
component, as seen in Table 5.
Scenario Number of experts Metric Notes
High level classifier
(BERT) 3 0.72
Weighted high-level clas-
sifier (BERT)
3
0.72 Initial weights of 1 each
0.72 GRG optimised weights
0.90 Evolutionary optimised weights
Table 5: Results from simulations based on the use of a high-level classifier
3. Emulating Watson fallback logic: The inbuilt Watson logic and rules employed by the
Watson Assistant service while acting as a multiple domain system was simulated. The
simulations were carried out in two ways:
a. Method 1: By using the confidence scores from Watson a combined chitchat
(CC) and short tail (ST) corpus with long tail (LT) apart and also varying the
logic while keeping them separate.
23
b. Method 2: Using a high-level classifier also built on the combined data to in-
vestigate its performance using the test data. The confidences scores were both
used as an unmodified value and also as a weighted component.
The outputs as follows in Table 6 can be analysed to gauge if any of the simulations
would be a good fit for creating the rules for the mixture of experts’ system.
Scenario Number of experts Metric Notes
Emulating fallback
logic
0.86
Threshold set at 0.2 (de-
fault)
2 0.98 Optimised threshold of 0.55
0.50
Version 1 with the threshold
set at 0.2 (Default)
0.86
Version 1 with an optimised
threshold of 0.69
3 0.50
Version 2 with the threshold
set at 0.2 (Default)
0.70
Version 2 with an optimised
threshold of 0.67
High-level classifier 2 0.96
Weighted high-level
classifier
0.96 Initial weights of 1 each
2 0.96 GRG optimised weights
0.96
Evolutionarily optimised
weights
Table 6: Results from simulations mimicking the Watson default logic
4. Testing for effect of concurrency:
A simulation was conducted to test the effect of concurrency being used as a weight in the
selection of an expert system to respond to the utterance at a particular conversational turn.
The presence of concurrency was modelled using a ‘boosting value’ which was used to
boost the value of the corresponding system’s confidence score. The testing for this sce-
nario was conducted using the extended and modified test corpus, which had situations for
concurrency incorporated into it; the output of which can be seen in Table 7.
24
Scenario
Number of
Experts Metric Comments
Effect of concurrency 3
0.74 Initial boosting value of 0.1
0.74 GRG optimised boost values
0.77 Evolutionary optimised boost values
Table 7: Results from simulation incorporating the effect of concurrency
The values of the accuracy metric were found to be improved upon by using Excel Solver for
optimisation as can be seen in Table 8 for a 2-system architecture and Table 9 for a 3-system
architecture.
Type
Original
metric
Improved
metric
%
Change
2 system with weights - Weighted classifier 0.94 0.94 0.00
2 system with confidence - OOTB Watson Rules 0.86 0.98 13.95
Table 8: Effect of optimisation on the metric for 2 expert systems
Type
Original
metric
Improved
metric
%
Change
3 system with weights - Weighted high-level classifier 0.72 0.90 25.00
3 system with weights - Weighted confidences 0.82 0.84 2.44
3 system with confidence - OOTB Rules – Version 2 0.50 0.70 40.00
3 system with confidence - OOTB Rules – Version 1 0.50 0.86 72.00
Testing for concurrency 0.74 0.77 4.05
Table 9: Effect of optimisation on the metric for 3 expert systems
25
5. Discussion & Conclusion
5.1. Qualitative evaluation A better understanding of the work carried out can be gained by looking at the strengths
and weaknesses of the project on a qualitative basis. This should further help in identifying
opportunities for future work and any associated possibilities in the domain.
Strengths:
1. Uniqueness: There has been minimal work which has been carried out in the domain
of employing expert systems and maintaining a balance between them in the area
of conversational agents. The research carried out will also help the business make
an informed decision on the technologies and rules to use while attempting to build
a multi-domain conversational agent. This makes the work carried out unique in its
aspect.
2. Achieving research goals: Concurrency & robustly handling user inputs came
across as a drawback in present systems during the literature review, and it was
considered as a facet for investigating its effects on how the expert system responds
based on its presence and absence.
3. Test corpus creation: Two datasets were made from scratch for testing purposes in
the use case of a ‘Car Configurator’ to be put to use in the experiments. The dataset
used for extensive testing consisted of 50 user utterances across five conversations
and the dataset used for testing concurrency comprised of 70 utterances across four
conversations. This can be used for future work with minimal modifications based
on requirements.
Weaknesses:
1. Cognitive Bias in testing data: Since both the training (short tail expert system) and
testing datasets were created by me during the process of carrying out the experi-
ments, there is a high likelihood of cognitive bias being introduced into the same.
Cognitive bias is something which is faced by any entity trying to create a corpus
of data for use in a data science problem. This may have led to the introduction of
noise in data and the presence of unwanted filters which can sometimes cause cru-
cial aspects in the domain to be missed out. The presence of such cognitive bias can
lead to misclassification by the expert systems.
26
2. Small data for training short tail: The corpus of data used for training the short tail
expert system consisted of 16 intents, and approximately 150 example utterances
are significantly small in comparison to the corpora used for training the chitchat
and long-tail expert systems. However, this is precisely the problem businesses are
facing when trying to incorporate a short tail domain for their conversational agent
and is the problem I have aimed to address with this research. They have massive
datasets available for long-tail and chitchat but want to have a short tail facet ready
with minimal effort and a much smaller corpus.
3. Lack of ‘real’ user data: Another drawback which can be taken into account is the
lack of real user testing data which can be put to use for testing the performance of
the created mixture of experts’ system.
5.2. Possibility for future work 1. Two different datasets were used for testing the scenarios. The initial set had 50
utterances across five conversations, and the one updated to account for concur-
rency had 70 utterances across four conversations. If there were more time avail-
able, the possible first step to be taken would be to conduct all experiments and
test all scenarios on the updated dataset as it accounts for concurrency and may
lead to results closer to the ‘ground reality’ in user conversations.
2. Run a beta test on real user data by opening up the platform to a small section of users and thereby obtaining real user conversations upon which further testing,
and performance metrics can be obtained. This could lead to a more thorough in-
vestigation of the problem at hand and its possible solutions.
3. Currently, the mixture of expert systems has been built using only two and three experts based on the requirements. Another aspect to investigate in the future
would be to identify which scenario would work best and how the accuracy will
change with the introduction of n-systems, for example when a fourth one is in-
troduced and to perhaps model a relationship between the same.
27
5.3. Recommendations & Findings • A comparative analysis was performed between traditional machine learning
algorithms used predominantly for classification tasks and Google BERT; a
state-of-the-art model devised for several natural language processing tasks.
The results were highly favourable for BERT in this particular case having a
small training data corpus.
• Google BERT was also put to the test against a traditional ‘Conversational Ar-
tificial Intelligence (AI)’ configuring toolkit such as Watson Assistant using the
default logic utilised while creating a multi-domain conversational system. This
was done as BERT was found to be optimised for small datasets. The classifier
was found to have a slightly lower performance in comparison to the toolkit.
• The scenarios for rule building while creating a multi-domain mixture of ex-
perts’ system in the conversational agent’s space can be divided into two based
on the data available and the manipulations possible on the same as follows:
o If the training data for chitchat and short tail can be merged into a single
corpus while keeping long-tail separate, the best scenario or rule to adopt
would be the Watson default logic as done in Version 3 of the emulating
fallback logic. This would entail building a mixture of experts’ system
with two domain-specific experts.
o Building a mixture of experts’ system using three separate domain-spe-
cific experts of short tail, chitchat and long tail were found to be the best
possible scenario for cases wherein data merging is not possible. Among
those, a high-level classifier with normalised and weighted confidences
gave the best results.
• Another important finding was the improvement in result metrics when the pa-
rameters such as weights for the system confidences or the cut-off confidences
was optimised using Solver. The average improvement was around 7% for the
2 expert system scenarios and 29% for the 3 expert system scenarios. Thus, it
can be postulated that the need for optimisation possibly increases with an in-
crease in the number and variety of expert systems and their weighting.
28
References Colby KM, 1975. Artificial Paranoia - 1st Edition. Pergamon Press INC. Maxwell House,
New York, NY, England. Davis, B., 2018. Vodafone’s chatbot is delivering double the conversion rate of its website –
Econsultancy [WWW Document]. URL https://econsultancy.com/vodafones-chatbot- is-delivering-twice-the-conversion-rate-of-its-website/ (accessed 8.31.19).
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding. ArXiv181004805 Cs.
Ebrahimpour, 2009. Recognition of Persian handwritten digits using Characterization Loci and Mixture of Experts. Int. J. Digit. Content Technol. Its Appl. 3. https://doi.org/10.4156/jdcta.vol3.issue3.5
Estabrooks, A., Japkowicz, N., 2001. A Mixture-of-experts Framework for Text Classifica- tion, in: Proceedings of the 2001 Workshop on Computational Natural Language Learning - Volume 7, ConLL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9:1–9:8. https://doi.org/10.3115/1117822.1117828
Excel Solver: Which Solving Method Should I Choose?, 2016. . EngineerExcel. URL https://www.engineerexcel.com/excel-solver-solving-method-choose/ (accessed 8.16.19).
Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A.A., Lally, A., Murdock, J.W., Nyberg, E., Prager, J., Schlaefer, N., Welty, C., 2010. Building Wat- son: An Overview of the DeepQA Project. AI Mag. 31, 59–79. https://doi.org/10.1609/aimag.v31i3.2303
Forsyth, R., 1984. Expert systems : principles and case studies. London ; New York : Chap- man and Hall ; New York, NY : Methuen.
Forsyth, R., n.d. The architecture of expert systems 7. Freed, A.R., 2018. Testing Strategies for Chatbots (Part 1)— Testing Their Classifiers
[WWW Document]. Medium. URL https://medium.com/ibm-watson/testing-strate- gies-for-chatbots-part-1-testing-their-classifiers-20becaf5f211 (accessed 8.14.19).
Hampshire, J.B., Waibel, A., 1992. The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14, 751–769. https://doi.org/10.1109/34.142911
Harb, H., Chen, L., Auloge, J.-, 2004. Mixture of experts for audio classification: an applica- tion to male female classification and musical genre recognition, in: 2004 IEEE Inter- national Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763). Presented at the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), pp. 1351-1354 Vol.2. https://doi.org/10.1109/ICME.2004.1394479
Hartikainen, M., Turunen, M., Hakulinen, J., Salonen, E.-P., Adam Funk, J., 2004. Flexible dialogue management using distributed and dynamic dialogue control.
Ignizio, J.P., 1991. Introduction to expert systems: the development and implementation of rule-based expert systems. McGraw-Hill, New York.
Ignizio, J.P., 1990. A brief introduction to expert systems. Comput. Oper. Res. 17, 523–533. https://doi.org/10.1016/0305-0548(90)90058-F
Io, H.N., Lee, C.B., 2017. Chatbots and conversational agents: A bibliometric analysis, in: 2017 IEEE International Conference on Industrial Engineering and Engineering Man- agement (IEEM). Presented at the 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 215–219. https://doi.org/10.1109/IEEM.2017.8289883
29
Jacobs, R., Jordan, M., J. Nowlan, S., E. Hinton, G., 1991. Adaptive Mixture of Local Expert. Neural Comput. 3, 78–88. https://doi.org/10.1162/neco.1991.3.1.79
Jacobs, R.A., Jordan, M.I., Barto, A.G., 1991. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cogn. Sci. 15, 219–250. https://doi.org/10.1016/0364-0213(91)80006-Q
Koehler, A., 2017. Meet TOBi the chatbot: The latest addition to our customer service team [WWW Document]. Vodafone Soc. Off. Vodafone UK Blog. URL https://blog.voda- fone.co.uk/2017/04/12/meet-tobi-chatbot-latest-addition-vodafone-uks-customer-ser- vice-team/ (accessed 8.31.19).
Komatani, K., Kanda, N., Nakano, M., Nakadai, K., Tsujino, H., Ogata, T., Okuno, H.G., 2006. Multi-domain Spoken Dialogue System with Extensibility and Robustness Against Speech Recognition Errors, in: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, SigDIAL ’06. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9–17.
Lin, B.-S., Wang, H., Fen, Q., 1999. Consistent Dialogue Across Concurrent Topics Based On An Expert System Model.
Liu, X., Eshghi, A., Swietojanski, P., Rieser, V., 2019. Benchmarking Natural Language Un- derstanding Services for building Conversational Agents. ArXiv190305566 Cs.
Lu, Z., 2006. A regularized minimum cross-entropy algorithm on mixtures of experts for time series prediction and curve detection. Pattern Recognit. Lett. 27, 947–955. https://doi.org/10.1016/j.patrec.2005.12.002
M. O’Neill, I., Hanna, P., Liu, X., Mctear, M., 2004. Cross domain dialogue modelling: an object-based approach.
Masoudnia, S., Ebrahimpour, R., 2014. Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293. https://doi.org/10.1007/s10462-012-9338-y
Mossavat, S.I., Amft, O., Vries, B. de, Petkov, P.N., Kleijn, W.B., 2010. A bayesian hierar- chical mixture of experts approach to estimate speech quality, in: 2010 Second Inter- national Workshop on Quality of Multimedia Experience (QoMEX). Presented at the 2010 Second International Workshop on Quality of Multimedia Experience (QoMEX), pp. 200–205. https://doi.org/10.1109/QOMEX.2010.5516203
Nakano, M., Funakoshi, K., Hasegawa, Y., Tsujino, H., 2008. A Framework for Building Conversational Agents Based on a Multi-expert Model, in: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08. Association for Compu- tational Linguistics, Stroudsburg, PA, USA, pp. 88–91.
NatWest begins testing AI driven ‘digital human’ in banking first [WWW Document], n.d. URL https://www.rbs.com/rbs/news/2018/02/natwest-begins-testing-ai-driven-digital- human-in-banking-first.html (accessed 8.31.19).
Niranjan, M., Saipreethy, M.S., Kumar, T.G., 2012. An intelligent question answering con- versational agent using Naïve Bayesian classifier, in: 2012 IEEE International Confer- ence on Technology Enhanced Education (ICTEE). Presented at the 2012 IEEE Inter- national Conference on Technology Enhanced Education (ICTEE), pp. 1–5. https://doi.org/10.1109/ICTEE.2012.6208614
Peng, F., Jacobs, R.A., Tanner, M.A., 1996. Bayesian Inference in Mixtures-of-Experts and Hierarchical Mixtures-of-Experts Models with an Application to Speech Recognition. J. Am. Stat. Assoc. 91, 953–960. https://doi.org/10.1080/01621459.1996.10476965
Rumney, E., 2018. British bank RBS hires “digital human” Cora on probation. Reuters. Setiaji, B., Wibowo, F.W., 2016. Chatbot Using a Knowledge in Database: Human-to-Ma-
chine Conversation Modeling, in: 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS). Presented at the 2016 7th International
30
Conference on Intelligent Systems, Modelling and Simulation (ISMS), pp. 72–77. https://doi.org/10.1109/ISMS.2016.53
Shieber, S.M., 1994. Lessons from a Restricted Turing Test. Commun ACM 37, 70–78. https://doi.org/10.1145/175208.175217
Shum, H., He, X., Li, D., 2018. From Eliza to XiaoIce: challenges and opportunities with so- cial chatbots. Front. Inf. Technol. Electron. Eng. 19, 10–26. https://doi.org/10.1631/FITEE.1700826
Simmons, R.F., 1970. Natural Language Question-answering Systems: 1969. Commun ACM 13, 15–30. https://doi.org/10.1145/361953.361963
Suzuki, J., Taira, H., Sasaki, Y., Maeda, E., 2003. Question Classification using HDAG Ker- nel, in: Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering. Association for Computational Linguistics, Sapporo, Japan, pp. 61–68. https://doi.org/10.3115/1119312.1119320
Tripathi, K.P., 2011. A Review on Knowledge-based Expert System: Concept and Architec- ture. Artif. Intell. Tech. 5.
Turing, A.M., 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX, 433–460. https://doi.org/10.1093/mind/LIX.236.433
Tzafestas, S.G., Kokkinaki, A.I., Valavanis, K.P., 1993. An Overview of Expert Systems, in: Tzafestas, S. (Ed.), Expert Systems in Engineering Applications. Springer Berlin Hei- delberg, Berlin, Heidelberg, pp. 3–24. https://doi.org/10.1007/978-3-642-84048-7_1
Versace, M., Bhatt, R., Hinds, O., Shiffer, M., 2004. Predicting the exchange traded fund DIA with a combination of genetic algorithms and neural networks. Expert Syst. Appl. 27, 417–425. https://doi.org/10.1016/j.eswa.2004.05.018
Walker, M., S. Aberdeen, J., Boland, J., Bratt, E., S. Garofolo, J., Hirschman, L., N. Le, A., Lee, S., Narayanan, S., Papineni, K., L. Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S., 2001. DARPA communicator dialog travel planning systems: the june 2000 data collection. pp. 1371–1374.
Wallace, R.S., 2009. The Anatomy of A.L.I.C.E., in: Epstein, R., Roberts, G., Beber, G. (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer Netherlands, Dordrecht, pp. 181–210. https://doi.org/10.1007/978-1-4020-6710-5_13
Walter, P., Elsen, I., Muller, H., Kraiss, K.-, 1999. 3D object recognition with a specialized mixtures of experts architecture, in, IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339). Presented at the IJCNN’99. In- ternational Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), pp. 3563–3568 vol.5. https://doi.org/10.1109/IJCNN.1999.836243
Waltinger, U., Breuing, A., Wachsmuth, I., 2012. Connecting Question Answering and Con- versational Agents. KI - Künstl. Intell. 26, 381–390. https://doi.org/10.1007/s13218- 012-0208-1
Wang, Z., Ahmadvand, A., Choi, J.I., Karisani, P., Agichtein, E., n.d. Emersonbot: Infor- mation-Focused Conversational AI Emory University at the Alexa Prize 2017 Chal- lenge 11.
Weigend, A.S., Mangeas, M., Srivastava, A.N., 1995. Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. Int. J. Neural Syst. 06, 373–399. https://doi.org/10.1142/S0129065795000251
Weizenbaum, J., 1966. ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine. Commun ACM 9, 36–45. https://doi.org/10.1145/365153.365168
31
Williams, J.D., Young, S., 2007. Partially observable Markov decision processes for spoken dialog systems. Comput. Speech Lang. 21, 393–422. https://doi.org/10.1016/j.csl.2006.06.008
Yazdani, M., 1989. Expert Systems Principles and Case Studies, in: Forsyth, R. (Ed.), . Chap- man & Hall, Ltd., London, UK, UK, pp. 173–183.
Yuksel, S.E., Wilson, J.N., Gader, P.D., 2012. Twenty Years of Mixture of Experts. IEEE Trans. Neural Netw. Learn. Syst. 23, 1177–1193. https://doi.org/10.1109/TNNLS.2012.2200299
Zhang, D., Lee, W.S., 2003. Question Classification Using Support Vector Machines, in: Pro- ceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03. ACM, New York, NY, USA, pp. 26–32. https://doi.org/10.1145/860435.860443
32
Appendices
A. Code for obtaining output from IBM Watson Services ''' This code reads multiple user utter- ance from a file and then parsing the same to all 3 expert sys- tems.The confidence obtained from the sys- tems is then stored as a matrix across single user utter- ances in the outputted excel file '''
import json import ibm_watson import pandas as pd from ibm_watson import DiscoveryV1 #setting configurations for Watson Assistant api_version_assistant = '' apikey_assistant = '' assistant_url = '' shorttail_workspace = '' chitchat_workspace = '' assistant = ibm_watson.AssistantV1( version = api_version_assistant, iam_apikey = apikey_assistant, url= assistant_url ) #setting configurations for long tail/ Watson discovery service api_version_discovery = '' apikey_discovery= '' discovery_url= '' environment_id = '' collection_id = '' discovery = DiscoveryV1( version=api_version_discovery, iam_apikey=apikey_discovery, url=discovery_url ) #read csv file with input test utterances df = pd.read_csv('inputfilepath') #give path to file containing test utterances
33
#defining columns in dataframe to store confidence values df['shorttail_confidence'] = '0' df['shorttail_intent'] = '0' df['chitchat_confidence'] = '0' df['chitchat_intent'] = '0' df['longtail_confidence'] = '0' df['lt_conf1'] ='0' df['lt_conf2'] ='0' df['lt_conf3'] ='0' #passing utterances to chitchat system for i in range(0,len(df)): input = df['message'].loc[i] response_cc = assistant.message( workspace_id = chitchat_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.chitchat_confidence.iloc[i] = '0' df.chitchat_intent.iloc[i] = 'Invalid' else: df.chitchat_confidence.iloc[i] = response_cc['in- tents'][0]['confidence'] df.chitchat_intent.iloc[i] = response_cc['in- tents'][0]['intent'] #passing utterances to shorttail system for i in range(0,len(df)): input = df['message'].loc[i] response_st = assistant.message( workspace_id = shorttail_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.shorttail_confidence.iloc[i] = '0' df.shorttail_intent.iloc[i] = 'Invalid' else:
34
df.shorttail_confidence.iloc[i] = response_st['in- tents'][0]['confidence'] df.shorttail_intent.iloc[i] = response_st['in- tents'][0]['intent'] #passing utterances to longtail system for i in range(0,len(df)): user_input = df['message'].loc[i] input_text = "text:"+user_input query_ex = discovery.query(environment_id, collec- tion_id, filter=None, query=input_text, natural_lan- guage_query=None, passages=True, aggregation=None, count=3, re- turn_fields=None, offset=None, sort=None, highlight=True, pas- sages_fields=None, passages_count=3, passages_charac- ters=None, deduplicate=None, deduplicate_field=None, collec- tion_ids=None, similar=None, similar_document_ids=None, simi- lar_fields=None, bias=None, logging_opt_out=None) for j in range(0,len(query_ex.result['results'])): if j<3: #get top 3 results from discov- ery/ longtail system df.longtail_confidence.iloc[i] = query_ex.result['re- sults'][0]['result_metadata']['confidence'] df['lt_conf'+str(j+1)].iloc[i] = query_ex.result['re- sults'][j]['result_metadata']['confidence'] else: break df.to_excel('outputfilepath', index = False) #give path to store the output as an excel file for further manipulation
B. Test Data Types
a. Type 1 Conversation 1 Hi Good Morning How are you? I am looking to buy a car Can you show me red sedans please? Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives Let's buy that
Conversation 2
35
Hello what's up can we chat can you help me configure a car I am looking for something in the mid 20k pound range I like green SUV's with a sunroof Yes please Does it have driver assist? How much is the insurance going to cost? Ok, let's buy that
Conversation 3 Good evening What's up? How are you? I wanna buy a car I am looking for a sedan with manual transmission What is the power of that car? Does it have 6 airbags? I like it What is the tax liability Ok, let's go ahead and send the quote
Conversation 4 Greetings Good afternoon Describe yourself How can you help I want to configure a car I am looking for a blue automatic sedan What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive for me? I'll be back later
Conversation 5 are you here? Hey Good Morning I want to customise a car Looking for something in the 40k pound range with automatic transmission Does it have dsg transmission? Perfect! Exactly what I am looking for
36
Can I get a test drive for that nearby I am ready to buy. Please send configuration to dealer
b. Type 2 Conversation 1 Hi What are you I am looking to buy a car Can you show me an automatic red sedan please? I like the Audi a4 Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives This is exactly what I am looking for Can I get a quote of what it's costing now Would it be possible to get a test drive tomorrow? Excellent Goodbye for now
Conversation 2 Hello can we chat help me configure a car I am looking for a something in the mid 20k pound range I want a green hatchback with a sunroof I like the kia rio I want to see that with 17-inch alloys It's beginning to move towards exactly what I am looking for Does it have driver assist? What is the power of that car? Does it have 6 airbags? How much is the insurance going to cost? How much is the cost now? Ok, let's go ahead and send the quote I'll be back later Bye
Conversation 3 Greetings Describe yourself How can you help I want to configure a car I am looking for a diesel SUV
37
I want it in blue with black alloys Start again I want to experiment I am looking to buy a petrol convertible with automatic transmission I love the mercedes amg-gt I want to see the car in a light green colour instead of black That's perfect What's the tax liability? Does it have dsg transmission? What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive on Friday at 4pm for me? Great Work You're funny Are you real? I'll be back later bye
Conversation 4 are you here? Good afternoon I want to customise a car Looking for something in the 40k pound range with automatic transmis- sion Which of those have Apple car in them? I'll go for the hyundai i30 I want it in dark blue How many airbags does it have? Does it have ebd? What's the mileage in mpg? That's really nice Can I get a test drive for that nearby now Yes please I am ready to buy. Please send configuration to dealer I am bored! Nah Talk to you later
Machine Learning Dissertation(1).pdf
High Frequency Oscillation Detection Using Wavelet Analysis and
Convolutional Neural Networks
Joe Morris
September 2019
School of Mathematics, Cardiff University
A dissertation submitted in partial fulfilment of the requirements for MSc (in Data Science and Analytics) by taught programme.
i
ii
Acknowledgements
First and foremost, I would like to thank my project supervisor Alexia Zoumpoulaki, who has
provided fantastic guidance throughout this process. Her insights and advice have been
instrumental in bringing this project to fruition.
I would also like to say thank you to Miguel Navarrete, his expertise in this field has been
invaluable to this project. I am immensely grateful for the time he has taken to produce the
simulated data for the project.
I would also like to thank Supercomputing Wales for allowing me access to their facilities.
iii
List of Acronyms
HFO High frequency oscillation
CNN Convolutional Neural Network
EEG Electroencephalogram
iEEG Intracranial Electroencephalogram
IPSP Inhibitory Postsynaptic Potential
EPSP Excitatory Postsynaptic Potential
iv
Table of Figures
Figure 1: Taking windows of a signal........................................................................................ 6
Figure 2: Naively structured inception module ....................................................................... 15
Figure 3: 1×1 convolutions for depth reduction ...................................................................... 15
Figure 4: Inception module structure with 1×1 convolutional blocks ..................................... 16
Figure 5: Residual connection ................................................................................................. 17
Figure 6: Wavelet analysis of 3 waves .................................................................................... 22
Figure 7: CNN structure proposed by Lai et al ........................................................................ 25
Figure 8: Stem of newly proposed models .............................................................................. 26
Figure 9: Module 1 of Model A ............................................................................................... 28
Figure 10: Module 2 of Model A ............................................................................................. 28
Figure 11: Module 3 of Model A ............................................................................................. 28
Figure 12: Module 4 of Model A ............................................................................................. 28
Figure 13: Module 5 of Model A ............................................................................................. 29
Figure 14: Module 1 of Model B ............................................................................................. 30
Figure 15: Module 2 of Model B ............................................................................................. 30
Figure 16: Module 3 of Model B ............................................................................................. 30
Figure 17: Module 4 of Model B ............................................................................................. 30
Figure 19: Module 5 of Model B ............................................................................................. 30
Figure 18: Module 6 of Model B ............................................................................................. 30
Figure 20: Module 1 of Model C ............................................................................................. 32
Figure 21: Module 2 of Model C ............................................................................................. 32
Figure 22: Module 3 of Model C ............................................................................................. 32
Figure 23: Module 4 of Model C ............................................................................................. 32
Figure 24: Module 5 of Model C ............................................................................................. 33
Figure 25: Confusion matrix for Model B's application on the test set ................................... 35
Figure 26: Confusion matrix for the Industry Standard's application on the test set ............... 35
Figure 27: Venn diagram comparing the performance of Models ........................................... 36
Figure 28: Model B's performance on Ripples over distance .................................................. 39
Figure 29: Industry Standard's performance on Ripples over distance.................................... 39
Figure 30: Model B's performance on Fast Ripples over distance .......................................... 40
Figure 31: Industry Standard's performance on Fast Ripples over distance ............................ 40
Figure 32: Model B's performance on Spikes over distance ................................................... 41
v
Figure 33: Industry Standard's performance on Spikes over distance ..................................... 41
Figure 34: Model B's performance on Ripple-FastRipples over distances .............................. 42
Figure 35: Industry Standard's performance on Ripple-FastRipples over distances ............... 42
Figure 36: Incorrect predictions of Model B on Ripple-FastRipples ...................................... 42
Figure 37: Incorrect predictions of the Industry Standard on Ripple-FastRipples .................. 42
Figure 38: Model B's performance on Spike-Ripples over distances ...................................... 43
Figure 39: Industry Standard's performance on Spike-Ripples over distances ....................... 43
Figure 40: Incorrect predictions of Model B on Spike-Ripples ............................................... 44
Figure 41: Incorrect predictions of the Industry Standard on Spike-Ripples .......................... 44
Figure 42: Distribution of incorrect predictions by Model B on Spike-Ripples ...................... 45
Figure 43: Distribution of incorrect predictions by the Industry Standard on Spike-Ripples . 45
Figure 44: Model B's performance on Spike-FastRipples over distances ............................... 46
Figure 45: Industry Standard's performance on Spike-FastRipples over distances ................. 46
Figure 46: Incorrect predictions of Model B on Spike-FastRipples ........................................ 46
Figure 47: Incorrect predictions of the Industry Standard on Spike-FastRipples .................... 46
Figure 48: Distribution of incorrect predictions by Model B on Spike-FastRipples ............... 47
Figure 49: Distribution of incorrect predictions by the Industry Standard on Spike-
FastRipples ............................................................................................................................... 47
vi
Summary
This study applies to the field of automated high frequency oscillation (HFO) detection.
HFOs are a biomarker of epileptogenic tissue. Therefore, models capable of automatically
detecting these behaviours could improve the success rates of epileptic tissue removal
surgery, as well as significantly reduce costs for health service providers. Of course, this is a
challenging task. In particular, the difficulty of this problem relates to the disruptive effects of
noise and non-HFO behaviours within iEEG signal. Such signal behaviours are easily mis-
classified as HFOs.
This dissertation project utilizes depth-wise stacks of time-frequency plots, which may allow
for the encoding of accessory information. Once the frequency behaviour over time of the
iEEG signal is captured within these plots, CNN models are used to classify them as either
HFO or non-HFO behaviour. We present new model structures specifically designed to
capture the intricate details within these time-frequency plots. CNN models with these
specific structures have never been utilised on this problem and this study shows the
applicability of these alternative structures to this task.
Three new alternative models are constructed. First, cross-validation is performed in order to
test the stability of each model’s performance. Once an optimal model structure is chosen,
said model and a re-created version of the industry standard model are applied to a final test
set. The model proposed in our research provides more accurate results than the industry
standard model on this task. In fact, the model proposed here performs more accurately on all
wave types simulated.
An analysis of the relationship between the predictive power of the models and the distance
to behaviours is also conducted. We conclude that the smaller the displacement between
electrodes and the position of HFOs, the more accurate the predictions. Conversely, we find
that the smaller the distance to disruptive non-HFO behaviours such as spikes, the larger the
disruptive effects.
vii
Contents Acknowledgements .................................................................................................................... ii
List of Acronyms ..................................................................................................................... iii
Table of Figures ........................................................................................................................ iv
Summary ................................................................................................................................... vi
1. Introduction ............................................................................................................................ 1
2. Background ............................................................................................................................ 2
2.1 EEG .................................................................................................................................. 2
2.2 HFOs ................................................................................................................................ 2
2.3 Electrical Behaviour at the Cellular Level ....................................................................... 3
2.4 Fourier Analysis and Filtering.......................................................................................... 4
2.5 Time Frequency Analysis and Wavelets .......................................................................... 6
3. Literature Review and History of the Field ........................................................................... 8
3.1 Overview of Previous Methods ........................................................................................ 9
3.1.1 Early Papers ............................................................................................................... 9
3.1.2 Clustering Methods.................................................................................................... 9
3.1.3 Linear SVMs............................................................................................................ 10
3.1.4 Neural Networks ...................................................................................................... 11
3.1.5 Convolutional Neural Networks .............................................................................. 11
3.2 Current Industry Standard .............................................................................................. 11
4. Introduction to Convolutional Neural Networks ................................................................. 13
4.1 Overview of Basic Principles ......................................................................................... 13
4.1.1 How Images are Interpreted by a CNN ................................................................... 13
4.1.2 Convolutional Layers .............................................................................................. 13
4.1.3 Pooling Layers ......................................................................................................... 14
4.2 Inception Networks ........................................................................................................ 14
4.3 Residual Connections ..................................................................................................... 16
viii
5. Data Pre-processing ............................................................................................................. 17
5.1 Wave Simulations .......................................................................................................... 17
5.2 Simulating the Effects of Distance ................................................................................. 19
5.3 Time-Frequency Plot Creation ....................................................................................... 20
5.4 Creation of Sets for Cross Validation and Final Testing ............................................... 23
5.5 Limitations of Simulated Data ....................................................................................... 23
6. Construction of Models........................................................................................................ 24
6.1 Reconstruction of the Industry Standard Model ............................................................ 24
6.2 Construction of New Models ......................................................................................... 25
6.2.1 Model A ................................................................................................................... 26
6.2.2 Model B ................................................................................................................... 29
6.2.3 Model C ................................................................................................................... 31
6.3 Applicability of Deep Learning and CNNs to HFO Detection ...................................... 33
7. Applying the Models............................................................................................................ 34
7.1 Cross-Validation Results ................................................................................................ 34
7.2 Test Set Results .............................................................................................................. 35
7.2.1 Overall Performance ................................................................................................ 35
7.2.2 Accuracy Breakdown by Wave Type ...................................................................... 37
7.3 Predictive Power of Models over Distance .................................................................... 38
7.3.1 Simple Waveforms – Ripples, Fast Ripples and Spikes .......................................... 39
7.3.2 Complex Waveforms – Pairings Between Ripples, Fast Ripples and Spikes ......... 41
7.4 Performance on Small Datasets...................................................................................... 48
8. Final Discussion ................................................................................................................... 48
8.1 Limitations to Conclusions ............................................................................................ 49
8.2 Possible Next Steps ........................................................................................................ 49
9. Appendices ........................................................................................................................... 50
10. Bibliography ...................................................................................................................... 53
1. Introduction
Epilepsy is a widespread neurological disorder. For many sufferers, the use of medication can
reduce or even stop the occurrence of seizures. However, in some cases, this course of action
is not feasible. This can be due to a variety of reasons such as the medication being
ineffective or causing particularly adverse side effects. In cases were medicine is not a viable
treatment option, one alternative is the surgical removal of the brain tissue that is responsible.
Of course, the success of these surgical procedures is dependent on the correct identification
of the tissue that is to be removed. Intracranial electroencephalography (iEEG) is a method
carried out pre-surgery in order to locate these epileptogenic zones. iEEG utilises
microelectrodes placed within the exposed brain tissue of the patient. Measurements of
voltage at each electrode site are taken. These continuous voltage signals can give us an
insight into the behaviour of the brain at these locations. Specifically, high frequency
oscillations (HFOs) are of interest, as these signal behaviours have proven to be a biomarker
for the epileptogenic zones of the brain (Jacobs et al. 2008).
The current methods for detecting and subsequently classifying these behaviours are centred
around visual inspection by reviewers. For health care services, the application of automated
techniques promises a more optimal use of their time and financial recourses, as well as
removing human error from the process. However, researchers have struggled to find a
method capable of classifying HFOs from the noisy background of iEEG data.
Recent research indicates that time-frequency plots are a valuable tool for identification of
HFO’s from within iEEG data (Liu et al. 2016). These plots allow for an exceptionally
accurate representation of the frequency of signal and how such frequencies change over
time. In addition, the application of Convolutional Neural Networks (CNNs) to classify these
plots has led to promising results (Lai et al. 2019).
In this research, simulations are used to recreate the scenario of an HFO detection problem.
We then build upon the work of Lai et al and investigate the applicability of new CNN model
structures to this classification problem.
2
2. Background
When conducting research in a field such as this, it is vital to consider the underlying
scientific foundations. While this dissertation primarily focuses on the development of new
machine learning methods, the science behind the data on which these methods are applied is
important for context.
2.1 EEG
The electroencephalogram (EEG) is a method of measuring the electrical activity of the brain
using electrodes placed on the scalp. However, this method suffers from certain limitations.
For instance, the low conductivity of the skull makes this approach poor at detecting the
electrical activity of the inner brain. The intracranial electroencephalogram (iEEG) is a
variation of the EEG in which the electrical activity of the brain is instead measured through
electrodes placed within the exposed surface of the brain.
At each electrode position, the output of an iEEG is a continuous measurement of the changes
in voltage plotted against time (Luck, 2014, p. 4). More precisely, individual measurements
of voltage are taken at such a high frequency that the wave can be treated as continuous in
practice. The brain is of course always active, which leads to an ever-changing signal
measurement at each electrode.
2.2 HFOs
iEEG signal oscillations can occur at a variety of frequencies. Over the last few decades, the
development of broad-band digital EEG has increased the frequency measurement capacity to
over 500Hz (Navarrete et al. 2016a).
In this field of study, signals are often categorized by their frequency. In fact, it is common
practice to delegate to each frequency band a Greek letter. For example, Delta waves are
usually of relatively low frequency (2–4 Hz), while Gamma waves denote oscillations with a
fairly high frequency (30-100Hz) (Whittingstall et al. 2009). To further complicate matters,
researcher’s definition of what exactly constitutes an HFO varies. One such definition is that
3
HFOs are oscillatory activities that have a power increase inside the 40-800Hz band and have
durations in the tens of milliseconds (Navarrete et al. 2016a).
Certain HFOs have been found to be physiological in nature. For instance, HFOs have been
measured from several regions of the brain when participants were exposed to visual stimuli
in the form of images (Kucewicz et al. 2014). While many HFOs are manifestations of
physiological processes of the brain, some have proven to be pathological in nature. Of
particular interest is the research linking the appearance of certain HFOs to the seizure onset
zone (Jacobs et al. 2008). Indeed, research strongly links the removal of brain tissue
associated with these HFOs to the success of epilepsy surgery (Wu et al. 2010) (Jacobs et al.
2010).
In this field of epileptic tissue identification, two of the most commonly studied HFOs are
ripples and fast ripples. Ripples are HFOs that occur with a frequency range of 120 - 240 Hz,
while fast ripples are HFOs that occur with a frequency exceeding this range (Navarrete et al.
2016a).
2.3 Electrical Behaviour at the Cellular Level The human brain is composed of billions of cells known as neurons. These are highly
specialized cells that take on a variety of structures in order to carry out specific tasks. To this
end, neurons are classified based on factors such as their function, location, and shape (Squire
et al. 2008, p.4). Despite this wide range of different cell types, there are structures common
to almost all neurons; a cell body (soma), dendrites and an axon. The soma contains the
nucleus and various other organelles that are vital for the neuron’s functionality. Dendrites
extend from this cell body and branch in often elaborate patterns, they are responsible for the
transmission of electrochemical signals that are received from other cells to the soma. This
signal is then passed to other cells via the axon (Squire et al. 2008, p.4)
There are numerous contributors to the measurable electrical activity of the brain. The two
most prominent of these are action potentials and postsynaptic potentials, moreover these
phenomena are interdependent. Action potentials are voltage spikes that are transmitted
through the cell from the soma to the axon (Luck, 2014, p. 39). Postsynaptic potentials either
encourage or inhibit action potentials and their origin is a little more complex.
4
Synapses can be thought of as the junction between neurons, across which these cells
communicate. When an action potential occurs within a neuron, neurotransmitters are
released, these neurotransmitters then bind to sites on the postsynaptic neuron. When a
neurotransmitter binds to the postsynaptic cell it has one of two possible effects. It may
hyperpolarize the membrane of the postsynaptic cell, this is known as an inhibitory
postsynaptic potential (IPSP), these decrease the probability of the postsynaptic neuron firing
an action potential. Alternatively, a neurotransmitter may depolarize the membrane of the
postsynaptic neuron, this is known as an excitatory postsynaptic potential (EPSP), which
increases the probability of the postsynaptic cell firing an action potential.
If several neurons have action potentials that occur simultaneously, and the axons of the cells
are orientated in parallel to each other, then this voltage may aggregate. However, if one of
these conditions is not met, then this will lead to signal cancellation. The short duration of an
action potential means it is unlikely for these conditions of both timing and orientation to be
met. A large number of summed action potentials are needed to create a voltage capable of
successfully propagating through the resistive brain tissue and scalp. Therefore, electrical
activities originating from action potentials are not always measurable by a classical EEG
(Luck, 2014, p. 39). Conversely, postsynaptic potentials have a longer duration, and the
probability of voltage summation is higher, meaning postsynaptic potentials can usually be
measured from a greater distance (Luck, 2014, p. 40). This means they are more likely to be
measurable using classical EEG.
Research suggests that HFO’s, and in particular HFO’s of a pathological nature, originate
from a number of simultaneously occurring action potentials from groups of suitably
orientated neurons (Cendes et al. 2018). Since iEEG is taken intra-cranially, the resistive
effect of the scalp is removed and the distance between the electrical activity and the
measuring electrodes is reduced. This results in iEEG being a far more sensitive measuring
instrument to certain electrical behaviours. Indeed, the HFOs originating from these summed
action potentials are far more measurable using iEEG rather than EEG.
2.4 Fourier Analysis and Filtering Filtering is a commonly used technique in the field of EEG and iEEG analysis. It is used to
suppress signal behaviours that fall within certain frequency bands (Luck, 2014, p. 226). In
5
the field of HFO detection, filtering is predominantly applied to iEEG signals as a data pre-
processing method aimed at suppressing low frequency behaviour and isolating oscillations
occurring at higher frequencies. Despite the relative success of this technique within this
field, research also suggests application of filtering in scenarios where it is not suitable to do
so may lead to distortive effects (Bénar et al. 2010).
Filter types are categorised by the frequencies they let pass or suppress. For example, high
pass filters let high frequency oscillations pass, while attenuating low frequencies. Low pass
filters allow low frequencies to pass while attenuating high frequencies. Band pass filters are
used to attenuate high and low frequencies simultaneously at the user’s discretion.
The filtering process is based on the underlying concept of Fourier analysis. Fourier analysis
allows for the deconstruction of a continuous signal in the time domain to a number of sine
waves with various frequencies, amplitudes and phases (Luck, 2014, p. 220). The Fourier
transform, created by the mathematician Joseph Fourier, is the function that allows for this
mapping from the time domain onto the frequency domain.
It’s worth noting that a non-zero measurement on a signal’s chart on the frequency domain
does not necessarily reflect that oscillations occurred at that specific frequency. Rather, the
respective chart of a continuous signal in the frequency domain reflects the sine waves and
their respective properties that are needed to reconstruct the original signal. However, in
practice, oscillations at a particular frequency in the raw signal will usually manifest as strong
power at that frequency within a chart on said frequency domain. (Luck, 2014, p. 225)
Once raw signal is translated from the time domain onto the frequency domain by use of this
transformation, each frequencies power is then multiplied by some pre-defined gain value.
Each gain value is a number between 0 and 1. Subsequently, frequencies are attenuated or
passed based on the corresponding gain value that is to act upon them. For example, in a high
pass filter, high frequencies are multiplied by high gains in order to preserve them, while low
frequencies are inhibited through multiplication by a small gain. The gain values to be
utilized on each specific frequency band is defined by a frequency response function. Once
suitable attenuations have been carried out, this now filtered version of the signal, which lies
on the frequency domain, is translated back onto the time domain by the inverse Fourier
transform to re-form a continuous signal.
6
2.5 Time Frequency Analysis and Wavelets Time frequency analysis is another commonly used technique in the field of EEG analysis.
Again, this is based upon the deconstruction of waves into the wave frequencies needed to re-
create the original signal. However, in this method, information concerning the time in which
these frequencies occur is also maintained.
When considering a method to properly examine how the frequency of a wave changes over
time, the Fourier transform proves itself to be extremely ineffective. In fact, taking a Fourier
transform of an entire iEEG signal would mean the loss of all information as to what
frequencies occurred at what which point in time.
A possible solution to this problem could be to take several sample points over the section of
signal we’re interesting in. At each sample point chosen, a window of pre-specified size
could be centred on the point. We could then conduct Fourier analysis within these windows
and attribute the results as happening at the point corresponding to the centre of a window.
To visualize how such a method may be carried out, figure 1 shows how several windows
could be used to analyse two seconds of signal we would like to analyse for the presence of
an HFO.
Note that while figure 1 utilizes only four windows, each of which is disconnected from all
other windows, there is nothing to stop us from increasing the number of windows and
2 seconds of signal
Windows
Raw Signal
Figure 1: Taking windows of a signal
7
ensuring that these windows overlap to give a more in depth analysis of the signal at all
points in time. Unfortunately, there are huge limitations imposed by using such a method.
One such problem is that the relevance of all oscillatory behaviour within a window is
weighted equally by the Fourier transform, and yet the behaviour is attributed to a single
point in time. This is not ideal, as there is potential for interesting oscillatory behaviour that
occurs on the outer border of a window to be attributed to a point in time despite the
behaviour occurring before/after that point in time.
Another limitation is imposed by the use of a constant window size. This is because of the
balancing act of localizing wave behaviour in terms of both their time and frequency
simultaneously. Different window sizes will have specific benefits for the task. For instance,
small window sizes mean we are likely to have a high resolution in terms of time, while we
will find it difficult to pick up low frequency behaviour that takes a longer time to complete
oscillations. On the other hand, larger windows are suited to picking up on the lower
frequency activity that occur over larger time intervals. However, by using larger windows,
we are less certain of where exactly this behaviour is occurring, i.e. we lose resolution in
terms of time.
A solution to these issues is wavelet analysis. There are many types of wavelets, each with
their own benefits for signal processing. The wavelets utilised in this research are Gabor
wavelets, which are formed by multiplying sine waves of particular frequencies by a
Gaussian function.
Rather than deconstructing a wave into sine waves of varying frequencies, as the Fourier
transform does, wavelet analysis deconstructs a signal into many wavelets of different
frequencies. This is achieved by a process of convolution between the signal and the different
wavelets. In practical applications, wavelets of low frequency and therefore longer time
duration are convolved with the signal to pick up low frequency, long duration activities of
the raw signal. While small duration, high frequency wavelets are convolved with the signal
to pick up high frequency activity.
By using a mix of wavelet frequencies, we escape the limitations of fixed window sizes.
Additionally, since we are considering a Gabor wavelet, which is a product of a Gaussian bell
8
curve, when convolution takes place between the wavelet and signal within certain windows
of time, the points in the outer region of such a window are attenuated according to their
relative position in time. This means that the more central a point in a window, the more
strongly it is attributed to that point in time.
A time-frequency image is simply a 2D matrix in which the rows stand for each frequency
considered in a wavelet analysis, and the columns the specific times at which these
frequencies occurred. Wavelet analysis allows for the deconstruction of signal into the
frequencies that occur while also maintaining a high temporal resolution.
3. Literature Review and History of the Field
Early research was able to identify HFO’s and link these oscillations to the seizure onset zone
(Fisher et al. 1992). Before attempts at automation were proposed, researchers deployed
manual review to identify oscillations of interest. This was a monotonous, time consuming
task and required the efforts of highly trained experts in the field. Their excellent work laid
the foundations of the field today.
Since then, numerous research papers have aimed to create a method for the automated
detection of HFOs. As the field has progressed, a range of machine learning techniques have
been applied in hope of finding a suitable model. Due to the large number of papers
published in the field, a full overview of each method applied is beyond the scope of this
dissertation. However, in this chapter, an overview that highlights some of the most notable
and successful papers is given.
It should be noted, that most HFO detection techniques can be thought of as a three-step
process. Initially, filtering is applied to the raw signal in order to bring to focus frequencies of
interest. Secondly, putative HFO’s are identified by the application of a threshold based on
selected characteristics of the filtered wave. Lastly, more advanced machine learning models
are used to distinguish true HFO’s from background noise and errors (Navarrete et al. 2016a).
9
3.1 Overview of Previous Methods 3.1.1 Early Papers The earliest attempt at automated HFO detection was a process put forward in 2002 by Staba
et al (Staba et al. 2002). Raw signal was first filtered to attenuate all frequency behaviour
outside of the 100-500Hz range. This filtered signal was then passed over by a 3-millisecond
sliding window, for each window a calculation of the root mean square error was taken.
Successive RMS values calculated to be greater than 5 standard deviations above the mean
RMS value of the entire signal, with duration of 6ms or longer, were selected as putative
HFOs. These potential HFOs were then put through an additional rule, that there must be a
minimum 6 peaks greater than 3 standard deviations above the mean value of the rectified
band pass signal. The method, while not providing a completely effective solution to
automated HFO detection, provided further evidence for the existence of different forms of
HFOs, specifically the existence of ripples and fast ripples.
In Crépon et al, a Hilbert transform was applied to the high pass filtered signal in order to
obtain a signal envelope (Crépon et al. 2009). Local maxima of this signal envelope were
considered putative HFOs. Unfortunately, the classification methods detected many false
positives. This led to the research team having to manually review potential false positives
using visual inspection of both the raw signal and time-frequency maps. Therefore, this could
only be considered a semi-automated procedure.
3.1.2 Clustering Methods As the field of automated HFO detection progressed, so did the application of statistical
techniques and algorithms that have been successful in other fields. A multitude of papers
have applied clustering methods to this classification problem.
Research conducted by Blanco et al was able classify HFOs into three distinct classes through
the use of k-medoid clustering (Blanco et al. 2010). With a fourth group of artifacts also
formed. Their methods included the use of the RMS based method for pre-selecting putative
HFOs proposed by Staba et al. The output of these clustering methods gave further evidence
for the existence of distinct sub-classes of HFO, such as ripples, fast ripples, and mixed
10
frequency events. This research was particularly notable as the first paper to apply
unsupervised methods to the problem.
In 2019, both semi-supervised k means and mean shift clustering algorithms were utilised
(Du et al. 2019). Several pre-processing steps were taken. Firstly, data normalization was
conducted using min-max normalization, before several filtering methods were applied. The
Teager energy operator and wavelet entropy were used as features in the semi-supervised k
means clustering algorithm to separate pathological and physiological HFOs. Previously
labelled data was used to initiate the k-means clustering algorithm. Remaining data was then
labelled based off relative position to these pre-created cluster centres and was used to
iteratively calculate new centroids. The group labelled as pathological HFOs then had an
unsupervised mean shift clustering algorithm applied to further divide into sub-classes.
3.1.3 Linear SVMs
Research by Matsumoto et al provided an in depth analysis that led to a further understanding
of how event frequency, duration and amplitude is linked to physiological and pathological
HFOs (Matsumoto et al. 2013). This paper proposed a linear SVM for HFO classification.
Visual scanning and finger movement exercises were used to yield physiological HFOs
within iEEG recordings of patients. These physiological events were then compared to the
pathological HFOs also recorded. The linear SVM provided extremely mixed results, for
instance specificity ranged between 32.61% and 99.38% according to which patient the HFOs
were recorded from.
Jrad et al proposed the use of a multi-class linear SVM to classify HFOs from artifacts and
noise, as well as classify HFOs themselves into 4 distinct sub-groups (Jrad et al. 2016). Gabor
atoms were utilized to deconstruct raw signal into specific frequency bands, before energy
ratios and temporal information were used as features for input to the multiclass SVM. A
particularly interesting part of this study was the use of simulated data to test their model. As
well as using recorded data from epilepsy patients, this paper simulated HFO data by
inserting real-life events into background activity. This is a technique similar to the one
utilized in this dissertation and provides support for the possible applicability of simulated
data in HFO detection.
11
3.1.4 Neural Networks
The first attempt at the application of neural networks to the HFO classification problem was
proposed by Dümpelmann et al (Dümpelmann et al. 2012). In this paper, a radial basis
function (RBF) network was proposed as a classifier. Pre-processing of the raw signal was
conducted via the application of a high pass filter. Three features were extracted and used to
form input vectors for the network; short time energy, short time line length and short time
instantaneous frequency. Unfortunately results showed particularly low sensitivity and
specificity of 49.1% and 36.3% respectively.
Lopez-Cuevas et al investigated the use of an artificial recurrent neural network (ARNN) to
classify HFO’s in rats (López-Cuevas et al. 2013). Feature extraction was carried out by the
use of approximate entropy to highlight points of high unpredictability in the raw signal.
While the paper mentions that the ARNN was trained and tested on this data, the results were
not released.
3.1.5 Convolutional Neural Networks The first application of a CNN to this classification task was proposed by Zuo et al (Zuo et al.
2019). In this paper, data pre-processing consisted of applying band-pass filtering on the raw
signal. The raw signal was then divided into one second periods. Oscillatory behaviour
considered to be noise and artifacts was removed. Greyscale images in which a higher
amplitude of the signal was plotted using darker colours were used as an input for the
network. In this paper, the CNNs ability to classify HFOs from non-HFOs were measured
against commonly used methods such as a Short Time Energy Detector, Short line length
detector, Hilbert detector and MNI detector that were implemented in the RIPPLELAB
application (Navarrete et al. 2016b). In most instances, the proposed CNN structures were
able to gain more accurate results than these highly respected models.
3.2 Current Industry Standard
A true evaluation of which approach and specific paper has been the most successful in the
field of automated HFO detection is impossible. This is primarily because papers test their
models on sets of putative HFOs generated by their own specific patients, pre-processing
steps, and opinions of what exactly constitutes an HFO. These datasets then offer varying
12
levels of difficulty to models and make a direct comparison of accuracies between papers
unsuitable in most situations. The propensity of papers to focus on sub-problems of the field,
such as only considering ripples or fast ripples but not both, also makes direct comparisons
between papers laborious and often inappropriate.
The industry standard model in this dissertation is taken to be a method proposed by Lai et al.
This uses short time energy calculations to identify putative HFOs, before a CNN is
developed to classify their time-frequency plots (Lai et al. 2019).
iEEG data was collected from five patients using subdural electrodes. The raw signal was
first visually inspected and examples of extreme noise and artifacts were removed. The
continuous signal was partitioned into 15-minute segments before bandpass filtering was
applied to highlight events in the range of 80-250Hz (corresponding to ripples) and 250-
500Hz (corresponding to fast ripple activity). After filtering, each 15-minute segment
containing events of these frequencies were normalized. These segments were further split
into 10ms windows in which the short time energy is calculated. If STE values were above a
certain threshold for three successive windows, then a 2 second window centred on this
region was deemed a putative HFO.
For each putative HFO, a time-frequency plot was then created. These plots were used as the
input for the CNN. Two specialist reviewers visually marked each plot as either an HFO or
non-HFO in a double-blind trial, providing a labelled dataset for the CNN to predict against.
The dataset consisted of 14,998 HFOs, however the number of non-HFO’s used in this
dataset is unspecified. 10% of this data is taken as the test set, with the remaining dataset
again split into a training set (80%) and a validation set (20%). The results on the test set are
sensitivities of 88.16% and 93.37% on Ripples and Fast Ripples respectively.
The reasons behind taking this methodology as the industry standard are three-fold. Firstly,
the use of time-frequency plots using wavelet analysis has proven a promising method in
HFO detection (Liu et al. 2016). Secondly, there are a multitude of reasons why CNN
classifiers have a high applicability to HFO detection, as discussed in section 4.4. And
finally, the results themselves are promising. While, as suggested earlier, direct results
comparisons should be taken with a pinch of salt in this field. This method does provide more
13
accurate results than most other papers, including the research by Zuo et al, which in turn
showed more accurate results than several well-respected automated detection algorithms.
4. Introduction to Convolutional Neural Networks
4.1 Overview of Basic Principles
CNNs are often difficult to interpret, therefore it was deemed necessary in this dissertation to
give a brief overview of the key foundational principles.
4.1.1 How Images are Interpreted by a CNN
Coloured images are 3-dimensional tensors. Two of these dimensions specify the horizontal
and vertical positions of pixels, while the third defines the colour channel. When an image is
used as input for a CNN, each pixel is seen as an individual element of this tensor.
4.1.2 Convolutional Layers
CNNs are layered structures. Several different layer types are used to carry out specific tasks
in a network. Convolutional layers are a foundational component of any CNN. In these
layers, the mathematical process of convolution occurs between the input and a pre-defined
number of learnable kernels (also known as filters). Kernels convolve across the input, dot
products are taken at each position, and the results of these dot products are the entries to a 2-
dimensional activation map. These activation maps stacked in the depth dimension are the
output of the convolutional layer. Consequently, the number of kernels defined in a layer
effects the shape of the output. For example, a convolutional layer that operates with n
kernels would result in a depth wise stack of n 2-dimensional activation maps.
The spatial dimensions of the kernels are another hyperparameter to consider. While the
stride hyperparameter controls how the kernels move across the input. For instance, defining
a stride of 2 means the kernels move 2 pixels at a time over the input. Consequently, the size
of each activation map is determined by the stride and kernel size.
14
4.1.3 Pooling Layers
Pooling layers serve to reduce the size of an input. In these layers, a pooling window of pre-
specified size passes over the input with a designated stride. The most common form of
pooling is max pooling, in which the maximum value from each pooling window is selected
to be part of the output. However, average pooling is also commonplace, where the mean of
the elements within each window is calculated.
4.2 Inception Networks
A method for increasing the performance of CNNs in classification tasks, is to increase the
computational complexity of a model. In a sequential CNN model this can be achieved by
extending depth (additional layers) or width (the number of filters at each convolutional
layer). Deep sequential models have proven their merits in fields such as image classification
(Krizhevsky et al. 2012). However, the consequence of increasing the number of layers/filters
is that we also increase the computational cost of a model, which in turn leads to longer
training times. Moreover, a sequential structure is limited due to the requirement that we must
define the filter size at each layer. Such an approach stipulates that only features of an exact
spatial size can be learned at each point in a model.
Due to the repercussions associated with deeper sequential models, their usability in HFO
detection and the medical field in general is limited. Ideally, the aim is to create models that
can identify more fine detailed features while also maintaining a reasonably low
computational requirement and training time.
15
One such CNN model structure that may satisfy these requirements are inception networks.
These do not rely on a simple sequential structure, rather they utilise sub-structures known as
inception modules. Inception modules are acyclic in structure and consist of towers orientated
in parallel to each other.
While a sequential structure would only
permit for the application of either a
convolutional or pooling process with a
fixed window size being applied one after
another. An inception module instead
applies this variety of processes in parallel.
Consequently, an inception module can
learn filters with a variety of spatial sizes
simultaneously, while also considering the
applicability of pooling at each stage.
Figure 2 gives an example of a generalised inception
module. An input is taken, and various procedures are applied in parallel. The outputs of
these parallel towers are then concatenated into a single tensor and act as an input for the next
inception module.
Of course, this is an extremely
computationally expensive solution. For this
reason, convolutional processes with 1×1
filter sizes, a step size of 1 and a smaller
number of filters than the depth of the input
are used to reduce the size of a tensor just
before it is acted upon by computationally
expensive convolutional processes.
A 1×1 convolution process undertaken with a step size of 1 and k filters will of course
maintain the shape of any input spatially, but the depth of the output shape will be of size k.
Importantly, the amount of information lost during this process has been found to be minimal
Convolution with m × m filters
Convolution with n × n filters Pooling
Input
Concatenate
𝑦 𝑘
𝑦
𝑥
𝑧
𝑥
Figure 2: Naively structured inception module
Figure 3: 1×1 convolutions for depth reduction
16
in practise (Szegedy et al. 2015). This reduction method allows us to amplify a model both in
terms of depth and width while maintaining a comparatively low computational cost.
4.3 Residual Connections
Residual networks are an idea that was first put forward by He et al. They propose a solution
to the degradation of accuracy observed when training very deep CNNs (He et al. 2016).
In this method, the tensor formed before a convolutional block is added to the output of said
convolutional block. The intuition in this idea is that the effects of a convolutional block can
be thought of as a function H acting upon an input x. It is proposed that since a block is
capable of learning a complicated function H(x), it makes sense that it could also learn the
residual function i.e H(x) – x (He et al. 2016). Ergo, residual connections mean that a
convolutional block learns a residual f(x) where H(x) = x + f(x). The input information x
skips the effects of a convolutional block and is added to the output of a block despite the
relevant function not being applied to it.
Convolution with m × m filters
Convolution with n × n filters
Pooling
Concatenate
Input
Convolution with p × p filters
Convolution with 1 × 1 filters
Convolution with 1 × 1 filters
Convolution with 1 × 1 filters
Convolution with 1 × 1 filters
Figure 4: Inception module structure with 1×1 convolutional blocks
17
This type of connection helps to solve issues such as a vanishing gradient and has allowed for
the learning of deeper networks. The implementation of this connection adds no extra
parameters and requires no extra computational power other than the inconsequential amount
required for the element-wise addition of the tensors.
5. Data Pre-processing
5.1 Wave Simulations
The data used within most HFO detection papers are sourced from epileptic patients.
However, due to several limiting factors, this study uses artificially created data produced by
Miguel Navarrete, an expert in this field. Factors include the unavailability of iEEG data
obtained in such a medical trial, the ethical issues surrounding the recording of new data and
the time constraints imposed by this being an MSc dissertation rather than a full research
paper. In practical terms, we are simulating putative HFOs already identified by some
method. All data was simulated using MatLab.
Since the objective is to produce artificial data that replicates the task of HFO classification,
it makes sense to setup a system that properly represents how iEEG data is recorded. In a
real-life situation, measuring apparatuses are implanted into patients exposed brain tissue at
various positions. Each apparatus has a grouping of electrodes that record voltage. Our setup
considers a generalised version of such a measuring device on which 4 electrodes are
positioned equidistance apart.
𝑥
Convolutional Block
𝑥
𝐻(𝑥)
Convolutional Block
𝐻(𝑥) = 𝑓(𝑥) + 𝑥
Figure 5: Residual connection
Normal Connection Residual Connection
18
While there is no available iEEG data that contains HFOs, there does exist iEEG data without
these events present. These signals are used as a baseline within which we implant oscillatory
behaviours of interest. Specifically, baseline signals are separated into 2 second intervals, and
oscillatory behaviours are implanted into the centre of these segments. To give a realistic and
demanding challenge, nine distinct behaviours are formulated. Each type of signal behaviour
is a waveform often encountered by HFO detectors.
Each simulated wave is formulated with its own distinct properties such as frequency and
duration. These properties are randomly set to values between pre-defined limits normally
exhibited by that type of waveform. Below is a summary of each simulated waveform:
• Ripples: This is an HFO event in which signal oscillates at a relatively high
frequency. Frequency is randomly set between 120Hz and 240Hz. This wave may
repeat anything between 8 and 20 times.
• Fast Ripples: This is another HFO event, like a ripple, where signal is oscillating
with high frequency. However, the frequency of this waveform is greater than that of
a ripple. Fast ripple frequency is randomly set within 240Hz to 450Hz. This wave
may repeat anything between 5 and 15 times.
• Spike: This is a non-HFO event. Signal reaches high amplitude before suddenly
falling, of course this quick activity that occurs with high frequency, meaning it
represents an especially difficult problem for any HFO detector to classify this non-
pathological spiking activity from an HFO. Instead of being defined in terms of how
many times a wave repeats, spikes are defined over a randomised time interval.
Spikes may be anything from 0.025 to 0.08 milliseconds in duration.
• Ripple-FastRipple: This is a simulated event in which a ripple and fast ripple are set
to occur simultaneously. The challenge for an HFO detection model is to disregard
any disruptive effects both HFOs may have on each other.
19
• Spike-Ripple: This is an event in which both an HFO in the form of a ripple, as well
as a non-HFO behaviour in the form of a spike are simulated simultaneously. Models
must flag this behaviour as an HFO despite the occurrence of signal distorting non-
HFO behaviour occurring.
• Spike-FastRipple: In this instance we simulate an HFO in the form of a fast ripple,
as well as a non-HFO in the form of a spike to occur simultaneously.
• Baseline: No wave is simulated; this is simply the raw baseline signal without any
distinctive behaviour within it. Of course, raw signal, even without specific waves
within it, may contain a fair amount of noise that has the potential to be misidentified
as HFO activity.
• Noise: This is a simulation of extremely noisy signal often encountered in iEEG data.
No specific type of wave is simulated, instead signal is distorted. Noise is defined
over 0.2 seconds and is created using a pink noise function in MatLab. This gives the
signal in this region of time added irregularity, which may be mistaken for HFO
behaviour.
• Artifact: This is an anomalous activity not associated with the electrical behaviour of
the brain but rather behaviour from external sources. Artifacts are created by passing a
gaussian wave through a step function, this replicates the huge jumping behaviour
within signal created by external sources.
Any signal in which a Ripple or Fast Ripple occurs should be classified as an HFO. I.e.
Ripples, Fast Ripples, Ripple-FastRipples, Spike-Ripples and Spike-FastRipples are labelled
as HFOs. While Spikes, Baseline, Noise and Artifacts are labelled as non-HFOs.
5.2 Simulating the Effects of Distance
The use of simulations gives the opportunity to study HFO detection from an alternative
perspective. Specifically, we would like to investigate whether the distance between an
electrode and the source of an event influences detection accuracy. Intuitively, we would
20
expect behaviours that occur further from the point of measurement to be more challenging to
correctly identify.
In order to properly simulate the effects of distance, the behaviours from the previous section
are set to occur within a 3-dimensional coordinate system. Within this system, the electrode
positions are constant. Ripples, Fast Ripples and Spikes are set to occur no closer than 0.01,
0.005 and 0.005 metres to an electrode respectively. The maximum distance between these
behaviours and any electrode is upper bounded at 0.04 metres.
In order to simulate the resistive properties of brain tissue, we calculate the expected effects
of travelling through gray matter across the pre-specified distance using methods derived by
Logothetis et al (2007). Where the average impedance for grey matter in a medio-lateral
direction is taken to be 75 ohms.
Through the application of the methods put forward in this research, the attenuating effects of
distance on each event can be calculated. These scaled waves are then projected onto the
baseline signal. The higher the distance, the less prominent the behaviour within the signal,
meaning it should be more difficult to correctly identify.
5.3 Time-Frequency Plot Creation
Section 2.5 covered the fundamentals of time-frequency analysis. Time-frequency plots
derived from wavelet analysis have proven to be a valuable method for predicting HFOs (Liu
el. 2016) (Lai et al. 2019). This project builds upon such research and utilizes a new way to
encode time-frequency data into 3-dimensional plots. These plots contain not only the
information that would be available within a regular 2-dimensional time-frequency plot, but
also information on how an electrodes signal varies from the electrodes in its immediate
vicinity.
In a real-life case, groups of electrodes on a measuring apparatus are implanted to a small
depth within the brain. From a logical standpoint is makes sense to take advantage of this
grouped structure to calculate the probability of an HFO having occurred, rather than rely on
each electrode to work independently.
21
This new method can be thought of as a stacking of three time-frequency plots where each
layer is sourced from a particular signal. Since we have three 2-d time frequency plots
stacked in the depth dimension, we can visualise this matrix as an RGB image, where the first
time-frequency plot corresponds to red, the second to green and the third to blue.
The first signal, from which we create a time-frequency plot that corresponds to red in the
final image, is simply the normal signal measured from that specific electrode.
The second signal is formed by taking the difference between the specified electrode and the
most closely positioned electrode below it in the coordinate system. In the case of the
electrode positioned at the bottom, the signal of the electrode positioned immediately above it
is subtracted. This signal is then a representation of the voltage difference measured between
that electrode and a closely positioned electrode.
The third signal is formed by first taking the mean over the signals measured from each of the
4 electrodes and then subtracting it from that electrodes signal. The middle 0.5 seconds of
each signal is used to create the time-frequency plots.
Once all three 2-dimensional time-frequency images are created, these are stacked to form a
3-dimensional matrix. An example of such a matrix, with the signals used to form it, is
visualised using the RGB configuration in figure 6.
The time-frequency plots are created using a wavelet analysis algorithm. Specifically, Gabor
wavelet transforms are used in this instance to create plots that encode the behaviour over a
duration of 0.5 seconds.
22
In cr
ea si
ng F
re qu
en cy
0.5 Seconds
Wavelet Analysis
0.5 Seconds
Figure 6: Wavelet analysis of 3 waves
23
5.4 Creation of Sets for Cross Validation and Final Testing
The end result of these simulations is a total of 512,760 time-frequency matrices. 285,400 of
these are HFOs, while 227,360 are non-HFOs. Python is used to extract matrices from the
MatLab files and save each as a .png filetype.
When creating sets for training and testing of models, stratified random sampling is used.
HFOs and non-HFOs are taken as the strata, which ensures that the ratio of HFOs to non-
HFOs is maintained in each set. 10% of the data is taken as the test set, which is composed of
28,540 HFOs and 22,736 non-HFOs. From the remaining 90% of data, a 4-fold cross-
validation is composed. A 4-fold cross-validation stipulates training models on 75% of a set
and testing on 25%. Consequently, in each fold models are trained using 192,645 HFOs and
153,468 non-HFOs, before being tested on 64,215 HFOs and 51,156 non-HFOs.
This dissertaion proposes three different model structures. First, the 4-fold cross-validation is
conducted for these models. This provides a reliable measure of the stability of each models
predictive power. An optimally designed model is then chosen based upon the mean accuracy
obtained over the folds of the cross validation.
The data previously used to make the cross-validation folds is then combined into a set of
256,860 HFOs and 204,624 non-HFOs. The chosen model and the re-created industry
standard models are trained on this set before being applied to the hold-out test set. An in
depth analysis of the results can then be conducted.
5.5 Limitations of Simulated Data Simulated data provides a reasonable alternative for this project, especially when considering
said data is created by an expert in this particular field. Additionally, simulations allow for a
proper study of distance effects on HFO detection, something that has not been conducted
before. Of course, it is vital we understand the potential pitfalls of these simulation methods,
and the limitations this imposes on any conclusions that are made in this research.
One implication that must be considered is that this experimental simulative setup may lead
to mis-labelling by wave diminishment. Consider the simulation of a Ripple. It is possible
that this Ripple is created with a small amplitude as well as a large distance from the
24
electrode. This could mean it’s magnitude within the baseline signal is diminished to such an
extent, that we must consider the question: when is a Ripple truly considered a Ripple? The
wave considered may be so minute that an expert would no longer consider this type of
behaviour an HFO, despite this, our system labels it as an HFO regardless. This is a by-
product of the simulation set up, and although very rare, it is possible that some instances of
the dataset may replicate this situation.
A similar issue is that the simulation may lead to mis-labelling by overly distortive wave
effects. In this simulation method, hybrid waves formed from both an HFO and non-HFO are
created. These simulations are all labelled as HFOs, since we would like a model to detect the
appearance of this behaviour despite the effects of the disruptive non-HFO activity. However,
this may lead to instances where the non-HFO behaviour is so disruptive, and the HFO
behaviour so diminished and effected by noise, that the final wave no longer properly
resembles any kind of HFO behaviour. Again, this is a possible occurrence in which we label
the behaviour as an HFO, despite the fact that upon visual inspection, experts may no longer
classify the wave as an HFO.
Finally, it is important to consider how the relative cardinality of wave subsets produced
leads to a disproportioned dataset. In these simulations, each type of wave is given an equal
probability to occur. This produces a dataset with roughly equal numbers of each type of
wave. In a real life study, it is highly unlikely that these proportions actually occur. In a real-
life situation we may see a very small/large amount of a certain wavetype occur, meaning this
behavior is over/under represented in our dataset. This is not as much of an issue when
considering the performance of models against eachother within this dissertation, since both
models are exposed to the same data. However, it would be unsuitable to make a side-by-side
comparison between the overall accuracies obtained here and papers that derive their data
from a real-life medical study.
6. Construction of Models
6.1 Reconstruction of the Industry Standard Model This dissertation includes graphs of interconnected blocks to visualise network sub-
structures. Each block represents a separate process within a structure. An explanation of how
to read these diagrams can be found in the appendices.
25
In this study, we re-create the CNN proposed by Lai et al (2019) in order to give a baseline
level of performance. This model is a simple sequential structure with two convolutional
layers and two pooling layers. Refer to figure 7 for a visualisation of this structure.
Unfortunately, the paper gives no information about several key design choices, therefore
some presumptions are made in order to re-create the model. For instance, in the fully
connected layers of the model, both the number of layers and the number of neurons within
each layer are unspecified. We make the presumption that there are 16 neurons in the first
layer and 10 in the second layer. This assumption is derived from the model diagram within
the original paper.
The optimization method used is also left unspecified, and so is assumed to be a standard
stochastic gradient descent. This is the same optimization method used by our models. The
learning rate and its decay settings are also unspecified. In practical runs of the model it was
found that a large initial learning rate often lead to poor results. An initial learning rate of
0.01 was found to be optimal and this was halved every 2 epochs. A batch size of 250 is used,
the same as our proposed models. The finalised model has a total of 3,844,512 parameters, all
of which are trainable.
6.2 Construction of New Models
This dissertation proposes three new models, each of which follow a different structure. We
label these models A, B and C. Each model is optimized via stochastic gradient descent with
a batch size of 250. The initial learning rate is 0.05 and is set to halve every 2 epochs. Due to
the time regulations surrounding access to certain facilities on the Supercomputing Wales
cluster, the maximum number of epochs is restricted to 10. For each convolutional block,
non-linearity is achieved by the application of a rectified linear activation function. Batch
normalization is then applied to this output.
Each model uses depth-wise separable convolution in place of the classical convolutional
process as this has been shown to provide a modest upgrade in accuracy, whilst
Max Pool (3,3) 2 N Input
Conv 16 (3,3) 1 N
Conv 16 (3,3) 1 N
Max Pool (3,3) 1 N
Dense 12 Dense 10 Softmax
Figure 7: CNN structure proposed by Lai et al
26
simultaneously reducing the number of parameters (Chollet et al. 2017). Another
architectural feature common to all models is the use of overlapping pooling, meaning that all
pooling procedures are carried out with a step size smaller than the corresponding pool size.
This has been found to produce more accurate results (Krizhevsky et al. 2012)
Each model uses the same simple, sequentially structured stem. This serves to not only
extract low level features of interest; but also reduce the size of the input tensor before more
complex processes are applied.
Each model includes a final pooling procedure positioned after the inception modules. In
practise, tests of the way in which these models optimized showed a need to reduce the
interdependence between neurons in the network. Therefore, dropout with rate of 0.2 is used
to promote the learning of more robust features and minimize overfitting. Specifically, this
dropout is employed for all weights between the output of the final pooling layer and the
softmax layer.
6.2.1 Model A Model A applies filter factorisation and expansive filter banks to put a focus on capturing
large spatial features in a computationally efficient manner. While convolution with large
filter sizes may potentially extract features of great importance to this specific task, this
comes at the expense of computational efficiency. For example, each time a 7×7 filter is
applied, there are 7×7=49 multiplicative operations required. A 3×3 filter on the other hand
requires 3×3=9 operations. The 7×7 filter is disproportionally 49/9 ≈ 5.44 times more
expensive. A proposed solution to this problem is to replace convolutional layers that depend
upon large filter sizes with a series of convolutional processes with smaller filter sizes. Not
only does this reduce computational expense, theory suggests this method is able to extract
high dimensional features in a similar way to that of large filter sizes (Szegedy et al. 2016).
Taking this idea further, we can employ asymmetric factorization of convolutional layers, i.e
the replacement of single convolutional layers that employ n×n filters with two layers that
Input Conv 64 (3,3) 1 N Max Pool (3,3) 2 N
Conv 32 (3,3) 2 N
Conv 128 (3,3) 1 N
Conv 192 (3,3) 2 N
Figure 8: Stem of newly proposed models
27
employ 1×n and n×1 filter size respectively. Again, this reduces computational cost and we
can still capture high dimensional features. In the original paper proposing this method of
asymmetric factorization, this method was particuarly effective when acting upon inputs of
sizes m×m where m ∈ [12,20]. Conversely, this method was less effective on earlier layers
(Szegedy et al. 2016).
Another method employed in this model is the use of filter bank expansion. This is a method
in which a convolutional layer of a tower within an inception module is replaced by two
convolutional processes to be considered in parallel. Szegedy et al (2016) promoted the use
of this method when applied to a spatially small input.
The 2nd and 4th inception modules act without padding and employ a step size of 2 within one
process of each tower. This serves to reduce the spatial size of their inputs. In contrast, the 1st
3rd and 5th modules employ padding throughout and a step size of 1 in order to extract
features while maintaining spatial size for the next module.
An overview of the structure of Model A is given in
the table on the right. The finalised model has
3,442,717 parameters, 3,427,805 of which are
trainable and 14,912 of which are non-trainable.
Figures 9-13 give a graphical representation of the
inception modules used in the model.
Process Output Shape Stem (29, 29, 192)
Module 1 (29, 29, 256) Module 2 (14, 14, 224) Module 3 (14, 14, 768) Module 4 (6, 6, 1280) Module 5 (6, 6, 2048) Final Pool (1, 1, 2048) Softmax (0, 0, 2)
28
Prev Layer
Conv 192 (1,1) 1 Y
Conv 64 (3,3) 1 Y
Conv 64 (1,1) 1 Y
Max Pool (3,3) 1 Y
Conv 192 (1,1) 1 Y
Depth Concat
Conv 64 (1,1) 1 Y
Conv 96 (3,3) 1 Y
Conv 96 (3,3) 1 Y
Conv 64 (1,1) 1 Y
Conv 96 (3,3) 2 N
Max Pool (3,3) 2 N
Conv 64 (1,1) 1 Y
Depth Concat
Conv 64 (3,3) 2 N
Conv 64 (3,3) 1 Y
Prev Layer
Prev Layer
Conv 128 (1,1) 1 Y
Conv 192 (1,1) 1 Y
Conv 128 (1,1) 1 Y
Avg Pool (3,3) 1 Y
Conv 192 (7,1) 1 Y
Conv 192 (1,7) 1 Y
Conv 192 (1,7) 1 Y
Conv 192 (7,1) 1 Y
Conv 192 (1,1) 1 Y
Conv 128 (1,7) 1 Y
Conv 192 (7,1) 1 Y
Depth Concat
Prev Layer
Avg Pool (3,3) 2 N
Conv 192 (1,7) 1 Y
Conv 192 (7,1) 1 Y
Conv 128 (1,1) 1 Y
Conv 192 (3,3) 2 N
Conv 192 (1,1) 1 Y
Conv 320 (3,3) 2 N
Depth Concat
Figure 9: Module 1 of Model A
Figure 10: Module 2 of Model A
Figure 11: Module 3 of Model A
Figure 12: Module 4 of Model A
29
6.2.2 Model B Model B is a more classical inception network design. A maximum filter size of 5×5 is
employed, but redistribution of its computational budget is used to create a deeper network
than models A and C. In Model B, similarly structured inception modules are repeated with
increasing filter numbers at each stage.
Again, spatial size of the input tensor is
decreased throughout the model. However, in
this case, this reduction is achieved by
intermediate pooling layers located just after the
2nd and 4th inception modules. The finalised
model has 1,738,573 total parameters,
1,728,541 of which are trainable and 10,032 of
which are non-trainable.
Process Output Shape Stem (29, 29, 192)
Module 1 (29, 29, 288) Module 2 (29, 29, 480) Module 3 (14, 14, 512) Module 4 (14, 14, 512) Module 5 (6, 6, 832) Module 6 (6, 6, 1024) Final Pool (1, 1, 1024) Softmax (1, 1, 2)
Prev Layer
Conv 320 (1,1) 1 Y
Conv 192 (1,1) 1 Y
Avg Pool (3,3) 1 Y
Conv 384 (1,3) 1 Y
Depth Concat
Conv 384 (3,1) 1 Y
Conv 384 (3,3) 1 Y
Conv 448 (1,1) 1 Y
Conv 384 (1,3) 1 Y
Conv 384 (1,1) 1 Y
Conv 384 (3,1) 1 Y
Figure 13: Module 5 of Model A
30
Conv 16 (1,1) 1 Y
Conv 32 (5,5) 1 Y
Depth Concat
Conv 96 (1,1) 1 Y
Conv 128 (3,3) 1 Y
Conv 64 (1,1) 1 Y
Prev Layer
Max Pool (3,3) 1 Y
Conv 64 (1,1) 1 Y
Conv 32 (1,1) 1 Y
Conv 96 (5,5) 1 Y
Depth Concat
Conv 128 (1,1) 1 Y
Conv 192 (3,3) 1 Y
Prev Layer
Conv 64 (1,1) 1 Y
Max Pool (3,3) 1 Y
Conv 128 (1,1) 1 Y
Max Pool (3,3) 2 N
Conv 16 (1,1) 1 Y
Conv 48 (5,5) 1 Y
Depth Concat
Conv 96 (1,1) 1 Y
Conv 208 (3,3) 1 Y
Conv 64 (1,1) 1 Y
Prev Layer
Max Pool (3,3) 1 Y
Conv 192 (1,1) 1 Y
Conv 24 (1,1) 1 Y
Conv 64 (5,5) 1 Y
Depth Concat
Conv 112 (1,1) 1 Y
Conv 224 (3,3) 1 Y
Conv 64 (1,1) 1 Y
Prev Layer
Max Pool (3,3) 1 Y
Conv 160 (1,1) 1 Y
Max Pool (3,3) 2 N
Conv 32 (1,1) 1 Y
Conv 128 (5,5) 1 Y
Depth Concat
Conv 160 (1,1) 1 Y
Conv 320 (3,3) 1 Y
Conv 128 (1,1) 1 Y
Prev Layer
Max Pool (3,3) 1 Y
Conv 256 (1,1) 1 Y
Conv 48 (1,1) 1 Y
Conv 128 (5,5) 1 Y
Depth Concat
Conv 198 (1,1) 1 Y
Conv 384 (3,3) 1 Y
Conv 128 (1,1) 1 Y
Prev Layer
Max Pool (3,3) 1 Y
Conv 384 (1,1) 1 Y
Figure 14: Module 1 of Model B Figure 15: Module 2 of Model B
Figure 16: Module 3 of Model B Figure 17: Module 4 of Model B
Figure 18: Module 5 of Model B Figure 19: Module 6 of Model B
31
6.2.3 Model C
Model C employs residual connections within an inception network structure. As discussed in
section 4.3, residual connections were first proposed as a method for training extremely deep
networks (He et al. 2016). Despite the models proposed in this paper lacking such depth,
residual connections remain an important feature to investigate the applicability of in this
task. Moreover, residual connections have been deployed in inception architectures for image
recognition with great effect (Szegedy et al. 2017).
Inception modules are designed with less width than the modules proposed in models A and
C, again this is in keeping with the designs that have been successful for inception networks
(Szegedy et al. 2017). In this model, the 1st and 3rd modules employ residual connections,
while others are classical inception modules that act to reduce spatial dimensions.
Factorization, as used in Model A is utilised.
It should be noted that after the depth-wise concatenation in modules that employ residual
connections, this output is often of reduced depth when compared to the input of the module.
Due to this discrepancy, an element-wise addition would be impossible. Therefore a 1×1
convolutional layer with a step size of 1 is utilised, where the number of filters is equal to the
depth of the input tensor. This method is used in order to scale up the output from the
module.
Additionally, in a residual connection, before the processed tensor is added to the tensor from
the previous layer, the values are multiplied by a scaling factor of 0.2. This scaling is a
technique proposed to fix the training instability that can be exhibited when the number of
kernels within a module become exceptionally
large (Szegedy et al. 2017). It proved a necessary
design choice in Model C, as early practical
applications without scaling showed the model
was unable to properly optimize. The finalised
model has 3,427,005 total parameters, 3,411,069
of which are trainable and 15,936 of which are
non-trainable.
Process Output Shape Stem (29, 29, 192)
Module 1 (29, 29, 192) Module 2 (14, 14, 960) Module 3 (14, 14, 960) Module 4 (6, 6, 1856) Module 5 (6, 6, 1856) Final Pool (1, 1, 1856) Softmax (0, 0, 2)
32
Prev Layer
Conv 32 (1,1) 1 Y
Conv 32 (1,1) 1 Y
Conv 32 (3,3) 1 Y
Depth Concat
Conv 32 (1,1) 1 Y
Conv 32 (3,3) 1 Y
Conv 32 (3,3) 1 Y
Conv 192 (1,1) 1 Y Scale Addition
Max Pool (3,3) 2 N
Conv 384 (3,3) 2 N
Depth Concat
Conv 384 (3,3) 2 N
Conv 192 (3,3) 1 Y
Conv 192 (3,3) 1 Y
Prev Layer
Prev Layer
Conv 128 (1,1) 1 Y
Conv 128 (1,1) 1 Y
Conv 128 (1,7) 1 Y
Conv 960 (1,1) 1 Y
Conv 128 (7,1) 1 Y
Scale Addition Depth Concat
Prev Layer
Max Pool (3,3) 2 N
Conv 384 (3,3) 2 N
Conv 256 (1,1) 1 Y
Conv 256 (1,1) 1 Y
Conv 256 (3,3) 2 N
Depth Concat
Conv 256 (1,1) 1 Y
Conv 256 (3,3) 2 N
Conv 256 (3,3) 1 Y
Figure 18: Module 1 of Model C
Figure 19: Module 2 of Model C
Figure 20: Module 3 of Model C
Figure 21: Module 4 of Model C
33
6.3 Applicability of Deep Learning and CNNs to HFO Detection
While this area of research is an active one, models thus far fall short of the accuracies
required to apply an automated technique in the medical field. The challenge of HFO
classification is in the intricacy of the data. Not only can HFOs occur in a variety of
frequencies and durations, non-HFO related electrical behaviours commonly lead to signal
distortion. Consequently, the multitude of different signal behaviours that constitute an HFO
are vast.
Less complex methods such as SVMs and Clustering, while showing reasonable accuracies,
have proven unable to truly capture the intricate factors at play in HFO detection. Deep
learning promises a way to build models with the ability to properly encapsulate the difficulty
of the task.
CNNs have a huge applicability to this problem. Their ability to learn features automatically,
rather than rely on manual feature extraction, distinguishes them from more traditional
methods. Classical predictive models are only as strong as the features, with hopefully
important statistical discriminatory value, that they are fed.
From a logical standpoint, the creation of an HFO classification model can be split into two
steps. The first of these is the way in which the raw signal is encoded to make it interpretable
to models. The second is how the model is structured to find patterns of interest from this
data. The methodology presented in this dissertation therefore provides a comprehensive
solution. Wavelet analysis allows for a more accurate encoding of frequency and time
information. While inception networks provide a model type more appropriately designed to
learn the intricate and subtle features of time frequency plots than the CNN proposed by Lai
et al (2019).
Prev Layer
Conv 192 (1,1) 1 Y
Conv 192 (1,1) 1 Y
Conv 128 (1,3) 1 Y
Conv 1856 (1,1)
1 Y Conv 128 (3,1) 1 Y
Scale Addition Depth Concat
Figure 22: Module 5 of Model C
34
7. Applying the Models.
We now apply the models to our data. Cross validation is first used to test the stability of
each models’ predictive power. Once an optimally designed structure has been established,
we apply this model and the industry standard to a final test set. Models were trained on the
Supercomputing Wales cluster. Specifically, training was distributed over 2 GPUs
simultaneously. Model scripts as well as an example bash script are given in the appendices.
7.1 Cross-Validation Results
Table 7.1.1 gives the results of the cross validation. To give some appreciation of a baseline
performance, the industry standard is also applied at this stage. Interestingly, all the designed
models proposed perform more accurately than the industry standard across all four folds.
Furthermore, the time taken to complete the 10 epochs is presented in the format of days:
hours: minutes. Models show negligible difference from the industry standard on this metric.
Model B obtains the highest mean accuracy across the folds of the cross validation. Due to
this performance, we take this to be the optimally designed model and select it for application
on the test set.
Fold 1 Fold 2 Fold 3 Fold 4 Mean Model A 96.74% 96.95% 96.82% 96.93% 96.86% Model B 96.93% 96.81% 97.00% 96.96% 96.925% Model C 96.36% 96.19% 96.44% 96.33% 96.33% Industry Standard
95.81% 95.60% 95.52% 95.87% 95.70%
Fold 1 Fold 2 Fold 3 Fold 4 Mean Model A 01:17:22 01:17:58 01:18:07 01:17:05 01:17:38 Model B 01:19:50 01:18:23 01:17:42 01:18:01 01:18:29 Model C 01:17:20 01:17:53 01:17:32 01:18:30 01:17:48 Industry Standard
01:19:04 01:17:15 01:16:41 01:17:39 01:17:40
35
7.2 Test Set Results
Model B and the industry Standard model are now applied to the test set in order to simulate
a real-life HFO detection task. As discussed in section 5.4, the data utilised for the 4-fold
cross validation now becomes the training set. Batch size, number of epochs, and
optimisation details are maintained from the previous section.
Not only is an overall analysis of predictive performance given, but also an investigation of
accuracy in respect to each individual wave type. We also inspect whether the distance
between the waveform and electrode has meaningful effects on predictive power.
7.2.1 Overall Performance
Figures 26 and 27 show the confusion matrices from application to the test set of both Model
B and the Industry Standard. Model B has a higher number of true positives and true
negatives, indicating a more effective performance.
Industry Standard Model B
Figure 25: Confusion matrix for Model B's application on the test set
Figure 26: Confusion matrix for the Industry Standard's application on the test set
36
To obtain a more formal measure of performance, sensitivity and specificity is calculated
using the equations below. Accuracy and the time taken to complete the 10 training epochs
are also presented.
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃
𝑇𝑃 + 𝐹𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 1 − 𝐹𝑃
𝑇𝑃 + 𝐹𝑃
Model B performs more effectively than the Industry Standard in terms of overall sensitivity
and specificity. The industry Standard model is faster to complete the 10 epochs, however
such a small difference in training time is arguably negligible in this situation.
Below is an interesting visualisation of the predictive behaviour of each model on the 51,276
instances of the test set and how this behaviour relates to the other model. The two circles of
the Venn diagram represent the group of accurately predicted instances within the test set by
each model.
Model B Industry Standard Training Time 01:21:23 01:20:47 Sensitivity 96.36% 94.81% Specificity 98.34% 97.03% Accuracy 97.07% 95.59%
502 1256
1001
48517
Figure 23: Venn diagram comparing the performance of Models
Model B Industry Standard
37
48,517 of the instances within the test set are accurately predicted by both models. 1256
instances are accurately predicted by Model B that are inaccurately predicted by the Industry
Standard model. Conversely, 502 instances are correctly predicted by the Industry Standard
model that are incorrectly predicted by Model B. 1001 instances are incorrectly predicted by
both models.
7.2.2 Accuracy Breakdown by Wave Type
While Model B performs more effectively when taking an overall view of the test set, it’s of
course important to understand the reasons for this performance. A more in-depth analysis of
each model’s performance by wave-type is necessary. By evaluating model performance by
wave-type, we can examine if model B is identifying better patterns over all different waves,
or simply performing more optimally on a certain sub-task of the problem.
The above table gives a breakdown by wave type of both Model B and the Industry Standards
performance on HFOs of the test set. Interestingly, Model B has more accurate results for all
types of wave. This suggests Model B has developed more meaningful features than the
Industry Standard independent of the type of HFO prediction is made upon.
Both Models can identify fast ripples with more accuracy than ripples, which is in line with
the finding as of the industry standard paper (Lai et al. 2019). Both models find situations in
which a ripple and fast ripple wave occur simultaneously as the simplest wave type to predict.
Model B Industry Standard Wave N Correct
Prediction Incorrect Prediction
Accuracy Correct Prediction
Incorrect Prediction
Accuracy
Ripple 5651 5520 131 97.68% 5412 239 95.77% Fast Ripples
5768 5655 113 98.04% 5606 162 97.19%
FastRipple- Ripple
5637 5627 10 99.82% 5625 12 99.79%
Spike- FastRipple
5836 5438 398 93.18% 5355 481 91.76%
Spike- Ripple
5648 5261 387 93.15% 5062 586 89.62%
Total 28540 27501 1039 96.36% 27060 1480 94.81%
38
Perhaps in line with intuition, situations in which a ripple or fast ripple occurs with disruptive
non-HFO behaviour in the form of a spike present the most difficult task for the models.
There is a significant drop in both ripple and fast ripple detection accuracy when spiking
behaviour is set to occur simultaneously.
The below table gives a breakdown by wave type of each model’s performance on non-HFO
events. Interestingly, both models have a higher accuracy when evaluating non-HFO activity
compared to HFO activity. Both models perform well on artifacts and noise, Model B in fact
scores 100% when predicting on artifacts. Spikes present the most challenging waveform to
predict for both models. Model B again performs more accurately on all sub-types of non-
HFO behaviour than the Industry Standard.
7.3 Predictive Power of Models over Distance
Of interest in this field is the performance of a model with respect to the distance between the
electrodes and the behaviour it measures. The challenge in a real-life scenario is locating
specific epileptogenic brain tissue. Ergo, future solutions may look to utilise predictions from
multiple electrodes, all with varying distance to the behaviour, in order to more precisely
locate defective tissue. This means an understanding of how HFO detection accuracy varies
over distance is of great importance. It should be noted that distance measurements are given
in metres within this report.
Model B Industry Standard
Wave N Correct Prediction
Predicted Wrong
Accuracy Correct Prediction
Predicted Wrong
Accuracy
Artifacts 5663 5663 0 100% 5652 11 99.81%
Baseline 5639 5508 131 97.68% 5431 208 96.31%
Noise 5679 5678 1 99.98% 5662 17 99.70%
Spike 5755 5423 332 94.23% 5214 541 90.60%
Total 22736 22272 464 97.96% 21959 777 96.58%
39
7.3.1 Simple Waveforms – Ripples, Fast Ripples and Spikes
We first investigate HFO accuracy over distance when models are evaluating simple
waveforms. By simple waveforms, we mean waveforms without additional disruptive waves
set to coincide with them. Specifically, we first study simple ripples, fast ripples and spikes.
Figures 28-31 show plots of the output probabilities from a model against the distance to that
respective wave. For example, figure 28 shows how Model B predicts on the ripples within
the test set. By plotting the probability output against the distance from each ripple, we can
analyse how the model’s performance differs in respect to the distance from said Ripples. In
the case of figure 28, we can see how the distribution of incorrectly predicted ripples is
shifted towards larger distances.
The output probability simply reflects the probability with which a model predicts an instance
to be an HFO. Therefore, in plots like these, points that occur more towards the central area
of the y-axis account for instances where the model is less certain about what label should be
allocated to this observation. Comparing figures 28 and 29, we can clearly see that not only
does the Industry Standard incorrectly predict more instances, there is far less certainty in its
predictions.
Figure 28: Model B's performance on Ripples over distance
Figure 29: Industry Standard's performance on Ripples over distance
40
Figure 30: Model B's performance on Fast Ripples over distance
From observation of each plot, it is explicit that increased distance has some obscuring
effects on ripple and fast ripple detection for both models. By comparing figures 30 and 31, it
is clear that the industry standard carries out a far less accurate prediction of fast ripples than
Model B.
Spiking is a non-HFO wave that both models found relatively challenging to predict. An
incorrect prediction from a model on this wave-type would be labelling the spike as an HFO.
Therefore, incorrect points are this time located above 0.5 on the y-axis. Figures 32 and 33
compare the performance of the models on spikes of the test set. There appears to be high
amount of uncertainty in the predictions from both models, which underlines the difficulty of
predicting this specific waveform. There appears to be a small shift in the distribution of the
incorrect predictions towards smaller distances, indicating that both models find predicting
spikes more difficult the closer the spike is to the electrode. In conclusion, it seems that the
closer a spike is located, the stronger the propensity of models to confuse it with HFO
activity.
Figure 31: Industry Standard's performance on Fast Ripples over distance
41
Figure 32: Model B's performance on Spikes over distance
7.3.2 Complex Waveforms – Pairings Between Ripples, Fast Ripples and Spikes
The task of classifying HFO activity when both a Ripple and Fast Ripple occur
simultaneously is an important assignment for an HFO detection model. Another important
task for Models is the ability to detect HFO’s when there is the presence of non-HFO
activities occurring simultaneously.
Of course, in these mixed waves, two specific events are occurring with their own respective
distance. Therefore, we first consider each models accuracy over both these distances
simultaneously using 3D-plots.
In figures 34 and 35 we can see the predictions of Model B and the Industry Standard model
in comparison to distance from both the ripple and fast ripple. From our previous analysis, we
have already seen that both models perform relatively well on this wave type, and so
examples of incorrect predictions are sporadic. Although, it appears most incorrect instances
occur at higher distances.
Figure 33; Industry Standard's performance on Spikes over distance
42
Figure 34:Model B's performance on Ripple- FastRipples over distances
Figure 35: Industry Standard's performance on Ripple- FastRipples over distances
Figure 24: Incorrect predictions of Model B on Ripple- FastRipplesFigure 25: Industry Standard's performance on Ripple-FastRipples over distances
Figure 36: Incorrect predictions of Model B on Ripple-FastRipples
Figure 37: Incorrect predictions of the Industry Standard on Ripple-FastRipples
43
To make this pattern a little easier to observe, figures 36 and 37 show 2d plots of all incorrect
predictions of this wave type. These can be thought of as projections of incorrect predictions
onto the floor of figures 34 and 35. This allows us to clearly see where these incorrect
observations occur in respect to both the distance from the ripple and fast ripple.
The model’s successes at when predicting on this wave type are to the detriment of the
analysis, since there are so few incorrect predictions to analyse. Despite the relatively small
volume of instances, it seems incorrect predictions for this wave type occur when both the
distance from the ripple and the fast ripple are reasonably large. Intuitively, this makes sense,
as predictive errors only seem to occur when both distances are large enough to make the
recognition of both a ripple and fast ripple relatively difficult.
From our previous analysis, waveforms consisting of both HFO and non-HFO behaviour
proved to be more challenging for both models. The occurrence of spiking activity seems to
hinder each model’s ability to extract the relevant HFO behaviour from the wave.
Figure 38: Model B's performance on Spike- Ripples over distances
Figure 39: Industry Standard's performance on Spike-Ripples over distances
44
Figure 38 and 39 visualise how the distance from both a ripple and a spiking activity affect
the predictive ability of the models. As previously discussed, Model B predicts more
accurately on this wave type. Not only are there less incorrect predictions from Model B, the
cluster of correctly predicted instances are more greatly concentrated towards the high
probabilities, meaning there is more certainty regarding these observations. The industry
standard shows less certainty in this respect.
How the predictive powers of these models change over distances is again difficult to fully
interpret. We therefore take 2d-projections of all incorrectly predicted instances onto the
floors of figures 38 and 39.
The plots suggest that as the distance to the ripple activity increases, the number of
incorrectly produced observations increases. In contract, most incorrectly predicted
observations seem to occur in situations where the spike occurs a smaller distance away. To
further analyse these effects, boxplots are given in figure 42 and 43. These show the
distributions of the incorrectly predicted instances in terms of the distance from both the
ripple and spike.
Figure 40: Incorrect predictions of Model B on Spike-Ripples
Figure 41: Incorrect predictions of the Industry Standard on Spike-Ripples
45
The boxplots show that incorrect predictions are more likely to appear in situations where the
distance from the ripple is large, and the distance from the spike is small. This is in line with
logical reasoning. As expected, the smaller the distance from a spike, the more likely it is to
have disruptive effects. Meanwhile, ripple activity is easier to detect and classify when the
wave is less attenuated.
We now consider waves composed of simultaneously occurring fast ripples and spikes. 3D-
plots of distance to both the fast ripple and the spike are given in figures 44 and 45. Again,
model B predicts more accurately. Additionally, the cluster of correctly predicted
observations are more highly concentrated in the upper region, meaning the correct
predictions are made more confidently.
Figure 42: Distribution of incorrect predictions by Model B on Spike-Ripples
Figure 43: Distribution of incorrect predictions by the Industry Standard on Spike-Ripples
Figure 26: Model B's performance on Spike-FastRipples over distancesFigure 27: Distribution of incorrect predictions by the Industry Standard on Spike-Ripples
46
2d-projections of all incorrectly predicted instances onto the floor of the previous plot are
given in figures 46 and 47 below. Again, this allows us to closely study the distribution of
incorrectly predicted observations in terms of both distances simultaneously.
Figure 44: Model B's performance on Spike- FastRipples over distances
Figure 45: Industry Standard's performance on Spike-FastRipples over distances
Figure 46: Incorrect predictions of Model B on Spike- FastRipples
Figure 47: Incorrect predictions of the Industry Standard on Spike-FastRipples
47
The plots for each model suggest that as the distance to the fast ripple activity increases, we
in turn see an increase in the number of incorrectly predicted results. There appears to be a
small shift in the plotted values towards regions where the distance from spiking activity is
smaller. This is a difficult pattern to interpret and so we inspect how the incorrect instances
are distributed over the distance to each kind of wave behaviour using boxplots.
Indeed, the plots show how the increasing distance from the fast ripple activity has disruptive
effects on the accuracy of both models. The opposite seems to be true for the spiking
behaviour. These conclusions again reinforce exactly the behaviour we would expect.
Measurements of HFOs in which a fast ripple and spike occur simultaneously, where spiking
behaviour occurs far closer to an electrode than the fast ripple of interest, are prime
candidates to be mis-classified. This is due to the highly distortive effects of the spike and
relatively weak measurement of the Fast Ripple.
Figure 48: Distribution of incorrect predictions by Model B on Spike-FastRipples
Figure 49: Distribution of incorrect predictions by the Industry Standard on Spike-FastRipples
48
7.4 Performance on Small Datasets
This research utilizes simulated data, which allows for a vast dataset to be generated. In a
real-life application, data is sourced from pre-surgery epileptic patients, making it far scarcer.
Therefore, it is important to inspect how the proposed model adapts to small datasets.
From the final training set, a stratified random sampling is carried out to create a small
dataset of just 20,000 images. Half of which are HFOs and half of which are non-HFOs.
This set is used to train the Industry Standard and Model B. Optimization details are
unchanged from previous sections. These models were then applied to the test set.
Model B gives an accuracy of 94.58% while the industry standard has an accuracy 79.61%.
Model B is able to maintain a relatively high predictive power despite the reduced training set
size. The industry standard shows a far sharper drop off in performance.
8. Final Discussion
In conclusion, we present a model that is more accurate than the industry standard on
simulated data. Not only does the model obtain more accurate results overall, upon further
analysis, the model performs more accurately on all types of signal behaviour simulated.
Of course, testing has only been conducted on simulated data, even so, such successful results
give great promise that this model and alike structures will be similarly effective when acting
upon real-life data. If such effective performances are replicable on real-life data, this could
potentially be a significant breakthrough in the field.
Not only do we present a successful new model. The analysis of how distance to behaviours
effects the classification accuracy of models is also valuable for the field. This analysis shows
how increasing distance from HFO behaviour negatively effects a model’s ability to detect
such behaviour. Furthermore, an investigation into signal behaviours derived from both HFO
and non-HFO events showed the disruptive effects of non-HFO behaviour and how it
increases with closer proximity. While the conclusions from this analysis may seem obvious,
this is the first research to analyse such effects in the field of HFO detection.
49
8.1 Limitations to Conclusions
While there are many positives to the work presented in this research. There are of course
limitations to what can be concluded. Specifically, the issues surrounding the use of
simulated data, as discussed in section 5.5, mean that it would be unwise to directly compare
the accuracies obtained within this report with accuracies obtained by papers from the field.
Rather, this research should simply show the applicability of these models, and act as a
motivation to test these models on real-life data.
8.2 Possible Next Steps
Results are promising, and a logical next step would be to test the model structures on real-
life data. Although considerations would have to be made to record the relevant iEEG from
pre-surgery patients.
Excitingly, the depth and width of models proposed here could also be increased. Design
choices such as the number of layers and number filters in said layers are restricted in order
to build models trainable in reasonable time limits. Far more deep and complex structures
could be considered. Edits to other model hyperparameters, such as the optimization method
and number of epochs, could be made to possibly yield more accurate results.
It is not unreasonable to suggest that with the application of even more complex structures
than researched here, inception architectures may be able to approach the classification
accuracies necessary to make the utilisation of these models in the medical field practical.
50
9. Appendices
A. Diagram Explanation This paper includes graphs to represent the network sub-structures. Each block in a graph represents a separate process. Below is an explanation of how to read the blocks that make up these graphs.
Layer type & filters used.
Figure 28: Taking (x,y)
(x,y)
s
s
pad pad
The upper section of a block defines the type of process applied at this stage. Conv is used to show that this is a convolutional layer, and in this case the number of kernels used is indicated as a small number to the right of this. Alternatively, ‘Avg Pool‘ and ‘Max Pool’ are used to indicate that this is an average pooling or max pooling layer. The upper section of a block
The bottom left section contains a tuple that defines the filter/pool size use in this convolutional/ pooling layer. The bottom The bottom-middle section gives
an integer ‘s’ that corresponds to the step size used in this layer. The bottom-middle section gives
The bottom-right section defines information on the use of padding. A label of ‘Y’ indicates that the spatial dimensions of the input are maintained using zero-padding. A label ‘N’ indicates that no padding is used.
51
B. Bash Script Example Example of a bash script used to distribute model training on the Supercomputing Wales cluster. This example relates to the final training of the Industry Standard model before application to the test set.
C. Learning Rate Example Code defining function to control learning rate. This specific example relates to the industry standard model.
52
D. Inception Module Example Example of an inception module structure defined in Keras. This example is of the 1st inception module of model B.
E. Example of Calling a Model in Keras Calling a model for training. Example relates to calling model B for final training before application to the test set.
53
10. Bibliography
Bénar, C., Chauvière, L., Bartolomei, F., Wendling, F. 2010. Pitfalls of high-pass filtering for detecting epileptic oscillations: a technical note on “false” ripples. Clinical Neurophysiology 121(3), pp. 301-310.
Blanco, J.A., Stead, M., Krieger, A., Viventi, J., Marsh, R. et al. 2010. Unsupervised Classification of High-Frequency Oscillations in Human Neocortical Epilepsy and Control Patients. Journal of Neurophysiology 104(5), pp. 2900–2912.
Cendes, F. and Meador, K. 2018. Searching for the good and bad high-frequency oscillations. Neurology 90(8), pp. 347-348. Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251- 1258).
Crépon, B., Navarro, V., Hasboun, D., Clemenceau, S., Martinerie, J. et al. 2009. Mapping interictal oscillations greater than 200 Hz recorded with intracranial macroelectrodes in human epilepsy. Brain 133(1), pp. 33-45.
Du, Y., Sun, B., Lu, R., Zhang, C., Wu, H. et al. 2019. A method for detecting high- frequency oscillations using semi-supervised k-means and mean shift clustering. Neurocomputing 350, pp. 102-107.
Dümpelmann, M., Jacobs, J., Kerber, K., Schulze-Bonhage, A. 2012. Automatic 80–250 Hz “ripple” high frequency oscillation detection in invasive subdural grid and strip recordings in epilepsy by a radial basis function neural network. Clinical Neurophysiology 123(9), pp. 1721-1731.
Fisher, R., Webber, W., Lesser., R., Arroyo., Uematsu, S. 1992. High-frequency EEG activity at the start of seizures. Journal of Clinical Neurophysiology 9(3), pp. 441-448.
He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770- 778).
Jacobs, J., LeVan, P., Chander, R., Hall, J., Dubeau, F. et al. 2008. Interictal high-frequency oscillations (80-500 Hz) are an indicator of seizure onset areas independent of spikes in the human epileptic brain. Epilepsia 49(11), pp. 1893–1907.
Jacobs, J., Zijlmans, M., Zelmann, R., Chatillon, C., Hall., J. et al. 2010. High-frequency electroencephalographic oscillations correlate with outcome of epilepsy surgery. Annals of Neurology 67(2), pp. 209–220.
54
Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).
Logothetis, N.K., Kayser, C. and Oeltermann, A., 2007. In vivo measurement of cortical impedance spectrum in monkeys: implications for signal propagation. Neuron, 55(5), pp.809- 823.
Matsumoto, A., Brinkmann, B., Stead, M., Matsumoto, J., Kucewicz. et al. 2013. Pathological and physiological high-frequency oscillations in focal human epilepsy. Journal of Neurophysiology 110(8), pp. 1958–1964.
Jefferys, J., Menendez de la Prida, L., Wendling, F., Bragin, A. et al. 2012. Mechanisms of physiological and epileptic HFO generation. Progress in Neurobiology 98(3), pp. 250-264.
Jrad, N., Kachenoura, A., Merlet, I., Bartolomei, F., Nica, A. et al. 2016. Automatic detection and classification of high-frequency oscillations in depth-EEG signals. IEEE Transactions on Biomedical Engineering, 64(9), pp.2230-2240.
Kucewicz, M., Cimbalnik, J., Matsumoto, J., Brinkmann, B., Bower, M. et al. 2014. High frequency oscillations are associated with cognitive processing in human recognition memory. Brain 137(8), pp. 2231-2244.
Lai, D., Zhang, X., Z., Ma, K., M., Chen, Z., Chen W. et al. 2019. Automated Detection of High Frequency Oscillations in Intracranial EEG Using the Combination of Short-Time Energy and Convolutional Neural Networks. IEEE Access 7, pp. 82501-82511.
Liu, S., Sha, Z., Sencer., A., Aydoseli, A., Bebek, N. et al. 2016. Exploring the time– frequency content of high frequency oscillations for automated identification of seizure onset zone in epilepsy. Journal of Neural Engineering 13(2).
López-Cuevas, A., Castillo-Toledo, B., Medina-Ceja, L., Ventura-Mejía, C., Pardo-Peña, K. 2013. An algorithm for on-line detection of high frequency oscillations related to epilepsy. Computer Methods and Programs in Biomedicine 110(3), pp. 354-360.
Luck, S. (2014). An Introduction to The Event-Related Potential Technique. 2nd ed. Cambridge, Mass: MIT Press.
Navarrete, M., Pyrzowski, J., Corlier, J., Valderrama, M. and Le Van Quyen, M., 2016a. Automated detection of high-frequency oscillations in electrophysiological signals: Methodological advances. Journal of Physiology-Paris, 110(4), pp.316-326.
Navarrete, M., Alvarado-Rojas, C., Le Van Quyen, M., Valderrama, M. 2016b. RIPPLELAB: A Comprehensive Application for the Detection, Analysis and Classification of High Frequency Oscillations in Electroencephalographic Signals. PLoS ONE 11(6).
55
Schirrmeister, R.T., Springenberg, J., Dominique, L., Fiederer, J., Glasstetter, M. et al. 2017. Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping 38(11), pp. 5391–5420.
Squire, L., Bloom, F., Spitzer, N., Du Lac, S., Ghosh, A., Berg, D. (2008). Fundamental Neuroscience. 3rd ed. San Diego: Academic Press.
Staba, R.J., Wilson, C.L., Bragin., A., Fried, Engel, J. 2002. Quantitative Analysis of High- Frequency Oscillations (80–500 Hz) Recorded in Human Epileptic Hippocampus and Entorhinal Cortex. Journal of Neurophysiology 88(4), pp. 1743–1752.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).
Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A.A., 2017, February. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.
Whittingstall, K., Logothetis, N. 2009. Frequency-band coupling in surface EEG reflects spiking activity in monkey visual cortex. Neuron 64(2), pp. 281-289.
Worrell, G., Parish, L., Cranstoun, S., Jonas, R., Baltuch, G. et al. 2004. High‐frequency oscillations and seizure generation in neocortical epilepsy. Brain 127(7), pp. 1496-1506.
Wu, J., Sankar, R., Lerner, J., Matsumoto, J., Vinters, H. et al. 2010. Removing interictal fast ripples on electrocorticography linked with seizure freedom in children. Neurology 75(19), pp. 1686-1694.
Zijlmans, M., Jiruska, P., Zelmann, R., Leijten, F., Jefferys, J. et al. 2012. High-frequency oscillations as a new biomarker in epilepsy. Annals of Neurology 71(2), pp. 169–178. Zuo, R., Wei, J., Li, X., Li, C., Zhao, C., Ren, Z., Liang, Y., Geng, X., Jiang, C., Yang, X. and Zhang, X., 2019. Automated Detection of High Frequency Oscillations in Epilepsy Based on a Convolutional Neural Network. Frontiers in computational neuroscience, 13.
Modelling Tail Dependence of Stock Indexes by GARCH-Copula Model(1).pdf
Modelling Tail Dependence of Stock
Indexes by GARCH-Copula Model
Chenghao Li
September 2019
School of Mathematics,
Cardiff University
A dissertation submitted in partial fulfilment of the
requirements for MSc (in Operational Research, Applied Statistics and Financial Risk) by taught programme.
CANDIDATE’S ID NUMBER
1869787
CANDIDATE’S SURNAME
Please circle as appropriate Mr / Miss / Ms/ Mrs / Rev / Dr / Other ……Li….......
CANDIDATE’S FULL FORENAMES
Chenghao
DECLARATION This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 1 This dissertation is being submitted in partial fulfilment of the requirements for the degree of ………MSc…………(insert MA, MSc,MBA, etc, as appropriate) Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 2 This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended. Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 3 – I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations. Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 4 - BAR ON ACCESS APPROVED I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee. Signed ……………………………………………. (candidate) Date …………………………
Executive Summary
Abstract
Economic globalization has increased the linkage between financial markets. Estimating
the correlation between the log returns of the global stock market has a strong practical
significance for financial asset pricing and financial risk management. In this paper, the
GARCH-copula model is used to estimate the correlation and tail dependence between the
log returns of FTSE100, S&P500, Nikkei225 and HS300.
Methodology
In this paper, the GARCH(1,1) model is used to fit the log return of each stock index and
obtain the marginal distributions of the log returns. The two indexes distributions are then
joined as a bivariate joint distribution using the Gaussian copula, T copula and Clayton copula
functions, respectively. The correlation between the log returns is examined according to the
parameters of the copula function. The tail dependence coefficients between the log returns are
investigated by the nature of tail dependence in T copula and Clayton copula.
Results
As a result, there is a clear correlation between the log returns of the stock indexes, and
there is strong tail dependence correlation between each pair of log returns. This shows that
there is linkage between global financial markets, and in extreme cases, its correlation will
increase. It was also found that the correlation between FTSE100 and S&P500's log returns is
the strongest, and the correlation between HS300 and log returns of other indices is relatively
weaker. To a certain extent, this partly reflects the greater freedom of capital flow between the
United Kingdom and the United States, while China has the control of capital flows.
Acknowledgements
Throughout the writing of this dissertation I have received a great deal of support and assistance. I
would first like to thank my supervisor, Dr. Anqi Liu, whose expertise was invaluable in the formulating of
the research topic and methodology in particular.
I would like to acknowledge my friend Jiliang Zhu who provided me with a cloud server account, which
enabled my code to run for a long time on the cloud server to get the research results.
Contents Abstract ............................................................................................................................................................... 1
1 Introduction ................................................................................................................................................... 1
2 Literature Review ........................................................................................................................................ 4
2.1 Techniques Used in Modelling Financial Returns ..................................................................... 4
2.2 Copula Function Estimation Methods .......................................................................................... 6
2.3 Tail Dependence .................................................................................................................................... 7
3 The Model ....................................................................................................................................................... 9
3.1 Model Index Returns .......................................................................................................................... 9
3.2 Estimate Copula Function ............................................................................................................... 10
3.2.1 Estimate Gaussian Copula ...................................................................................................... 10
3.2.2 Estimate T Copula ..................................................................................................................... 11
3.2.3 Estimate Clayton Copula ......................................................................................................... 11
3.3 Tail Dependence .................................................................................................................................. 12
4 Empirical Calibration and Results ....................................................................................................... 13
4.1 The Data ................................................................................................................................................ 13
4.2 Results .................................................................................................................................................... 15
4.2.1 Marginal Distributions ............................................................................................................. 15
4.2.2 Estimated Copula Functions .................................................................................................. 17
4.2.3 Tail Dependence .......................................................................................................................... 17
5 Conclusion .................................................................................................................................................... 19
Reference .......................................................................................................................................................... 21
1
Modelling Tail Dependence of Stock Indexes by
GARCH-Copula Model
Abstract With the increasing degree of economic globalization, the linkage between
national stock markets is also growing stronger. Especially in the case of extremely
bad conditions, the global stock market is more likely to fall at the same time, which
is the so-called tail dependence. This paper uses the stock market indexes to establish
copula models. The fact of tail dependence between markets will be researched.
1 Introduction Financial markets are interrelated, and portraying the relationship between
financial products or financial markets can help solve financial asset pricing problems
and financial risk management issues. To understand the correlation between financial
markets, traditional correlation measures (such as Pearson’s rho) are often insufficient,
see, e.g., Embrechts et al. (2002). The best way to characterize the correlation between
the variables of financial markets or financial assets is to obtain a joint distribution of
the variables. But directly estimating the joint distribution between variables is
difficult. The Copula theory proposed by Sklar (1959) provided a solution for this.
Sklar's theorem states that any joint distribution can be represented by its marginal
distributions and an appropriate copula function. When the joint distribution is
continuous, the copula function is unique. To illustrate this theorem,let’s consider a
2-dimention joint distribution function H(𝑋1, 𝑋2) . 𝐹1(𝑋1 ) and 𝐹2(𝑋2) are the
marginal distributions corresponding to H(𝑋1, 𝑋2). Then, a copula function can be
found to join the univariate marginal distributions to be the multivariate joint
distribution.
H(𝑋1, 𝑋2) = 𝐶(𝐹1(𝑋1), 𝐹2(𝑋2))
If H(𝑋1, 𝑋2) is continuous, the copula function 𝐶(𝑈1, 𝑈2) is unique. Conversely,
2
if the 2-dimention copula function 𝐶(𝑈1, 𝑈2) and the marginal distributions 𝐹1(𝑋1)
and 𝐹2(𝑋2) are known, we can find a bivariate distribution function H(𝑋1, 𝑋2) such
that 𝐹1(𝑋1) and 𝐹2(𝑋2) are margins of H(𝑋1, 𝑋2).
Therefore, with the help of copula function, the estimation of joint distribution
function can be generally separated into two steps: 1) estimate the marginal
distributions; 2) chose the copula function and estimate the parameter of the copula
function. It can be said that copula function contains all the dependence information.
Although the copula theory was proposed early, the application of copula theory
in the financial field was in the early 21st century, 60 years later than the time proposed
by the copula theory. In 1999, David X. Li first proposed the use of the copula function
to model the relevance of loan defaults and how to apply it to the pricing of credit
derivatives. But it can be said that its first application in the financial field failed. Wall
Street referenced the method proposed by Li and apply Gaussian Copula to the pricing
of Collateralized Debt Obligations (CDOs) which speeded up the issuance of CDOs.
As a result, we all know that the CDOs market has collapsed, and a large number of
mortgage defaults have caused the value of CDOs to be devastating, and investors
have suffered heavy losses. Felix Salmon (2009) described Gaussian Copula as a
‘recipe for disaster’ in his article. The mainly reason for this failed application is that
Wall Street did not choose the right copula function. Gaussian Copula has no tail
dependence, but in the extreme case of financial markets, the price of assets will show
an increase in correlation. Therefore, if you want to apply copula to finance, whether
it is financial risk management or financial asset pricing, it is necessary to consider
the tail dependence. It is very important to choose the appropriate copula function.
This paper will establish GARCH-Copula models for the log returns of stock
market indexes. Why use the GARCH model to fit log returns is due to the
phenomenon of fat tail and volatility clustering in financial time series. The traditional
econometric models cannot solve these problems well until Engle (1982) proposed
autoregressive conditionally heteroscedastic (ARCH) and Bollerslev (1986) proposed
the problem of generalized ARCH (GARCH). Therefore, this paper will use the
3
GARCH model to fit log returns. In section 2, we will introduce the method of fitting
financial time series in more detail, and introduce the expression of GARCH model in
detail. The copula model is to build joint distribution and better describe the
correlation between log returns, especially to measure its tail dependence. As
mentioned earlier, the correlation between financial assets will increase significantly
during the financial crisis. This is called “correlation breakdown”. If financial asset
pricing or financial risk management does not take into account the increase in the
correlation of financial assets in an extreme market environment, it will bring losses.
Tail dependence coefficients can capture the correlation between financial assets in
extreme market conditions, so calculating tail dependence has a strong practical
significance.
This paper researches four indexes’ log returns including FTSE100, S&P500,
Nikkei225 and HS300, and then examine their tail dependence. This paper finds that
there is a lower tail dependence between the indices, which indicates that the pricing
of financial products and the management of financial risks should take into account
the existence of tail dependence. In addition, the lower tail dependence coefficient of
FTSE100 and S&P500 is the largest, and the lower tail dependence between HS300
and the other three indexes is relatively weak. This reflects to a certain extent the
relatively weak linkage between China's capital market and the world's major capital
markets.
This paper is organized as follows. Section 2 is a literature review. The literature
review includes three aspects: 1) commonly used models in financial time series
modelling; 2) estimation methods of copula functions; 3) tail dependence. In section
3, the model is described. Section 4 shows the empirical results. Section 5 concludes
this paper.
4
2 Literature Review
2.1 Techniques Used in Modelling Financial Returns
Before using the copula function to establish a joint distribution, a good estimate
of the marginal distribution is required. In financial time series, such as stock returns,
there are often effects of volatility clustering, fat tail and financial leverage. Volatility
clustering is the variability in time of the conditional variance. In financial time series,
it can be usually observed that the high volatility is clustered in some time interval.
Fat tail is another common phenomenon in financial returns which means the spread
of returns is significantly larger than that corresponding to the normal distribution.
Financial leverage effect is the phenomenon that positive and negative information
have asymmetric influence on the whole variance of time series.
It is significant that many financial time series of returns cannot be assumed to be
normal. The key point to precise modelling of financial returns is the volatility
modelling. Currently, a broad class of GARCH processes with fat-tailed innovations
are used to model the volatility of financial returns. Some well know GARCH-type
processes are listed below.
The GARCH model was created by Bollerslev (1986). The financial time series
of returns {𝑅𝑡 } is said to be modelled with GARCH(p,q) when
{ 𝑅𝑡 = 𝜇 + 𝜀𝑡
𝜀𝑡 = 𝜎𝑡 𝑣𝑡 𝜎𝑡2 = 𝜔 + ∑ 𝛼𝑖 𝜀𝑡−𝑖
2𝑝 𝑖=1 + ∑ 𝛽𝑗 𝜎𝑡−𝑗
2𝑞 𝑗=1
(2.1)
In the above formula, 𝑣𝑡 denotes the innovations. The advantage of the GARCH
model is it can deal with conditional heteroscedasticity. In formula (2.1) the
conditional variance is described by the function 𝜎𝑡2 = 𝜔 + ∑ 𝛼𝑖 𝜀𝑡−𝑖 2𝑝
𝑖=1 +
∑ 𝛽𝑗 𝜎𝑡−𝑗 2𝑞
𝑗=1 . Formula (2.1) is a constant mean GARCH model and it can be extended
to be AR-GARCH model or ARMA-GARCH model when autoregressive effect is
considered. Time series {𝑅𝑡 } modelled with AR(k)-GARCH(p,q) model can be
expressed as
5
{ 𝑅𝑡 = ∑ 𝜑𝑙 𝑅𝑡−𝑙𝑘𝑙=1 + 𝜀𝑡
𝜀𝑡 = 𝜎𝑡 𝑣𝑡 𝜎𝑡2 = 𝜔 + ∑ 𝛼𝑖 𝜀𝑡−𝑖
2𝑝 𝑖=1 + ∑ 𝛽𝑗 𝜎𝑡−𝑗
2𝑞 𝑗=1
(2.2)
The basic GARCH model does not take the financial leverage effect into account.
The effect of financial leverage can be understood as good information and bad
information will have asymmetric effects on the variance of financial returns.
Obviously, good information corresponds to a positive return, and bad information
corresponds to a negative return. Specific to the return of stock market indexes, a
negative return will make the variance of the index returns larger. Several ways can be
found in the literature to handle the financial leverage effect. An EGARCH model
invented by Nelson(1991) can deal with the disadvantage of the GARCH model. In
EGARCH(p,q) model, the noise process {𝜀𝑡 } satisfies 𝜀𝑡 = 𝜎𝑡 𝑣𝑡 and below
equation:
{ ln(𝜎𝑡2) = 𝛼0 + ∑ 𝛼𝑖 𝑔(𝑣𝑡−𝑖 )
𝑝 𝑖=1 + ∑ 𝛽𝑗 ln (𝜎𝑡−𝑗
2𝑞 𝑗=1 )
g(𝑣𝑡 ) = 𝜃𝑣𝑡 + 𝛿(|𝑣𝑡 | − 𝐸|𝑣𝑡 |) (2.3)
GJR-GARCH proposed by Glosten et.al. (1993) is another model to deal with
financial leverage effect and it is very popular. The noise process {𝜀𝑡 } follows the
GJR-GARCH(p,q) model when {𝜀𝑡 } satisfies 𝜀𝑡 = 𝜎𝑡 𝑣𝑡 and
𝜎𝑡2 = 𝛼0 + ∑ 𝛼𝑖 𝜀𝑡−𝑖 2𝑝
𝑖=1 + ∑ 𝛽𝑗 𝜎𝑡−𝑗 2𝑞
𝑗=1 + ∑ 𝛾𝑖 𝐈{𝜀𝑡−𝑖<0}(𝜀𝑡−𝑖 2𝑝
𝑖=1 ) (2.4)
In the GJR-GARCH model, the 𝐈{𝑥<0} denotes the indicator function that is
𝐈{𝑥<0}(𝑥) = 1 while x < 0 and 𝐈{𝑥<0}(𝑥) = 0 while x ≥ 0 . Through equation
(2.4), it can be found that parameter 𝛾𝑖 determines the sensitivity of conditional
volatility function with respect to negative returns.
Regarding to the distribution of {𝑣𝑡 },the commonly used distributions are normal,
t-Student (as in Bollerslev, 1987), skewed t (as in Patton, 2004) and GED (generalized
error distribution). The latter three distributions are more suitable to fit {𝑣𝑡 } since
they have heavier tail than normal distributions. Ferenstein and Gasowski(2004)
modelled stock returns by AR-GARCH model and found GED and Student-t
distribution are the best to fit the innovation.
6
In general, fitting the financial returns requires taking into account the three
characteristics of the financial time series and then selecting the appropriate model to
characterize the three features.
2.2 Copula Function Estimation Methods
Generally, copula function estimation method can be classified into two types,
parametric estimation and non-parametric estimation (Cherubini et al., 2004).
There are three parametric estimation methods:1) Exact maximum likelihood
method; 2) Inference for the margins (IFM) method; 3) Canonical maximum
likelihood method.
Exact maximum likelihood method is based on the canonical representation:
f(𝑥1, 𝑥2, … , 𝑥𝑛) = 𝑐(𝐹1(𝑥1), 𝐹2(𝑥2), … , 𝐹𝑛 (𝑥𝑛)) × ∏ 𝑓𝑗 (𝑥𝑗 ) 𝑛 𝑗=1 (2.5)
Let ℵ = {𝑥1𝑡 , 𝑥2𝑡 , … , 𝑥𝑛𝑡 }𝑡=1𝑇 be the sample data matrix. Thus, the expression for
the log-likelihood function is
𝑙(θ) = ∑ 𝑙𝑛𝑐(𝐹1(𝑥1𝑡 ),𝑇𝑡=1 𝐹2(𝑥2𝑡 ), … , 𝐹𝑛 (𝑥𝑛𝑡 )) + ∑ ∑ ln 𝑓𝑖 (𝑥𝑗𝑡 ) 𝑛 𝑗=1
𝑇 𝑡=1 (2.6)
By maximizing the above log-likelihood function the maximum likelihood
estimator can be found:
𝜃𝑀𝐿𝐸 = 𝑚𝑎𝑥𝜃∈Θ𝑙(𝜃) (2.7)
The exact maximum likelihood method estimate the parameter marginal
distribution and the parameter of copula function at the same time, but this could be
very computationally intensive, especially in the high dimension case. According to
(2.6), it can be found the log-likelihood function is composed by two terms: one term
involving the copula function parameters and one term involving the parameters of the
marginal distributions, so the parameters can be estimated separately not
simultaneously to reduce computational load. Based on this idea, the IFM method (Joe
and Xu, 1996) was proposed. IFM method estimates the parameters in two steps:
Step 1: The parameter 𝜽𝟏 of the marginal distribution is estimated by
�̂�𝟏 = ArgMax𝜽𝟏 ∑ ∑ 𝑙𝑛𝑓𝑗 (𝑥𝑗𝑡 ; 𝜽𝟏 𝑛 𝑗=1
𝑇 𝑡=1 ) (2.8)
7
Step 2: Then, given �̂�𝟏, the estimation of 𝜽𝟐 is performed
�̂�𝟐 = 𝐴𝑟𝑔𝑀𝑎𝑥𝜽𝟐 ∑ ln 𝑐(𝐹1(𝑥1𝑡 ), 𝐹2(𝑥2𝑡 ), … , 𝐹𝑛 (𝑥𝑛𝑡 ); 𝜽𝟐, 𝑇 𝑡=1 �̂�𝟏) (2.9)
Compared to ML method, IFM method is highly efficient (Joe, 1997).
Canonical maximum likelihood method estimate the copula parameters without
specifying the marginal distribution. This method can be described as following steps:
Step 1: Estimate the marginal distributions by using empirical distribution,
namely �̂�𝑖 (𝑥𝑖𝑡 ) with i = 1, … , n.
Step 2: Estimate the copula parameters via MLE
�̂�𝟐 = 𝐴𝑟𝑔𝑀𝑎𝑥𝜽𝟐 ∑ ln 𝑐( 𝑇 𝑡=1 �̂�1(𝑥1𝑡 ), �̂�2(𝑥2𝑡 ), … , �̂�𝑛 (𝑥𝑛𝑡 ); 𝜽𝟐) (2.10)
Non-parametric estimation no longer assumes a particular parametric copula.
Empirical copula and Kernel copula are two commonly used non-parametric
estimation methods.
2.3 Tail Dependence
Whether financial markets will become more interdependent during the financial
crisis is a concern because it relates to asset allocation and risk management. Many
literatures have pointed out that there is a correlation breakdown between financial
markets, namely, in crash period, there exist a statistically significant increase in
correlation between financial markets. Bertero and Mayer (1989) and King and
Wadhwani (1990) find the correlation of stock returns at the time of the 1987 crash.
Calvo and Reinhardt (1996) find evidence that correlation shifts in the Mexican crisis.
Baijn and Goldfajn (1999) shows evidence of contagion in the currency and equity
markets between Malaysia, Indonesia, Korea and Philippine during the Asian crisis.
Since there is such a phenomenon in the financial market, we want to understand
the dependence between financial markets in extreme cases. For example, in an
extreme case, when a financial asset has a huge loss, we want to know whether another
asset will also suffer a huge loss, what is the degree of correlation between the two
assets and which method should be used to accurately measure the correlation. Some
8
empirical studies, such as Ane and Kharoubi (2003), indicate that tail dependence is a
useful tool to describe correlations in extreme cases.
The most common definition of tail dependence, discussed by Sibuya (1960) and
Joe (1997) among others, is the following approach. Let (X, Y) be a random vector,
and the joint distribution function is F, and the marginal distribution functions of X
and Y are G and H, respectively.
Then its upper tail dependence coefficient is
𝜆𝑈 = lim 𝑡→1−
𝑃{G(X) > t|𝐻(𝑌) > 𝑡} (2.11)
Lower tail dependence coefficient is
𝜆𝐿 = lim 𝑡→0+
𝑃{G(X) ≤ t|𝐻(𝑌) ≤ 𝑡} (2.12)
By this definition, it can be found that the tail dependence coefficient is exactly
equal to the probability that one variable exceeds one high/low threshold and the other
variable also exceeds one high/low threshold. As Juri (2002) pointed out that the aim
to researching the tail dependence between random variables is to know the probability
that a random variable will change similarly when one random variable changes.
Tail dependence coefficient can also be defined by copula function
𝜆𝑈 = lim 𝑢→1−
1−2𝑢+𝐶(𝑢,𝑢) 1−𝑢
(2.13)
𝜆𝐿 = lim 𝑢→0+
𝐶(𝑢,𝑢) 𝑢
(2.14)
The tail dependence coefficient defined by the copula function depends only on
the form and the parameters of the copula function itself, and it is independent of the
marginal distributions. Therefore, the copula function can be easily used to study the
tail dependence. Just select the appropriate copula function and estimate the
parameters of the copula function to get the tail dependence coefficient. Patton (2006)
considered an extension of the theory of copulas and found evidence of asymmetric
exchange rate dependence. Rodriguez(2007)utilized copula approach and found that
in times of financial turmoil, tail dependence will be more prevalent. Aloui et al. (2011)
employed a multivariate copula approach to capture the tail dependence of four
emerging markets and the US markets.
9
This paper will use the copula function to model the stock market indexes and
examine their tail dependence.
3 The Model The modelling steps are divided into three steps: 1) fit the log return of each index
with the GARCH(1,1) model to obtain the conditional distribution of log return; 2)
Estimate the parameters of the copula function after knowing marginal distributions
and selecting certain parametric copula family; 3) Calculate the tail dependence based
on the obtained copula model. These three steps will be described separately below.
3.1 Model Index Returns
Let {𝑆𝑡 } denote the index close price and {𝑅𝑡 } denote the log return, so
{𝑅𝑡 } = {𝑙𝑛 𝑆𝑡−1
𝑆𝑡 } (3.1)
When using the GARCH(1,1) model to fit the {𝑅𝑡 }, this can be expressed as
{ 𝑅𝑡 = 𝜇 + 𝜀𝑡
𝜀𝑡 = 𝜎𝑡 𝑣𝑡 𝜎𝑡2 = 𝛼0 + 𝛼1𝜀𝑡−12 + 𝛽𝜎𝑡−12
(3.2)
In formula (3.2), {𝑣𝑡 } are i.i.d. random variables with zero mean and unit
variance. The distribution form of {𝑣𝑡 } determines the distribution of the marginal
distribution. The commonly used distribution of {𝑣𝑡 } is Gaussian distribution, T
distribution, skewed T distribution and GED (generalized error distribution). The latter
three distribution types can better characterize the fat tail characteristic of financial
time series. This article will use the Students t distribution as the distribution form of
{𝑣𝑡 }, namely 𝑣𝑡 ~𝑡𝑑𝑓 . Therefore, there is one more parameter to be estimated, i.e. the
degree of freedom.
The parameters to be estimated for the t-GARCH(1,1) model are 𝜇, df, 𝛼0, 𝛼1
and 𝛽. After estimating these parameters, the conditional distribution of 𝑅𝑡+1 can be
obtained.
F(r) = P(𝑅𝑇+1 ≤ 𝑟 | 𝐼𝑇 ) = 𝑃(𝜀𝑡+1 ≤ 𝑟 − 𝜇 | 𝐼𝑇 ) = 𝑃(𝜎𝑡+1𝑣𝑡+1 ≤ (𝑟 − 𝜇)|𝐼𝑇 )
10
= P (𝑣𝑡+1 ≤ 𝑟−𝜇
√𝛼0+𝛼1𝜀𝑡 2+𝛽𝜎𝑡
2 ) = 𝑡𝑑𝑓 (
𝑟−𝜇 𝛼0+𝛼1𝜀𝑡
2+𝛽𝜎𝑡 2) (3.3)
In formula (3.3), 𝐼𝑇 is the set of information up to time T.
3.2 Estimate Copula Function
After having the conditional distribution and historical observations of the four
indexes log returns, we can get {𝑢𝑖,𝑡 = 𝑡𝑑𝑓 ( 𝑟𝑖,𝑡−𝜇
𝛼𝑖,0+𝛼𝑖,1𝜀𝑖,𝑡 2 +𝛽𝑖𝜎𝑖,𝑡
2 )} 𝑡=1
𝑇 𝑖 = 1,2,3,4 and
estimate bivariate copula functions.
There is one point to note before estimating the copula function. The thing is that
since the indexes may not be traded for some days, the samples are not as 𝒓𝑡 =
{𝑟i,𝑡 , 𝑟j,𝑡 }𝑡=1𝑇 , 1 ≤ i ≠ j ≤ 4 which is a complete observation vector at each day. Since
the estimate of the marginal distribution involves only the sample of each index itself,
the day of non-transaction can be removed as a holiday. However, this cannot be done
when estimating the copula function, and the data needs to be pre-processed. The
approach taken in this paper is to select the days when both indices have transactions,
that is, to exclude those sample points that any one of the indexes has no data.
In this paper, the maximum likelihood estimation method is used to estimate the
three different parametric copula functions, namely Gaussian copula, T copula and
Clayton copula.
3.2.1 Estimate Gaussian Copula The density of bivariate Gaussian copula is:
𝑐(𝑢1, 𝑢2, … , 𝑢𝑛 ) =
1
(2𝜋) 𝑛 2 |𝑅|
1 2
exp(−1 2
𝒙′𝑅−1𝒙)
∏ ( 1 √2𝜋
𝑛 𝑗=1 exp (−
1 2
𝑥𝑗 2))
(3.4)
In formula (3.4), 𝑥𝑗 = Φ−1(𝑢𝑗 ). Then we can get
𝑐(𝑢1, 𝑢2, … , 𝑢𝑛 ) = 1
|𝑅| 1 2
exp (− 1 2
𝝇′(𝑅−1 − 𝐼)𝝇) (3.5)
In formula (3.5), 𝝇 = (Φ−1(𝑢1), Φ−1(𝑢2), … , Φ−1(𝑢𝑛))′ . Then the log
11
likelihood function is
𝑙(𝛉) = − T 2
ln|𝑅| − 1 2
∑ 𝝇𝑡′ (𝑅−1 − 𝐼)𝝇𝑡𝑇𝑡=1 (3.6)
In the bivariate case the only parameter is the correlation coefficient ρ . The
specific log likelihood function in bivariate case is
𝑙 = − T 2
ln(1 − ρ2) − 1 2
∑[ (Φ−1(𝑢1,𝑡 ))2 + (Φ−1(𝑢2,𝑡 ))2 − 2𝜌Φ−1(𝑢1,𝑡 )Φ−1(𝑢2,𝑡 )
1 − 𝜌2
𝑇
𝑡=1
−(Φ−1(𝑢1,𝑡 ))2 − (Φ−1(𝑢2,𝑡 ))2] (3.7)
The likelihood function values can be obtained by bringing {𝑢1,𝑡 }, {𝑢2,𝑡 } and 𝜌
into the above equation. The parameter 𝜌 can be estimated by iterating the value of
𝜌 such that the value of the log likelihood function is maximized.
3.2.2 Estimate T Copula The density of the bivariate t copula is
𝑐(𝑢1, 𝑢2) = 1
√1−𝜌2
Γ(𝑣+2 2
)Γ(𝑣 2
)(1+ 𝑡𝑣
−1(𝑢1) 2+𝑡𝑣
−1(𝑢2) 2−2𝜌𝑡𝑣
−1(𝑢1)𝑡𝑣 −1(𝑢2)
𝑣(1−𝜌2) )−
𝑣+2 2
Γ2(𝑣+1 2
)(1+ 𝑡𝑣
−1(𝑢1)2
𝑣 )−
𝑣+1 2 (1+
𝑡𝑣 −1(𝑢2)2
𝑣 )−
𝑣+1 2
(3.8)
In the above expression, v is the degree of freedom of T copula functions. It can
be deduced that the log likelihood function is
𝑙(𝜌, 𝑣) = − 𝑇 2
ln(1 − 𝜌2) + T × ln (Γ (𝑣+2 2
)) + 𝑇 × ln (Γ (𝑣 2 )) − 𝑣+2
2 ∑ ln ((1 +
𝑡𝑣−1(𝑢1,𝑡) 2
+𝑡𝑣−1(𝑢2,𝑡) 2
−2𝜌𝑡𝑣−1(𝑢1,𝑡)𝑡𝑣−1(𝑢2,𝑡) 𝑣(1−𝜌2)
))𝑇𝑡=1 −
2 T × ln (Γ (𝑣+1 2
)) + 𝑣+1 2
∑ ln (1 + 𝑡𝑣 −1(𝑢1,𝑡)
2
𝑣 )𝑇𝑡=1 +
𝑣+1 2
∑ ln (1 + 𝑡𝑣 −1(𝑢2,𝑡)
2
𝑣 )𝑇𝑡=1 (3.9)
By iterating the value of 𝜌 and v, the estimated parameters can be obtained by
maximizing the log likelihood function value.
3.2.3 Estimate Clayton Copula Clayton Copula is a kind of Archimedean Copula. In bivariate case, its copula
density is
c(u1, u2) = (1 + α)(u1u2)−𝛼−1(u1−𝛼 + u2−𝛼 − 1) −2−1
𝛼 (3.10)
In formula(3.9), α is the parameter of Clayton Copula.
12
The log likelihood function corresponding to bivariate Clayton Copula is
𝑙(𝛼) = T ∗ ln(1 + α) − (α + 1) ∑(ln (u1,𝑡 + 𝑇
𝑡=1
u2,𝑡 )
−(1 𝛼
+ 2) ∑ ln (𝑇𝑡=1 u1,𝑡 −𝛼 + u2,𝑡 −𝛼 − 1) (3.11)
Similarly to the estimation of Gaussian and T copulas, by iterating the parameter
α and maximizing the log likelihood function (3.10) the parameter 𝛼 can be
estimated.
3.3 Tail Dependence
The calculation of the tail dependence coefficient is based on the calculation
formula of the tail dependence coefficient of each copula, since the tail dependence
coefficient defined by the copula function depends only on the form and parameters
of the copula function itself. The following table shows the formula to calculate the
tail dependence coefficient of Gaussian copula, T copula and Clayton copula.
Table 1: Tail dependence coefficient formula
Copula Function 𝜆𝑢𝑝 𝜆𝑙𝑜𝑤
Gaussian Copula 0 0
T Copula 2𝑡𝑣+1(−√𝑣 + 1√ 1 − 𝜌 1 + 𝜌
) 2𝑡𝑣+1(−√𝑣 + 1√ 1 − 𝜌 1 + 𝜌
)
Clayton Copula 0 2− 1 𝛼
Gumbel Copula 2 − 2− 1 𝛼 0
Frank Copula 0 0
Gaussian copula has no tail dependence, namely its upper and lower tail
dependence are equal to zero. T copula has upper and lower tail dependence and they
are equal. Clayton copula has only lower tail dependence.
13
4 Empirical Calibration and Results In this section, the daily closing price of the stock market indexes will be modelled,
including FTSE100, S&P500, HS300, and Nikkei225. In order to make the fitting
process smooth, log return is scaled by 100, i.e. {100 × 𝑅𝑡 } is the object to be
modelled. The time interval is from January 1, 2009 to July 31, 2019.
4.1 The Data
Financial time series have some characteristics of their own. Through the
following descriptive statistics, it can be found that the log return of all indexes shows
the case of negative skewness. Except that the log return of FTSE100 has a kurtosis
of less than 3, showing a platykurtic, the log returns of other indices exhibit a
leptokurtic, that is, the kurtosis is greater than 3. This indicates that the financial time
series often has fat tail phenomenon.
Table 2: Descriptive statistics
Indexes Mean Median Std Skewness Krutosis
FTSE100 0.000189 0.000341 0.010012 -0.159446 2.839389
S&P500 0.000436 0.000663 0.010293 -0.329508 5.184206
Nikkei225 0.000324 0.000513 0.013499 -0.477822 4.778648
HS300 0.000276 0.000552 0.015492 -0.623943 4.107534
Some basic features of the financial time series can also be found through time
series plots. The following figure shows the time series plot of {𝑅𝑡 } of FTSE100.
14
Figure 1: The left is the time series plot of {𝑅𝑡 } of FTSE100. The right is the time series plot of {𝑅𝑡2} of FTSE100.
Through the time series plot at the top of the left, it can be seen that there is
volatility clustering in the log return sequence of FTSE100 where volatility is
obviously not constant, but changes over time. Through ACF and PACF in (a), it seems
that log return does not have autocorrelation. The QQ plot at the bottom left tells us
that there is a fat tail in the distribution of log return. The ACF and PACF on the right
tell us that there is a significant autocorrelation of {𝑅𝑡2}. Similarly, the log returns of
the S&P500, HS300 and Nikkei225 have similar characteristics.
Figure 2: The left is the time series plot of {𝑅𝑡 } of S&P500. The right is the time series plot of {𝑅𝑡2} of S&P500. We can find there exist volatility clustering and fat tail in the {𝑅𝑡 }. And the {𝑅𝑡2} has autocorrelation.
15
Figure 3: The left is the time series plot of {𝑅𝑡 } of Nikkei225.The right is the time series plot of {𝑅𝑡2} of Nikkei225. We can find there exist volatility clustering and fat tail in the {𝑅𝑡 }. And the {𝑅𝑡2} has autocorrelation.
Figure 4: The left is the time series plot of {𝑅𝑡 } of HS300. The right is the time series plot of {𝑅𝑡2} of HS300. We can find there exist volatility clustering and fat tail in the {𝑅𝑡 }. And the {𝑅𝑡2} has auto- correlation.
According to the above analysis, it can be found that there are fat tail and volatility
clustering in the log return series. Therefore, it is reasonable to use the GARCH model
with the T-distributed innovation to fit the marginal distributions.
4.2 Results
4.2.1 Marginal Distributions The GARCH(1,1) model was established for FTSE100, S&P500, Nikkei225 and
16
HS300 respectively. The estimated parameters are as follows.
Table 3: estimated parameters for GARCH models
Parameters/Indexes FTSE100 S&P500 Nikkei225 HS300
𝜇 0.0415***
(t=2.842)
0.0838***
(t=6.902)
0.0818***
(t=4.117)
0.0507**
(t=2.415)
𝛼0 0.0254***
(t=3.084)
0.0169***
(t=3.364)
0.0465***
(t=3.133)
9.3504e-03**
(t=2.261)
𝛼1 0.1207***
(t=5.389)
0.1407***
(t=6.422)
0.1162***
(t=5.117)
0.0550***
(t=6.017)
𝛽 0.8580***
(t=33.318)
0.8540***
(t=42.385)
0.8635***
(t=35.189)
0.9443***
(t=112.036)
𝑣 6.8268***
(t=8.025)
4.9861***
(9.763)
5.9979***
(t=8.189)
4.7577***
(t=10.532)
Note: The t statistics are in parentheses. ‘*’ means significant at the 10% significance level, ‘**’ means
significant at the 5% significance level and '***’ means significant at the 1% significance level.
According to the parameter estimation obtained in the above table, the marginal
distributions of the four log returns of FTSE100, S&P500, Nikkei225 and HS300 can
be obtained by (3.3).
Through the parameter estimation result table of the above GARCH (1, 1) model,
we can find that the β value of each GARCH model exceeds 0.85. This shows that
there are strong serial correlations in all four log returns. In addition, we can also find
that 𝛼1 + 𝛽 of each GARCH model is close to 1. 𝛼1 + 𝛽 is called the persistence,
as it defines the speed at which shocks to the variance revert to their long-run values.
This shows that the persistence of these model is very strong, that is if the variance is
increased by an impact, it takes a long time to recover the long-run average level.
17
4.2.2 Estimated Copula Functions The parameters of the copula function estimated based on the MLE method are
shown in the following table.
Table 4: Estimated parameters of copula functions
Parameters Gaussian Copula T Copula Clayton Copula
FTSE100 vs. S&P500
ρ 0.670 0.666 / df / 5.406 / α / / 1.112
FTSE100 vs. Nikkei225
ρ 0.374 0.358 / df / 12.043 / α / / 0.450
FTSE100 vs. HS300
ρ 0.308 0.273 / df / 6.506 / α / / 0.349
S&P500 vs. Nikkei225
ρ 0.332 0.286 / df / 5.151 / α / / 0.352
S&P500 vs. HS300
ρ 0.257 0.212 / df / 5.804 / α / / 0.250
Nikkei225 vs. HS300
ρ 0.422 0.411 / df / 18.984 / α / / 0.512
From the estimation results of the copula function, it can be found that the
estimated parameters are within a reasonable interval.
For Gaussian copula and T copula, the greater the ρ, the greater the correlation
between the two log return sequences. For Clayton copula, the larger the α, the greater
the correlation between the two log return sequences. From the parameter table, it can
be found that the parameters of Gaussian copula, T copula and Clayton copula show
consistency, that is, the correlation ranking is consistent in each copula model.
4.2.3 Tail Dependence The tail dependence between the log returns can be visually observed first through
the three-dimensional histogram.
18
Figure 5: 3D histograms
From the above three-dimensional histograms, it can be found that the height of
the bar in the lower-left corner of each graph is higher, that is the two log returns have
a higher frequency of having smaller values at the same time, which indicates that
there is a tail dependence between the two log returns.
The tail dependence coefficient can quantify the magnitude of the tail dependence.
After estimating the parameters of the copula function, the tail dependence coefficient
can be obtained according to the formula of the tail dependence coefficient. The table
below summarizes the tail dependence coefficient.
Table 5: Tail dependence coefficients
T Copula Clayton Copul
19
FTSE100 vs. S&P500 𝜆𝑈 = 𝜆𝐿 = 0.297 𝜆𝑈 = 0, 𝜆𝐿 = 0.536
FTSE100 vs. Nikkei225 𝜆𝑈 = 𝜆𝐿 = 0.027 𝜆𝑈 = 0, 𝜆𝐿 = 0.214
FTSE100 vs. HS300 𝜆𝑈 = 𝜆𝐿 = 0.074 𝜆𝑈 = 0, 𝜆𝐿 = 0.138
S&P500 vs. Nikkei225 𝜆𝑈 = 𝜆𝐿 = 0.113 𝜆𝑈 = 0, 𝜆𝐿 = 0.140
S&P500 vs. HS300 𝜆𝑈 = 𝜆𝐿 = 0.075 𝜆𝑈 = 0, 𝜆𝐿 = 0.063
Nikkei225 vs. HS300 𝜆𝑈 = 𝜆𝐿 = 0.009 𝜆𝑈 = 0, 𝜆𝐿 = 0.258
Since Clayton Copula can better reflect the lower tail dependence, through the
lower tail dependence coefficient of Clayton Copula, we can find that there is a
relatively strong lower tail dependence between the log returns of each index. This
means that one index has a large probability to fall when the other index falls.
The lower tail dependence coefficient between FTSE100 and S&P500 is the
largest. It seems that UK and USA stock markets have the highest level of financial
contagion. HS300 has relatively low tail dependence coefficients with other indexes
compared to other indexes. This may be explained by China's capital control over the
capital market. Capital control makes the circulation of funds unfree, and the linkage
between stock markets declines.
5 Conclusion This paper uses the GARCH-copula method to establish the bivariate joint
distribution model between stock index log returns. First, the log returns of FTSE100,
S&P500, Nikkei225 and HS300 are fitted by GARCH(1,1) respectively. The GARCH
model better solves the problems of volatility clustering and fat tail in the log returns
sequence. From the parameters of the GARCH model, we can find that the Beta value
of each model is relatively large (larger than 0.85), and the persistence of the model is
close to 1, which means that the variance of each log returns sequence takes a long
time to recover to long-run value after being shocked.
Secondly, Gaussian copula, T copula and Clayton copula are estimated. Finally
20
the tail dependence is calculated between each index log return and it is found that
there is a strong lower dependence between the indices. It can be seen from the
analysis results that the copula function of FTSE100 and S&P500 have the largest
correlation coefficient parameters, which indicates a strong correlation. At the same
time, the tail dependence coefficient of the two is also relatively large. In contrast, the
correlation between HS300 and the other three indices is much weaker. This shows
that the linkage between the UK and the US stock market is strong, which is
inseparable from the financial system in which capital flows freely. China’s control
over capital flows has always been strict, which will definitely lead to a decline in the
linkage between its stock market and international developed stock markets. However,
on September 10, 2019, the China Foreign Exchange Administration announced the
cancellation of the investment quota limit for QFII (Qualified Foreign Institutional
Investor) and RQFII (RMB Qualified Foreign Institutional Investor). This move will
certainly enhance the linkage between the Chinese stock market and the international
stock market in the future. Human behavior is guided and restricted by various systems.
Therefore, it is obvious that when we do investment decentralization or pricing of
financial products, we should take into account changes of systems which will cause
human behaviour changes and lead to changes in market correlation.
21
Reference [1] Aloui, R., Aïssa, M.S.B. and Nguyen, D.K., 2011. Global financial crisis, extreme
interdependences, and contagion effects: The role of economic structure?. Journal of
Banking & Finance, 35(1), pp.130-141.
[2] Ane, T. and Kharoubi, C., 2003. Dependence structure and risk measure. The
journal of business, 76(3), pp.411-438.
[3] Baig, T. and Goldfajn, I., 1999. Financial market contagion in the Asian crisis. IMF
staff papers, 46(2), pp.167-195.
[4] Bertero, E., & Mayer, C., 1990. Structure and performance: Global
interdependence of stock markets around the crash of October 1987. European
Economic Review, 34(6), pp.1155-1180.
[5] Bollerslev, T., 1986. Generalized autoregressive conditional
heteroskedasticity. Journal of econometrics, 31(3), pp.307-327.
[6] Bollerslev, T., 1987. A conditionally heteroskedastic time series model for
speculative prices and rates of return. Review of economics and statistics, 69(3),
pp.542-547.
[7] Calvo, S., 1999. Capital flows to Latin America: is there evidence of contagion
effects?. The World Bank.
[8] Cherubini, U., Luciano, E. and Vecchiato, W., 2004. Copula methods in finance.
John Wiley & Sons.
[9]Engle, R.F., 1982. Autoregressive conditional heteroscedasticity with estimates of
the variance of United Kingdom inflation. Econometrica: Journal of the Econometric
Society, pp.987-1007.
[10] Ferenstein, E. and Gasowski, M., 2004. Modelling stock returns with AR-
GARCH processes. SORT-Statistics and Operations Research Transactions, 28(1),
pp.55-68.
[11] Frahm, G., Junker, M. and Schmidt, R., 2005. Estimating the tail-dependence
coefficient: properties and pitfalls. Insurance: mathematics and Economics, 37(1),
pp.80-100.
22
[12] Glosten, L.R., Jagannathan, R. and Runkle, D.E., 1993. On the relation between
the expected value and the volatility of the nominal excess return on stocks. The
journal of finance, 48(5), pp.1779-1801.
[13] Joe, H. and Xu, J.J., 1996. The estimation method of inference functions for
margins for multivariate models.
[14] Joe, H., 1997. Multivariate models and multivariate dependence concepts. CRC
Press.
[15] Juri, A. and Wüthrich, M.V., 2002. Copula convergence theorems for tail
events. Insurance: Mathematics and Economics, 30(3), pp.405-420.
[16] King, M. A., & Wadhwani, S., 1990. Transmission of volatility between stock
markets. The Review of Financial Studies, 3(1), pp.5-33.
[17] Li, D.X., 2000. On default correlation: A copula function approach. The Journal
of Fixed Income, 9(4), pp.43-54.
[18] Nelson, D.B., 1991. Conditional heteroskedasticity in asset returns: A new
approach. Econometrica: Journal of the Econometric Society, pp.347-370.
[19] Rodriguez, J.C., 2007. Measuring financial contagion: A copula approach. Journal
of empirical finance, 14(3), pp.401-423.
[20] Patton, A. J., 2004. On the out-of-sample importance of skewness and asymmetric
dependence for asset allocation. Journal of Financial Econometrics, 2(1), pp.130-168.
[21] Patton, A.J., 2006. Modelling asymmetric exchange rate dependence.
International economic review, 47(2), pp.527-556.
[22] Salmon, F., 2009. A formula for disaster. Wired, March, pp.74-79.
[23] Sklar, M., 1959. Fonctions de repartition an dimensions et leurs marges. Publ.
inst. statist. univ. Paris, 8, pp.229-231.
ONS Escaping Poor Performance Dissertation(1).pdf
Executive Summary As identified by Her Majesty’s Chief Inspector of Education, Children’s Services and Skills,
there are a group of around 450 state-funded schools in England that have had poor inspection
results in every inspection they have had since 2005. These ‘stuck’ schools are receiving
increased levels of attention, since a poor inspection result is meant to instigate an improvement
in school performance.
This project aimed to investigate the data around stuck schools, to see if it was possible to
generate a machine learning model that could accurately predict whether a poorly performing
school would remain ‘stuck’ or would ‘escape’ the cycle of poor performance by itself.
To do this, a dataset needed to be created. This was done using input variables selected from a
number of publicly available sources. Many were provided by the Department for Education,
whilst details of every school inspection since 2005 were provided by Ofsted. The resulting
dataset had a row for each of the 21,900 open, state-funded schools in England and around 80
columns.
The definition of ‘stuck’ was updated slightly for this project, in consultation with the
Department for Education. The schools that met these new criteria were identified, along with
a different subset – those that had been performing poorly but had recently had one or more
good inspections and ‘escaped’ stuck. Schools that fitted in neither category were not used
further. The created dataset now had binary labels – ‘stuck’ and ‘escaped’ – from which a
binary classifier could be built and tested.
Six different binary classifiers were built in this project, using the Scikit-learn package in
Python: Random Forests, Support Vector Machines, Neural Networks, Gaussian Naïve Bayes,
Logistic Regression and K-Nearest Neighbours. For each model type, the optimum
combination of input features and model hyperparameters was found using an iterative process
involving random grid searches for model hyperparameters and sequential feature selection.
The best models were found to predict the future class of a school with 75% accuracy and area
under the ROC curve values of 75%. If a higher confidence in the prediction of a stuck school
is required, a Support Vector Machine model was correct in 88% of its predictions of stuck
schools (precision), although it only identified 40% of all stuck schools (recall). Logistic
Regression and Gaussian Naïve Bayes generally provided inferior results to the other four
model types, of which K-Nearest Neighbours was found to perform the best overall.
Accuracy values of 75% are useful, and show that the models can predict future inspection
results, but they are unlikely to be high enough to be of direct use to Ofsted and the Department
for Education. Given the selection of well known machine learning techniques used and the
large range of input features and hyperparameters tested, it is considered unlikely that
significant improvement upon these scores can be achieved without a step change in the quality
of the input data or the use of more advanced machine learning techniques that are beyond the
scope of this project.
An analysis of the most important features used by the models in making the classification was
also carried out, showing that different groups of features were being used by the different
models. Features concerning school financial balance were shown to be of high importance to
the Random Forest model in an assessment of its feature importance values.
This project was carried out with input from the two primary stakeholders: Ofsted and the
Department for Education. The results have been presented to them in person, with an
explanation of how they were achieved. They are now in a position to decide the future
direction of this work.
Acknowledgements My thanks in this project go to the EMU team at ONS who have been a pleasure to work with this summer – particularly to Joe for his help in all things from laptop setup to showing me how to get from the bike sheds to the showers and for his interest, support and guidance throughout; to Tim for giving me great flexibility, allowing me to concentrate solely on my project and being on hand if I needed anything; to Rebecca for keeping on finding errors in my list of stuck schools.
David Marshall, my supervisor in Cardiff University, has also been extremely helpful in this project, giving me useful practical advice on carrying out the work and for his assistance with writing this dissertation. I appreciate your willingness to meet on Skype on your day off.
I would also like to thank George and Louise at Ofsted and Pennie and Pippa at the Department for Education for their input during the project and their helpful comments and feedback in the presentation.
Contents 1. Introduction .................................................................................................................................... 1
1.1. School Inspections ....................................................................................................................... 1
1.2. Stuck Schools................................................................................................................................ 2
1.3. Ofsted and the Department for Education .................................................................................. 2
1.4. Tools Used .................................................................................................................................... 3
1.5. Project Aims ................................................................................................................................. 3
1.6. Process Plan ................................................................................................................................. 4
2. Literature Review ............................................................................................................................ 5
2.1. Previous Analysis of Schools Data ................................................................................................ 5
2.2. Machine Learning Models ............................................................................................................ 6
2.2.1. Logistic Regression ................................................................................................................ 6
2.2.2. Random Forests .................................................................................................................... 7
2.2.3. Support Vector Machines ..................................................................................................... 7
2.2.4. Neural Networks ................................................................................................................... 9
2.2.5. Naïve Gaussian Bayes Networks ......................................................................................... 11
2.2.6. K-Nearest Neighbours ......................................................................................................... 12
2.3. Methods for Refining the Models .............................................................................................. 13
2.3.1. Iterative imputation ............................................................................................................ 13
2.3.2. Sequential Forward and Backward Selection ..................................................................... 13
2.3.3. Recursive Feature Elimination ............................................................................................ 14
2.3.4. Oversampling ...................................................................................................................... 14
2.4. Measuring Model Effectiveness ................................................................................................. 15
2.4.1. Cross validation ................................................................................................................... 15
2.4.2. Accuracy .............................................................................................................................. 15
2.4.3. Precision and recall ............................................................................................................. 15
2.4.4. ROC curves .......................................................................................................................... 16
3. Input Data and Pre-Processing ...................................................................................................... 17
3.1. Datasets Available ...................................................................................................................... 17
3.1.1. Ofsted data.......................................................................................................................... 17
3.1.2. DfE data ............................................................................................................................... 17
3.1.3. Get Information About Schools data .................................................................................. 17
3.1.4. List of Academies ................................................................................................................ 18
3.2. Data Stitching ............................................................................................................................. 18
3.3. Data Cleaning ............................................................................................................................. 18
3.3.1. Missing data ........................................................................................................................ 18
3.3.2. Data types ........................................................................................................................... 19
3.3.3. Normalising ......................................................................................................................... 19
3.3.4. Imputation .......................................................................................................................... 19
4. Machine Learning Implementation............................................................................................... 21
4.1. Feature Generation .................................................................................................................... 21
4.1.1. Categorical data .................................................................................................................. 21
4.1.2. Changes over time .............................................................................................................. 21
4.1.3. Performance data ............................................................................................................... 21
4.2. Initial Feature Selection ............................................................................................................. 23
4.3. Stuck School Labelling ................................................................................................................ 23
4.3.1. Options for labels ................................................................................................................ 23
4.3.2. Updated definition of Stuck schools ................................................................................... 24
4.3.3. Data selection and labelling ................................................................................................ 24
4.3.4. Labelling method ................................................................................................................ 25
4.4. Data Summary ............................................................................................................................ 26
4.5. Principal Component Analysis .................................................................................................... 30
5. Results ........................................................................................................................................... 32
5.1. Assessment Methods Used ........................................................................................................ 32
5.2. Computing Set up for Modelling ................................................................................................ 33
5.3. Experiment 1: Finding the optimum machine learning classification model ............................ 37
5.3.1. Experiment 1.1: Finding the best model type ..................................................................... 37
5.3.2. Experiment 1.2: Finding the optimum model hyperparameters ........................................ 43
5.4. Experiment 2: Finding the optimum features to input to the model ........................................ 53
5.4.1. Experiment 2.1: Investigating whether Sequential Forward Selection or Sequential Backward Selection give better results ......................................................................................... 53
5.4.2. Experiment 2.2: Investigating whether Recursive Feature Elimination improves the set of features selected ........................................................................................................................... 55
5.4.3. Key features selected in most effective model ................................................................... 57
5.5. Experiment 3: Finding the optimum operations on the input data ........................................... 59
5.5.1. Experiment 3.1: Determining whether oversampling improves model performance ....... 59
5.6. Attributes of Most Effective Model ........................................................................................... 61
6. Discussion ...................................................................................................................................... 62
7. Conclusions ................................................................................................................................... 67
8. References .................................................................................................................................... 70
9. Appendices .................................................................................................................................... 74
9.1. Appendix 1: Input Variables Used .............................................................................................. 74
9.1.1. School Financial Balance ..................................................................................................... 74
9.1.2. School performance data .................................................................................................... 74
9.1.3. Pupil population and absence data..................................................................................... 75
9.1.4. Spine .................................................................................................................................... 75
9.1.5. Workforce data ................................................................................................................... 75
9.1.6. School finances data ........................................................................................................... 76
9.1.7. Generated variables ............................................................................................................ 77
9.2. Appendix 2: Variables selected for models ................................................................................ 77
9.2.1. Neural Network ................................................................................................................... 77
9.2.2. Support Vector Machine ..................................................................................................... 78
9.2.3. Random Forest .................................................................................................................... 78
9.2.4. Gaussian Naïve Bayes.......................................................................................................... 79
9.2.5. Logistic Regression .............................................................................................................. 79
9.2.6. K-Nearest Neighbours ......................................................................................................... 80
List of Figures Figure 1: A multi layer perceptron with 1 hidden layer containing k units (Scikit-learn, no date b) .... 10 Figure 2: Confusion Matrix .................................................................................................................... 15 Figure 3: Frequency of each overall inspection history ........................................................................ 27 Figure 4: Histograms showing distribution between 'Stuck' schools, 'Escaped' schools and schools that fall into neither category ............................................................................................................... 29 Figure 5: Principal Component Analysis - Cumulative explained variance versus number of principal components used .................................................................................................................................. 30 Figure 6: The results of Principal Component Analysis - the two classes and the ‘other’ remaining schools that fall in neither class are plotted on Principal Component 1 vs Principal Component 2 axes .............................................................................................................................................................. 31 Figure 7: Process for selecting optimum model for each model type .................................................. 38 Figure 8: Performance of the best version of each model type. Note that different measures of the same six models are plotted, sorted in order of decreasing area under the ROC curve score. ........... 39 Figure 9: ROC curve for model with highest area under the ROC curve value for each model type. The lines plotted are the mean scores over the 5 folds of cross validation. ............................................... 40 Figure 10: ROC curves for models in Table 3. Each of the five cross validation folds is plotted, along with a mean and the range of one standard deviation ........................................................................ 42 Figure 11: Precision for Stuck class - The best model from each model type in terms of precision for identifying stuck schools. The same six models are plotted in each chart, sorted in order of precision score. ..................................................................................................................................................... 42 Figure 12: Variation of AUC and Accuracy with different hyperparameters for Support Vector Machines using features listed in section 9.2 ....................................................................................... 45 Figure 13: Variation of AUC and Accuracy with different hyperparameters for Neural Networks using features listed in Section 9.2 ................................................................................................................. 47 Figure 14: Variation of AUC and Accuracy with different hyperparameters for Random Forests using features listed in Section 9.2 ................................................................................................................. 49
Figure 15: Variation of AUC and Accuracy with different hyperparameters for K-Nearest Neighbours using features listed in Section 9.2 ....................................................................................................... 50 Figure 16: Sequential Forward/Backward Selection results: Model accuracy for different numbers of features ................................................................................................................................................. 54 Figure 17: Recursive Feature Elimination - For each of the six different models, for each level of RFE, the highest score of Area Under the ROC curve and Accuracy are plotted.......................................... 56 Figure 18: Oversampling - For each model type, the run with the highest accuracy and area under the ROC curve are shown for the group where oversampling was used on the training data and the group which did not use oversampling. ................................................................................................ 60 Figure 19: ROC curves for the selected models, with randomness disabled in the cross validation process. Fold 1 is therefore the first 20% of the data points, Fold 2 is 20-40% etc. ............................ 64
List of Tables Table 1: Numbers of each category of school ...................................................................................... 27 Table 2: Minimum values of each metric to qualify. ............................................................................ 33 Table 3: Characteristics of the best model for each model type, measured by area under the ROC curve and subject to the minimum constraints of Table 2 ................................................................... 39 Table 4: Difference in training and test set accuracy when the kernel is changed. Each run is otherwise identical, using the parameters shown in Table 3 ............................................................... 46 Table 5: Features that appear in more than one of the selected 6 models ......................................... 57 Table 6: Importance of each feature to the selected Random Forest model....................................... 57 Table 7: Importance of the top 30 features when the parameters selected of the Random Forest model are applied to all features .......................................................................................................... 58 Table 8: Confusion matrix for best model ............................................................................................ 61 Table 9: Properties of best model ......................................................................................................... 61
1
1. Introduction This project is based on the inspection results of all state-funded schools in England and, in
particular, a small subset of these schools that are considered to be ‘stuck’ in a cycle of poor
performance. The aim of this project is to use machine learning to investigate whether it is
possible to predict the future performance of a school that has been repeatedly receiving poor
inspection results. It has been carried out at the Office for National Statistics in Newport, South
Wales.
1.1. School Inspections School inspections are one of the primary measures by which schools are judged. State-funded
schools in England can be inspected at any time, with a minimum of 15 minutes’ notice given
that an inspector is due to arrive at the premises (Office for Standards in Education, 2018). The
inspector is then required to be granted immediate access to everything they deem necessary
to investigate how the school is performing in a range of areas, such as how the school is being
managed and the effectiveness of the teaching in the classroom.
There are different types of inspection, such as ‘full’ Section 5 inspections that last for two
school days and ‘short’ Section 8 inspections that can last for one day. Section 5 ‘full’
inspections are what is investigated in this project and will be referred to simply as an
inspection from this point onwards.
An inspection will result in the school receiving ratings and feedback on a number of criteria.
The headline figure, which is what is considered in this project, is the ‘Overall Effectiveness’
rating. There are four available ratings:
- Category 1: Outstanding
- Category 2: Good
- Category 3: Requires improvement
- Category 4: Inadequate (subcategories: Serious Weakness and Special Measures)
Whilst it is clear that all schools would want to be ‘Outstanding’, the distinction applied in this
work is whether a school is ‘Good or better’ (Category 1 or 2), or ‘less than Good’ (Category
3 or 4).
The frequency with which a school is inspected can vary greatly. An average school could be
inspected once every four years whilst a poorly performing school may be inspected far more
2
frequently. In 2012, government policy stated that it was stopping routine inspections of
schools which have received an Outstanding rating (Fowler, 2012), although would monitor
their data and would inspect if it was deemed necessary. At present, this exemption remains in
place although it has been announced that routine inspections of Outstanding schools will be
reinstated (Department for Education, 2019).
There are different types of state-funded schools in England, such as grammar schools and
comprehensives. Recent government policy has led to the creation of academies. These are
schools where funding is received directly from the government, as opposed to non-academy
schools which receive funding from their Local Authority, who are in turn funded by the
government. The creation of these academies has either been voluntary or forced – any school
rated as Inadequate is legally obliged to become an academy.
From September 2019 onwards, the Education Inspection Framework (Ofsted, 2019) will be
in place, which will change how Ofsted carries out inspections.
1.2. Stuck Schools ‘Stuck’ is a phrase used to describe a school with a repeating pattern of poor inspections. As
used by Ofsted (Spielman, 2018), the criteria for a school to be labelled as stuck are:
- Having had at least four inspections since 2005.
- Every inspection rated ‘less than Good’ (Category 3 or 4).
When a school closes and reopens under a new name, whether to become an academy or not,
the closed school is said to be a predecessor of the opened school. The stuck school criteria
consider all inspections of predecessor schools, so a school can be labelled stuck if it has never
been inspected but its predecessor(s) have and meet the criteria.
Stuck schools are a point that have been focused on as an area to improve by Ofsted, who are
working with the Department for Education to look into them and see what they can do to
improve (Spielman, 2018).
1.3. Ofsted and the Department for Education These are two government organisations that are key stakeholders in this work.
The Office for Standards in Education, Children’s Services and Skills is better known as
Ofsted. They are responsible for carrying out inspections on services providing education and
skills, as well as those that care for children and young people. They then publish the reports
3
of their findings and inform policymakers of the effectiveness of the services. In this work, the
focus is on school inspections carried out and reported by Ofsted.
The Department for Education (DfE) is responsible for the schooling system as a whole. It is
capable of carrying out ‘interventions’ on schools which it believes would benefit from extra
support. These interventions can take many forms, such as increasing funding or providing a
management consultant.
1.4. Tools Used This work was carried out entirely using the Python programming language. The data
processing was carried out using the pandas (McKinney, 2010) and numpy (Walt, Colbert and
Varoquaux, 2011) libraries whilst the Scikit-learn library (Pedregosa FABIANPEDREGOSA
et al., 2011) was used for the machine learning models. Matplotlib (Hunter, 2007) was used to
generate the plots. Sequential Feature Selection was implemented using mlxtend (Raschka,
2018). Oversampling was implemented using imblearn (Lemaitre, Nogueira and Aridas, 2017).
A GitHub repository was used to store the code used throughout this project. It can be accessed
at github.com/crees00/SchoolsData.
1.5. Project Aims The primary aim of this work is to make a model that can accurately predict, for a school with
three consecutive ‘less than Good’ (Category 3 or 4) inspections, whether the next inspection
will be ‘Good or better’ (Category 1 or 2) or ‘less than Good’ (Category 3 or 4). This binary
classification is to be achieved with the greatest accuracy possible.
The second aim is to find the important features in making this prediction and assign an
importance value to them. This is useful in many ways – it allows insight to improve the
accuracy of the model. It also helps to back up the results of the model, providing more weight
to them. It makes the model more accessible and less ‘black box’, allowing the DfE to trust it
more and ensure that evidence is available if a school were to question the model’s output. As
a result, it is important that each variable has an understandable meaning and principal
components cannot be used for modelling.
In order to achieve the two project aims stated above, a dataset must first be created. This
dataset must contain data on every currently open, state-funded school in England, with enough
variables to allow modelling to be carried out. The data must be cleaned and formatted in a
way that is amenable to modelling, with binary class labels generated and added to the dataset.
4
1.6. Process Plan 1. Create dataset from multiple data sources.
2. Label data.
3. Preliminary analysis of data.
4. Iteratively run hyperparameter searches and Sequential Feature Selection to identify
best combination of hyperparameters and features for each model type.
5. Analyse results and present findings to Ofsted and DfE.
5
2. Literature Review 2.1. Previous Analysis of Schools Data Analysis of schools data has been completed in a number of different forms for different
purposes. The prediction of secondary school performance using machine learning has been
carried out in Tunisia (Rebai, Ben Yahia and Essid, 2019). Here, Random Forests and
regression trees were used to identify variables associated with strong exam performance for a
school. School size, the male/female split and class size are among the key variables, of which
it is noted that there is a ‘high non-linearity of the relationships between these key factors and
school performance’. Dummy variables are used for the different geographical regions, but this
is shown to provide few positive results as the information is contained in another variable
(urban/rural). The paper does, however, use relatively few variables which would not appear
to cover the range of inputs required.
The complexity of the link between school level data and exam performance is again noted in
a paper which uses regression trees to generate feature importance for predicting school exam
performance (Masci, Johnes and Agasisti, 2018). The percentage of disadvantaged students,
school funding and student truancy are listed as key variables. It is noted that, while these
variables are linked to exam results, exam results themselves are to be used as a variable in this
project. Student exam results have been predicted with relatively poor accuracy using deep
learning (Tanuar et al., 2018).
Student exam results have also been predicted using, among others, Naïve Bayes, Support
Vector Machines, Random Forests and Logistic Regression as it is recognised that no one
algorithm works best for every problem (Canagareddy, Subarayadu and Hurbungs, 2019). In
this paper, a subset of the variables, such as student age and gender, are removed prior to
modelling because they ‘do not have any impact on the predictions’. Their results appear to
show that their classifier performance decreases as a result, suggesting that variables should be
tested in the model before concluding that they are ineffective and removing them from the
analysis.
As noted above, the uses of machine learning prediction identified in the literature apply to
exam performance as opposed to school inspection performance. They also mainly use
individual student data and make predictions on an individual student level. Some analyses
have been undertaken to characterise stuck schools (Spielman, 2018; Thomson, 2019) but these
do little beyond providing summary statistics.
6
There therefore exists an opportunity to investigate whether machine learning techniques can
be used to predict future inspection performance of schools.
2.2. Machine Learning Models Each open state-funded school in England forms a data point in this analysis, with many
features and a single binary class: Stuck or Escaped. The aim of this work is to establish a
binary classifier that can accurately predict, for an unlabelled school, whether it will be Stuck
or Escaped.
There are many available machine learning models which can be used to build a binary
classifier. It is assumed that the reader is familiar with the relatively well-known techniques
used in this work and, as such, a brief description is provided. The Scikit-learn implementation
of these models (Pedregosa FABIANPEDREGOSA et al., 2011) have hyperparameters which
can be tuned to improve the model performance. The key hyperparameters for each model are
described under sub-headings, together with the options available for them in Scikit-learn. The
remainder of this chapter uses information from the Scikit-learn website heavily, in the
documentation related to the Scikit-learn tool named.
2.2.1. Logistic Regression Logistic Regression is a well known binary classifier, described based on the Machine Learning
Journal article (Lin, Yu and Huang, 2011). In Logistic Regression, the conditional probabilities
describing the possible outcomes for a single data point are modelled using a logistic function:
𝑃(𝑦 |𝒙) ≡ 1
1 + 𝑒−𝑦𝒘𝑇𝒙
where 𝒙 is the data point, 𝑦 is the class label and 𝒘 is the weight vector. Given binary (two
class) training data with 𝑙 points, Logistic Regression minimises the following cost function:
𝑃(𝒘) = 𝐶 ∑ log (1 + 𝑒𝑖 −𝑦𝑖𝒘𝑇𝒙𝑖 ) +
1 2
𝒘𝑇 𝒘 𝑙
𝑖=1
where 𝐶 > 0 is a penalty parameter. In this project, 𝑙2 regularisation as shown above is used,
with the Limited-memory BFGS (L-BFGS) solver selected for its stability.
Logistic Regression is a useful classifier because it is relatively simple, well known and can
provide insight into the relative importance of the input features.
The module Scikit-learn.linear_model.LogisticRegression was used for this analysis.
7
2.2.2. Random Forests Random Forests are a type of ensemble classifier based on the decision tree. A known issue
with decision trees is the possibility of overfitting the data due to the tree depth being too great
(Shalev-Shwartz and Ben-David, 2014), but Random Forests can remove this issue. The
Random Forest algorithm is a perturb-and-combine technique, where a diverse set of classifiers
is created by introducing randomness into the classifier construction. The prediction of the
ensemble is given as the averaged prediction of the individual classifiers.
Each tree in the ensemble is built from a random sample drawn with replacement (a bootstrap
sample) from the training set. When splitting each node during the construction of a tree, the
best split is found either from all input features or a random subset of specified size. The
purpose of these two sources of randomness is to decrease the variance of the model. Random
Forests achieve a reduced variance by combining a diverse selection of trees although this can
increase the bias. The Scikit-learn implementation of Random Forests used was Scikit-
learn.ensemble.RandomForestClassifier. This combines classifiers by averaging their
probabilistic prediction instead of letting each classifier vote for a single class.
2.2.2.1. Number of Estimators The number of trees making up the Random Forest. The accuracy of the model converges to a
limit as the number of trees in the forest becomes large. There is therefore a trade-off between
model accuracy and processing time as larger models will take longer to compute.
2.2.2.2. Maximum Number of Features to Consider The size of the random subsets of features to consider when splitting a node. The lower the
number, the greater the reduction of variance, but also the greater the increase in bias.
2.2.2.3. Criterion The function used to measure the quality of the split at each node in a tree. Either the Gini
impurity or the information gain (entropy) can be used in the model.
2.2.2.4. Bootstrap Whether a subset of the data points or all of the data points are used when adding a tree to the
Random Forest.
2.2.3. Support Vector Machines The following description is largely based on Chapter 5 of Pattern Classification (Duda, Hart
and Stork, 2001) and Chapter 16.5 of Numerical Recipes (Press et al., 2007). Support Vector
Machines are a form of linear discriminant function that specialise in separating data where the
pattern is in a higher dimension.
8
If the data is linearly separable, there exists a hyperplane in 𝑛 dimensions, an 𝑛 − 1
dimensional surface, given by
𝑓(𝒙) ≡ 𝒘. 𝒙 + 𝑏 = 0
that completely separates the training data 𝒙. All that remains is to find 𝒘 (a normal vector to
the hyperplane) and 𝑏 (an offset), then 𝑓(𝒙)will be the decision rule – if 𝑓(𝒙) > 0 then Class=1
and the point is on one side of the hyperplane, if 𝑓(𝒙) < 0 then Class=0 and the point is on the
other side of the hyperplane.
The margin is the perpendicular distance to points nearest to the hyperplane on both sides. The
goal in training a Support Vector Machine is to find the separating hyperplane with the largest
margin. The larger the margin, the better the classifier is expected to perform on unseen test
cases.
The support vectors are two vectors that are equally close to the hyperplane. They pass through
the training samples that define the optimal separating hyperplane and are the most difficult
points to classify as they are nearest to the hyperplane.
An important benefit of the Support Vector Machine is that the complexity of the resulting
classifier is based on the number of support vectors rather than the dimensionality of the
transformed space. Support Vector Machines therefore tend to be less prone to problems of
overfitting than some other methods. As they are based only on the support vectors, outliers
have less impact.
A disadvantage of the Support Vector Machine is that it does not lend itself to providing
probability estimates and it does not indicate the importance of different input features.
The Scikit-learn.svm.SVC module was used to implement Support Vector Machines.
2.2.3.1. C value The optimal decision boundary described above is that which maximises the distance between
the nearest training points and the decision boundary, assuming that the data is linearly
separable. If the data cannot be separated by a hyperplane, then training data points will be
misclassified and the model is said to have a ‘soft margin’. This misclassification is penalised
by the C value in the optimisation process, with a higher C value more strongly penalising
misclassification, with a penalty related to the distance between the misclassified point and the
hyperplane.
9
Setting C to a high value therefore favours achieving perfect separation of the data, at the risk
of generating an overly complex model. Setting a lower C value favours increasing the margin,
which will perform worse on the training data but could be more robust to variation in the test
data.
2.2.3.2. Kernel function Support Vector Machines can utilise the ‘kernel trick’. If an embedding function exists that
maps the n-dimensional feature vectors to a much higher N-dimensional space, it may be
possible that a very non-linear separating surface in the n-dimensional space maps into a linear
hyperplane in the N-dimensional space. This potentially highly complex mapping does not
need to be computed, instead a kernel is computed that could have come from the mapping.
The three kernel functions available in the model are linear, polynomial and the Gaussian radial
basis function (‘rbf’). The polynomial kernel seeks smoother, more global solutions whilst the
Gaussian radial basis function is more influenced by local nearest neighbour effects.
2.2.3.3. Degree of polynomial If a polynomial kernel is selected, this is the order of the polynomial.
2.2.3.4. Gamma value Gamma value is a parameter used in the polynomial and Gaussian radial basis function kernels.
2.2.4. Neural Networks The following is based largely on Chapter 4.4 of Machine Learning (Mitchell, 1997).
A single perceptron unit with a binary threshold takes a vector of input values 𝒙, calculates a
linear combination of these inputs and outputs a 1 if the result is greater than some threshold
and -1 otherwise. The weights 𝑎𝑖 and bias 𝑏 are real valued constants that are learned so that
the perceptron produces the correct (-1 or +1) output for each of the given training examples.
Output 𝑜(𝒙) = { 1 𝑖𝑓 𝑏 + 𝑎1𝑥1 + ⋯ + 𝑎𝑛𝑥𝑛 > 0 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
A single perceptron can only express a linear decision surface. Adding in layers of
perceptrons (Figure 1) allows the model to learn a more complex, non-linear decision surface.
Each layer of perceptrons contains a pre-defined number of perceptron units, each of which
can learn different weights (𝒂) and bias (𝑏). The layers of perceptron units are fully
connected – each unit in one layer is connected to all units in the previous layer and all units
in the next layer. The inputs to the second layer are the outputs from the first layer, and so on.
A multi-layer perceptron model may be classed as a feed-forward Neural Network.
10
Figure 1: A multi layer perceptron with 1 hidden layer containing k units (Scikit-learn, no date b)
The backpropagation algorithm learns the weights for a multilayer network. It employs gradient descent to attempt to minimise the squared error between the network output values and the target value for these outputs (the training data labels). For each training example, it applies the network to the example, calculates the error of the network output for this example, computes the gradient with respect to the error on this example, then updates all weights in the network.
The number of hidden units governs the complexity of the decision boundary (Duda, Hart and Stork, 2001) so, if the data classes are highly interspersed, more hidden units are required.
Well trained Neural Networks are capable of accurately classifying highly complex non-linear datasets. A disadvantage of the Scikit-learn implementation is that there is a non-convex loss function with more than one local minimum. This can lead to variation between runs with the same data. Neural Networks also require the tuning of a number of hyperparameters, such as the number of hidden units, layers and iterations.
Scikit-learn.neural_network.MLPClassifier was the implementation of Neural Networks used in this project.
2.2.4.1. Activation function For a perceptron unit, this is a binary threshold function. In this work, the rectified linear unit
(ReLU) was used due to its constant gradient (for positive values), ensuring that the learning
rate was not unnecessarily slow for large values.
11
2.2.4.2. Number of layers The number of hidden layers in the network. It is unlikely that more than three layers would be
required, as it would be necessary to have ‘special problem conditions or requirements to
recommend the use of more than three layers’ (Duda, Hart and Stork, 2001).
2.2.4.3. Number of nodes per layer In order to simplify the parameter search, the decision was taken to have, for a single Neural
Network, the same number of units in each layer. For example, the following setups were
possible: [2,2,2] or [5,5,5,5].
2.2.4.4. Solver Three solvers are available in the Scikit-learn implementation:
- Stochastic Gradient Descent – Updates parameters using the gradient of the loss
function with respect to a parameter that needs adaptation.
- Adam – This is also a stochastic optimiser but it can automatically adjust the amount to
update parameters based on adaptive estimates of lower-order moments.
- L-BFGS – This approximates the Hessian matrix which represents the second-order
partial derivative of a function. It then approximates the inverse of the Hessian matrix
to perform parameter updates.
2.2.4.5. Alpha Alpha is a regularisation term which helps avoid overfitting by penalising weights with large
magnitudes.
2.2.5. Naïve Gaussian Bayes Networks This description is based on ‘The Optimality of Naïve Bayes’ (Zhang, 2004).
Naïve Gaussian Bayes networks are based on applying Bayes’ theorem with the ‘naïve’
assumption of conditional independence between every pair of features given the value of the
class label 𝑦. Bayes’ theorem states the following relationship, given class label 𝑦 and
dependent feature vector 𝒙:
𝑃(𝑦 |𝑥1, … , 𝑥𝑛) = 𝑃(𝑦)𝑃(𝑥1, … , 𝑥𝑛 |𝑦)
𝑃(𝑥1, … , 𝑥𝑛)
Using the naïve conditional independence assumption for all features, this relationship is
simplified to:
12
𝑃(𝑦 |𝑥1, … , 𝑥𝑛 ) = 𝑃(𝑦) ∏ 𝑃(𝑥𝑖 |𝑦)
𝑛 𝑖=1
𝑃(𝑥1, … , 𝑥𝑛)
As 𝑃(𝑥1, … , 𝑥𝑛 ) is constant given the input, we can use the following classification rule:
𝑃(𝑦 |𝑥1, … , 𝑥𝑛 ) ∝ 𝑃(𝑦) ∏ 𝑃(𝑥𝑖 |𝑦) 𝑛
𝑖=1
⇒ 𝑦 ̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦
𝑃(𝑦) ∏ 𝑃(𝑥𝑖 |𝑦) 𝑛
𝑖=1
Maximum A Posteriori estimation can then be used to estimate 𝑃(𝑦) and 𝑃(𝑥𝑖 |𝑦), with 𝑃(𝑦)
then being the relative frequency of class label 𝑦 in the training set.
In the Gaussian Naïve Bayes algorithm used, the likelihood of the features is assumed to be
Gaussian:
𝑃(𝑥𝑖 |𝑦) = 1
√2𝜋𝜎𝑦2 exp (−
(𝑥𝑖 − 𝜇𝑦 ) 2
2𝜎𝑦2 )
The parameters 𝜎𝑦 and 𝜇𝑦 are estimated using maximum likelihood. There are no
hyperparameters to tune in this model.
The conditional independence assumption is rarely true in most real-world applications.
Despite this, there are numerous scenarios where these models have been proven to be
effective. They require a small amount of training data to estimate the necessary parameters
and are extremely fast compared to more sophisticated methods. The decoupling of the class
conditional feature distributions means that each distribution can be independently estimated
as a one-dimensional distribution. This then helps avoid problems with the curse of
dimensionality.
The module used in this project was Scikit-learn.naive_bayes.GaussianNB.
2.2.6. K-Nearest Neighbours The K-Nearest Neighbours algorithm assumes that all data points correspond to points in n-
dimensional space (Mitchell, 1997). The nearest neighbours of an instance are defined in terms
of a distance measure as specified by the user. When k is 1, the algorithm assigns a label to the
test point corresponding to the nearest training point in the space. For larger values of k, the
algorithm assigns the most common class label of the k nearest training points.
Unless weightings are applied, all features are treated equally. As such, if there are large
numbers of unimportant features in the model, the classification can be dominated by these
13
unimportant features. Also, each test point must be compared to every training point so this
can be computationally intensive.
The implementation of K-Nearest Neighbours used was Scikit-
learn.neighbors.KNeighborsClassifier.
2.2.6.1. K value The larger the value of k, the larger the space being investigated and the less prone to error due
to noise.
2.2.6.2. P measure The order of Minkowski distance to measure the distance between points with. For example,
p=1 is the Manhattan distance and p=2 is the Euclidian distance.
2.3. Methods for Refining the Models Aside from selecting a model type and tuning its parameters (Section 2.2), there are many ways
in which a model can be further improved. The methods used in this project are described
below.
2.3.1. Iterative imputation The models required complete sets of data to function, yet a substantial quantity of the input
data was missing (Section 3.3.1). It was therefore required to use data imputation. The
imputation method selected was the IterativeImputer from Scikit-learn. This models each
feature with missing values as a function of other features, and uses that estimate for
imputation. This is achieved by fitting a regressor to the input data and using this to predict the
missing values in an iterative, round-robin fashion. This imputation method was selected to
provide a more ‘accurate’ prediction of the missing value than simply imputing the mean or
the mode, although this has not been verified. It also maintains a level of variance within the
data.
2.3.2. Sequential Forward and Backward Selection SFS and SBS are described based on the information on the mlxtend website (mlxtend, no
date). This technique is a greedy method of selecting the optimum combination of features to
use in the model. For forward selection, the selected feature set is initially empty and the
available features are input into a list. In each iteration, each feature in the input list is added
to the selected feature set. The model is trained and tested (using cross validation) on the new
variable set, the score is recorded and the feature is removed. When all features in the input list
14
have been tested, the feature that resulted in the best score is removed from the input list and
added to the selected feature set. This is continued until all features have been selected if an
early stopping criterion is not used. Backward selection works in the reverse fashion – the
selected feature set initially contains all features and one feature is removed in each iteration.
The scoring measure can be selected from standard measures such as model accuracy and the
area under the ROC curve. This is computationally intensive as the model must be trained and
tested a large number of times for each feature selection run.
In this work, the module mlxtend.feature_selection.SequentialFeatureSelector was used to
implement sequential feature selection in the SFS.py script.
2.3.3. Recursive Feature Elimination Recursive Feature Elimination is another method of selecting features and is described in the
Scikit-learn documentation. This is achieved by recursively removing the features which score
lowest in a Logistic Regression until the number of required features is reached. It was achieved
using the Scikit-learn package Scikit-learn.feature_selection.RFE.
2.3.4. Oversampling When using different subsets of the data, with different criteria for positive and negative labels,
it is possible that there can be a large difference between the sizes of the two classes for a binary
classifier. As a result, the classifier can be distorted by having few training points in the less
populous ‘minor’ class. This can be mitigated through the generation of synthetic cases of the
minor class in a technique called oversampling.
An oversampling technique called SMOTE (Chawla et al., 2002) is used in this work. The
following description is based on the documentation of imbalanced-learn (Lemaitre, Nogueira
and Aridas, 2017) which provided the implementation of SMOTE used. This module is
imblearn.over_sampling.SMOTE. The regular SMOTE algorithm takes a point of the minor
class, selects one of its nearest neighbours in the same class and generates a new point by linear
interpolation between the two existing points, giving the new point the label of the minor class.
The result of using this technique is that the training set will have the same number of each
class in it. The more populous ‘major’ class of the training set and all of the test set remain
unchanged.
15
2.4. Measuring Model Effectiveness A key consideration when building a machine learning model is how the results from the model
should be assessed to identify which are most effective and how they can be improved.
2.4.1. Cross validation In order to provide an indication of the accuracy of the model, it must be tested on previously
unseen data. As a result, for each run, a subset of the data should be removed and held as a test
set, the remainder being the training set. The model can then be trained solely on the training
set and its performance judged on how the model performs when applied to the test set. The
reported performance of the model is then dependent on the training/test split used – some
splits would result in better performance on the test set than others. To reduce this effect, 𝑘
fold cross validation can be implemented as described in section 9.6.2 of Pattern Recognition
(Duda, Hart and Stork, 2001):
The training set is randomly divided into 𝑘 disjoint sets of equal size 𝑛/𝑘, where 𝑛 is the total
number of data points. The classifier is trained 𝑘 times, each time with a different set held out
as a test set. The estimated performance is the mean of these 𝑘 scores. The number of folds
traditionally used is 𝑘 = 10.
The Scikit-learn.model_selection.KFold module was used to implement cross validation
because the full implementations of cross validation in Scikit-learn were not compatible with
the way the models and data had been set up. Setting up the different training and test sets is
demonstrated in the oneFullRun(..) function in genericModelClass.py as shown in Section 5.2.
This generates output for each of the 𝑘 folds separately, so separate post processing is required
to calculate the average output.
2.4.2. Accuracy Accuracy is the percentage of the test points that were correctly classified by the model.
2.4.3. Precision and recall The confusion matrix (Figure 2) summarises a classifier’s performance.
True
1 0
Predicted 1 True positives False positives
0 False negatives True negative Figure 2: Confusion Matrix
Precision shows what proportion of the points classified as class 1 are actually in class 1:
16
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
Recall shows what proportion of the points that are labelled as class 1 are predicted to be in class 1:
𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
These measures are particularly useful when there is an uneven split between the two values.
The F1 score is the harmonic mean of precision and recall, giving equal weight to each:
𝐹1 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙
The confusion matrices, precision and recall were calculated using the classification_report and confusion_matrix classes within Scikit-learn.metrics.
2.4.4. ROC curves An ROC curve (see examples in Figure 10 in Results section) is built upon a model’s false
positive rate and true positive rate – the proportion of points that the model has classed as class
1 that are actually in class 0 and 1 respectively. By calculating these rates for a range of
classifier threshold values, a Receiver Operating Characteristic (ROC) curve can be plotted. If
the threshold is set high enough, all points are predicted to be in class 0, so both rates are zero,
and the opposite holds for when the threshold is set to a minimum value. The method of
obtaining the values for ROC curves for the different model types is set out in Chapter 3 of
‘ROC Analysis of Classifier in Machine Learning: A Survey’ (Majnik and Bosnic, 2011).
The area under an ROC curve can be calculated, for which a perfect classifier would have a
value of 1. An Introduction to ROC Analysis (Fawcett, 2006) states that “The AUC has an
important statistical property: the AUC of a classifier is equivalent to the probability that the
classifier will rank a randomly chosen positive instance higher than a randomly chosen negative
instance.” A classifier which assigns classes at random would expect to achieve an AUC of
0.5. Therefore, for a model to be useful it must have an AUC score exceeding 0.5.
The datapoints and AUC results for the ROC curves plotted in this report were generated using
Scikit-learn.metrics.roc_curve and Scikit-learn.metrics.roc_auc_score respectively.
17
3. Input Data and Pre-Processing A key part of this project was assembling the data into a useable format compatible with the
machine learning models.
3.1. Datasets Available This project used publicly available data provided by Ofsted and the DfE.
3.1.1. Ofsted data Ofsted provided a record of every inspection that had occurred between 1st September 2005
and 31st August 2018, of which there are approximately 100,000. This data came in the form
of .csv files. The period until 31st August 2015 was covered by a single .csv file provided by
Ofsted, with the remaining data available in a separate file per term from the government
website. As stated in Section 1.1, these 100,000 inspections are neither evenly distributed
between schools nor by time.
Each inspection had around 50 variables, the key ones are shown below.
• URN – Each school has a Unique Reference Number which is the primary identification
of the school. If a school reopens under a different name, or becomes an academy, it is
given a new URN.
• LA/ESTAB Number – The Local Authority / Establishment number is another unique
identifier that was used in the cases where URN was not available.
• Inspection Number – This is the unique reference for the inspection.
• Overall Effectiveness rating – This is the category (1/2/3/4) that is provided.
• Inspection Start Date – This is used for providing the year of the inspection.
3.1.2. DfE data The data provided by the Department for Education had been acquired by the ONS before the
start of this project.
A list of the variables used is provided in the appendix (Section 9.1).
3.1.3. Get Information About Schools data This service, formerly known as EduBase, provides a list of all state-funded schools which are
currently open in England. This analysis focuses solely on these schools, although any
18
inspection that a current school’s predecessors had are considered in the current school’s
inspection history.
3.1.4. List of Academies This list of schools made a link between a current school and any predecessor school(s) it had.
This could be used in combining the inspection histories of schools.
3.2. Data Stitching An initial aim of the project was to create a single large dataset containing all relevant data for
each open school in England. The list of all currently open schools was used to ensure that
there was one row for each school, then columns were added in batches. The
combineInspData.py, creatingAMonster.py, GDIhelper.py and genericDataIn.py scripts were
used to carry out the following work.
A series of helper functions were created to perform the tasks. For each data source, a dictionary
was filled with the .csv file paths, the column names containing each data type and a selection
of other required inputs. The dictionaries were then passed sequentially to the helper functions
to add columns to the existing dataframe. During this process, the data types were corrected.
Some data was in separate .csv files for different subsets of the data for a given year, with
inconsistent names for equivalent columns. Here, the corresponding column names of the
required columns were identified and placed in lists. The lists of the two input dataframes were
then used to merge the individual columns to provide a combined dataframe that could be
joined to the primary dataframe. A similar operation was carried out on the examination
performance data which is described in section 4.1.3.
The final dataset contained one row for each open, state funded school in England. The URN
(Section 3.1.1) was the unique identifier, followed by columns for each selected variable. A
‘Class’ column, indicating the binary classification assigned to the school (Section 4.5) was
added as a final column.
3.3. Data Cleaning Data retrieved from different sources needed to be cleaned to ensure that information was not
lost whilst ensuring that all data input to the models would be compatible with the models.
3.3.1. Missing data Between the data sources there was a range of levels of data completeness. The completeness
of each variable was assessed to ascertain which had sufficient data to be useful in the analysis.
This was partly carried out using the makePickColsToUse(..) function within
19
pickColsToUse.py. Some variables were replicated between different data sources, with
different levels of completeness. In this case, the more complete source was used.
The data stitching technique also meant that if a school had a predecessor, the predecessor URN
was not used to find data from when the predecessor was open. The cleaning process also
removed some values that were incorrect, leading to more missing data.
3.3.2. Data types All data was required to be in numeric form, either as a float or an integer, to be compatible
with the models. The input data contained multiple types and formats, often with multiple types
within a single column. A selection of helper functions were required to deal with specific
formats such as percentages and currency. Where data was in an inappropriate format, such as
text in a numerical variable, this was replaced with the blank value np.nan. This was largely
carried out using the GDIhelper.py and genericDataIn.py scripts.
3.3.3. Normalising The data values used range in magnitude from large school budgets to proportions smaller than
one. They also are spread over different distributions. Some of the models used in this analysis,
such as Support Vector Machines, are not scale invariant, so a feature with larger values would
have a disproportionate impact on the results given. Not normalising the data could also lead
to numerical difficulties during the calculation as kernel values usually depend on the inner
products of feature vectors (Hsu, Chang and Lin, 2008).
In order to allow the models to work effectively, the data in all variables used in the analysis
was normalised using the normalise(..) and normaliseSDcol(..) functions in pickColsToUse.py.
This resulted in each variable having a mean value of 0 and a standard deviation of 1.
3.3.4. Imputation The models used in this analysis required that there were no missing values. As discussed in
section 3.3.1, the input data contained many missing values, with no single variable being
complete. In order to run the models using this data, the missing data had to be imputed.
The imputation method described in Section 2.3.1 was implemented as shown in the extract
from pickColsToUse.py below.
20
Imputing data will clearly introduce inaccuracy into the dataset because it has been generated
by the imputer rather than being a measured value. It is expected that the imputation would
cause the results to be worse than those obtained if the complete, original data were available.
This risk, however, is accepted due to the necessity of passing complete input data to the
models.
21
4. Machine Learning Implementation 4.1. Feature Generation As well as the features immediately available in the input data, a selection of extra features
were created based on the existing features.
4.1.1. Categorical data Categorical features must be converted to numerical values for use in the models, which was
achieved using fixCategoricalcols(..) in pickColsToUse.py. For some variables, such as
‘Boarding’ and ‘HasBoys’, this was achieved by grouping categories together to give a binary
output. When there were more than two outcomes for a variable, ‘one-hot encoding’ was
implemented using the pandas get_dummies() function. This ensured that all values were
treated equally by the model and would not introduce any unwanted effects.
4.1.2. Changes over time Many of the datasets used in this work are published annually, with data available for around
5 to 10 years. This was accounted for by generating variables that were a difference between
the current value and the corresponding value a specified number of years ago. If this was done,
it was added as a separate variable to the original.
4.1.3. Performance data School examination performance in England is published for pupils sitting examinations at the
end of Key Stage 2 (age 11), Key Stage 4 (age 16) and Key Stage 5 (age 18). Schools are
generally either for pupils aged up to age 11 (primary) or for older children (secondary).
Conducting separate primary and secondary analyses was beyond the scope of this project, so
a means of comparing the examination performance of primary and secondary schools was
implemented using the fixPerfCol(..) function in genericDataIn.py as described below.
22
The Key Stage 2 data selected shows the percentage of children that reach the ‘expected
standard’ in reading, writing and maths. Two main Key Stage 4 metrics were available for
assessing school examination performance: ‘Attainment 8’ and ‘Progress 8’. Attainment 8 is
an integer measure of a pupil’s performance across 8 core subjects and Progress 8 is this same
measure but relative to a pupil’s previous performance (Department for Education (DfE),
23
2016). Whilst the Progress 8 measure arguably provides a better indication of the school’s
performance, it was not selected because there is not a comparable Key Stage 2 measure which
measures improvement. The Key Stage 4 measure selected shows the average Attainment 8
score per pupil.
For each set of data, the applicable schools were ranked by their performance. The performance
rankings were then converted to percentages, so the median school would receive a score of
50% and the best would receive 100%. The new column added to the main dataset was then
the percentage ranking score for that school, whether it was for Key Stage 2 or Key Stage 4. If
a school had results for both Key Stage 2 and Key Stage 4, a mean was taken of their two
scores.
4.2. Initial Feature Selection The initial data sources contained over one thousand variables. Many of these were duplicates,
variants of others or not useful. When these were removed, there were still more features than
were necessary for a model with close to 22,000 data points. The initial feature selection was
carried out after consultation with colleagues at the ONS who have worked with these datasets
previously. The strategy used was to keep more features than were expected to be optimal, then
to narrow down the feature space once the initial models had been run.
4.3. Stuck School Labelling The dataset created was unlabelled – it did not contain the school classification labels that
would be needed for modelling. This section describes the process of adding the labels to the
dataset. Selecting the best definition of the classes to be classified by the model would be
important to the success of the project. Two different labelling methods were used, both of
which were binary.
4.3.1. Options for labels The initial basis of the project was to investigate schools classed as ‘Stuck’ by Ofsted (Section
1.2). This group of schools had already received extra focus from the authorities and analysis
had been carried out on them, such as the Ofsted Annual Report (Spielman, 2018). This
definition, however, had some clear flaws.
The first flaw in the existing stuck school definition is that the start date of the time window in
which inspections count towards a school being stuck is fixed at 1st September 2005. As time
passes, this window of time will increase and the effective meaning will change.
24
A further issue is that a stuck school can only be termed stuck if it has had zero ‘Good or better’
inspections since the starting date. This means that if a school had one ‘Good’ inspection in
September 2005 then it could never be classed as stuck, even if it subsequently received ten
consecutive ‘Inadequate’ (category 4) ratings. Given the expected variation within inspection
results, a single ‘Good’ inspection in 13 years is not unlikely if a school is inspected regularly,
even if it generally receives poor ratings. This would prevent a school from being labelled as
stuck even if it would fit into the thinking behind the stuck definition.
4.3.2. Updated definition of Stuck schools Given the flaws in the existing stuck schools definition, a new definition was agreed based on
input from the Department for Education. This new definition states that, to be stuck, a school
must have:
- At least four previous inspections (including those of predecessors).
- An Overall Effectiveness rating of ‘less than Good’ (category 3 or 4) in each of its four
most recent inspections.
4.3.3. Data selection and labelling With this new definition of stuck schools, it was decided that the modelling should proceed by
dividing the data into three groups:
- Currently Stuck schools (Class=1): As defined in Section 4.3.2. Examples of ratings of
schools that would fall into this category (from first inspection to most recent):
o 4,4,3,4,3
o 2,2,4,4,4,3
- ‘Escaped Stuck’ schools (Class=0): Schools that have had at least 3 consecutive ‘less
than Good’ inspections, with all inspections following this (minimum of one) rated as
‘Good or better’. For example:
o 4,3,4,2
o 2,2,3,3,3,2
o 2,2,4,3,4,4,1,1,2
- All other schools: These schools do not fall into either category and so were removed
from the dataset.
25
These labels allowed a specific question to be asked: If a school has three consecutive ‘less
than Good’ inspections, can we predict whether its next inspection will also be ‘less than
Good’?
The new class definitions clearly defined the training data labels required and the subset of
schools that the model could be used to predict in future. They also ensured that there was no
overlap between the classes and resulted in a relatively even split between the two classes. A
significant drawback of this approach is that a large proportion of the input data was being
removed and not used in the modelling.
4.3.4. Labelling method The Stuck school definitions of Section 1.2 and 4.3.2 needed to be applied to the dataset. The
selected definition could then be appended onto the dataset as a binary ‘Class’ column which
could be used as the data labels for training and testing the models.
The list of schools which fit into either definition of stuck was not available, so the
schoolClass.py script was written to identify which schools met the different criteria. This was
achieved by first defining an Inspection class in Python and generating an instance for each of
the ~100,000 inspections in the Ofsted data with the attributes listed in Section 3.1.1. Next, a
School class was defined and an instance generated for each school (open or closed) whose
URN matched the URN attribute of an Inspection, populating the School instance with
information from the dataset.
In order to incorporate inspections from predecessor schools, the predecessor URNs were first
identified and added to the relevant School instance using addPredecessorURNsFromDF(..).
For each School, the addPredecessorInspections(..) function was called recursively to work
through the predecessor schools identified and read in all of their Inspection instances. This
ensured that, even if a school’s predecessor itself had a predecessor, then the inspections would
be counted correctly. This then made it easy to work out which schools were stuck (old
definition) using calcStuck(..).
26
Further functions could then identify which schools were open, sort the inspections by year and
carry out further analysis, such as plotting the total inspection history of each open school as
shown in the following section (Figure 3).
4.4. Data Summary A summary of the number of schools in each category is shown in Table 1.
27
Table 1: Numbers of each category of school
All inspected
schools
All open schools Class=1
‘Stuck’
Class=0
‘Escaped’
32131 21943 805 908
The overall inspection histories of all schools are shown in Figure 3. The figure shows, for
example, that approximately 3,500 schools have an inspection history of 3 ‘Good or better’
inspections and 0 ‘less than Good’ inspections.
Figure 3: Frequency of each overall inspection history
A series of histograms are plotted in Figure 4 from classComparisonForEachCol.py. Each plot
shows three overlaid histograms which represent the distribution of values between the classes,
with the orange 'Escaped' bars being partially transparent to see the blue 'Stuck' bars behind.
Variable descriptions are provided in Section 9.1. The histogram y-axes are normalised to show
the relative frequencies of each class as opposed to absolute values. They are plotted before the
dataset was normalised so that the x-axis is on the original scale of the variable. For binary
values, 1 represents True and 0 represents False.
Some basic, generalised observations from the plots are as follows:
28
- Both stuck and escaped schools have much greater rates of pupils eligible for free
school meals, low achievement in key stage 1 and English as an additional language
than the average English state-funded school.
- A much higher proportion of stuck schools are secondary than for escaped schools,
which are both above the overall average. The same applies for academies.
- Stuck schools tend to have generally worse exam performance than escaped schools,
with both classes having generally inferior performance to all other schools.
- Stuck schools tend to have a lower Pupil:Teacher ratio than escaped schools.
- Stuck schools tend to have a greater pupil absence rate than escaped schools. Both
groups have frequencies of absence of 0.05% or greater that are considerably higher
than average.
- Stuck schools tend to have more pupils than escaped schools.
- Stuck schools tend to have had a greater increase in supply staff spend since 4 years
ago than escaped schools.
- There is not a clear link between teacher pay and a school being stuck or escaped.
29
Figure 4: Histograms showing distribution between 'Stuck' schools, 'Escaped' schools and schools that fall into neither category
30
4.5. Principal Component Analysis A principal component analysis was run with all input features using Scikit-
learn.decomposition.PCA and the results shown in Figure 5. The plot shows that 25 principal
components are required to capture 80% of the variance and 39 principal components are
required to capture 90% of the variance.
Figure 5: Principal Component Analysis - Cumulative explained variance versus number of principal components used
Stuck, escaped and all other schools are plotted on Principal Component 1 vs Principal
Component 2 axes in Figure 6 using plotClassesOnPCA(..) in PCAallFeatures.py. Note that
points are plotted in the order shown, with the grey points ‘beneath’ the other points if they are
coincident, and the orange points for escaped schools are on top.
The scatter plot shows that the classes are heavily intermingled in terms of principal
components, suggesting that building a classifier to separate them could be challenging. The
first two principal components do, however, only account for 18% of the variance in the data
so there is a large quantity of information that is not contained in the plot.
31
Figure 6: The results of Principal Component Analysis - the two classes and the ‘other’ remaining schools that fall in neither class are plotted on Principal Component 1 vs Principal Component 2 axes
32
5. Results During this work, a series of experiments and sub-experiments were undertaken, with the
overall aim of finding the optimum predictive model for this dataset. The model components
investigated in this chapter are all interlinked. For example, the best hyperparameters for a
Support Vector Machine model will depend on the features input to the model whilst the best
choice of features for a Support Vector Machine model will depend on the hyperparameters
set. The approach taken was therefore iterative.
5.1. Assessment Methods Used The number of cross validation folds (Section 2.4.1), 𝑘, used in this project is 5. This provides
a balance between providing enough folds to smooth out the results sufficiently and the
processing time required to effectively run each model five times. All results in this report have
used 5-fold cross validation, where the result provided is the average of the individual scores
of the 5 folds.
Many different metrics were considered when assessing the performance of a single model
(Section 2.4). The simplest measure, accuracy, was used heavily as this is the basic requirement
of a classifier. It was noted that the split of the data between the two classes was 47:53, so a
reported accuracy value would not be misleading due to the class sizes being imbalanced.
When accuracy values were not at a sufficiently high level, the area under the ROC curve was
considered more strongly. This is because, for models which allow tuning such as Random
Forests and Logistic Regression, the operating point of the model could be changed. For a
model with low accuracy, the class boundary decision threshold could be increased. This would
reduce the number of schools predicted to be in the positive class (and therefore reduce recall),
but precision would increase as a higher proportion of these positive predictions would be
correct. It was noted that there is a clear correlation between model accuracy and area under
the ROC curve, so choosing between the two measures had a relatively small impact on the
results.
For some of the measures of performance, the results could be misleading. For example, in the
case with the original definition of stuck schools, a model would achieve over 98% accuracy
by simply predicting every school not to be stuck. In the following results, therefore, minimum
values for a selection of measures are set (Table 2) to reduce the chances of an unhelpful
classifier being selected.
33
Table 2: Minimum values of each metric to qualify.
Accuracy AUC F1 F0 Recall 1 Recall 0 Precision 1
0.6 0.6 0.25 0.25 0.1 0.1 0.1
5.2. Computing Set up for Modelling The models were set up and run using genericModelClass.py. First, an instance of ModelData
was generated for a run. This contained the data, carried out the train/test split and ran
oversampling and/or recursive feature elimination if specified. Next, an instance of the
specified model type, for example RandomForest, was generated. This inherits from the Model
parent class which has numerous get..(self) and set..(self) methods as well as plotROC(self).
Each instance of a child class of Model is initialised with the ‘dataName’, a ModelData instance
and a dictionary of run parameters.
ModelData instances were stored in modelDataDict and Model instances in modelDict. The
naming of the ModelData and Model instances with generated, descriptive names ensured that
each ModelData instance was only generated once and could be reused, avoiding wasting
computing resource. The fitModel(self, [params]) method for each Model instance was only
called once it had been checked that it had not previously been run.
The runAGroup(..) function was set up to work through each combination of the data inputs
(RFE options and whether or not to use oversampling) for each model type and pass inputs to
runsForModels(..). In the extract of runAGroup(..) below, the model postprocesses and clears
the modelDict if it has greater than 400 entries to save memory.
34
Every combination of the parameter dictionary of lists was generated using itertools in the two
lines of code indicated below (Rees, 2017) in runsForModels(..), and a random subset (using
random.sample) of the selected size was passed sequentially to oneFullRun(..).
35
The oneFullRun(..) function then generates the data name, the train/test sets for cross
validation, instances of ModelData and Model and fits the model to the data. The regular
expression re module was used here.
36
Once the runs were completed, the individual .csv files generated were recombined into one
file in analyseAvgCSVs.py. Columns for the different parameters and run settings (all identified
37
from the run names using regular expressions) were added to allow analysis. The plots were
then generated using plotBestModel.py.
5.3. Experiment 1: Finding the optimum machine learning classification model The optimum machine learning classification model used depended on many aspects, each of
which were investigated in the following sub-experiments.
5.3.1. Experiment 1.1: Finding the best model type There are a range of well known models available for binary classification, of which six were
selected for consideration at the outset (Section 2.2). In order to compare the effectiveness of
the models, each model type was first optimised as described below and illustrated in Figure
7.
An initial hyperparameter search was carried out for each model type, based on an initial
selection of approximately half of the available variables selected to be expected to be of high
importance to the model. The set of hyperparameters for each model that resulted in the highest
average accuracy score was selected. These six models were then run through Sequential
Forward and Backward Selection. For each of the six models, the run (forwards or backwards)
that led to the highest accuracy score was used. A further hyperparameter search was then
carried out for each model. This used the set of features that had resulted in the highest accuracy
from Sequential Forward or Backward Selection for that model and searched a narrower
hyperparameter search space based on the results of the initial hyperparameter search.
38
Figure 7: Process for selecting optimum model for each model type
The performance of the six optimised models was then compared across different measures to
ascertain which was the most effective.
5.3.1.1. Results The best performing model for each of the six model types was selected. The models selected
were the models with the highest area under the ROC curve, subject to minimum threshold
scores as specified in Table 2.
Figure 8 shows that the model with the highest area under the ROC curve was the K-Nearest
Neighbours model, which also had the highest accuracy. Neural Networks, Support Vector
Machines and Random Forests all have a similar accuracy whilst the Logistic Regression model
has a higher recall for class 1 and a lower recall for class 0. The K-Nearest Neighbours model
has the highest precision for class 1, whilst the Neural Network and Logistic Regression have
the highest precision for class 0. Gaussian Naïve Bayes has the lowest score for five of the six
measures.
Final model features and hyperparameters
Model used for analysis
Feature set selected from second round of SFS or SBS. Narrower range of hyperparameters
Final hyperparameter search
All features. Best hyperparameters from refined search
Sequential Forward/Backward selection
Feature set selected from SFS or SBS. Narrower range of hyperparameters
Refined hyperparameter search
All features. Best hyperparameters from initial search
Sequential Forward/Backward selection
Initial feature set. Wide hyperparameter range
Initial hyperparameter search
39
Figure 8: Performance of the best version of each model type. Note that different measures of the same six models are plotted, sorted in order of decreasing area under the ROC curve score.
The parameters used in each model type are shown in Table 3.
Table 3: Characteristics of the best model for each model type, measured by area under the ROC curve and subject to the minimum constraints of Table 2
K-Nearest
Neighbours
Neural
Network
Support
Vector
Machine
Random
Forest
Logistic
Regression
Gaussian
Naïve Bayes
No RFE or
oversampling
No RFE or
oversampling
No RFE or
oversampling
No RFE or
oversampling
Oversampling
used
Oversampling
and RFE with
5 best features
selected
28 Features 19 Features 7 Features 8 Features 9 Features 10 Features
(before RFE)
K = 41 5 layers C = 1.455 12 estimators
P = 1 12 nodes per
layer
RBF kernel 6 features
Adam solver Gamma =
0.244
Entropy
criterion
Alpha =
0.0667
Bootstrapping
used
The ROC curves for the six models in Table 3 are plotted in Figure 9. The K-Nearest
Neighbours model’s ROC curve is above the other models’ curves for almost all values of
40
threshold. All curves exhibit a similar shape, although the Logistic Regression and Gaussian
Naïve Bayes models have a much greater false positive rate than the other models for lower
values of true positive rate.
Figure 9: ROC curve for model with highest area under the ROC curve value for each model type. The lines plotted are the mean scores over the 5 folds of cross validation.
For each model in Table 3, separate ROC curves are plotted in Figure 10. The mean values
correspond to those plotted in Figure 9. The K-Nearest Neighbours model has the greatest range
in performance between folds, with a wider band of one standard deviation from the mean. The
K-Nearest Neighbours, Neural Network, Random Forest and Support Vector Machine models
are all relatively consistent up to a true positive rate of around 0.4, although there is a wider
41
spread for the Neural Network. The curves for K-Nearest Neighbours are smoother because
there are fewer (k) threshold values from which to plot the curve.
42
Figure 10: ROC curves for models in Table 3. Each of the five cross validation folds is plotted, along with a mean and the range of one standard deviation
An alternative measure of the best classifier is to select the classifier, subject to the minimum
thresholds in Table 2, with the best precision for data points in class 1: Stuck schools. The
results in Figure 11 show that the Support Vector Machine model has a significantly higher
precision than the other models. It has the lowest recall for class 1 of any of the models, with
a high recall for class 0. It is therefore predicting that a much higher proportion of schools will
be in class 0 than in class 1.
Figure 11: Precision for Stuck class - The best model from each model type in terms of precision for identifying stuck schools. The same six models are plotted in each chart, sorted in order of precision score.
5.3.1.2. Conclusion The performance of the best model of five of the six model types is similar, with Gaussian
Naïve Bayes having an appreciably lower performance. K-Nearest Neighbours produces the
model with the highest area under the ROC curve and the best accuracy. If a high precision for
predicting stuck schools is required, then the Support Vector Machine model plotted in Figure
11 is best. Gaussian Naïve Bayes is again the poorest performer by this measure.
The ROC curve of the K-Nearest Neighbours model in Figure 9 shows that it is the best
performing classifier because its ROC curve is above the other curves throughout. For K-
Nearest Neighbours, the Neural Network, Support Vector Machine and the Random Forest,
there is little to choose between the classifiers if the thresholds are set to achieve a true positive
rate of 0.4.
If the alternative measure of precision for stuck schools is used, a Support Vector Machine
model is superior to the other model types. It only identifies 40% of the schools that will
43
become stuck but around 88% of those that it selects will become stuck. If a smaller, higher
confidence selection of schools is to be identified as likely to become stuck then this model is
superior to the others.
5.3.2. Experiment 1.2: Finding the optimum model hyperparameters Within many of the machine learning models available there are many parameters that can be
adjusted to finetune the working of the model. When running these models, it is important to
consider that the default settings may not provide the best results for the input data supplied.
The parameters for consideration in this experiment, selected based on their perceived
likelihood of improving the model and availability in the Scikit-learn model implementation,
are described for each model in Section 2.2. Logistic Regression and Gaussian Naïve Bayes
did not have any tuneable parameters so do not feature in this experiment.
In order to find the optimum parameters for each model type, a series of parameter searches
were conducted, as described in the following paragraphs. Figure 7 shows how these steps
fitted into the overall process used.
Round 1
For categorical parameters, each value of the parameter was added to the parameter search
space. For each numeric parameter, an initial maximum and minimum value were selected
based on judgement from past experience and literature. This range was made wide enough
that the optimum value was considered highly likely to fall in the range. Between these two
values were then added a series of intermediate values on either a linear or logarithmic scale,
depending on the values involved.
For each model type, all possible combinations of the run parameter values were generated.
Due to the finite computer processing time available, not all combinations could be
investigated. As a result, for each run, the number of parameter combinations to investigate,
e.g. 50, was input. The program would then run the model with 50 different randomly selected
combinations of the parameters. This effectively implemented a subset of the runs available
through a full grid search. It was decided that this would allow a wider investigation of the
parameter space than a grid search, increasing the probability of finding optimal parameters
than a full search through a narrower search space. It did, however, lead to the possibility of
missing the optimum combination of parameters from within the ranges selected.
44
For each combination of parameter values selected, the model was run and its cross validation
results were recorded.
Rounds 2 and 3
For each parameter, the accuracy and AUC scores for the most effective models from the
previous round were plotted against the parameter values. From inspection of these plots, the
range of values which was most likely to contain the optimum was identified. The parameter
values in the search space were then adjusted, reducing the range and increasing the resolution
within the range. The parameter search was then run again on the narrower search space.
5.3.2.1. Results The results of the final hyperparameter search for each of the four models with tuneable
parameters are shown in the following figures. They are not subject to the minimum threshold
values specified in Table 2. Recursive Feature Elimination and Oversampling are not used in
this section. The final selected parameters for each model type are shown in Table 3.
45
Figure 12: Variation of AUC and Accuracy with different hyperparameters for Support Vector Machines using features listed in section 9.2
46
The Support Vector Machine results in Figure 12 show that accuracy and AUC appear to peak
at around a C value of 10. The gamma value has been heavily investigated at values less than
0.1 but there appears to be a peak in AUC between 0.1 and 1. This is not reflected in the values
of accuracy, where the peak is between 0.01 and 0.1. The higher AUC values with gamma
greater than 0.1 largely coincide with C values between 1 and 10. The radial basis function
kernel outperformed the polynomial kernel of degree 1, 2 and 3. Of the polynomial kernels, the
best performance was with a polynomial kernel of degree 2. This was investigated further, as
shown in Table 4. Higher order polynomials were trained and tested in the same way as the
other models, and using the same C and gamma values as in the Support Vector Machine in
Table 3. High order polynomial kernels were not used throughout the experiment due to the
lack of sufficient available computing power.
Table 4: Difference in training and test set accuracy when the kernel is changed. Each run is otherwise identical, using the parameters shown in Table 3
Kernel Training
accuracy
Test
accuracy
RBF
Gamma=0.2442
85% 72%
Polynomial
Degree 1
74% 72%
Polynomial
Degree 2
79% 72%
Polynomial
Degree 3
86% 71%
Polynomial
Degree 4
87% 70%
Polynomial
Degree 5
87% 70%
Polynomial
Degree 6
87% 69%
Polynomial
Degree 7
86% 68%
Polynomial
Degree 8
85% 66%
47
Figure 13: Variation of AUC and Accuracy with different hyperparameters for Neural Networks using features listed in Section 9.2
48
The Neural Network results in Figure 13 show that the Adam solver scored best on both
measures. There is small variation in the peak results when the number of layers and nodes per
layer are varied, with a clear decrease in accuracy and AUC when there are less than 3 nodes
per layer. Varying alpha over a large logarithmic scale appears to have a negligible effect on
the model performance.
49
Figure 14: Variation of AUC and Accuracy with different hyperparameters for Random Forests using features listed in Section 9.2
50
The accuracy and AUC of a Random Forest model are shown to be generally higher when
using entropy as the scoring criterion. The best scores do not appear to be attained when large
numbers of estimators are used, with higher scores achieved when fewer than 250 estimators
were used. It is noted, however, that the highest scores do coincide with the most densely tested
area of the parameter search space. A peak in both AUC and accuracy appears to exist when
the maximum number of features to consider is set between 5 and 10. The use of bootstrapping
tends to result in higher scores of AUC and average, but the small number of highest AUC
scores occurred when bootstrapping was not used.
Figure 15: Variation of AUC and Accuracy with different hyperparameters for K-Nearest Neighbours using features listed in Section 9.2
The value of k used in K-Nearest Neighbours (Figure 15) has a clear influence on both AUC
and accuracy, with peaks in the range from 25 to 45. The best scores are achieved with a p
parameter value of 1.
5.3.2.2. Conclusion For the K-Nearest Neighbours model, the two parameters trained both have a clear impact on
model performance. The model works best when it takes the most common class label amongst
the nearest 25-45 points. For values of k below 4, there is a noticeable drop in performance
51
which is much less than if k is 50. This suggests that there is noise in the model, as just taking
the single nearest neighbour is much worse than averaging over a larger number. Increasing
the value of k smooths the decision boundary. The best value of p to use from the results is 1.
The Adam solver proved to be the most effective for the Neural Network models which was
expected due to its widespread popularity. There appears to be little variation in performance
when the number of layers is varied between 2 and 7, and the number of units per layer varied
between 3 and 19. If this behaviour is representative of the reality, a smaller network would be
preferred to reduce processing times and complexity. Alpha does not appear to affect
performance in the range tested.
The Support Vector Machine optimum model has a C value of just greater than 1. This implies
that a balance between fitting the data points accurately, but without overfitting, is required as
the penalty for misclassifying a point is not too great. The kernel that gave the best results was
the radial basis function kernel. This agrees with the expectation that it is a useful default
kernel. It is perhaps more surprising that the polynomial kernel of degree 2 performs better
than the kernel of degree 3. This indicates that the best decision boundary found in these
experiments is not overly complex and is discussed further.
Given the highly interspersed nature of the data (Figure 6), the most complex kernel function
was expected to have the most success. This result, and the accuracy rates of around 70%,
suggests that the best Support Vector Machine investigated does not have a highly complex
boundary (in the original, non-transformed feature space) and simply finds areas of higher
density of each class and separates them. It also suggests that, if high levels of complexity do
not lead to better results, that there is a tendency for the model not to generalise well i.e. to
overfit the training data.
On further investigation, Table 4 shows that increasing model complexity does result in an
increased accuracy score on the training set. This suggests that, to achieve higher training
scores, a more complex model would be required than those that have been used in this
experiment. Higher order polynomials were found to increase the training accuracy slightly up
to a maximum of 87% before diminishing. The table shows, however, that increasing the degree
of the polynomial decreases the test set accuracy, meaning that the model is becoming more
overfitted to the training data. For the same accuracy on the training data, the radial basis
function kernel performs better on the test data.
52
For Random Forests, the more estimators used, the better the model is expected to perform,
with the performance reaching an asymptotic limit. It is therefore required to find the smallest
number which gives an appropriate performance level. Small numbers of estimators are not
shown because they were ruled out in earlier rounds of the hyperparameter search as being
ineffective. The results show that the model performance varies little with the number of
estimators, so using large numbers of estimators appears to be a waste of computing power.
The highest scores, however, are for models with fewer than 250 estimators, which defies
expectations. This can be explained by the random nature of the Random Forest.
The highest scores are located where the parameter has been tested the most. Random Forests
use randomness in two ways to build the trees (Section 2.2.2). Each Random Forest, even if
generated using exactly the same hyperparameters, will therefore likely generate different
results. By repeatedly testing in a confined search space, it is to be expected that a greater range
of outcomes is achieved. As this experiment is concentrating on the highest scoring models,
this distorts the results, as results which are unusually high are reported as the optimal solutions.
If the same density of testing was carried out over the range from 250 to 2000 estimators, it is
expected that results similar to (or slightly better than) those for below 250 estimators would
be attained.
The maximum number of features to consider peaks between 5 and 10, so a value is selected
from that range. Bootstrapping and the decision criterion selected do not appear to make too
much difference to the results.
Overall, there can be reasonable confidence that, for the parameters that were available, the
models are near their optimum. A more thorough search could be completed with more
computing power, which would have three effects. The first is that the parameter values could
be further refined, for small potential gains in performance. This could be achieved by using a
finer and finer resolution grid search, reducing the range of values tested each time. This would
also be helped by completing the full searches as opposed to random subsets.
More computing power could also be used to investigate the random nature of the results more
thoroughly, to characterise the noise and identify which results are due to noise and which are
due to the effect itself. This could most easily be done by simply repeating runs multiple times.
The final way these hyperparameters could be optimised if more computing power were
available would be to investigate a wider range of parameters. Many parameters could not be
investigated as thoroughly as desired because of the computing power required. For example,
53
neither high order polynomial kernels nor high C values for Support Vector Machines could
be used as they took too long to compute. Deeper Neural Networks with more combinations of
hidden layers could also be investigated.
5.4. Experiment 2: Finding the optimum features to input to the model The second aim of the project was to identify which features had the greatest importance in
predicting the future performance of a poorly performing school. For some of the models used,
such as Logistic Regression, the model can easily generate a measure of the importance of each
feature in the model. For other models, such as Support Vector Machines and Neural Networks,
there is no direct method of determining the importance of an individual feature to a model. As
shown in Section 5.3.1, Support Vector Machines and Neural Networks have proven to be
some of the more successful models and, as a result, the selection of input features for these
models is an important consideration.
One way to infer a ranking of feature importance to the models is by using Sequential Forward
or Backward Selection (Section 2.3.2). This can be used on any classification model as it
simply trains and tests the model with different combinations of features and records which
features result in the best scores. This is, however, extremely computationally intensive and so
can only be used sparingly.
5.4.1. Experiment 2.1: Investigating whether Sequential Forward Selection or Sequential Backward Selection give better results
Sequential Selection is a greedy algorithm for selecting the optimum combination of features
for a given model (see Section 2.3.2) and was implemented in SFS.py. As a single measure of
model effectiveness must be used, and input prior to running the algorithm, the accuracy was
selected. On each iteration of the algorithm, the feature that resulted in the highest accuracy
would be added or removed from the selected feature set.
Because the algorithm is greedy and does not necessarily reach an optimal solution, the results
of forward and backward selection can be different.
5.4.1.1. Results Figure 16 show the results of the experiment which was the second round of sequential feature
selection (see Figure 7). In each graph, a single set of model hyperparameters were used. The
model accuracy is plotted for each number of features, for both forward and backward
selection. The forward selection was limited to the first 50 features and both forward and
backward selection had a minimum of three features. A dotted line indicates the highest
accuracy achieved during the process.
54
Figure 16: Sequential Forward/Backward Selection results: Model accuracy for different numbers of features
In each of the plots, the maximum accuracy is achieved with forward selection. The difference
between the maximum accuracy for forward and backward selection ranges from 1% to 5%.
For most sizes of feature set, forward selection selects features which lead to a higher accuracy.
The only exceptions to this are for feature sets in the range 30-50 features for the Neural
Network and Support Vector Machine. In these cases, backward selection gave an accuracy
that was similar to or better than that from forward selection.
K-Nearest Neighbours shows that there is not much change in performance from features sets
of size 15 to 45 features for forward selection. For the other models, a peak in the range of 10
to 20 features is followed by a clear decrease in accuracy when further features are added.
5.4.1.2. Conclusion As shown by the results of this experiment, for the models selected, sequential forward
selection is more effective than sequential backward selection. The maximum accuracy
55
achieved is consistently higher for forward selection and, for a given size of feature set, the
model accuracy is almost always higher when forward selection is used.
For most models, a clear peak exists which indicates a possible feature set to select for use.
With K-Nearest Neighbours, however, the accuracy stays within a range of around 0.5% whilst
a further 30 features are added.
The plots show a difference in performance between forward and backward selection. This is
due to the algorithms being greedy and heading for local maxima. It therefore cannot be
concluded with confidence that the optimum set of features is that which results in the highest
accuracy score in this sequential feature selection process.
5.4.2. Experiment 2.2: Investigating whether Recursive Feature Elimination improves the set of features selected
Recursive Feature Elimination (Section 2.3.3) is another technique for feature selection. A
large quantity of runs were carried out in which RFE was incorporated at different levels. Each
run had a specified number of features to use – 5, 10, 15, 20, 25, All (no RFE). The features
selected by RFE would be used, the others removed from the input data to the model prior to
model fitting. The input feature sets used were those selected as optimum from the second
Sequential Forward Selection run in Section 5.4.1. Where the number of features for selection
in RFE is greater than the number of features in the input feature set, the RFE process has no
effect and the full input feature set is used. The hyperparameters used were the ranges of
hyperparameters tested in the final hyperparameter search.
The aim of this experiment was to find out if the set of features input to the model would prove
to be the best, or whether a subset of the features would result in an improved model.
56
5.4.2.1. Results Figure 17 shows the results of the experiments for Recursive Feature Elimination.
Figure 17: Recursive Feature Elimination - For each of the six different models, for each level of RFE, the highest score of Area Under the ROC curve and Accuracy are plotted
The results show that applying recursive feature elimination to a set of features which has been
selected through sequential feature selection does not result in an increase to the area under the
ROC curve or the model accuracy. The K-Nearest Neighbours plots show that the more features
that are used, the higher the accuracy of the model. The Gaussian Naïve Bayes plot does not
have a value for where the RFE value is 5 because no run met the minimum thresholds set (see
Section 5.1)
57
5.4.2.2. Conclusion The plots show that applying recursive feature elimination to take a subset of the input features,
which were themselves a subset of the initial features, does not improve the performance of the
model. This adds confidence that the features selected using SFS are a good selection. As the
RFE process and SFS process are different, the fact that they do not contradict each other
increases confidence in the results obtained.
5.4.3. Key features selected in most effective model The features selected for each model are shown in the appendices (Section 9.2).
Of the six feature sets selected, the features that are common to at least two sets are shown in
Table 5. As the most frequently occurring features only occur in half of the models, there is
clearly variation in which features work best for different models.
Table 5: Features that appear in more than one of the selected 6 models
Feature name Frequency Energy_2yrDiff 3 Supply.Staff_2yrDiff 3 TotalRevBalance Change 7yr 3 ISSECONDARY 2 Other_2yrDiff 2 Energy_4yrDiff 2 Total revenue balance (1) 2017-18 2 HasBoys 2 Learning.Resources.2018 2 Special 2 HasGirls 2 Back.Office_4yrDiff 2 Energy.2018 2 TotalRevBalance Change 4yr 2 Premises_4yrDiff 2
The importance of each feature can be calculated from a Random Forest as the decrease in node
impurity weighted by the probability of reaching that node (Ronaghan, 2018). This was first
calculated for the features in the Random Forest model used, with the results shown in Table
6.
Table 6: Importance of each feature to the selected Random Forest model
Feature name Feature
importance TotalRevBalance Change 4yr 0.31334
58
TotalRevBalance Change 7yr 0.253867 Catering.2018 0.126941 Teaching.Staff.2018 0.107894 Self.Income_2yrDiff 0.095085 AGEL 0.041989 ISSECONDARY 0.026097 ISPRIMARY 0.016324 GOR_North West 0.00784 GOR_East Midlands 0.005651 Special 0.003589 HasGirls 0.001383
The selected Random Forest model was then trained and tested using all of the original input
features. The feature importance was then calculated and shown in Table 7.
Table 7: Importance of the top 30 features when the parameters selected of the Random Forest model are applied to all features
Feature name Feature
importance Total revenue balance (1) 2017-18 0.126383 TotalRevBalance Change 7yr 0.118881 TotalRevBalance Change 4yr 0.099824 TotalRevBalance Change 2yr 0.078365 Total.Spend.pp_4yrDiff 0.023139 Supply.Staff.2018 0.01918 Supply.Staff_2yrDiff 0.019165 Total revenue balance (1) as a % of total revenue income (6) 2017-18 0.018994 PERCTOT 0.016633 Supply.Staff_4yrDiff 0.0146 PerformancePctRank 0.014318 Total.Income.pp_4yrDiff 0.011931 AcademyNew 0.011606 Energy.2018 0.011453 Catering.2018 0.011407 Catering_4yrDiff 0.011356 Learning.Resources_2yrDiff 0.011065 Back.Office_2yrDiff 0.011061 Teaching.Staff_2yrDiff 0.010531 Premises_4yrDiff 0.010522 ICT.2018 0.010342 Back.Office.2018 0.010228 Consultancy.2018 0.009657 PTFSM6CLA1A__18 0.009642 Mean Gross FTE Salary of All Teachers (£s) 0.009461
59
Total.Income.pp.2018 0.009431 ICT_2yrDiff 0.009284 Energy_2yrDiff 0.009065 AGEH 0.008857 Consultancy_2yrDiff 0.008537
Table 7 shows that the Random Forest classifier places a high value on financial data, with the
top 8 features being of this type. The level of pupil absence is also an important feature, as are
school examination results. The spend on supply teachers also appears to be an important
factor.
5.5. Experiment 3: Finding the optimum operations on the input data The performance of any model is reliant on the data that is input. A series of experiments were
set up to determine which operations on the input data would result in the most effective model.
5.5.1. Experiment 3.1: Determining whether oversampling improves model performance Each run in the parameter search was run both with and without oversampling and the results
recorded. This allows direct comparison between the sets of runs as each group (oversampled
or not oversampled) contains runs with identical parameters aside from whether oversampling
is used. The SMOTE oversampling algorithm was implemented as described in Section 2.3.4.
This experiment tests whether the use of this oversampling technique caused an improvement
in the model results.
5.5.1.1. Results The results of the experiment are shown in Figure 18.
60
Figure 18: Oversampling - For each model type, the run with the highest accuracy and area under the ROC curve are shown for the group where oversampling was used on the training data and the group which did not use oversampling.
The plots show that the best runs with and without oversampling have very similar scores for
both accuracy and area under the ROC curve. Logistic Regression shows a larger best accuracy
when oversampling is used.
5.5.1.2. Conclusion The results show that applying oversampling makes little difference to the best model found
for each model type. In the case of Logistic Regression, an increase in accuracy is observed
when oversampling is applied. The minimal effect of oversampling on this dataset can largely
be explained by the fact that the dataset has a relatively even split between the two classes
61
(stuck and escaped stuck). As a result, there are few oversampled data points required to make
the split exactly 50:50.
This technique is expected to be far more effective when there is a more uneven split between
the classes, such as when using data with the initial definition of stuck school (see Section 4.5).
5.6. Attributes of Most Effective Model The most effective model overall is the K-Nearest Neighbours model with k=41 and p=1. The
confusion matrix is shown in Table 8, its scores across different measures are shown in Table
9 and its parameters were specified in Table 3.
Table 8: Confusion matrix for best model
True
1 0
Predicted 1 474 119
0 331 789
Table 9: Properties of best model
Accuracy 0.737
AUC 0.759
Precision for class 0
0.694
Precision for class 1
0.746
Recall for class 0
0.889
Recall for class 1
0.514
62
6. Discussion The results show that the best models found have AUC and accuracy scores of between 70%
and 75%. The results are relatively self-consistent in that the results appear to be heading
towards a limit of around 75%.
The majority of the work has been carried out using the newer definition of stuck schools,
which has led to a more challenging classification task. As is seen in Figure 6, the two classes
are strongly overlapped and have proved challenging to separate. Judging by the consistency
of the results achieved in this project, it appears that increasing the accuracy and AUC scores
significantly would require a major change in the techniques used, since those used in this work
have been used thoroughly. There are, however, many areas in which the existing techniques
used could be optimised further.
School examination results were suspected to be an indicator of likely school inspection
success, yet they only feature in one of the models. One reason why they do not feature could
be the fact that primary and secondary schools have completely different scoring systems which
were resolved in Section 4.1.3. This leads to a more general issue with the data: that it is
measured and recorded in different ways over the years. A large amount of the data used is
solely for the most recent academic year, so is relatively self-consistent. Many other data
points, though, are time based, showing a change over the years.
Time is an important aspect of this analysis which is difficult to bring in. Firstly, school
inspections take place at irregular intervals (Section 1.1). When labelling schools based on their
histories, the two simple ways of doing this are to arrange them in order, most recent first, or
to arrange them by the date that they took place. Each has its advantages, but also causes a loss
of information. Given that inspection order was selected, a possible downside is that the data
used takes into account some differences from 7 years previously, whilst all of the relevant
inspections could have taken place in the last 3 years.
The second issue with time is the use of current data to predict the future performance of
schools. The data used is based on the current year, or a change from a number of years
previous. The class labels, as noted, are not being considered against time.
A possible area of weakness in the modelling is that schools are linked to their predecessors.
Generally, it is assumed that this is a sensible decision as schools do not change completely if
they close and reopen – it is usually still the same building with the same teachers. Sometimes,
though, schools have a list of many predecessors, where many smaller schools have been
63
merged into a larger school. All of the predecessor schools’ inspection results count towards
the new school’s results, so if four small schools with a recent poor inspection each merge into
a large new school, that school is immediately labelled as stuck.
The way that the data has been merged also means that data for the current school is used, but
for predecessors it is not. Therefore, if the school opened in 2016 then it will not have any data
in the combined dataset for dates before 2016. This is likely a source of significant missing
data and could be remedied by writing a program to look up predecessor school data if there is
no data available for the currently open school. This would need to be done carefully, however,
given that the predecessor is not necessarily a direct representation of the current school.
Missing data is an important area of uncertainty. Given that none of the features were complete
and the models required fully complete data, there was little option but to impute the missing
data (Section 3.3.4). The selected imputer uses other values in the row and column of the
missing data point to infer what value to insert. As a result, the imputed value depends on the
other features in the dataset. If the missing data in a new column is imputed as soon as the
column is added to the dataset, the missing values will be imputed with different values to those
if all of the columns were imputed at the same time, once they had all been added. Whichever
imputation technique is selected, it will add error to the model which has not been
characterised.
The data also showed that there is great variation throughout it, with a clear difference in
performance between different sections of the data. If the data is sorted by URN (i.e. in
approximate order of when the school was opened) and the random selection option of the cross
validation is disabled then the plot in Figure 19 can be generated. Moving from Fold 1 through
to Fold 5 is then moving approximately from the oldest 20% of schools to the newest 20%. The
performance of the model appears to improve drastically from one fold to the next. For
example, the area under the curve in some of the models varies from 55% in the first fold to
over 85% in the fourth fold. These results appear counterintuitive because, as explained above,
the most complete data is expected to be for the oldest schools because their data would not be
lost due to being assigned to a predecessor. Given that the effect is so pronounced, there appears
to be good reason to investigate this further, to see if a significant improvement could be made
to the model.
64
Figure 19: ROC curves for the selected models, with randomness disabled in the cross validation process. Fold 1 is therefore the first 20% of the data points, Fold 2 is 20-40% etc.
65
Cross validation was used throughout this work, with all results reported being averages over 5 fold cross validation. Given the clear variation in the data, it would likely make sense to increase to 10 folds. This would minimise the distortion of the results from the unevenness of the data. Re-running the same experiment generates different results due to the differing cross validation splits, introducing noise into the results.
Using accuracy and AUC as the measures for determining the performance of the model assumes that false positives and false negatives have equal cost (Adams and Hand, 2000). In this work, these measures have been used heavily as the costs are assumed equal. Further work and discussions with the relevant stakeholders may lead to adjusted measures being used.
The main focus of this work has been on the updated definition of stuck schools, and modelling them against schools that have ‘escaped’ being stuck (Section 4.3). Using the original definition of stuck schools, and modelling against all of the rest of the dataset, the results are much improved. One reason for this is that there are more data points. By modelling just the two subsets of the data, 90% of the data is removed and not used. On top of this, a school in the original stuck category is likely very different from a randomly selected other school. Using the new definition and subsets, however, the two classes that are to be distinguished are likely to be very similar. They have both had a run of three consecutive poor inspections, with potentially a single inspection different between them. There are approximately 430 schools in class 1 under the original stuck definition, with approximately 800 schools in class 1 under the new definition. The old class 1 is a subset of the new class 1.
The technique used for selecting the hyperparameters and the features were both computationally intensive, requiring the training and testing of large quantities of models. The nature of the searches meant that sub-optimal combinations would be tried, taking far longer than other combinations of hyperparameters and features. Throughout this work, computing power was at a premium. The code was set up to work best in these conditions, for example routinely dumping results to .csv files and python ‘pickles’ to free up memory, limiting hyperparameter ranges to those that processed faster and designing the code to run unattended for hours/days. Further information on this topic is provided in Section 5.2. This project was also undertaken on four different laptops so file locations, version control and module versions were watched closely.
The use of sequential feature selection is a useful technique because it allows feature selection for models which do not generate a feature importance value. It is a greedy algorithm which will not necessarily find the optimal combination of parameters. It does, however, provide confidence that adding or removing a single feature from the feature set would not improve performance because if it did, the feature would have been selected by the algorithm. The only manual input into the feature selection was done at the start, in selecting and preparing the features to add to the dataset. Once they had been added to the dataset, the features were selected by the algorithms in the process.
Some of the more complex models, such as the Neural Network, are unlikely to be optimal. There are many parameters and architectures available which can be altered to potentially improve the model, which would take a large quantity of time and computing power. For example, the number of iterations and using layers with varying numbers of units have not been investigated. Neural Networks were not originally considered for this work due to the small quantity of training data available, so shallow networks are expected to work best. The models could also be optimised by making a more thorough investigation into performance results when the model is applied to the
66
training data. This would give more insight into the overfitting/underfitting nature of the model and how it could be improved.
67
7. Conclusions The primary aim of this project was to create a dataset and a resulting model that could
accurately predict the future performance of a school in terms of inspection results. These aims
have been achieved to a good level, given the time and resources available, and the complexity
of the problem.
Firstly, a literature review was carried out in Section 2 to identify and assess previous work in
the field and to investigate techniques to be used in this work. It does not appear that work to
use machine learning to predict school inspection results has been published in the literature,
although a number of studies have been done to predict examination results. Although related
to inspection results, the two are clearly distinct. The literature provided information on six
machine learning models to implement in this project and how they may be optimised.
The dataset creation has been achieved by generating a dataset of every currently open state-
funded school in England, and is explained in Sections 3 and 4. This dataset has over 80
variables which have come from a range of input sources that are of multiple formats. Some of
the variables have been generated through feature engineering. The variables selected have
been narrowed down from over 1000 in the input data. The data have been cleaned, normalised
and the missing values imputed. The data have also been assigned labels using an updated
definition of the binary classes, and data not relevant have been removed from the dataset.
Many models have been trained and tested (Section 5), to find the best performing model for
this task. For each of the six model types identified in the literature survey, the optimum
combination of hyperparameters and features have been found. This was achieved iteratively
using random hyperparameter grid searches and recursive feature selection. The feature sets
found are shown to differ for each model type, with the feature importance shown for Random
Forest models.
The classification accuracy and the area under the ROC curve of the best performing models
overall are around 75%, which is considered a good result when the complexity of the task, the
time available and the quality of the input data are taken into account. The two classes to be
distinguished are very similar, and the available training set is only a small subset of the
complete input data. If a higher confidence in the prediction of which schools will become
stuck is required, an 88% precision has been achieved for a Support Vector Machine with 40%
recall.
68
This work has been presented to both the Department for Education and Ofsted to explain what
has been done and the results achieved. It has been agreed that it is unlikely that further work
to model this data will provide a significant gain in predictive power beyond that of the
classifiers described in this report. At the time of writing, it is not yet known if this work will
be continued within the ONS beyond the project period.
An initial step in terms of further work would be to determine whether an increase in
classification accuracy was possible and, if so, by how much. If there is room for improvement,
an investigation would be required to find out why the models are not producing optimum
results. This would start with an analysis of training and test data to see the overfitting
tendencies of the models.
Should an improvement in classifier performance be sought, there are many possibilities for
work that would provide this. Firstly, there are likely to be techniques to extract high quality
classifications from the classifiers presented. For example, a more thorough investigation of
varying the thresholds could provide a classifier with an excellent precision, even if the recall
is low. The classifiers may also be effective for subsets of the data. For example, the classifier
has not been run on just primary schools. There is known to be a difference in the data between
primary and secondary schools, and this work has considered the two together.
The techniques described in this report could simply be extended with more computing power
available. A more exhaustive hyperparameter search, investigating more hyperparameters over
a larger range would be trivial to implement using the code written for this project. More
sequential feature selection runs, with more different models could also be carried out, with
randomly generated starting feature sets to try to find better results than those generated by the
greedy algorithm.
The input data is clearly an area that can be improved, that could lead to improvements in
model performance. Adding in more features could be done with more time available, but the
key thing would be to reduce the missing data. One part of this would be to incorporate data
for predecessor schools. Another would be to investigate whether more complete data sources
are available. The technique for imputation used, whilst likely to be the best available, also
could be investigated as it has a large bearing on the results and imputes different values
depending on what other features are in the dataset.
69
If a classifier of high quality is not attained, there is likely a large quantity of information to
gain from carrying out a more traditional statistical analysis. This would also tie in with the
more qualitative work that has been completed by Ofsted and the Department for Education.
Working on this project has been an excellent opportunity to carry out a machine learning
project in the ‘real world’, testing out a large variety of the skills acquired this year and in
previous years. It has been interesting to try out many different machine learning techniques
and find out what works best for myself. Working with the ONS, Ofsted and the Department
for Education has also given me some experience of how carrying out a machine learning
project on data can lead to insight and the change of government policy.
Given that there have been three months from first arriving at the ONS to completing this
report, there is only a finite quantity of work that can be achieved. Many steps of the project
could easily have been allocated more than three months themselves so, as a result, there are
areas where more thorough work could have been done. This would, however, have directly
resulted in not completing some of the later work.
The aim of the project was to carry out a full, from start to finish, data science project, with a
real risk that the end was not reached in time. For example, creating the dataset proved to be a
real challenge that put the project behind schedule. The techniques used had not been attempted
before by the team, who were interested to know if they would be of any use. There was
therefore uncertainty in how long the models would take to implement and whether they would
generate anything useful. Given that meaningful results have been generated and the results
have been communicated to the stakeholders within the allotted time, the goals of this project
are considered to have been achieved.
70
8. References Adams, N. M. and Hand, D. J. (2000) ‘Improving the practice of classifier performance
assessment’, Neural Computation, 12(2), pp. 305–311. doi: 10.1162/089976600300015808.
Canagareddy, D., Subarayadu, K. and Hurbungs, V. (2019) ‘A Machine Learning Model to
Predict the Performance of University Students BT - Smart and Sustainable Engineering for
Next Generation Applications’, in Fleming, P. et al. (eds). Cham: Springer International
Publishing, pp. 313–322.
Chawla, N. et al. (2002) ‘SMOTE: Synthetic Minority Over-sampling Technique’, J. Artif.
Intell. Res. (JAIR), 16, pp. 321–357. doi: 10.1613/jair.953.
Department for Education (2019) New drive to continue boosting standards in schools -
GOV.UK. Available at: https://www.gov.uk/government/news/new-drive-to-continue-
boosting-standards-in-schools (Accessed: 19 September 2019).
Department for Education (DfE) (2016) ‘Progress 8: How Progress 8 and Attainment 8
measures are calculated’, pp. 1–5. Available at:
https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data
/file/561021/Progress_8_and_Attainment_8_how_measures_are_calculated.pdf.
Duda, R. O., Hart, P. E. and Stork, D. G. (2001) Pattern classification. 2nd ed. New York ;
Chichester: Wiley (A Wiley-Interscience publication).
Fawcett, T. (2006) ‘An introduction to ROC analysis’, Pattern Recognition Letters, 27(8), pp.
861–874. doi: 10.1016/j.patrec.2005.10.010.
Fowler, J. (2012) ‘Ofsted Inspection of Outstanding schools - The Education ( Exemption
from School Inspection ) ( England ) Regulations 2012’, (January), pp. 0–2. Available at:
https://www.lgiu.org.uk/wp-content/uploads/2012/05/Ofsted-Inspection-of-Outstanding-
schools-The-Education-Exemption-from-School-Inspection-England-Regulations-2012.pdf.
Hsu, C.-W., Chang, C.-C. and Lin, C.-J. (2008) ‘A Practical Guide to Support Vector
Classification’, BJU international, 101(1), pp. 1396–400. Available at:
http://www.csie.ntu.edu.tw/~cjlin%0Ahttp://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.p
df.
Hunter, J. D. (2007) ‘Matplotlib: A 2D Graphics Environment’, Computing in Science &
Engineering, 9(3), pp. 90–95. doi: 10.1109/MCSE.2007.55.
71
Lemaitre, G., Nogueira, F. and Aridas, C. K. (2017) ‘Imbalanced-learn: A Python Toolbox to
Tackle the Curse of Imbalanced Datasets in Machine Learning’, Journal of Machine
Learning Research2, 18(17), pp. 1–5. Available at: http://jmlr.org/papers/v18/16-365.html.
Lin, C., Yu, H. and Huang, F. (2011) ‘Dual Coordinate Descent Methods for Logistic
Regression and Maximum Entropy Models’, Machine Learning, 2(85), pp. 41–75.
Majnik, M. and Bosnic, Z. (2011) ROC Analysis of Classifiers in Machine Learning : A
Survey Technical report MM-1 / 2011.
Masci, C., Johnes, G. and Agasisti, T. (2018) ‘Student and school performance across
countries: A machine learning approach’, European Journal of Operational Research.
Elsevier B.V., 269(3), pp. 1072–1085.
McKinney, W. (2010) ‘Data Structures for Statistical Computing in Python’, in van der Walt,
S. and Millman, J. (eds) Proceedings of the 9th Python in Science Conference, pp. 51–56.
Mitchell, T. M. (Tom M. (1997) Machine learning. New York: McGraw-Hill.
mlxtend (no date) Sequential Feature Selector - mlxtend. Available at:
http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/
(Accessed: 12 September 2019).
Office for Standards in Education (2018) ‘School inspection handbook’, Ofsted School
Inspection Handbook, (September). doi: 10.4324/9780203416242_chapter_2.
Ofsted (2019) ‘Education inspection framework for September 2019’, (May), pp. 1–14.
Available at: www.legislation.gov.uk/uksi/2014/3283/contents/made;
Pedregosa FABIANPEDREGOSA, F. et al. (2011) ‘Scikit-learn: Machine Learning in
Python’, Journal of Machine Learning Research, 12, pp. 2825–2830. Available at:
http://Scikit-learn.sourceforge.net.
Press, W. H. et al. (2007) Numerical Recipes : The Art of Scientific Computing. 3rd ed.
Cambridge ; New York: Cambridge University Press.
Raschka, S. (2018) ‘MLxtend: Providing machine learning and data science utilities and
extensions to Python’s scientific computing stack’, Journal of Open Source Software, 3(24),
p. 638. doi: 10.21105/joss.00638.
Rebai, S., Ben Yahia, F. and Essid, H. (2019) ‘A graphically based machine learning
72
approach to predict secondary schools performance in Tunisia’, Socio-Economic Planning
Sciences. Elsevier, (June), p. 100724. doi: 10.1016/j.seps.2019.06.009.
Rees, G. (2017) List all possible permutations from a python dictionary of lists - Code
Review Stack Exchange. Available at:
https://codereview.stackexchange.com/questions/171173/list-all-possible-permutations-from-
a-python-dictionary-of-lists (Accessed: 19 September 2019).
Ronaghan, S. (2018) The Mathematics of Decision Trees, Random Forest and Feature
Importance in Scikit-learn and Spark. Available at: https://medium.com/@srnghn/the-
mathematics-of-decision-trees-random-forest-and-feature-importance-in-Scikit-learn-and-
spark-f2861df67e3 (Accessed: 15 September 2019).
Scikit-learn (no date a) 1.1. Generalized Linear Models — Scikit-learn 0.21.3 documentation.
Available at: https://Scikit-learn.org/stable/modules/linear_model.html#linear-model
(Accessed: 11 September 2019).
Scikit-learn (no date b) Neural Network models (supervised) - 1.17.1 Multi-layer Perceptron.
Available at: https://Scikit-learn.org/stable/modules/neural_networks_supervised.html
(Accessed: 11 September 2019).
Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning - From
Theory to Algorithms. 1st Ed. Cambridge University Press.
Spielman, A. (2018) The Annual Report of Her Majesty’s Chief Inspector of Education,
Children’s Services and Skills 2017/18 - GOV.UK. Available at:
https://www.gov.uk/government/publications/ofsted-annual-report-201718-education-
childrens-services-and-skills/the-annual-report-of-her-majestys-chief-inspector-of-education-
childrens-services-and-skills-201718 (Accessed: 12 September 2019).
Tanuar, E. et al. (2018) ‘Using Machine Learning Techniques to Earlier Predict Student’s
Performance’, in 2018 Indonesian Association for Pattern Recognition International
Conference (INAPR), pp. 85–89. doi: 10.1109/INAPR.2018.8626856.
Thomson, D. (2019) A look at Ofsted’s ‘stuck’ schools - FFT Education Datalab. Available
at: https://ffteducationdatalab.org.uk/2019/01/a-look-at-ofsteds-stuck-schools/ (Accessed: 18
September 2019).
Walt, S. van der, Colbert, S. C. and Varoquaux, G. (2011) ‘The NumPy Array: A Structure
73
for Efficient Numerical Computation’, Computing in Science & Engineering, 13(2), pp. 22–
30. doi: 10.1109/MCSE.2011.37.
Zhang, H. (2004) ‘The Optimality of Naive Bayes Naive Bayes and Augmented Naive
Bayes’, Aa, 1(2), p. 3.
74
9. Appendices 9.1. Appendix 1: Input Variables Used The variables used in this project are shown below, along with a brief description of their
meaning. Variable names with a year suffix mean that the value was recorded in that year. For
example, VariableName_16 would correspond to the value VariableName had in the academic
year 2015-16.
9.1.1. School Financial Balance Total revenue balance (1) 2017-18 – Float
Total revenue balance (1) as a % of total revenue income (6) 2017-18 – Float
TotalRevBalance Change 7yr – Float (calculated) – Difference between Total revenue
balance in 2010-11 and 2017-18.
TotalRevBalance Change 4yr – Float (calculated) – Difference between Total revenue
balance in 2013-14 and 2017-18.
TotalRevBalance Change 2yr – Float (calculated) – Difference between Total revenue
balance in 2015-16 and 2017-18.
9.1.2. School performance data TOTPUPS__18 – Integer – Total number of pupils in the school.
PTKS1GROUP_L__18 – Float – Values taken from columns with the following names:
“Percentage of pupils at the end of key stage 4 with low prior attainment at the end of key stage
2”, “Percentage of pupils in cohort with low KS1 attainment”, “% pupils in cohort with low
KS1 attainment”
PTKS1GROUP_M__18 – Float – Values taken from columns with the following names:
“Percentage of pupils at the end of key stage 4 with medium prior attainment at the end of key
stage 2”, “Percentage of pupils in cohort with medium KS1 attainment”, “% pupils in cohort
with medium KS1 attainment”.
PTKS1GROUP_H__18 – Float – Values taken from columns with the following names:
“Percentage of pupils at the end of key stage 4 with high prior attainment at the end of key
stage 2”, “Percentage of pupils in cohort with high KS1 attainment”, “% pupils in cohort with
high KS1 attainment”.
75
PTFSM6CLA1A__18 – Float – Percentage of pupils in school eligible for free school meals.
Values taken from columns with the following name: “Percentage of pupils at the end of key
stage 4 who are disadvantaged”, “Percentage of pupils who are disadvantaged”.
PTMOBN__18 – Float – Values taken from columns with the following names: “Percentage
of pupils at the end of Key Stage 4 who are non-mobile”, “Percentage of eligible pupils
classified as non-mobile”.
PSENELSE__18 – Float – Percentage of pupils with Special Educational Needs. Values taken
from columns with the following names: “Percentage of eligible pupils with SEN with
Statement or EHC plan”, “Percentage of Pupils with statements or supported at school action
plus”, “Percentage of pupils at the end of key stage 4 with special educational needs (SEN)
with a statement or Education, health and care (EHC) plan”, “Percentage of key stage 4 pupils
with statements of SEN (Special Educational Need) or on School Action Plus”.
PerformancePctRank – Float (calculated) – Percentage ranking of schools based on exam
performance. A full explanation is provided in 4.1.3. Values taken from columns with the
following names: “Average Attainment 8 score per pupil”, “Percentage of pupils reaching the
expected standard in reading, writing and maths”.
9.1.3. Pupil population and absence data PNUMEAL – Float – Percentage of pupils with English not as first language.
PNUMFSM – Float – Percentage of pupils eligible for free school meals.
PERCTOT – Float – Percentage of overall absence (authorised and unauthorised) for the full
2017/18 academic year.
9.1.4. Spine ISPRIMARY – Binary – Whether the school is primary.
ISSECONDARY – Binary – Whether the school is secondary.
ISPOST16 – Binary – Whether the school is for pupils aged over 16.
AGEL – Integer – The lowest age of pupils.
AGEH – Integer – The highest age of pupils.
9.1.5. Workforce data Pupil : Teacher Ratio – Float.
76
Mean Gross FTE Salary of All Teachers (£s) – Integer – Mean Full Time Equivalent salary
of teachers in school.
9.1.6. School finances data These variables are an annual total amount of money. Each of the titles in this section represents
three variables:
- The variable value in 2018
- The difference between the variable values in 2016 and 2018
- The difference between the variable values in 2014 and 2018.
Self.Income - Integer – Annual self generated income
Total.Income.pp – Integer – Annual total income per pupil
Teaching.Staff – Integer – Annual spend on teaching staff
Supply.Staff – Integer – Annual spend on supply staff
Ed.Support.Staff – Integer – Annual spend on educational support staff
Premises – Integer – Annual spend on premises
Back.Office – Integer – Annual spend on back office staff
Catering – Integer – Annual spend on catering
Other.Staff – Integer – Annual spend on other staff
Energy – Integer – Annual spend on energy
Learning.Resources – Integer – Annual spend on learning resources
ICT – Integer – Annual spend on IT equipment
Consultancy – Integer – Annual spend on external consultants
Other – Integer – Other annual spend
Total.Spend.pp – Integer – Total spend per pupil
77
9.1.7. Generated variables Boarding – Binary (converted from text) - Whether the school has pupils who stay at the school
overnight. Positive cases: “Boarding school”, “Children’s home (Boarding school)”, “College
/ FE residential accommodation”. All other cases negative.
SixthForm – Binary (converted from text) – Whether the school has a sixth form.
HasBoys – Binary (converted from text) – School has male students. Positive cases: “Mixed”,
“Boys”.
HasGirls – Binary (converted from text) – School has female students. Positive cases:
“Mixed”, “Girls”.
Maintained – Binary – School is a local authority maintained school
Academy – Binary – School is an academy
Special – Binary – School is a special school
GOR_East Midlands / GOR_East of England / GOR_London / GOR_North East /
GOR_North West / GOR_South East / GOR_South West / GOR_West Midlands /
GOR_Yorkshire and the Humber – Binary variables – Whether school is in the Government
Office Region.
9.2. Appendix 2: Variables selected for models Each optimised model used a different subset of the available variables. The variables used for
each model are shown in this appendix.
9.2.1. Neural Network 'Total revenue balance (1) 2017-18',
'Other.Staff_2yrDiff',
'Premises_4yrDiff',
'Mean Gross FTE Salary of All Teachers (£s)',
'Supply.Staff_2yrDiff',
'BoardingNew',
'TotalRevBalance Change 7yr',
78
'TotalRevBalance Change 2yr',
'Supply.Staff_4yrDiff',
'PNUMEAL',
'GOR_West Midlands',
'MaintainedNew',
'GOR_North East',
'ISPOST16',
'Supply.Staff.2018',
'Other.2018',
'Learning.Resources_2yrDiff',
'TotalRevBalance Change 4yr',
'ICT.2018',
9.2.2. Support Vector Machine 'TotalRevBalance Change 7yr',
'Total revenue balance (1) as a % of total revenue income (6) 2017-18',
'Supply.Staff_2yrDiff',
'Mean Gross FTE Salary of All Teachers (£s)',
'Consultancy_4yrDiff',
'Ed.Support.Staff_4yrDiff',
'Consultancy.2018',
9.2.3. Random Forest 'TotalRevBalance Change 4yr',
'PerformancePctRank',
79
'Supply.Staff_4yrDiff',
'ISSECONDARY',
'TotalRevBalance Change 7yr',
'PTKS1GROUP_H__18',
'Total revenue balance (1) as a % of total revenue income (6) 2017-18',
'AGEL',
9.2.4. Gaussian Naïve Bayes 'TotalRevBalance Change 4yr',
'Total.Income.pp.2018',
'Supply.Staff.2018',
'Total revenue balance (1) 2017-18',
'Self.Income.2018',
'Mean Gross FTE Salary of All Teachers (£s)',
'PSENELSE__18',
'TotalRevBalance Change 7yr',
'Ed.Support.Staff.2018',
'Consultancy_2yrDiff'
9.2.5. Logistic Regression 'Total revenue balance (1) 2017-18',
'Supply.Staff_2yrDiff',
'TotalRevBalance Change 7yr',
'Other_4yrDiff',
'Teaching.Staff_4yrDiff',
80
'Other.Staff_4yrDiff',
'Other.Staff_2yrDiff',
'Supply.Staff.2018',
'Ed.Support.Staff.2018'
9.2.6. K-Nearest Neighbours 'Total revenue balance (1) 2017-18',
'TotalRevBalance Change 4yr',
'HasGirlsNew',
'HasBoysNew',
'ISPRIMARY',
'Catering.2018',
'GOR_North East',
'Ed.Support.Staff_4yrDiff',
'BoardingNew',
'Back.Office_2yrDiff',
'Consultancy.2018',
'Consultancy_2yrDiff',
'SpecialNew',
'Ed.Support.Staff_2yrDiff',
'PERCTOT',
'Other.Staff_4yrDiff',
'Learning.Resources.2018',
'Other.2018',
'Other_4yrDiff',
81
'Energy_4yrDiff',
'TotalRevBalance Change 7yr',
'Teaching.Staff.2018',
'Total revenue balance (1) as a % of total revenue income (6) 2017-18',
'Learning.Resources_4yrDiff',
'ISSECONDARY',
'SixthFormNew',
'ISPOST16',
'Catering_2yrDiff',
Tata Steel Optimisation Dissertation(1).pdf
Steelmaking Continuous Casting Production Schedule Optimization for TATA Steel
Sotirios Filippou
September 2019
School of Mathematics, Cardiff University
A dissertation submitted in partial fulfilment of the requirements for MSc (in Operational Research,
Applied Statistics and Financial Risk) by taught programme.
CANDIDATE’S ID NUMBER
1872049
CANDIDATE’S SURNAME
Mr. Filippou
CANDIDATE’S FULL FORENAMES
Sotirios
DECLARATION This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 1 This dissertation is being submitted in partial fulfilment of the requirements for the degree of MSc (insert MA, MSc, MBA, etc., as appropriate) Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 2 This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended. Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 3 – I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations. Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 4 - BAR ON ACCESS APPROVED I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee. Signed ……………………………………………. (candidate) Date 19/09/2019
1
Executive Summary Iron and steel production is one of the most significant industries worldwide. The steel
manufacturing process is divided into three stages, ironmaking, steelmaking-continuous
casting and production of finished products. In this paper, we focus on the steelmaking-
continuous casting stage, in which liquid iron is transformed into steel slabs with the
addition of required alloys and removal of impurities, solidifies and forms slabs.
The steelmaking-continuous casting process has been described as the bottleneck of the
steel production process. Thus, optimal scheduling of this process could result in many
advantages. However, it is a combinatorial problem with complex practical constraints
and strict requirements and is considered as one of the most difficult industrial planning
and scheduling problems.
Across the literature several methods attempting to solve this problem can be found. Each
steelmaking-continuous casting scheduling problem may differ from the others
considering the facilities, the way demand is translated into production planning and any
other additional constraints. As a result, a universal method that can be applied to solve
such a problem does not exist.
The process followed by TATA Steel Port Talbot Works consists of five stages with
multiple machines at all stages. Furthermore, it includes a variety of constraints that result
in high complexity in mathematically formulating a scheduling problem for this process.
Attempting to solve this problem, three different mathematical models were developed.
The first model schedules the last stage of the process, and the second one uses the results
of the first one to schedule the preceding stage. The third model uses the results of the
first one and attempts to schedule all remaining stages. It should be pointed that the second
or the third model may not be able to obtain a feasible solution for any output of the first
one.
The three models were tested on a realistic scenario. A complete schedule was acquired
for the two last stages from the first two models. However, a solution for the third model
could not be acquired in a reasonable time frame. Thus, it was tested on a smaller scale
problem for which a feasible solution was acquired.
2
Several suggestions on how the three models could be improved and used as part of a
complete scheduling system are included in the paper. The models could be combined
with different methods, to obtain a complete feasible schedule. Furthermore, different
solution methods should be tested since they may be proven more advantageous for
business purposes.
3
Acknowledgments
I would like to thank my sponsor supervisor at TATA Steel, Mr. James Watson, for his
guidance, assistance, suggestions and patience throughout the process of writing this
dissertation.
I would also like to thank my university supervisor, Dr. Tony Lewins, for his expertise,
support and guidance on how to approach this research topic. Furthermore, I would like
to thank Dr. Jonathan Thompson for his contribution, willingness to help and always
responding to my questions and queries.
I need to express my gratitude to my parents for their continuous support and
encouragement. In addition, I would like to acknowledge my fellow postgraduate students
and friends at Cardiff University. Your advice and friendship helped me complete this
dissertation. I would like to single out Artemis Giannakopoulou for her support and
sympathetic ear.
4
Table of Contents Executive Summary .......................................................................................................... 1
Acknowledgments ............................................................................................................. 3
Table of Figures ................................................................................................................ 5
Table of Tables .................................................................................................................. 5
Abstract ............................................................................................................................. 6
1. Introduction ................................................................................................................ 7
2. Literature Review .................................................................................................... 11
3. Process Description and Constraints ....................................................................... 17
4. Mathematical Formulation ....................................................................................... 20
4.1. Model 1: Continuous Casting Scheduling ........................................................ 20
4.1.1. Notation ..................................................................................................... 20
4.1.2. Model Constraints ..................................................................................... 21
4.1.3. Objective Function .................................................................................... 24
4.2. Model 2: Treatment Units Scheduling ............................................................. 24
4.2.1. Notation ..................................................................................................... 24
4.2.2. Constraints ................................................................................................ 25
4.2.3. Objective Function .................................................................................... 27
4.3. Model 3: Basic Oxygen and Secondary Steelmaking Scheduling ................... 27
4.3.1. Notation ..................................................................................................... 27
4.3.2. Constraints ................................................................................................ 28
4.3.3. Objective Function .................................................................................... 30
5. Experimental Tests .................................................................................................. 31
6. Recommendations for Further Development ........................................................... 37
7. Conclusion ............................................................................................................... 43
References ....................................................................................................................... 44
Appendix A: Flying Tundish Change (FTC) between Products ..................................... 47
Appendix B: Flying Tundish Change (FTC) between Products ..................................... 48
Appendix C: Xpress Code Model 1 ................................................................................ 49
Appendix D: Xpress Code Model 2 ................................................................................ 51
Appendix E: Xpress Code Model 3 ................................................................................ 55
5
Table of Figures Figure 1. Steel Manufacturing Process ............................................................................. 7 Figure 2. Steelmaking-Continuous Casting Process at TATA Steel .............................. 18 Figure 3. Allocation of sequences to continuous casting machines – Case Study 1 ....... 32 Figure 4. Treatment Unit 1 (RH) Schedule – Case Study 1 ............................................ 33 Figure 5. Treatment Unit 2 (RD) Schedule – Case Study 1 ............................................ 33 Figure 6. . Treatment Unit 3 (CAS1) Schedule – Case Study 1...................................... 33 Figure 7. Treatment Unit 4 (CAS2) Schedule – Case Study 1 ....................................... 34 Figure 8. Heats Schedule – Case Study 1 ....................................................................... 34 Figure 9. Heats Schedule - Case Study 2 ........................................................................ 36 Figure 10. Proposed Heuristic ......................................................................................... 39
Table of Tables Table 1. Sequences - Case Study 1 ................................................................................. 31 Table 2. Availability of the continuous casting machines - Case Study 1 ...................... 31 Table 3. Machine ID Number – Figure 8 ........................................................................ 35 Table 4. Heats – Case Study 2 ........................................................................................ 35 Table 5. Machine ID Number – Figure 9 ........................................................................ 36
6
Abstract The purpose of this paper was to develop a mixed integer linear programming model for
scheduling the steelmaking continuous casting process at TATA Steel Port Talbot Works.
Several formulations and solution methods exist in the literature. However, the
formulation of a model highly depends on the considered process and the respective
constraints. Three models attempting to schedule different stages of the process were
developed. The models were tested using experimental data and feasible solutions were
acquired. Further research on solution methods and improvement to the current models
are required for the development of a complete scheduling system.
7
1. Introduction Iron and steel production is one of the most significant industries worldwide since its
products are used as primary materials for a variety of other industries such as automobile,
construction and manufacturing (Missbauer et al., 2009; Tang et al., 2014).
Input materials, such as iron, ore and scrap, are used to manufacture steel products in a
process that can be divided into three stages:
(1) Ironmaking: iron ore, coke and a fluxing agent are transformed into molten iron
(also called hot metal) in blast furnaces.
(2) Steelmaking-continuous casting: through melting, refining and continuous
casting, the molten iron is converted into solid slabs with a specified chemical
composition.
(3) Production of finished products: slabs are shaped into coils by hot and cold rolling
and take their final form via various processes including continuous or batching
annealing, electro-galvanizing and continuous galvanizing (Missbauer et al.,
2009; Tang and Wang, 2008).
In Figure 1, a representation of the described process can be found.
Figure 1. Steel Manufacturing Process
This industry is characterized by high-temperature high-weight material flow,
sophisticated technological procedures, major investment and high energy consumption.
Additionally, the steelmaking continuous casting (SCC) stage is regularly described as
the bottleneck of the iron and steel making process since it has high-cost energy and
equipment requirements, runs continuously and its total capacity is smaller than the
capacity of the rest of the stages in the iron and steelmaking process (Tang et al., 2002).
As a result, effective scheduling of the SCC processes can be advantageous in several
8
ways, including minimizing material and energy requirements, increasing profit, reducing
costs and superior response to customer demand (Li et al., 2012). However, scheduling of
such a process is a combinatorial problem with complex practical constraints and strict
requirements on material and flow continuity subject to processing times at the different
stages and transportation and waiting times between them (Li et al., 2012; Tang et al.,
2000). According to Zhu et al., 2010, it is an NP-complete problem and can be described
as a “specific hybrid flow shop scheduling problem” that includes multiple jobs,
operations and machines.
As already mentioned, the steelmaking phase consists of three sub-stages, steelmaking,
refining and continuous casting. Starting with the steelmaking phase, oxygen combustion
taking place in a converter or an electric arc furnace decreases the impurity components
(carbon, sulphur, silicon, etc.) of the molten iron to acceptable levels converting it to
molten steel that contains the major alloy contents. A basic production unit in the SCC
stage is termed as charge and refers to the simultaneous smelting in a single converter
(Tang et al., 2002). The words job and heat are alternative terms for charge and are used
interchangeably in this paper. Several slabs for different orders can be casted form a single
charge, but they need to have the same steel grade (Tang et al., 2002).
Afterwards, the output is placed into ladles and transferred to refining furnaces. During
refining, any remaining impurities are removed from the molten steel or additional alloy
elements are added. In the case all refining furnaces are occupied, charges must be held
till the processing of the preceding charges is completed. Waiting causes a decline in the
temperature of the molten steel and reheating is required. The longer the waiting time, the
longer the temperature drop and the higher the energy requirements for reheating (Tang
et al., 2002).
Continuous casting follows refining. The molten steel is poured into a tundish from where
it is tapped into the caster and solidifies into slabs at the bottom of the caster. A series of
jobs successively casted on the same caster without any interruptions is named a cast.
Between consecutive jobs, the casting machine does not require any setup time, but if the
caster must be changed between two casts a significantly long setup time is needed.
Furthermore, a removal time is required for cleaning the equipment between the casts.
9
However, setup and removal time is not included in the duration of the operation since
only the equipment and not the charges are involved (Tang et al., 2002).
Several machines are usually available for the same stage of the process. The goal of the
SCC production scheduling problem is to establish on which machine each job will be
processed at each production stage, determine the respective processing time and the
sequence of jobs on each machine (Tang et al., 2014). A defined SCC problem includes
two types of data:
(1) Production data: information on job grouping (cast) and jobs. This information is
acquired from solving a batch problem that links required slab production to
required jobs and arranges jobs into casts.
(2) Process data: include information on the available machines, processing times,
transportation times, process route and casting speed (Tang et al., 2014).
The SCC scheduling problem is considered as one of the most difficult industrial planning
and scheduling problems (Sbihi et al., 2014). Across the literature several methods
attempting to solve this problem can be found. Among the most common ones are
mathematical programming, heuristics, simulation, expert systems and artificial
intelligence. Each SCC scheduling problem may differ from the others considering the
facilities, the way demand is translated into production planning and any other additional
constraints. A solution may be obtained either optimally or heuristically. In this paper,
focus is given on mathematical programming, and specifically mixed integer
programming,
This paper discusses a problem presented by TATA Steel considering the SCC process at
their steel plant in Port Talbot, UK. The specifics of the problem are discussed in Section
3. Currently, humans schedule the production without the use of any optimization
methods. An attempt to identify possible solution methods and related challenges and
model part of the process using mixed integer programming is presented in this paper.
Furthermore, recommendations on improving the scheduling process are given.
The rest of the paper is organized as follows. In Section 2, a brief review of similar
problems found in literature and their proposed solution methods is presented. Section 3
10
introduces the SCC scheduling problem as described by Tata Steel Strip Products UK Port
Talbot Works. In Section 4, three mixed integer linear programming models are described,
one for scheduling the continuous casting stage of this problem, one for scheduling the
refining stage and one for scheduling all stages preceding continuous casting. In Section
5, examples of the implementation of these models are presented. In Section 6,
recommendations on how a complete scheduling system could be developed are
discussed. Finally, Section 7 draws the conclusions.
11
2. Literature Review In this section, literature related to this paper is reviewed. In particular, several examples
of the SCC scheduling problem and their proposed solution methods are presented. The
various solution methods that have been presented by researchers can be categorized into
mathematical programming, heuristics and artificial intelligence. However, many
examples of combing these methods exist. A complete review of mathematical
programming methods was prepared by Tang et al. (2001). Also, Dutta and Fourer (2001)
reviewed mathematical programming applications in the integrated steel industry.
The SCC scheduling problem can be described as a hybrid flowshop problem with
parallel machines at one or more stages (Zhao et al., 2011; Li et al., 2014; Atighehchian
et al., 2009; Tang et al., 2002; Pan et al., 2013; Missbauer et al., 2009). It is one of the
most complex and difficult hybrid flowshop problems due to the additional practical
constraints (e.g. job sequencing, precedence) and the more complicated scheduling
criteria (Li et al., 2014; Atighehchian et al., 2009; Tang et al., 2002). Pan et al. (2013)
and Li et al. (2014) described the problem as a realistic hybrid flowshop problem. The
traditional hybrid flowshop problem involves a process with multiple stages and multiple
machines at one or more stages in which all jobs go through the required stages in the
same order. The realistic hybrid flowshop problem is a generalization of the traditional
one that includes realistic considerations and constraints (Pan et al., 2013). The realistic
hybrid flowshop problem has been studied excessively due to its significant industrial
applications (Pan et al., 2013). However, the differences in the complex production
constraints, the production mode and the objectives of different steel producers result in
need of developing scheduling systems adapted to the specifications of each
manufacturer. Additionally, although Gupta (1988) proved that the two stage hybrid
flowshop problem is NP-complete and several publications on solving a hybrid flowshop
problem exist, these methods cannot be used for the SCC scheduling problem due to the
additional practical constraints. (Tang et al., 2002). As a result, researches have studied
several solution methods specifically for the SCC scheduling problem.
Harjunkoski and Grossmann (2001) suggested a decomposition method instead of
modelling one large-scale and unsolvable MILP. They considered a problem of a four-
12
stage process, with two machines in the first stage and a single machine at the rest three
stages, multiple product types and a predetermined processing time of each product at
each stage. The problem was divided into the following sub-problems: (1) grouping jobs
into sequences. (2) scheduling each sequence individually. (3) combining all individual
schedules. (4) LP-improvement problem which attempts to make improving changes to
the schedule developed in the previous phase. The model was tested using real-world data
and solved using GAMS-19.5/XPRESS-MP.
Missbauer et al. (2009) created a computerized scheduling system for a steel plant in
Austria. They modelled the SCC problem as a mixed integer linear problem (MILP) and
used a heuristic algorithm for solving it. They divided the problem into four sub-problems:
(1) creating a schedule for the continuous casters. (2) assigning the jobs to the parallel
machines at the steelmaking and refining stages. (3) sequencing the jobs at the
steelmaking and refining machines. (4) determining the timing of each operation. Their
heuristic algorithm attempts to solve the problem in three stages: (1) scheduling the
continuous casters taking into account the capacity of the steelmaking and refining stages
and the supply of liquid metal. (2) scheduling all jobs at the remaining stages. (3) solving
a linear problem to make improvements on the schedule. In other words, stages (1) and
(2) are heuristics that fix the values of the binary variables of their MILP model and
determine the initial values for the continuous variables. Based on these fixed values, the
final values of the continuous variables are calculated at stage (3). A complete
computerized scheduling program was developed based on their model, but it was not
implemented on existing software. A software vendor developed a customized software.
Tang et al. (2000) proposed a mathematical programing model for overcoming machine
conflicts in SCC scheduling. As Missbauer et al. (2009) suggested, they also divided the
whole SCC problem into four sub problems: (1) Cast sequencing. Casts are scheduled at
the continuous casters prioritizing those with the nearest delivery time. Resource
constraints are not considered at this stage and the problem is formulated as a single
machine sequence problem. (2) Creating sub-schedules. Scheduling the jobs of each cast
at the rest of the stages based on time progress. (3) Creating a “rough” schedule combining
the sub-schedules from the previous step. (4) Elimination of machine conflicts. Since
13
resource constraints are not considered in the previous steps, machine conflicts exist in
the “rough” schedule that must be eliminated to obtain a feasible schedule. The first three
steps are completed using human-computer interaction while the last one is solved using
a mathematical model. It is a non-linear program which can be transformed into a liner
problem and solved by standard software packages. A combination of human-computer
interaction and the proposed model resulted in the development of a scheduling system in
MS C 6.0 language and SYBASE database system.
Bellabdaoui and Teghem (2006) presented a MILP model for scheduling the SCC process
of an Arcelor Group site in Belgium. The considered process consisted of three stages and
two parallel machines at each stage. They also accounted for the transportation time
between the machines. Their objective was to create a production schedule given one job
sequence for each casting machine and their initial condition (i.e. the instance a machine
becomes available) while minimizing the completion time for all sequences. Processing
times at the first two stages and transportation times were fixed and entered as input
parameters while the processing time of a charge on the casters was a decision variable
that varied between a lower and an upper bound. The model was implemented in
OMPartners software.
Fanti et al. (2016) developed a MILP model for a more complicated process. They
considered a four-stage process (melting, refining, degassing, casting/stripping) with
multiple machines. The last stage could be performed on two different types of machines
(continuous casting or ingot casting machines). They divided the problem into four sub-
problems that they modelled separately, and then, connected them introducing a set of
additional constraints. Thus, the solution was a global optimal. The four sub-models were
defined as: (1) SCC flow, establishing the sequence of the machines on which each charge
would be processed starting from the melting and ending to casting. (2) Ladle scheduling,
assigning ladles to charges. (3) Continuous casting machines scheduling. (4) Ingot casting
machines scheduling. The optimization model considered deterministic processing times
for each machine and its objective was to minimize the completion time of the last job.
The model was implemented in C++.
14
Sbihi et al. (2014) attempted to introduce a generalized formulation of the SCC problem.
They modelled a three stage process as a MILP without directly considering any material
handling resources, such as ladles or transportation machines. For each caster, the number
of sequences and the number of charges assigned to each sequence was predetermined.
The objective was to schedule the jobs of each sequence at all stages while maximizing
productivity. The processing times at the first two stages of the SCC process and the
transportation times between machines were constant while the processing time of a
charge at the third stage was determined by the model and it was required to be between
an upper and a lower bound. The model was tested on CPLEX software.
Tang et al. (2002) modelled the scheduling problem as an integer program and used
Lagrangian relaxation for solving it. They modelled a three-stage process. The number of
casts and charges was predetermined as well as the processing and transportation times of
each charge at each stage. They proposed a solution method that incorporated Lagrangian
relaxation, dynamic programming and heuristics. After relaxing machine capacity
constraints in their formulation, the problem could be divided into simpler sub-problems.
Using dynamic programming, these simpler problems were solved in a low level while
Lagrangian multipliers were changing iteratively at a high level. After iteration
completion, a heuristic was used to modify the sub-problem solutions in order for a
feasible global solution to be obtained. Visual C++ language was used for the model
implementation.
Tang and Liu (2007) created a deterministic mixed integer program model for scheduling
production orders based on data from Baosteel in Shanghai, China. Considering the SCC
and the rolling stages of the steel making process, the objective was to schedule each
production order at each stage under capacity constraints while minimizing the weighted
completion time of all orders. As Tang et al. (2002) proposed, the solution method
presented was based on a combination of Lagrangian relaxation, linear programming and
heuristics and implemented using Visual C++.
Similarly, Mao et al. (2014) formulated the SCC problem as a MILP and used Lagrangian
relaxation to solve it. They modelled it as a hybrid flowshop problem, and their objective
was to minimize earliness/tardiness of the completion of charge processing. As in Tang
15
et al. (2002), relaxation on machine capacity constraints resulted into two simpler sub-
problems for which several solution algorithms were tested and compared. It was
concluded that their proposed Lagrangian relaxation method results in better quality
solutions in less time than convectional Lagrangian relaxation techniques. Algorithms
were implemented in C#.
Researchers have also studied different versions of the SCC scheduling problem. For
example, Naphade et al. (2001) formulated a batching problem. Considering the size of
the charges, received orders and their delivery time and different product characteristics,
they developed a MILP that determined which charges should be used for each order and
processing times. The objective was to minimize tardiness of delivery time and waste.
Due to the computational difficulty of the problem and time restrictions, they developed
a two-level heuristic algorithm that decomposed the problem into simpler sub-problems
to solve it. The algorithm was implemented by using C++ language. Additionally, Tan
and Liu (2013) formulated the SCC scheduling problem considering a variable electricity
price. The objective was to obtain a daily schedule minimizing electricity and production
costs. The proposed solution method consisted of two stages. At the first one, using
mathematical programming, a relative schedule for each cast was acquired without
considering the electricity price. At the second stage, a scheduling problem for all casts
with resource constraints and variable electricity was modelled and solved by using a
combination of heuristics and constraint propagation. The model was tested using real-
world data and implemented using JAVA language.
Pacciarelli and Pranzo (2004) presented a different heuristic approach. Their model was
based on the alternative graph, a generalization of the disjunctive graph of Roy and
Sussman. It was solved using a beam search procedure and implemented in C language.
Furthermore, Zhao et al. (2011) described a two-step solution approach. In the first step,
tabu search algorithm arranged the jobs on the machines. In the second one, starting and
ending times of the processing of jobs at all stages were determined solving a linear
programming model. Their model was implemented using Visual C++.
Using artificial intelligence to solve the SCC scheduling problem has also been
researched. For instance, Pan et al. (2013) described an artificial bee colony algorithm
16
that scheduled a multiple-stage, multiple-machine SCC process with predefined casts and
processing times at each stage. Furthermore, Li et al. (2014) proposed a fruit fly
optimisation algorithm for solving the SCC scheduling problem formulated it as hybrid
flowshop problem with predefined casts and processing times. Atighehchian et al. (2009)
presented an approach that combined ant colony optimization and non-linear optimization
techniques for solving a similar version of the problem. The solution method consisted of
two stages: (1) assigning jobs to machines and determining sequencing. (2) determining
timings of the jobs on the machines.
Several formulations and solution approaches exist in literature; however, there is not a
universal model or methodology that can be applied in all cases. Each problem is highly
dependable on the structure of the considered steel plant the desired objective. This
obvious when comparing the problem presented by TATA Steel (described in Section 3)
to the literature findings above.
17
3. Process Description and Constraints In this section, the SCC process followed at TATA Steel, Port Talbot and all the
restrictions applied to it are presented.
The steel and slab process consists of three different parts: basic oxygen steelmaking,
secondary steel making and continuous casting. Liquid iron is supplied by the blast
furnaces at an inconstant rate. The liquid iron is transferred using torpedoes. During the
basic oxygen steel making process, the liquid iron is poured in iron ladles. The content of
a ladle is termed as a heat and each heat is used to produce slabs of a specific product type
and width. The filled ladle is transferred to the next stage during which desulphurisation
happens. To complete the basic oxygen steelmaking process, the liquid iron is transferred
to one of the two available vessels which have already been charged with scrap (approx.
80 % (270 - 310t) of the charge is hot metal and 20% (60-90t) is scrap). After the vessel
has been loaded, a copper tipped, water cooled lance is lowered into the vessel and oxygen
is blown at a rate of 1000 m3/min.
The next part of the process is called dwell and it involves tapping and transfer, treatment
and floatation. The vessels are tapped into ladles which are transferred to one of the four
available treatment units where the secondary steelmaking process takes place. Based on
the type of the products that are being produced different treatment units may require to
be used. Then, floatation follows and ladles are transferred to casting machines. The last
step is continuous casting where the hot metal is solidified into slabs in one of the casting
machines.
Figure 2 illustrates the different paths that can be followed by a heat before it is
transformed into slabs.
18
Figure 2. Steelmaking-Continuous Casting Process at TATA Steel
Before attempting to create a production schedule for this process, the constraints related
to this process must be introduced. Firstly, the hot metal stock needs to be kept between
an upper and a lower limit to avoid any disturbances in the production. As mentioned
above, the hot metal arrives at an inconstant rate. Additionally, several constraints related
to operation timings exist. The processing time for several parts of the process are not
fixed, but they need to lay within a predetermined range. The total dwell time, but also its
three different parts, specifically, tap and transfer, treatment and floatation can be
adjusted, but their duration needs to be between a minimum and a maximum value that
are predetermined. For the treatment and the whole dwell process, these values depend on
the product type while the tap & transfer and the floatation times are independent of the
product and they need to be in the range of 15-25 minutes. Similarly, the processing time
of a heat in the casting machines is calculated based on the casting speed that is controlled
by adjusting the speed utilization percentage. The casting speed depends on the product,
width and the casting machine that is used, and the speed utilization ratio is usually in the
range of 70% to 90%. From the casting speed, the emptying time (processing time) is
determined. The duration of the remaining steps of the steelmaking process (ex. vessel
blowing) are fixed and independent of any other variables.
Additionally, there are several constraints related to the casting machines. Casting needs
to be continuous in all machines with a gap of two to four minutes between the ending
19
time of the previous heat and the arrival of the next one. Any interruption would result in
a two-hour break in the steelmaking making process. As sequence length is defined the
number of continuous heats of the same product and width. For each combination of
product, width and caster, there is a minimum and a maximum value for the sequence
length.
Lastly, there are constraints in the order of sequences. Some products cannot be
manufactured on the same caster after specific products due to restrictions in the flying
tundish changes. A table presenting permitted flying tundish changes is available in
Appendix A. Furthermore, changes equal to or less than 200mm are allowed regarding
width between two consecutive sequences. For example, a sequence of product 319x with
width 3600mm can be placed after a sequence of product 319x with width 3400mm, but
not after one of product 319x with width 3200mm. If two sequences do not satisfy these
conditions and they are scheduled one after the other on the same continuous casting
machine a two-hour gap must exist between the ending time of the preceding and the
starting time of the following sequence.
20
4. Mathematical Formulation In this section, three MILP models are described. The first model schedules the heats at
the continuous casting stage, and the second one uses the results of the first one to
schedule the heats on the treatment units. The third one uses the results of the first one
and attempts to schedule the rest of the stages.
4.1. Model 1: Continuous Casting Scheduling A MILP model for scheduling a predetermined number of casts (sequences) at the final
stage of the process (continuous casting) was formulated. For the development of this
model, the following assumptions were made:
x The length, product type and width of all sequences is predetermined.
x The processing time of all heats in the same sequence is constant but not fixed.
x All sequences can be processed on any casting machine independently of their
width and product type.
x Not all continuous casting machines are available to start operating at time 0.
x The minimum acceptable processing time of a heat is 70% of its respective
maximum time.
x Each machine has a determined number of positions, and each heat processed on
that machine occupies a position. In reality, such a parameter does not exist, but it
was added in the formulation to avoid machine conflicts (i.e. no more than one
heats are being processed on the same machine simultaneously). The number of
positions is set equal to the total number of sequences, so in case all sequences
need to be assigned to one machine due to restrictions, a solution can be obtained.
However, this number can be modified. For instance, if the same number of
sequences needed to be scheduled on all machines the number of possible
positions could be adjusted accordingly.
4.1.1. Notation Several notations for defying the problem set, indices, parameters and variables are
required to formulate the model:
21
Sets, Indices and Parameters
𝑃 Set of product types. 𝑆 Set of sequences. 𝑀 Set of continuous casting machines. 𝑉 Set of possible positions in a continuous casting machine (vk=N is the
last position). 𝑝, 𝑝′ Indices of product types. 𝑠, 𝑠′ Indices of sequences. 𝑚 Index of continuous casting machines. 𝑣 Index of positions. N Number of total sequences. 𝐽𝑠 Number of heats in sequence s. 𝑊𝑠 Width of slabs that will be produced from the heats of sequence s. 𝑌𝑠,𝑝 =1 if the heats of sequence s will be used to produce slabs of product type
p; 0 otherwise. 𝑀𝑇𝑝 Maximum processing of a heat of product p. 𝑅𝑝,𝑝′ =1 if set-up time is required in order a sequence of product type 𝑝′ to be
processed after a sequence of product type p on the same machine; 0 otherwise.
𝐴𝑉𝑚 Starting time of continuous casting machine m.
Decision Variables
𝑆𝑇𝑣,𝑚 Starting time of position v on machine m. 𝐸𝑇𝑣,𝑚 Ending time of position v on machine m. 𝑋𝑠,𝑣,𝑚 =1 if sequence s is processed on position v of machine m.
𝑆𝑅𝑠,𝑠′ ,𝑣,𝑚 =1 if sequence 𝑠′ is assigned to the (v+1)th position of machine m, sequence s to vth position and set-up time is required between sequences s and 𝑠′.
4.1.2. Model Constraints The constraints developed represent the relationships a solution must meet considering
machine conflicts, production continuity, sequence succession allowance, etc. A
solution must satisfy the following constraints:
∑ ∑ 𝑋s,v,m
𝑉
𝑣
𝑀
𝑚
= 1
for all s ϵ S
(1)
22
∑ 𝑋s,v,m
𝑆
𝑠
≤ 1
for all m ϵ M, v ϵ V
(2)
𝑋s,v,m + 𝑋 𝑠′,v+1,m − 1 ≤ 𝑆𝑅𝑠,𝑠′,𝑣,𝑚 for all m ϵ M, 𝑝, 𝑝′ϵ P, 𝑠, 𝑠′ ϵ S, v=1…vk-1, 𝑅𝑝,𝑝′=1, 𝑌𝑠,𝑝=1, , 𝑌𝑠′,𝑝′=1
(3)
𝑋s,v,m + 𝑋 𝑠′,v+1,m − 1 ≤ 𝑆𝑅𝑠,𝑠′,𝑣,𝑚 for all m ϵ M, 𝑠, 𝑠′ϵ S, v=1…vk-1, Ws-Ws’ > 200
(4)
𝑋s,v,m + 𝑋 𝑠′,v+1,m − 1 ≤ 𝑆𝑅𝑠,𝑠′,𝑣,𝑚 for all m ϵ M, 𝑠, 𝑠′ϵ S, v=1…vk-1, Ws’-Ws > 200
(5)
𝑆𝑇𝑣,𝑚 = 𝐸𝑇𝑣−1,𝑚 + 120 ∗ ∑ ∑ 𝑆𝑅𝑠,𝑠′,𝑣−1,𝑚
𝑆
𝑠′
𝑆
𝑠
for all m ϵ M, v=2…vk
(6)
𝐸𝑇𝑣,𝑚 ≤ 𝑆𝑇𝑣,𝑚 + ∑ ∑ 𝑌𝑠,𝑝 ∗ 𝐽𝑠 ∗ 𝑀𝑇𝑝 ∗ 𝑋𝑠,𝑣,𝑚
𝑃
𝑝
𝑆
𝑠
for all m ϵ M, v ϵ V
(7)
𝐸𝑇𝑣,𝑚 ≥ 𝑆𝑇𝑣,𝑚 + ∑ ∑ 𝑌𝑠,𝑝 ∗ 𝐽𝑠 ∗ 0.7 ∗ 𝑀𝑇𝑝 ∗ 𝑋𝑠,𝑣,𝑚
𝑃
𝑝
𝑆
𝑠
for all m ϵ M, v ϵ V
(8)
∑ 𝑋s,v,m
𝑆
𝑠
− ∑ 𝑋s,v+1,m
𝑆
𝑠
≥ 0
for all m ϵ M, v=1…vk-1
(9)
𝑆𝑇1,𝑚 ≥ 𝐴𝑉𝑚 for all m ϵ M,
(10)
𝑋𝑠,𝑣,𝑚 ∈ {0, 1} for all s ϵ S, m ϵ M, v ϵ V
(11)
23
𝑆𝑅𝑠,𝑠′,𝑣,𝑚 ∈ {0, 1} for all s, s’ ϵ S, m ϵ M, v ϵ V
(12)
𝑆𝑇𝑣,𝑚 ≥ 0 for all m ϵ M, v ϵ V
(13)
𝐸𝑇𝑣,𝑚 ≥ 0 for all m ϵ M, v ϵ V
(14)
Constraint (1) ensures that each heat is processed on one and only one machine. Constraint
(2) means that at most one heat is assigned to every position of a machine (i.e. no more
than one heats are being processed on the same machine simultaneously). Constraints (3)
- (5) determine if set-up time is required between two sequences.
Constraint (3) checks if set-up time is required between two sequences because their
assigned product types cannot be processed on the same machine in succession.
Constraints (4) - (5) check if set-up time is required between two sequences due to the
difference in their assigned widths. Constraint (6) sets starting time of a position to be
equal to the ending time of the previous one on in the same machine plus any set-up time
if it is required. Thus, continuous casting is ensured.
Constraints (7) – (8) ensure that the processing time of each sequence is between an
allowed range based on the number of heats they consist of and the minimum and
maximum allowed processing times of the heats of their assigned product type.
Constraint (9) requires positions of all machines to be filled without leaving any empty
positions between two occupied ones. Differently, a position could not be occupied and
would have the same starting and ending time that would be equal to the ending time of
the previous one and the starting time of the following one. In this case, if set-up time was
required between the two successive sequences that were not occupying two successive
positions it would not be included in the calculations. For example, consider two
sequences s and s’ that their width difference is more than 200mm and s’ is scheduled to
be treated on machine m right after s is treated on m. Because of their width difference, set-up time is required between the processing of the two sequences. Without including
24
constraint (9), s could be assigned to position v and s’ to v+2 while position v+1 remained
empty. In this scenario, from constraint (7) results that 𝐸𝑇𝑣,𝑚 = 𝑆𝑇𝑣+1,𝑚 = 𝐸𝑇𝑣+1,𝑚 =
𝑆𝑇𝑣+2,𝑚 since 𝑆𝑅𝑠,𝑠′,𝑣+2,𝑚 = 0. So, the set-up time is not added and the obtained schedule
is not feasible.
Constraint (10) forces the starting time of the first position of all machines to be greater
or equal to the time the respective machines can begin operating. Constraints (11) – (14)
are introduced to strengthen the formulation.
4.1.3. Objective Function The aim of the model is to determine a feasible schedule while maximizing productivity.
The objective function is to minimize the total completion times of the sequences. All
sequences at machine m finish with the end of the last position at time 𝐸𝑇𝑁𝑘,𝑚. Thus, the
objective function is formulated as:
Min 𝑍1 = ∑ 𝐸𝑇𝑁,𝑚
𝑀
𝑚
(15)
4.2. Model 2: Treatment Units Scheduling
A MILP model for scheduling the heats at the treatment units after they have been
scheduled at the continuous casting stage is presented. It should be highlighted that it is
not certain that a solution exists for any output of the previous model. The following
assumptions were made while formulating this model:
x As casting machines in the previous model, treatment units have a determined
number of positions, and each heat processed on that unit occupies a position. In
reality, such parameter does not exist, but it was added in the formulation to avoid
machine conflicts. The number of positions is set equal to the total number of jobs,
but this number can be modified.
x No transportation time is considered between the treatment units and the casting
machines. In reality, the transportation time lays within the range of 15-25
minutes.
4.2.1. Notation The following notation was used in formulating the model:
25
Sets, Indices and Parameters
𝑃 Set of product types. 𝐽 Set of heats. 𝑈 Set of treatment units. 𝐾 Set of possible positions in a continuous casting machine (kl=L is the
last position). 𝑝 Index of product types. 𝑗 Index of heats. 𝑢 Index of treatment units. 𝑘 Index of positions. L Number of total heats.
𝑆𝐶𝐶𝑗 Starting time of heat j at the continuous casting stage. 𝐷𝑗,𝑝 =1 if heat j will be used to produce slabs of product type p; 0 otherwise.
𝑀𝑎𝑥𝑇𝑝 Maximum processing time of a heat of product p. 𝑀𝑖𝑛𝑇𝑝 Minimum processing time of a heat of product p. 𝑀𝑅𝑝,𝑢 =1 if a heat of product type 𝑝 cannot be processed on treatment unit u; 0
otherwise.
Decision Variables
𝐵𝑇𝑘,𝑢 Starting time of position k on treatment unit u. 𝐹𝑇𝑘,𝑢 Ending time of position k on treatment unit u. 𝑄𝑗,𝑘,𝑢 =1 if heat j is processed on position k of treatment unit u.
4.2.2. Constraints The constraints developed represent the relationships a solution must meet considering
machine conflicts, machine restrictions, etc. A solution must satisfy the following
constraints:
∑ ∑ 𝑄j,k,u
𝐾
𝑘
𝑈
𝑢
= 1
for all j ϵ J
(16)
∑ 𝑄j,k,u
𝐽
𝑗
≤ 1
for all u ϵ U, k ϵ K
(17)
26
𝐹𝑇𝑘,𝑢 = ∑ 𝑄j,k,u
𝐽
𝑗
∗ 𝑆𝐶𝐶𝑗
for all u ϵ U, k ϵ K
(18)
𝑄j,k,u = 0 for all u ϵ U, 𝑝ϵ P, 𝑗 ϵ J, k ϵ K, 𝑀𝑅𝑝,𝑢 = 1, 𝐷𝑗,𝑝 = 1
(19)
𝐹𝑇𝑘,𝑢 ≤ 𝐵𝑇𝑘,𝑢 + ∑ ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑎𝑥𝑇𝑝 ∗ 𝑄j,k,u
𝑃
𝑝
𝐽
𝑗
for all u ϵ U, k ϵ K
(20)
𝐹𝑇𝑘,𝑢 ≥ 𝐵𝑇𝑘,𝑢 + ∑ ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑖𝑛𝑇𝑝 ∗ 𝑄j,k,u
𝑃
𝑝
𝐽
𝑗
for all u ϵ U, k ϵ K
(21)
𝐹𝑇𝑘,𝑢 ≤ 𝐵𝑇𝑘+1,𝑢 for all u ϵ U, v=1…L-1
(22)
𝐵𝑇𝑘,𝑢 ≥ 0 for all m ϵ M, v ϵ V
(23)
𝐹𝑇𝑘,𝑢 ≥ 0 for all m ϵ M, v ϵ V
(24)
𝑄j,k,u ∈ {0, 1} for all s ϵ S, m ϵ M, v ϵ V
(25)
Constraint (16) ensures that each heat is processed on one and only one machine.
Constraint (17) means that at most one heat is assigned to every position of a machine.
Constraint (18) sets the ending time of a position equal to the starting time at the
continuous casting stage of the heat assigned to this position. Constraint (19) restricts
heats to be treated on units that cannot process their assigned product types.
27
Constraints (20) – (21) ensure that the processing time of each heat is between the allowed
range for its product type. Constraint (22) requires the ending time of a position to be
smaller or equal to the starting time of the next position on the same machine. Constraints
(23) – (25) are added to strengthen the formulation.
4.2.3. Objective Function The objective function is minimizing the processing time of all heats. It is formulated as
the sum of the difference between the ending and the starting time of all positions:
Min 𝑍2 = ∑ ∑(𝐹𝑇𝑘,𝑢 − 𝐵𝑇𝑘,𝑢)
𝑈
𝑢
𝐾
𝑘
(26)
4.3. Model 3: Basic Oxygen and Secondary Steelmaking Scheduling A MILP model for scheduling the heats at the remaining stages after they have been
scheduled at the continuous casting stage is described. This model was developed after
modifying the sub-model of steelmaking and casting flow discussed in Fanti et al. (2016).
It should be highlighted that it is not certain that a solution exists for any output of Model
1. The following assumptions were made while formulating this model:
x No transportation time is considered between the machines at the different stages.
4.3.1. Notation The following notation was used in formulating the model:
Sets, Indices and Parameters
𝑃 Set of product types. 𝐽 Set of heats.
𝑀 Set of machines. 𝐼 Set of stages. 𝑝 Index of product types.
𝑗, k Indices of heats. 𝑚, 𝑢 Indices of machines.
𝑖 Index of stages (il is the last stage). L Number of total heats. G A large number.
𝑆𝐶𝐶𝑗 Starting time of heat j at the continuous casting stage. 𝐷𝑗,𝑝 =1 if heat j will be used to produce slabs of product type p; 0 otherwise.
𝑀𝑎𝑥𝑇𝑝 Maximum processing time of a heat of product p at the treatment units. 𝑀𝑖𝑛𝑇𝑝 Minimum processing time of a heat of product p at the treatment units.
28
𝑀𝑅𝑝,𝑚 =1 if a heat of product type 𝑝 cannot be processed on machine m; 0 otherwise.
SMi, m =1 if machine m can be used at stage i; 0 otherwise. PTm Processing time of a heat at machine m (This set to 0 for the treatment
units). Decision Variables
𝐵𝑇𝑗,𝑖,𝑚 Starting time of heat j at stage i on machine m. 𝐹𝑇𝑗,𝑖,𝑚 Ending time of heat j at stage i on machine m.. 𝑥𝑗,𝑖,𝑚 =1 if heat j is processed on machine m at stage i.
𝑦𝑗,𝑘,𝑖,𝑚 =1 if heats j and k are both processed on machine m at stage i and heat j precedes k.
4.3.2. Constraints
The following constraints represent the relationships a solution must meet considering
machine conflicts, machine restrictions, etc.:
∑ 𝑥j,i,m
𝑀
𝑚
= 1
for all i ϵ I, j ϵ J
(27)
𝐹𝑇𝑗,𝑖𝑙,𝑚 = 𝑥j,il,m ∗ 𝑆𝐶𝐶𝑗 for all j ϵ J, m ϵ M
(28)
𝑥j,i,m ≤ 0 for all m ϵ M, 𝑝ϵ P, 𝑗 ϵ J, i ϵ I, 𝑀𝑅𝑝,𝑚 = 1, 𝐷𝑗,𝑝 = 1
(29)
𝑥j,i,m ≤ 𝑆𝑀𝑖,𝑚 for all m ϵ M, 𝑗 ϵ J, i ϵ I
(30)
𝐹𝑇𝑗,𝑖𝑙,𝑚 ≥ 𝐵𝑇𝑗,𝑖𝑙,𝑚 + ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑖𝑛𝑇𝑝 ∗ 𝑥𝑗,𝑖𝑙,𝑚
𝑃
𝑝
for all m ϵ M, j ϵ J
(31)
𝐹𝑇𝑗,𝑖𝑙,𝑚 ≤ 𝐵𝑇𝑗,𝑖𝑙,𝑚 + ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑎𝑥𝑇𝑝 ∗ 𝑥𝑗,𝑖𝑙,𝑚
𝑃
𝑝
for all m ϵ M, j ϵ J
(32)
29
𝐹𝑇𝑗,𝑖,𝑚 + 𝐵𝑇𝑗,𝑖,𝑚 ≤ 𝑥𝑗,𝑖,𝑚 ∗ 𝐺 for all m ϵ M, j ϵ J, i ϵ I
(33)
𝐹𝑇𝑗,𝑖,𝑚 = 𝐵𝑇𝑗,𝑖,𝑚 + 𝑥𝑗,𝑖,𝑚 ∗ 𝑃𝑇𝑚 for all m ϵ M, j ϵ J, i 1..(il-1)
(34)
𝐵𝑇𝑗,𝑖+1,𝑚 ≤ 𝐹𝑇𝑗,𝑖,𝑚 + (2 − 𝑥𝑗,𝑖,𝑚 − 𝑥𝑗,𝑖+1,𝑢) ∗ 𝐺 for all m ϵ M, j ϵ J, i 1..(il-1)
(35)
𝐵𝑇𝑗,𝑖+1,𝑚 ≥ 𝐹𝑇𝑗,𝑖,𝑚 − (2 − 𝑥𝑗,𝑖,𝑚 − 𝑥𝑗,𝑖+1,𝑢) ∗ 𝐺 for all m ϵ M, j ϵ J, i 1..(il-1)
(36)
𝐵𝑇𝑘,𝑖,𝑚 ≥ 𝐹𝑇𝑗,𝑖,𝑚 − (1 − 𝑦𝑗,𝑘,𝑖,𝑚) ∗ 𝐺 for all m ϵ M, j, k ϵ J, i ϵ I
(37)
𝑦𝑗,𝑘,𝑖,𝑚 + 𝑦𝑘,𝑗,𝑖,𝑚 ≥ 𝑥𝑗,𝑖,𝑚 + 𝑥𝑘,𝑖,𝑚 − 1 for all m ϵ M, j, k ϵ J, i ϵ I, k ≠ i
(38)
𝑦𝑗,𝑘,𝑖,𝑚 ≤ 𝑥𝑗,𝑖,𝑚 for all m ϵ M, j, k ϵ J, i ϵ I
(39)
𝑦𝑗,𝑘,𝑖,𝑚 ≤ 𝑥𝑘,𝑖,𝑚 for all m ϵ M, j, k ϵ J, i ϵ I
(40)
𝐵𝑇𝑗,𝑖,𝑚 ≥ 0 for all m ϵ M, i ϵ I, j ϵ J
(41)
𝐹𝑇𝑗,𝑖,𝑚 ≥ 0 for all m ϵ M, i ϵ I, j ϵ J
(42)
𝑥𝑗,𝑖,𝑚 ∈ {0, 1} for all m ϵ M, i ϵ I, j ϵ J
(43)
𝑦𝑗,𝑘,𝑖,𝑚 ∈ {0, 1} for all m ϵ M, i ϵ I, j, k ϵ J
(44)
30
Constraint (27) ensures that each heat is processed on one and only one machine at all
stages. Constraint (28) sets the ending time of a heat at the secondary steelmaking stage
equal to its starting time at the continuous casting stage. Constraint (29) restricts heats to
be assigned to machines that cannot process their product type. Constraint (30) forces
heats to be treated at the appropriate machines at each stage. In other words, if machine
m is a treatment unit, then Constraint (30) ensures it is used only for processing heats at
the secondary steelmaking stage.
Constraints (31) – (32) ensure that the processing time of each heat at the secondary
steelmaking stage is between the allowed range for its product type. Constraint (33) sets
the starting and the ending time of a heat at machine equal to zero, if it is not treated on
that machine. Constraint (34) sets the ending time of a heat at a specific machine equal to
its starting time plus the respective processing time at the stage preceding secondary
steelmaking.
Constraints (35) - (36) ensure that the starting time of a heat at a specific stage is equal to
the ending time of that heat at the previous stage. Constraint (37) forces the starting time
of a heat to be greater than the ending time of all the preceding heats treated on the same
machine. Constraint (38) guarantees that if two heats are processed on the same machine,
then one precedes the other. Constraints (39) – (44) are used to strengthen the formulation.
4.3.3. Objective Function The objective function is minimizing the processing time of all heats at the treatment units
(the last stage). It is formulated as the sum of the difference between the ending and the
starting time of all jobs at the last stage:
Min 𝑍3 = ∑ ∑(𝐹𝑇𝑗,𝑖𝑙,𝑚 − 𝐵𝑇𝑗,𝑖𝑙,𝑚)
𝑀
𝑀
𝐽
𝑗
(45)
31
5. Experimental Tests The models presented in Section 4 were implemented on FICO Xpress-Optimizer Solver.
The codes developed for the three models can be found in Appendices C, D and E. This
section discusses the results of a case study. A set of six sequences consisting of 48 heats
in total was considered. A detailed description of the data is displayed on Tables 1 and 2.
Table 1. Sequences - Case Study 1
Sequence Heats Product
Type Width Possible Treatment
Units
Max Treatment Time(min)
Min Treatment Time(min)
Max Casting
Time(min) 1 6 300x 3200 RH, RD 45 35 118 2 10 319x 2000 CAS1, CAS2, RH, RD 35 25 72.6 3 8 336x 1800 CAS1, CAS2, RH, RD 35 25 72.6 4 8 345x 2200 CAS1, CAS2 35 25 72.6 5 8 319x 1800 CAS1, CAS2, RH, RD 35 25 72.6 6 8 336x 2000 CAS1, CAS2, RH, RD 35 25 72.6
Table 2. Availability of the continuous casting machines - Case Study 1
Continuous Casting Machine Starting Time
CC1 0 CC2 20 CC3 0
Figure 3 is a visualization of the results obtained from Model 1 (after they were associated
with the results of Model 2, so starting times of machines 1 and 3 are not 0). The allocation
of the sequences to the continuous casting machines and their respective starting and
ending times are presented on the graph. It was determined that all constraints included in
the model were satisfied. No more than one sequences are processed on a machine
simultaneously, all sequences are treated, casting is continuous, restrictions on the order
of sequences on the machines due to product types and width are satisfied and all casters
start operating after the time they became available.
32
Figure 3. Allocation of sequences to continuous casting machines – Case Study 1
The results acquired from Model 1 were used as input for Model 2. However, data
preprocessing was required before attempting to solve Model 2. The starting time of each
heat at the continuous casting stage was required. If sequence s was assigned to position
v of machine m, then the starting times of the heats in this sequence were determined as follows:
𝑆𝐶𝐶𝑗 = 𝑆𝑇𝑣,𝑚 + (𝑗 − 1) ∗ ( 𝐸𝑇𝑣,𝑚 − 𝑆𝑇𝑣,𝑚
𝐽𝑠 )
for all j ϵ Js
(46)
Using this data and Model 2, the schedules shown in Figures 4 to 7 were acquired for the
four treatment units. Again, all constraints considered in Model 2 were satisfied. All heats
are treated on the appropriate units based on product type, no more than one heat is being
processed on a unit at a specific instance, and the ending times of the heats at the treatment
stage are equal to the their respective starting times at the continuous casting stage.
Sequence 6
Sequence 1
Sequence 2
Sequence 3
Sequence 4
Sequence 5
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
CC1
CC2
CC3
Time
CC M
ac hi
ne s
33
Figure 4. Treatment Unit 1 (RH) Schedule – Case Study 1
Figure 5. Treatment Unit 2 (RD) Schedule – Case Study 1
Figure 6. . Treatment Unit 3 (CAS1) Schedule – Case Study 1
0 100 200 300 400 500 600 700 800 900 1000
Time
H ea
ts
0 100 200 300 400 500 600 700 800 900 1000 1100 1200
Time
H ea
ts
0 200 400 600 800 1000 1200
Time
H ea
ts
34
Figure 7. Treatment Unit 4 (CAS2) Schedule – Case Study 1
Furthermore, the results of Model 1 and 2 were combined and the schedule presented in
Figure 8 was obtained. Each line (color) represents a heat, each horizontal segment of a
line represents the processing of that heat at the respective unit/machine and the bullets
the starting and ending times on each machine. The vertical segments connect the
machines on which a heat is treated at the different stages. Table 3 indicates which
machine corresponds to each number on the vertical axis of the graph. It is observed that
a feasible schedule for the continuous casting machines and the treatment units combined
can be acquired from Models 1 and 2.
Figure 8. Heats Schedule – Case Study 1
0 100 200 300 400 500
Time
H ea
ts
1
2
3
4
5
6
7 0 200 400 600 800 1000 1200
M ac
hi ne
s
Time
35
Table 3. Machine ID Number – Figure 8
ID Number Machine 1 Treatment Unit 1 (CAS1) 2 Treatment Unit 2 (CAS2) 3 Treatment Unit 3 (RH) 4 Treatment Unit 4 (RD) 5 Continuous Caster 1 (CC1) 6 Continuous Caster 2 (CC2) 7 Continuous Caster 3 (CC3)
The same data as in Model 2 were inputted in Model 3. It was attempted to solve it twice.
However, both runs were terminated by the user before a solution was obtained due to
time restrictions. The first run was terminated after 76043.6s (approx. 21 hours) and the
second one after 15.204.8s (approx. 4.5 hours). To check the validity of Model 3 a set of
10 heats was inputted. More details of the input data can be found on Table 4.
Table 4. Heats – Case Study 2
Heat Continuous Casting
Starting Time Product Type 1 50.82 336x 2 101.64 336x 3 152.46 300x 4 203.28 300x 5 254.1 336x 6 304.92 336x 7 355.74 319x 8 406.56 319x 9 457.38 345x
10 508.2 345x
The schedule acquired after solving Model 3 is shown in Figure 9. Table 5 shows which
machine corresponds to each number on the vertical axis of the graph. It can be observed
that all constraints (machine conflicts, processing times, continuity, etc.) were satisfied.
36
However, further research on whether this model could be used for business purposes is
required due to time restrictions.
Figure 9. Heats Schedule - Case Study 2
Table 5. Machine ID Number – Figure 9
ID Number Machine 1 Hot Metal Pouring (HM1) 2 Hot Metal Pouring (HM2) 3 Desulphurization (DeS1) 4 Desulphurization (DeS2) 5 Vessel Blow (Vessel1) 6 Vessel Blow (Vessel2) 7 Treatment Unit 1 (CAS1) 8 Treatment Unit 2 (CAS2) 9 Treatment Unit 3 (RH)
10 Treatment Unit 4 (RD)
1
2
3
4
5
6
7
8
9
10 0 50 100 150 200 250 300 350 400 450 500 550
M ac
hi ne
s Time
37
6. Recommendations for Further Development The SCC scheduling problem presented by TATA Steel proved to be challenging and an
end-to-end solution could not be developed in the required time frame. Further research
and investigation on whether other solution methods can be effectively applied is required.
In this section, the challenges faced in formulating a mathematical model for the process
are discussed. Furthermore, recommendations on how this project could be continued are
given.
The SCC process at TATA Steel Port Talbot involves numerous constraints and decision
variables while the predetermined parameters are limited. It was determined that the
decision variables needed to be reduced to formulate a MILP model. Hence, a model in
which the number of sequences to be scheduled, the number of heats per sequence and
their product type are input parameters was developed. The majority of literature
examples studied follow a similar approach. The grouping of heats (sequences) is an input
parameter determined by solving a batching problem that considers demand (or received
orders and their delivery time). Thus, it is proposed that a different model is developed
for this purpose, so its output can be used as input for the continuous casting model.
As already discussed in Section 2, several approaches found in literature decompose the
SCC scheduling problem into sub-problems to obtain a solution. Due to the size of the
considered problem and the large number of constraints, a similar approach is
recommended. Combining a batching problem for grouping heats and the two models
introduced in section 4 with additional ones could lead to obtaining a schedule for the
whole SCC process. The two developed sub-models schedule the last two stages of the
SCC process. One or more additional sub-models that schedule the rest of the stages are
necessary for obtaining an end-to-end solution. The rest of the stages could be modelled
as a single-stage hybrid flowshop problem since processing times are constant for all
heats, there are only two parallel machines and no machine-product type restrictions exist.
However, in this case, a feasible schedule cannot be obtained if one of the parallel
machines is unavailable.
Another challenge is combining the different sub-models. This difficulty derives from
accounting for both machine conflicts and varying processing times. Considering the sub-
38
models presented in this paper, two possible ways for connecting them are introducing
non-linear constraints or develop a heuristic. For example, consider Models 1, 2 and 3
formulated in Section 4 along with the following parameters:
𝐹𝑗,𝑠 =1 if heat j belongs to sequence s; 0 otherwise.
𝐻𝑉𝑗 The number of the position of heat j in sequence s for which 𝐹𝑗,𝑠 = 1
And the following decision variable:
𝐻𝑃𝑗,𝑝 =1 if heat j is assigned to a sequence of product type p
Then, Model 1 models can be connected with Model 2 or 3 by adding the nonlinear
constraint (27) and constraint (28):
𝑆𝐶𝐶𝑗 = ∑ 𝐹𝑗,𝑠 ∗ {∑ ∑ 𝑋𝑠,𝑣,𝑚 ∗ [𝑆𝑇𝑣,𝑚 + (𝐻𝑉𝑗 − 1) ∗ (
𝐸𝑇𝑣,𝑚 − 𝑆𝑇𝑣,𝑚 𝐽𝑠
)] 𝑉
𝑣
𝑀
𝑚
} 𝑆
𝑠
for all j ϵ J
(47)
𝐻𝑃𝑗,𝑝 = ∑ 𝐹𝑗,𝑠 ∗ 𝑌𝑠,𝑝
𝑆
𝑠
for all j ϵ J, p ϵ P
(48)
In this way, if Model 2 (or 3) cannot find a feasible solution for the output of the
continuous casting sub-model the processing time of the heats at the continuous casting
stage will be adjusted accordingly. Appropriate methods for solving a nonlinear problem
or converting such a model to a linear model should be investigated.
As for the heuristic option, an algorithm that manipulates the continuous casting solution,
so a feasible solution for the treatment units scheduling model can be obtained is
suggested. This can be achieved by finding feasible solutions of the casting problem with
increased cost (i.e. total completion time) and using them as an input for Model 2 or Model
3 until a feasible solution is obtained. A way for finding feasible solutions of the casting
39
scheduling problem while the cost increases gradually must be established. This process
could be repeated for a specified number of iterations and if a solution is not obtained than
the heat grouping could be differentiated. A visualization of such as process is presented
in Figure 10.
Figure 10. Proposed Heuristic
40
In case Model 2 was used, the next step would be to schedule the remaining stages of the
process. One or more sub-models could be developed for this purpose. Connecting any
additional sub-models with the existing ones could be achieved as discussed above.
However, attempting to schedule the whole SCC process solving a single MILP problem
may have several disadvantages. Due the size and complexity of the considered problem,
such an approach could require an inefficient for business purposes amount of time to be
solved or result in an NP-hard problem. Although the SCC scheduling problem can be
considered as a realistic hybrid flowshop problem and it has been proven that a two-stage
hybrid flowshop problem is NP-complete (Pan et al., 2013; Li et al., 2014; Gupta, 1988),
further investigation is required to determine if this approach would be practical for TATA
Steel. Their SCC process consists of five stages without considering transportation
between the machines and several extra practical constraints. For this reason, different
methods like artificial intelligence and heuristics could be further researched. As
mentioned in Section 2, many researchers have used these techniques to solve similar
problems. Formulating linear or nonlinear models may be proven useful as a first to
understand the specifics of the problem; however, feasible solutions could be obtained in
less time using these methods.
Furthermore, additional improvements to the existing models can be made. The proposed
continuous casting scheduling model assumes that the processing time for all heats of a
sequence is the same; however, it can vary as long as it is within the respective acceptable
range. Finding a way to integrate this to the model would give more flexibility to the
treatment unit scheduling model and result in solutions with reduced cost. In the
formulated models, transportation time between two stages is not considered. Thus,
including it is required for obtaining a feasible schedule. It should be highlighted that the
transportation times between the vessel blow and the treatment stages and the treatment
and the continuous casting stages are not fixed, but they must be within specified bounds.
Adjusting these values could result in reduced-cost solutions, but it could not be
determined how this could be integrated in the model. The transportation times between
the rest of the stages are constant and independent of product type and which machine is
used.
41
Although the processing and transportation times at several stages need to lay within a
specified range and are not predetermined, target values have been set for all of them.
Another improvement would be to force the actual times to be as close as possible to the
target values. This could be achieved by introducing new variables that represent the
difference of the actual processing times from their target values and adding them to the
objective function. Larger differences would result in increased cost. In case this factor is
not considered as important as the total completion time, a weighted objection function
could be used. Thus, it can be controlled what factors influence the cost the most.
Additionally, a significant constraint was not included in the models formulated in this
paper. As mentioned in Section 3, hot metal is produced at an inconstant rate and the hot
metal stock must constantly between an acceptable lower and upper bound. The difficulty
with adding such a constraint is that a time variable is not included, only starting and
ending times of processing at the different stages. As a result, the hot metal stock cannot
be calculated at all instances.
If a complete scheduling system is developed the next stage is to integrate the maintenance
schedule in it. This means that one or more machines are not available for several time
intervals. If the timing of maintenance of a machine is known, then setting the starting
and the ending times of all heats assigned to that machine to be smaller or larger than the
starting and the ending time of the maintenance of this machine is a possible way of
formulating this constraint. If the maintenance of a machine happens after a specific
number of heats has been processed integrating the maintenance schedule in the
production schedule is more complicated. Also, more input data such as when was the last
time each machine was maintained are required.
The use of a different software or programming language to further develop the
scheduling system is required. The models formulated in this paper were implemented in
FICO Xpress-Optimizer Solver. However, designing a complete scheduling system on it
seems inefficient due to its limited capabilities. Furthermore, the majority of the models
discussed in Section 2 were implemented in programming languages such as C++ and C#.
As a result, it would be recommended shifting from FICO Xpress-Optimizer Solver to a
programming language, especially if a heuristic or artificial intelligence method was
42
attempted. Furthermore, the FICO Xpress Optimizer has interfaces through a library
accessible from C/C++, .NET, Java and Visual Basic for Applications (VBA). Thus, the
models presented in this paper could be integrated in a scheduling system developed in
one of these languages.
43
7. Conclusion Concluding, the SCC scheduling problem as presented by TATA Steel Port Talbot Works
is challenging and significantly more complex than similar problems found in the
literature. The main difficulty derives from the fact that various product types need to be
considered and heats must be scheduled in sequences at the continuous casting stage.
Combining the above with machine conflict resolving and not fixed processing times at
various stages result in great difficulties that must be overcome when mathematically
formulating the problem.
It is suggested that the problem is divided to simpler one that can be solved separately and
be combined in the end. In this paper, three models are presented. Model 1 attempts to
schedule a set of sequences at the continuous casting stage while Model 2 uses the results
of Model 1 to schedule the heats at the secondary steelmaking stage. Model 3 attempts to
schedule the heats at all stages before continuous casting based on the results of Model 1.
It should be highlighted that a feasible solution of Model 2 or 3 may not exist for any
output of Model 1. Additionally, during an attempt to schedule a total of 48 heats at all
stages, results could not be acquired from Model 3 in a reasonable time frame. However,
a feasible solution was obtained from both Models 1 and 2.
Further research on mathematical formulation and solution methods of the problem is
necessary. The models discussed in the paper could serve as part of a larger scheduling
system or as guidance for developing a different solution method. It is suggested that the
application of heuristics methods is studied since a MILP model may be efficient for
business purposes due to the large scale of the problem and the numerous constraints.
44
References Atighehchian, A., Bijari, M. and Tarkesh, H. (2009). A novel hybrid algorithm for scheduling steel-making continuous casting production. Computers & Operations Research, [online] 36(8), pp.2450-2461. Available at: https://www.sciencedirect.com/science/article/pii/S0305054808001937 [Accessed 21 Aug. 2019].
Bellabdaoui, A. and Teghem, J. (2006). A mixed-integer linear programming model for the continuous casting planning. International Journal of Production Economics, [online] 104(2), pp.260-270. Available at: https://www.sciencedirect.com/science/article/pii/S0925527304004384 [Accessed 18 Aug. 2019].
Dutta, G. and Fourer, R. (2001). A Survey of Mathematical Programming Applications in Integrated Steel Plants. Manufacturing & Service Operations Management, [online] 3(4), pp.387-400. Available at: https://www.scholars.northwestern.edu/en/publications/a-survey-of-mathematical- programming-applications-in-integrated-s [Accessed 22 Aug. 2019].
Fanti, M., Rotunno, G., Stecco, G., Ukovich, W. and Mininel, S. (2016). An Integrated System for Production Scheduling in Steelmaking and Casting Plants. IEEE Transactions on Automation Science and Engineering, [online] 13(2), pp.1112-1128. Available at: https://www.researchgate.net/publication/282427141_An_Integrated_System_for_Pr oduction_Scheduling_in_Steelmaking_and_Casting_Plants [Accessed 18 Aug. 2019].
Gupta, J. (1988). Two-Stage, Hybrid Flowshop Scheduling Problem. The Journal of the Operational Research Society, [online] 39(4), p.359. Available at: https://www.jstor.org/stable/2582115 [Accessed 22 Aug. 2019].
Harjunkoski, I. and Grossmann, I. (2001). A decomposition approach for the scheduling of a steel plant production. Computers & Chemical Engineering, [online] 25(11-12), pp.1647-1660. Available at: https://www.sciencedirect.com/science/article/pii/S0098135401007293 [Accessed 22 Aug. 2019].
Li, J., Pan, Q., Mao, K. and Suganthan, P. (2014). Solving the steelmaking casting problem using an effective fruit fly optimisation algorithm. Knowledge-Based Systems, [online] 72, pp.28-36. Available at: https://www.sciencedirect.com/science/article/pii/S0950705114003220#b0105 [Accessed 21 Aug. 2019].
Li, J., Xiao, X., Tang, Q. and Floudas, C. (2012). Production Scheduling of a Large- Scale Steelmaking Continuous Casting Process via Unit-Specific Event-Based Continuous-Time Models: Short-Term and Medium-Term Scheduling. Industrial &
45
Engineering Chemistry Research, [online] 51(21), pp.7300-7319. Available at: https://pubs.acs.org/doi/full/10.1021/ie2015944 [Accessed 15 Aug. 2019].
Mao, K., Pan, Q., Pang, X. and Chai, T. (2014). A novel Lagrangian relaxation approach for a hybrid flowshop scheduling problem in the steelmaking-continuous casting process. European Journal of Operational Research, [online] 236(1), pp.51- 60. Available at: https://www.sciencedirect.com/science/article/pii/S0377221713009090 [Accessed 21 Aug. 2019].
Missbauer, H., Hauber, W. and Stadler, W. (2009). A scheduling system for the steelmaking-continuous casting process. A case study from the steel-making industry. International Journal of Production Research, 47(15), pp.4147-4172.
Naphade, K., Wu, S., Storer, R. and Doshi, B. (2001). Melt Scheduling to Trade Off Material Waste and Shipping Performance. Operations Research, [online] 49(5), pp.629-645. Available at: https://pubsonline.informs.org/doi/abs/10.1287/opre.49.5.629.10611 [Accessed 21 Aug. 2019].
Pacciarelli, D. and Pranzo, M. (2004). Production scheduling in a steelmaking- continuous casting plant. Computers & Chemical Engineering, [online] 28(12), pp.2823-2835. Available at: https://www.sciencedirect.com/science/article/pii/S0098135404002637 [Accessed 22 Aug. 2019].
Pan, Q., Wang, L., Mao, K., Zhao, J. and Zhang, M. (2013). An Effective Artificial Bee Colony Algorithm for a Real-World Hybrid Flowshop Problem in Steelmaking Process. IEEE Transactions on Automation Science and Engineering, 10(2), pp.307- 322.
Sbihi, A., Bellabdaoui, A. and Teghem, J. (2014). Solving a mixed integer linear program with times setup for the steel-continuous casting planning and scheduling problem. International Journal of Production Research, 52(24), pp.7276-7296.
Tang, L. and Liu, G. (2007). A mathematical programming model and solution for scheduling production orders in Shanghai Baoshan Iron and Steel Complex. European Journal of Operational Research, [online] 182(3), pp.1453- 1468. Available at: https://www.sciencedirect.com/science/article/pii/S0377221706009830 [Accessed 21 Aug. 2019].
Tang, L., Liu, J., Rong, A. and Yang, Z. (2000). A mathematical programming model for scheduling steelmaking-continuous casting production. European Journal of Operational Research, [online] 120(2), pp.423-435. Available at: https://www.sciencedirect.com/science/article/pii/S0377221799000417 [Accessed 15 Aug. 2019].
46
Tang, L., Luh, P., Liu, J. and Fang, L. (2002). Steel-making process scheduling using Lagrangian relaxation. International Journal of Production Research, [online] 40(1), pp.55-70. Available at: https://www.tandfonline.com/doi/abs/10.1080/00207540110073000 [Accessed 16 Aug. 2019].
Tang, L. and Wang, G. (2008). Decision support system for the batching problems of steelmaking and continuous-casting production. Omega, [online] 36(6), pp.976- 991. Available at: https://www.sciencedirect.com/science/article/pii/S0305048307001223 [Accessed 15 Aug. 2019].
Tang, L., Zhao, Y. and Liu, J. (2014). An Improved Differential Evolution Algorithm for Practical Dynamic Scheduling in Steelmaking-Continuous Casting Production. IEEE Transactions on Evolutionary Computation, [online] 18(2), pp.209-225. Available at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6473881 [Accessed 15 Aug. 2019].
Tan, Y. and Liu, S. (2013). Models and optimisation approaches for scheduling steelmaking–refining–continuous casting production under variable electricity price. International Journal of Production Research, 52(4), pp.1032-1049.
Zhao, Y., Jia, F., Wang, G. and Wang, L. (2011). A hybrid tabu search for steelmaking-continuous casting production scheduling problem. In: 2011 International Symposium on Advanced Control of Industrial Processes (ADCONIP). [online] IEEE, pp.pp. 535-540. Available at: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5930486&isnumber=593 0387 [Accessed 22 Aug. 2019].
Zhu, D., Zheng, Z. and Gao, X. (2010). Intelligent Optimization-Based Production Planning and Simulation Analysis for Steelmaking and Continuous Casting Process. Journal of Iron and Steel Research International, [online] 17(9), pp.19-24. Available at: https://link.springer.com/content/pdf/10.1016/S1006- 706X%2810%2960136-7.pdf [Accessed 15 Aug. 2019].
47
Appendix A: Flying Tundish Change (FTC) between Products
48
Appendix B: Flying Tundish Change (FTC) between Products
The casting time depends on the casting speed which is calculated based on a variety of factors such as continuous casting machine, product type etc.
49
Appendix C: Xpress Code Model 1
!@encoding CP1252 model Model1 uses "mmxprs"; !gain access to the Xpress-Optimizer solver !optional parameters section parameters ! SAMPLEPARAM1='c:\test\' ! SAMPLEPARAM2=false PROJECTDIR='' ! for when file is added to project end-parameters !sample declarations section declarations products=1..4 totalsequences=6 totalpositions=6 sequences=1..totalsequences seqj:array(sequences)of integer restr:array(products,products)of integer positions=1..totalpositions machines=1..3 seqw:array(sequences)of integer seqp:array(sequences,products)of integer MR:array(products,machines)of integer maxtime:array(products)of real AV:array(machines)of real ST:array(positions, machines) of mpvar ET:array(positions, machines) of mpvar w:array(sequences,positions,machines) of mpvar SR:array(sequences,sequences,positions,machines)of mpvar ! ... Objective:linctr end-declarations seqw::[3200,2000,1800,2200,1800,2000] seqj::[6,10,8,8,8,8] seqp::[1,0,0,0, 0,1,0,0, 0,0,1,0, 0,0,0,1, 0,1,0,0, 0,0,1,0] maxtime::[118,72.6,72.6,72.6] restr::[0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0] AV::[0,20,0] forall(s in sequences,v in positions, m in machines)w(s,v,m)is_binary forall(s1 in sequences,s2 in sequences,v in positions, m in machines)SR(s1,s2,v,m)is_binary forall(s in sequences)sum(m in machines,v in positions)w(s,v,m)=1
50
forall(m in machines,v in positions)sum(s in sequences)w(s,v,m)<=1 forall(m in machines,v in 1..(totalpositions-1)) do
sum(s in sequences)w(s,v,m)-sum(s in sequences)w(s,v+1,m)>=0 end-do forall(p1 in products, p2 in products,s1 in sequences,s2 in sequences)do
if restr(p1,p2)=1 and seqp(s1,p1)=1 and seqp(s2,p2)=1 then forall(m in machines,v in 1..(totalpositions-1))do w(s1,v,m)+w(s2,v+1,m)-1<=SR(s1,s2,v,m) end-do
end-if end-do forall(p1 in products, p2 in products,s1 in sequences,s2 in sequences)do
if seqw(s1)-seqw(s2)>200 or seqw(s2)-seqw(s1)>200 then forall(m in machines,v in 1..(totalpositions-1))do w(s1,v,m)+w(s2,v+1,m)-1<=SR(s1,s2,v,m) end-do
end-if end-do forall(m in machines)ST(1,m)>=AV(m) forall(m in machines, v in positions)ST(v,m)>=0 forall(m in machines,v in 2..totalpositions)ST(v,m)=ET(v- 1,m)+120*sum(s1 in sequences) sum(s2 in sequences)SR(s1,s2,v-1,m) forall(m in machines,v in positions)ET(v,m)<=ST(v,m)+sum(s in sequences)sum(p in products)seqp(s,p)*seqj(s)*(0.7*maxtime(p))*w(s,v,m) forall(m in machines,v in positions)ET(v,m)>=ST(v,m)+sum(s in sequences)sum(p in products)seqp(s,p)*seqj(s)*maxtime(p)*w(s,v,m) totaltime:=sum(m in machines)ET(totalpositions,m) if PROJECTDIR <> '' then
setparam('workdir', PROJECTDIR) writeln("Project directory: " + PROJECTDIR)
end-if writeln("Begin running model") minimize(totaltime) forall(m in machines,v in positions,s in sequences)do
if getsol(w(s,v,m))=1 then forall(i in 1..seqj(s))writeln(s,", ",getsol(ST(v,m))+(i- 1)*((getsol(ET(v,m))-getsol(ST(v,m)))/seqj(s)),", ",getsol(ST(v,m))+i*((getsol(ET(v,m))- getsol(ST(v,m)))/seqj(s)),", ",m)
end-if end-do !... writeln("End running model") end-model
51
Appendix D: Xpress Code Model 2 !@encoding CP1252 model Model2 uses "mmxprs"; !gain access to the Xpress-Optimizer solver !optional parameters section parameters ! SAMPLEPARAM1='c:\test\' ! SAMPLEPARAM2=false PROJECTDIR='' ! for when file is added to project end-parameters !sample declarations section declarations totaljobs=48 jobs=1..totaljobs products=1..4 machines=1..4 positions=1..totaljobs SCC:array(jobs)of real JP:array(jobs,products)of integer MaxT:array(products)of integer MinT:array(products)of integer MR:array(products,machines)of integer ST:array(positions,machines)of mpvar ET:array(positions,machines)of mpvar x:array(jobs,positions,machines)of mpvar ! ... Objective:linctr end-declarations SCC::[2000, 2050.82, 2101.64, 2152.46, 2203.28, 2254.1, 2304.92, 2355.74, 2406.56, 2457.38, 2508.2, 2559.02, 2609.84, 2660.66, 2711.48, 2762.3, 2813.12, 2863.94, 2914.76, 2965.58, 3016.4, 3067.22, 3118.04, 3168.86, 2020, 2102.6,
52
2185.2, 2267.8, 2350.4, 2433, 2000, 2050.82, 2101.64, 2152.46, 2203.28, 2254.1, 2304.92, 2355.74, 2406.56, 2457.38, 2508.2, 2559.02, 2609.84, 2660.66, 2711.48, 2762.3, 2813.12, 2863.94] JP::[0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0,
53
0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1] MaxT::[45,35,35,35] MinT::[35,25,25,25] MR::[1,1,0,0, 0,0,0,0, 0,0,1,1, 0,0,0,0] forall(v in positions,m in machines,j in jobs)x(j,v,m)is_binary forall(j in jobs)sum(m in machines,v in positions)x(j,v,m)=1 forall(v in positions,m in machines)sum(j in jobs)x(j,v,m)<=1 forall(v in positions,m in machines)ET(v,m)=sum(j in jobs)x(j,v,m)*SCC(j) forall(p in products,m in machines,j in jobs)do
if MR(p,m)=1 and JP(j,p)=1 then forall(v in positions)x(j,v,m)=0
end-if end-do forall(m in machines,v in positions)ET(v,m)<=ST(v,m)+sum(j in jobs,p in products)x(j,v,m)*MaxT(p)*JP(j,p) forall(m in machines,v in positions)ET(v,m)>=ST(v,m)+sum(j in jobs,p in products)x(j,v,m)*MinT(p)*JP(j,p) forall(m in machines,v in positions)ST(v,m)>=0 forall(m in machines,v in positions)ET(v,m)>=0 forall(m in machines,v in 1..totaljobs-1)ET(v,m)<=ST(v+1,m) obj:=sum(v in positions,m in machines)ET(v,m)-sum(v in positions,m in machines)ST(v,m) if PROJECTDIR <> '' then
setparam('workdir', PROJECTDIR) writeln("Project directory: " + PROJECTDIR)
end-if writeln("Begin running model") minimize(obj) forall(m in machines,v in positions, j in jobs)do
54
if getsol(x(j,v,m))=1 then writeln(j," ",getsol(ST(v,m))," ",getsol(ET(v,m))," ",m," ",v) end-if
end-do !... writeln("End running model") end-model
55
Appendix E: Xpress Code Model 3 !@encoding CP1252 model Model3 uses "mmxprs"; !gain access to the Xpress-Optimizer solver !optional parameters section parameters ! SAMPLEPARAM1='c:\test\' ! SAMPLEPARAM2=false PROJECTDIR='' ! for when file is added to project end-parameters !sample declarations section declarations M=10000 stages=1..4 totaljobs=10 jobs=1..totaljobs products=1..4 machines=1..10 SCC:array(jobs)of real JP:array(jobs,products)of integer MaxT:array(products)of integer MinT:array(products)of integer MR:array(products,machines)of integer SM:array(stages,machines)of integer PT:array(machines)of integer ST:array(jobs,stages,machines)of mpvar ET:array(jobs,stages,machines)of mpvar x:array(jobs,stages,machines)of mpvar y:array(jobs,jobs,stages,machines)of mpvar ! ... Objective:linctr end-declarations SCC::[2050.82, 2101.64, 2152.46, 2203.28, 2254.1, 2304.92, 2355.74, 2406.56, 2457.38, 2508.2] JP::[0,0,1,0, 0,0,1,0, 1,0,0,0, 1,0,0,0, 0,0,1,0, 0,0,1,0, 0,1,0,0, 0,1,0,0, 0,0,0,1, 0,0,0,1] MaxT::[45,35,35,35] MinT::[35,25,25,25] MR::[0,0,0,0,0,0,1,1,0,0,
56
0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,1,1, 0,0,0,0,0,0,1,0,0,0] SM::[1,1,0,0,0,0,0,0,0,0, 0,0,1,1,0,0,0,0,0,0, 0,0,0,0,1,1,0,0,0,0, 0,0,0,0,0,0,1,1,1,1] PT::[18,18,25,25,23,23,0,0,0,0] forall(i in stages,m in machines,j in jobs)x(j,i,m)is_binary forall(i in stages,m in machines,j in jobs,k in jobs)y(j,k,i,m)is_binary forall(j in jobs,i in stages)sum(m in machines)x(j,i,m)=1 forall(m in machines,j in jobs)ET(j,4,m)=x(j,4,m)*SCC(j) forall(p in products,m in machines,j in jobs)do
if MR(p,m)=1 and JP(j,p)=1 then forall(i in stages)x(j,i,m)<=0
end-if end-do forall(i in stages,m in machines)do
forall(j in jobs)x(j,i,m)<=SM(i,m) end-do forall(m in machines,j in jobs)ET(j,4,m)<=ST(j,4,m)+sum(p in products)(x(j,4,m)*MaxT(p)*JP(j,p)) forall(m in machines,j in jobs)ET(j,4,m)>=ST(j,4,m)+sum(p in products)(x(j,4,m)*MinT(p)*JP(j,p)) forall(m in machines,i in stages,j in jobs)ST(j,i,m)>=0 forall(m in machines,i in stages,j in jobs)ET(j,i,m)>=0 forall(m in machines,i in stages,j in jobs)ST(j,i,m)+ET(j,i,m)<=x(j,i,m)*M forall(i in 1..3,j in jobs,m in machines)ET(j,i,m)=ST(j,i,m)+PT(m)*x(j,i,m) forall(i in 1..3,j in jobs,m in machines,v in machines)ST(j,i+1,v)<=ET(j,i,m)+(2-x(j,i,m)-x(j,i+1,v))*M forall(i in 1..3,j in jobs,m in machines,v in machines)ST(j,i+1,v)>=ET(j,i,m)-(2-x(j,i,m)-x(j,i+1,v))*M forall(j in jobs,v in jobs,i in stages,m in machines)ST(v,i,m)>=ET(j,i,m)-(1-y(j,v,i,m))*M forall(a in jobs,b in jobs,i in stages,m in machines)do
if a<>b then y(a,b,i,m)+y(b,a,i,m)>=x(a,i,m)+x(b,i,m)-1
end-if
57
end-do forall(a in jobs,b in jobs,i in stages,m in machines)y(a,b,i,m)<=x(a,i,m) forall(a in jobs,b in jobs,i in stages,m in machines)y(a,b,i,m)<=x(b,i,m) obj:=sum(j in jobs,m in machines)ET(i,4,m)-sum(j in jobs, m in machines)ST(i,4,m) if PROJECTDIR <> '' then
setparam('workdir', PROJECTDIR) writeln("Project directory: " + PROJECTDIR)
end-if writeln("Begin running model") minimize(obj) !... writeln("End running model") end-model