financial mathematics dissertation

fffinal

Samplemodelessay1.zip

Home >Business & Finance homework help >Financial markets homework help >financial mathematics dissertation

AI Personality Dissertation(1).pdf

Balancing relevancy across expert systems for a Conversational AI personality

Gautam Prasad

September 2019

School of Mathematics, Cardiff University

A dissertation submitted in partial fulfilment of the requirements for MSc (in Data Science and Analytics) by taught programme.

CANDIDATE’S ID NUMBER 1821536

CANDIDATE’S SURNAME Please circle as appropriate Mr / Miss / Ms/ Mrs / Rev / Dr / Other ……………..........

PRASAD CANDIDATE’S FULL FORENAMES GAUTAM

DECLARATION

This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree.

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 1

This dissertation is being submitted in partial fulfilment of the requirements for the degree of Msc

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 2

This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended.

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 3 –

I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations.

Signed ……………………………………………. (candidate) Date 06-09-2019

STATEMENT 4 - BAR ON ACCESS APPROVED

I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee.

Signed ……………………………………………. (candidate) Date 06-09-2019 Gautam Prasad

Gautam Prasad

Executive Summary Companies want to interact with their customers in a way that is not limited by time, human

resource availability or language. They need to pre-empt the needs of their clientele in order to

keep them satisfied and thereby reduce customer churn. Businesses such as Vodafone, Royal

Bank of Scotland (RBS) and NatWest and other businesses in the telecom, and finance domain

are investing in chatbots to build and maintain new relationships looking to lower overheads,

costs and training time.

There are three significant types of chatbot expert systems in use; chitchat bots, short tail/ task-

focused bots, and longtail search or FAQ bots. These have primarily been used individually

based on business requirements. Chitchat focuses on having the most natural conversation with

a user based on their inputs; Short tail looks at helping the user complete a small number of

regularly performed tasks and requires high training effort to scale efficiently but tends to pro-

vides consistent results. Longtail systems are focussed on information retrieval and require a

more significant training effort at the start, provides a wider variety of answers at lower confi-

dences; however, scales more efficiently. Developing a mixture of experts’ system, that is ca-

pable of combining these three technologies into a single personality with balanced relevancies

of which response is to be used, is of high interest to organisations. This will enable them to

better lever their existing investments in brand personality (chitchat systems) and human-read-

able material (long tail systems) while allowing them to rapidly develop new task-focused

(short tail) systems in a wide variety of their business domains using modern API based con-

versational short tail systems. This project looks at the best method to bring these disparate

types of systems together into a coherent conversational personality.

The analysis looked at different scenarios using unmodified confidence scores from the expert

systems, building a high-level classifier to determine the best system to answer, and simulating

fallback rules used in systems like IBM Watson. The outputs from all the scenarios were opti-

mised using evolutionary algorithms and optimised using a prepared data set on the same topic

that was applicable to all three system types, before being compared using an accuracy metric

to determine the most successful strategy. The additional effect of concurrency between each

user utterance was then evaluated against these strategies.

The study concluded that merging the chitchat and short-tail training data to reduce the number

of expert systems from 3 to 2 and then use of the fallback rules works best if the confidence

level for fallback is optimised for the expected data set. If the training sets of the chitchat and

short tail systems cannot be merged, or there is a requirement for keeping three separate

sys-tems, a weighted high-level classifier performed best. Optimisation of confidence lev-els

used improved the performance of the fallback rules by a considerable margin. The effect of

concurrency was thought to be a crucial aspect to investigate from the recommendations of

the literature review, but the overall effect of concurrency on this data was shown to be small.

Recommended next steps could be beta testing with real user data to avoid any cognitive

bias in the test and train set and to gauge the change in performance by increasing the

number of expert systems to beyond three, where it is expected that a high-level classifier

increases in performance compared to fallback rules.

iii

Acknowledgements I would like to thank my supervisor, Professor Alexander Balinsky, for his timely help and for

pointing me towards the right direction throughout this project. I am grateful to Mr Stephen

Broadhurst of ThinJetty Ltd for providing me with an opportunity to pursue this project at his

organisation and for the continuous mentoring and support. Also, I would like to acknowledge

the advice and assistance from Ms Joanna Emery and the moral support provided by my col-

leagues on the MSc course.

Contents Executive Summary .................................................................................................................... i

Acknowledgements .................................................................................................................. iii

1. Introduction ........................................................................................................................ 1

2. Literature Review ............................................................................................................... 1

2.1. Expert Systems ........................................................................................................... 1

2.1.1. Overview .............................................................................................................. 1

2.1.2. Typical Architecture for an expert system ........................................................... 2

2.1.3. Applications ......................................................................................................... 3

2.2. Mixture of Experts ..................................................................................................... 3

2.2.1. Background/ Overview ........................................................................................ 3

2.2.2. Applications ......................................................................................................... 4

2.3. Conversational Agents and the application of Expert Systems ................................. 5

2.3.1. What are Conversational Agents? ........................................................................ 5

2.3.2. How are expert systems used in chatbots? ........................................................... 7

2.3.3. Comparison of toolkits for building conversational agents ................................. 8

3. Methodology & Implementation ...................................................................................... 10

3.1. Overview .................................................................................................................. 10

3.2. Tools setup and initialisation ................................................................................... 10

3.3. Knowledgebase ........................................................................................................ 11

3.4. Testing framework & Dataset optimisation ............................................................. 12

3.5. Scenarios .................................................................................................................. 13

3.5.1. Overview ............................................................................................................ 13

3.5.2. Experiment 1: High-level classifier ................................................................... 14

3.5.3. Experiment 2: Weighted High-Level Classifier ................................................ 16

3.5.4. Experiment 3: Based on unmodified confidence scores .................................... 17

3.5.5. Experiment 4: Weighted confidence scores ....................................................... 17

3.5.6. Experiment 5: Emulating fallback logic ............................................................ 18

3.5.7. Experiment 6: Effect of concurrency ................................................................. 20

4. Results .............................................................................................................................. 21

5. Discussion & Conclusion ................................................................................................. 25

5.1. Qualitative evaluation .............................................................................................. 25

5.2. Possibility for future work ....................................................................................... 26

5.3. Recommendations & Findings ................................................................................. 27

References ................................................................................................................................ 28

Appendices ............................................................................................................................... 32

1. Introduction The project investigates the use of a mixture of experts’ system in the conversational artificial

intelligence (AI) domain and to devise a rule-based or machine learnt algorithm-based tech-

nique which can balance the relevancy across them. The expert systems are specialised on a

domain level and are to be put into use to respond to a conversational turn in a robust and

precise manner all the while accounting for concurrency and minimising errors. The aim is to

have a mechanism that enables the rules to adapt during the conversation based on parameters

to do with the conversational state or features from the user utterance to best judge which sys-

tem must respond.

2. Literature Review

2.1. Expert Systems

2.1.1. Overview Expert systems is a branch of AI which deals with developing machines which, in a

specific domain, have problem-solving abilities similar to those displayed by a human expert

in the same field or to simulate human expert behaviour (Tzafestas et al., 1993). An expert

system is different from other forms of AI because it performs problem-solving using domain-

specific approaches at an expert knowledge level and also provides pieces of evidence for the

conclusion drawn (Tzafestas et al., 1993). Several advantages, such as the following, have also

been highlighted over the course of time in comparison to human experts (Ignazio, 1991):

• No human-like bias involved while obtaining solutions or prescribing strategies.

• Minimal chances for occurrences of errors in calculations.

• Serves the purpose without fail on a near-constant basis.

2.1.2. Typical Architecture for an expert system

Figure 1: Typical architecture of an expert system (Forsyth, 1984; Tzafestas et al., 1993; Yazdani, 1989)

An expert system consists of the following high-level modules as shown in Figure 1

which form the crux of the operations of the system (Forsyth, 1984, n.d.; Ignizio, 1990; Tripa-

thi, 2011; Tzafestas et al., 1993):

2.1.2.1. Knowledge Base The knowledge system encompasses the domain-specific expert-level knowledge

that is required by the expert system for comprehending user requirements, formu-

lating strategies, obtaining necessary rule-based solutions which can then be passed

onto the inference engine for further processing. It consists of both factual, which

is the most commonly shared/found forms of knowledge, and heuristic knowledge,

which is the less widely shared and considerably more individualistic form of

knowledge which acts as the reasoning for the solutions obtained.

2.1.2.2. Inference Engine The inference engine acts as the intelligence behind the expert system and takes

care of the inferences from user requests/utterances. It then analyses and processes

the rules obtained from the knowledge base in order to arrive at a solution with

logical reasoning for the same. In short, it controls the interpretation and reasoning

methodology of the expert system. The two most widely used reasoning strategies

are forward chaining and backward chaining. Forward chaining starts with the data

at hand and uses the inference rules to arrive at a solution whereas backward chain-

ing starts with the list of goals to be attained and works its way backwards to see if

there is any data available to solve the problem and attain the goals.

2.1.2.3. User Interface A user interface is built to interact with the user by receiving user inputs in the form

of utterances and for the system to revert with a user-identifiable output.

2.1.3. Applications Expert systems have varied applications majorly in classification tasks in the fields of

medical diagnosis, information retrieval and aligned services, engineering, human-

computer interaction, military, robotics amongst others. (Forsyth, 1984, n.d.; Ignizio,

1990; Tzafestas et al., 1993).

2.2. Mixture of Experts

2.2.1. Background/ Overview Mixture of experts is a method that was introduced almost 30 years ago by Jacobs and

co-workers (R. Jacobs et al., 1991). They investigated the use of a different error function in a

mixture of experts’ system, and their approach has been supremely popular in a suite of wide-

ranging applications (Yuksel et al., 2012). Over the years, around 20 different studies have

been conducted on the principles, working and applications of expert systems and to an extent

was even considered to be completely solved; However, recently there has been a resurgence

in interest in the context of using a mixture of experts for several new-age problems (Yuksel et

al., 2012).

Mixture of experts has been widely regarded as a combining method which, when put

to use in machine learning tasks, can lead to better performance and improved results (Masoud-

nia and Ebrahimpour, 2014). The critical aspect of a mixture of experts model in any applica-

tion was to employ specialised expert systems to return correct answers for topics which fall

under its knowledge base and use a gating network across all the expert systems which helps

in reducing the errors (Jacobs et al., 1991). The basic principle behind this is that the gating

network assigned a new input to an expert system, and weights of only this system are changed

if the output is found to be incorrect which removes any chance of interference for the other

expert systems (R. Jacobs et al., 1991). This also has the implication & possible added ad-

vantage of each expert system being assigned only a small set of extremely feasible input cases

(R. Jacobs et al., 1991).

Figure 2: A system of expert and gating systems (R. Jacobs et al., 1991)

The gating network is assumed to be a stochastic one-out-of-n selector, unlike

in (Hampshire and Waibel, 1992; R. A. Jacobs et al., 1991), which is how minimal interference

is achieved in a much more straightforward manner, albeit reconceptualising the error function

in order to make the expert systems challenge one another making the whole network compet-

itive in nature rather than being collaborative (R. Jacobs et al., 1991). An evaluative compari-

son was performed between standard backpropagation networks with a single hidden layer and

a mixture of experts by using it to recognise multi-speaker vowel recognition (R. Jacobs et al.,

1991). The parameters of the models were kept approximately equal by adjusting the number

of hidden layers in the backpropagation network (R. Jacobs et al., 1991). Upon investigation

of the results, the mixture of experts model achieved the error criterion (average squared error

of 0.08) at a much higher speed even while keeping the number of epochs needed for the same

at a lower number and also maintaining scalability with increase in number of experts used in

the system (R. Jacobs et al., 1991; Masoudnia and Ebrahimpour, 2014).

2.2.2. Applications Several applications have been devised over the years for a mixture of experts’ systems

such as (Yuksel et al., 2012):

• Used in the prediction of climate (Lu, 2006), electricity demand (Weigend et

al., 1995), stock prices (Versace et al., 2004), currency exchange rates (Coelho

et al., 2003), amongst others.

• Machine learnt classification tasks involved in

o Classification of

� Text (Estabrooks and Japkowicz, 2001),

� Audio signals (Harb et al., 2004) and

o Recognition of

� Handwriting (Ebrahimpour, 2009),

� Speech (Mossavat et al., 2010; Peng et al., 1996), and

� 3D objects (Walter et al., 1999).

2.3. Conversational Agents and the application of Expert Systems

2.3.1. What are Conversational Agents? A piece of software or program that enables a machine to converse using a natural lan-

guage such as English with a human user is called a Conversational agent (Io and Lee,

2017; Weizenbaum, 1966). Since the initial research and work done in the field since

the 1960s, the most significant challenges faced was in enabling the machine with in-

telligence that would facilitate such interactions (Shum et al., 2018; Turing, 1950).

In a typical human conversation with a chatbot, input from the user is in the form of a

single or set of natural language utterances which the system analyses to gauge the

requirements of the user and produces a response which it deems an appropriate one to

the analysed input (Weizenbaum, 1966). The afore-mentioned response is derived using

several techniques, of which rules have been predominantly used in the earlier stages

of development of chatbots, wherein the user utterance is searched for keywords and

based on their presence, associated rules are invoked to convert the utterance (Shum et

al., 2018; Weizenbaum, 1966).

2.3.1.1. Types of Conversational Agents

2.3.1.1.1. Chitchat

Several of the earliest systems developed such as “Eliza”, “ALICE” and “Parry”

(Colby KM, 1975; Shieber, 1994; Wallace, 2009; Weizenbaum, 1966) were fo-

cussed on performing as chitchat bots for the purpose of conversation with users

in the medium of text, audio amongst others. (Shum et al., 2018). These systems

used pattern matching based on rules to respond to the user’s input. (Shum et

al., 2018; Weizenbaum, 1966).

The chatbots were given different personalities such as a “Rogerian Psychother-

apist” (Shum et al., 2018; Weizenbaum, 1966), a paranoid person (Colby KM,

1975; Shum et al., 2018) and so on, but were severely limited in terms of capa-

bility to continue the conversation for a prolonged duration and had highly spec-

ified domains as well which further reduced their performance (Colby KM,

1975; Shieber, 1994; Wallace, 2009). These limitations were partly due to the

technology with which the systems were built, such as AIML for “A.L.I.C.E”

which in turn led to their failure in several evaluations such as the “Ultimate

Turing Test” (Shum et al., 2018; Wallace, 2009).

2.3.1.1.2. Task-Completion

Task-Completion conversational agents were built with a focus on realising spe-

cific tasks which fall under constrained domains (Shum et al., 2018; Walker et

al., 2001; Wang et al., n.d.). It typically gives a short single high confidence

answer. A few most commonly seen domains were that of hotel or flight book-

ing, weather forecast, information gathering amongst others. In general, the sys-

tem tries to gauge the user’s ‘intents’ and then responds with actions that will

complete said intent or goal (Shum et al., 2018; Walker et al., 2001). Further

improvements also included the ability to comprehend complex dialogues with

inherent variability and state tracking (Shum et al., 2018; Williams and Young,

2007). These systems were evaluated on several parameters, not limited to

(Walker et al., 2001):

• User satisfaction

• Task completion

• Task duration

• Accuracy

Telecom giant Vodafone has over a couple of years back introduced a chatbot,

‘TOBi’, which could help users in basic tasks of checking account details, trou-

bleshooting and also in purchasing new connections (Koehler, 2017). TOBi de-

livered the following metrics (Davis, 2018):

• An increased conversion rate of more than 100% when compared to their

website.

• A decreased transaction time of around 50% compared to their website

• Among the highest ever received usability scores of 90.

Another such an example would be that of ‘Cora’, a chatbot employed by RBS

and NatWest, both in the banking domain, to answer basic baking related que-

ries from the user (“NatWest begins testing AI driven ‘digital human’ in bank-

ing first,” n.d.; Rumney, 2018). This has helped in identifying the most fre-

quently asked questions and significantly cutting down queuing times.

2.3.1.1.3. Long tail or Question Answering

QA conversational agents process natural language queries raised by the user

and provide concise and relevant answers to it, thereby improving the overall

interaction between the user and intelligent system (Simmons, 1970; Waltinger

et al., 2012). It gives a number of lower confidence, more extended sections of

text mined from the corpus rather than configured. In general, the question was

first analysed, and a search performed, which resulted in an answer with sup-

porting evidence and a score (Ferrucci et al., 2010; Setiaji and Wibowo, 2016).

The answers thus obtained were then scored in order of relevance before being

presented back to the user (Ferrucci et al., 2010; Setiaji and Wibowo, 2016). In

order to aid better response to the user, other facets such as topic identifying,

context recognition, keyword detection amongst others are also used in tandem

(Niranjan et al., 2012; Setiaji and Wibowo, 2016; Waltinger et al., 2012). Ques-

tion classifying, in which the inherent ‘type’ of the question posed by the user

is obtained, has also been identified as a component which can improve the ac-

curacy of long-tail agents (Suzuki et al., 2003; Waltinger et al., 2012; Zhang

and Lee, 2003).

2.3.2. How are expert systems used in chatbots? 2.3.2.1. Expert systems in chatbots Traditionally, spoken dialogue systems/conversational agents have employed

mechanisms to control the dialogue flow of the user to limit the responses from the

user to a set of pre-defined or limited choices (M. O’Neill et al., 2004). In advanced

systems which could allow multi-domain interaction between user and agent, a

component was used to identify the domain or topic based on the user input and

perform the necessary action to fulfil the requirement (M. O’Neill et al., 2004).

Over time, several ‘plan-based dialogue modelling schemes’ were put forward to

build systems upon; the premise behind those being that behind every user-system

interaction lies a particular requirement or goal of a user and the system has to rec-

ognise those and perform accordingly (Lin et al., 1999). The entire system is con-

figured as multiple ‘domain-specific experts’ to facilitate multi-domain conversa-

tions, with the capability to complete transactions in a particular domain working

in association with each other all the while switching between themselves based on

user input (M. O’Neill et al., 2004; Nakano et al., 2008). A middle layer is present

in the system which is responsible for evaluating user utterances across all ‘experts’

present and determine which one has to respond to that particular utterance

(Hartikainen et al., 2004; Komatani et al., 2006; Lin et al., 1999; M. O’Neill et al.,

2004).

2.3.2.2. Problems faced in addressing multi-domain conversations There have been a few problems which have arisen while attempting to handle

multi-domain conversations in a concurrent and flexible manner, such as:

• Identifying how to handle errors in comprehending user input (Lin et al.,

1999).

• To determine if the user or the system should take the initiative to carry on

with the conversation (Lin et al., 1999).

• To tackle user initiatives in a proper and ‘consistent’ manner (Lin et al.,

1999).

• Diminishing efficiency due to multiple systems working simultaneously

(Hartikainen et al., 2004)

• Inability in handling concurrent topics (Lin et al., 1999).

2.3.3. Comparison of toolkits for building conversational agents Different toolkits which are specialised in building conversational agents were looked

at, selected primarily on their capability of having out-of-the-box, both a search (long

tail) and a typical conversational agent functionality. IBM Watson, Google Dialogflow

and Microsoft Luis had both these capabilities with their long tail functionalities,

namely being Watson Discovery, Knowledge connectors and Microsoft QnA Maker.

In order to select the best possible toolkit, the study conducted by (Liu et al., 2019) was

used. (Liu et al., 2019) compared state-of-the-art conversational toolkits by comparing

the metrics of precision, recall and F1 score as given in Table 1.

Intent

Toolkit/ Metric Precision Recall F1

Rasa 0.863 0.863 0.863

Dialogflow 0.87 0.859 0.864

LUIS 0.855 0.855 0.855

Watson 0.884 0.881 0.882 Table 1: Comparison of specialised toolkits (Liu et al., 2019)

As seen in Table 1, IBM Watson returns the highest F1 score for intent classification

and though there isn’t a significant difference in the scores for the other three toolkits

(Liu et al., 2019).

3. Methodology & Implementation

3.1. Overview The aim of the project is to create a rule-based or machine learnt algorithm for weighting

confidence and evidence returned by the expert systems to determine for each conversa-

tional turn which system is the best placed to respond and to enable these rules to adapt

during the conversation based on parameters to do with the conversational state or features

from the user utterance to best judge which system must respond. By doing this, we also

aim to recommend how businesses can best combine ‘off the shelf’/ out-of-the-box

(OOTB) chitchat with existing human-readable corpora (long tail) and then rapidly develop

domain-specific functionality.

A car configurator bot (multi-domain expert system/conversational agent) was decided to

be built to perform the tests and analysis. The bot would have the ability to engage in chit-

chat with the user, perform tasks involved in configuring a car such as gathering general

requirements, booking test drives, and such other queries. It would also act as a question

and answer bot wherein users can pose natural language queries to be answered by the bot

which could further help them narrow their search or enhance their knowledge of a vehicle

in mind. The short tail queries would be put forth to the system by the user at a higher

frequency, and the long tail ones would be at a significantly lower frequency. Each user

utterance is passed to all three expert systems, and the corresponding confidence scores are

retrieved. Metrics such as accuracy, precision and recall are derived and used as a baseline

score to compare and to simulate different scenarios. The whole experiment and analysis

were devised to be completed using IBM Watson tool of Assistant and Discovery as dis-

cussed in Section 2.3.4 along with the data extraction and manipulation using coding in

Python and optimisation tasks in Excel using Solver.

3.2. Tools setup and initialisation An account was set up in IBM Watson for using Assistant and Discovery for building the

conversational agent. The Watson Developer Cloud Python SDK will be used to communi-

cate to the Assistant and Discovery services using the application programming interface

(API) provided; the dependencies and packages for which are also installed. Access to the

online services is gained using a combination of usernames, API keys, environment IDs,

collection IDs etc. Python is also used to extract, manipulate and further analyse the outputs

from the services’ APIs. Microsoft Excel is used to create and store the data used for train-

ing and testing purposes in filetypes of Excel worksheets (.xlsx), Comma-separated value

files (.csv), and Tab-separated value files (.tsv) based on requirements and features.

Toward the latter end of the analysis, in order to optimise metric values such as precision,

recall or accuracy (objective) based on weights or other parameters (constraints) as neces-

sary, the Solver add-in of Excel is used. Solver has three solving methods which are used

throughout our experiments based on the requirements and enhancements brought about on

the metrics by using a method. In cases where more than one method is used, a comparison

is also made possible. The solver methods are as follows:

• LP Simplex: Used in cases where the problems are linear, which in turn means its

applications are restricted (“Excel Solver,” 2016). However, one of its benefits is

that the solutions obtained are always globally optimised (“Excel Solver,” 2016).

• Generalised Reduced Gradient (GRG) Non-Linear: This method is the fastest of the

non-linear methods but has a disadvantage that the solution obtained might not be

a global optimum and is also highly dependent on the initial conditions (“Excel

Solver,” 2016). It is used for smooth non-linear problems (“Excel Solver,” 2016).

• Evolutionary method: Based on the theory of natural selection, it may converge to

a solution if either the solution is the global optimum or if the population has lost

its diversity (“Excel Solver,” 2016). It is used in cases of non-smooth problems.

3.3. Knowledgebase A corpus of data had to be created to train and test the conversational agent. Since the

conversational agent has three experts, namely chitchat (social), short-tail (task-oriented)

and long-tail (Q&A), a corpus was created for all of them. The data were acquired in the

following manner:

• Chitchat: A collection of close to 890 example user utterances and 59 intents was

obtained for the chitchat expert system by combining the data from:

o Watson Assistant: The inbuilt ‘General’ content catalogue, which contains

ten unique intents and close to 200 example utterances.

o Google Dialogflow: The inbuilt ‘smalltalk’ agent was exported, which con-

tains 86 intents and around 1500 example utterances.

• Short-tail: A corpus of 16 intents with close to 150 example user utterances were

created manually for the short-tail expert system of the car configurator use case

which included intents such as ‘#GeneralRequirements’, ‘#BookTestDrive’ etc. and

their corresponding example user utterances.

• Long tail: 118 car brochures spread across different vehicle types, makes and mod-

els were obtained and collated to be used as the corpus for the long tail expert sys-

tem.

The data for both the chitchat and short tail expert systems were ingested into two separate

Watson Assistants, and the data for longtail was ingested into Watson Discovery to be used

in testing and analysis purposes. After ingestion into discovery, the search was optimised

for relevancy by using the out-of-the-box (OOTB) relevancy training available in Watson

Discovery. The process entailed posing natural language queries to discovery and marking

the results from the service as ‘relevant’ and ‘not relevant’ based on the contents of the

results.

For testing the conversational agent, a dataset consisting of 50 example user utterances

across five different conversations was manually created mimicking users who would be

using the chatbot to configure a car or book a test drive and such similar requests. The

utterances were created in such a way that they would be as close to a real conversation as

possible with sample responses for each and from different expert systems which would

further enable testing the multi-domain conversational agent.

3.4. Testing framework & Dataset optimisation In general, the data for conversational agents, which comprises of user example utterances

and intents are created by ‘subject matter experts’ based on ground truth (Freed, 2018).

Utterances are created and then marked with ‘entities’ and labelled with expected ‘intents’;

the corpus procured for the chit-chat workspace from the Watson and Google services has

been created in the same fashion by employing API’s to crawl the web. Such procurement

calls for the need for testing the same to identify any hidden patterns and weakness present

in it, which can further be remedied (Freed, 2018).

Testing is achieved by submitting the utterances to the classifier and investigating the out-

put of the classifier to see if it matches the set ‘ground truth’ (Freed, 2018). The data is split

into training and testing/blind datasets using the k-fold cross-validation technique. After

training the classifier on the training data, the validation dataset is used to evaluate the

classifier and obtain the required parameters. Precision and recall metrics are intended to

be put into use while evaluating and comparing the performance of the classifier.

The testing is done on the chitchat corpus, and the metrics are obtained on an intent level.

In order to improve the performance, the intents are sorted based on recall value and the

ones with the lowest value are selected for removal or to be fixed. The misclassified utter-

ances were either moved to different intents or edited to enable better classification. This

process was carried out for all the utterances falling under intents with the lower recall

values. After a complete overhaul, the data was re-ingested, and the testing was carried out

again. The entire process was reiterated multiple times, thereby improving the overall met-

ric values and boosting the intents with initial low recall value. The final dataset, which had

a precision value of 90.32% and was reduced to 50 unique intents and 859 utterances for

the chit-chat expert system, was then ingested back into the Watson Assistant service.

3.5. Scenarios

3.5.1. Overview In order to fulfil the primary aim which is to create an adaptable rule-based or machine

learnt algorithm for weighting confidence and evidence returned by the expert systems

to determine for each conversational turn which system is best placed to respond, dif-

ferent scenarios have to be devised to analyse and compare. The outcome of the com-

parison would give the best algorithm to implement in order to obtain the best concur-

rency and switching between the expert systems in an efficient but logical manner based

on user utterances. The scenarios thus formulated are as follows:

1. High-level classifier

2. Weighted high-level classifier

3. Based on pure-confidence values

4. Weighted system confidences

5. Emulating fallback logic

6. Testing for concurrency

The testing data created earlier, which consists of 50 unique user utterances are inputted

to the mixture of experts’ system and the confidences returned is used as a baseline to

perform the experiments and simulate the scenarios. A metric similar to accuracy was

devised, which was obtained by dividing the number of correct classifications to the

total number of classifications performed, to compare the scenarios on a qualitative

basis. The thorough investigation of the metric obtained post-simulation gives a clear

insight to which expert system should be utilised to respond to said utterance.

3.5.2. Experiment 1: High-level classifier The purpose of this experiment is to set up a high-level classifier which acts as the

‘selector’ in the mixture of experts’ systems and is trained on a sample of the example

utterances spread across the short-tail, chitchat and long-tail expert systems. The data

corpus must be sampled to obtain an equal number of utterances, fixed at 50 for the use

case, from each class to avoid any possible bias in that regard. The ‘Pandas’ module in

Python and it’s inbuilt ‘group by’ method is used for sampling the utterances of all three

separate expert systems. The separate sampled datasets are concatenated in order to

obtain a single corpus of 150 utterances labelled into the three classes of short-tail,

chitchat and long-tail.

The dataset is further split on an 80:20 ratio to be used for training and testing purposes

of the classifiers built. Different algorithms are used as the base for the classifier to

allow for the selection of the best possible classifier and are compared based on the

testing accuracy values, the algorithms being:

• Naïve Bayes: An algorithm based on Bayes theorem and assumes independence

between predictors used in the classification.

• Linear Classifier – Logistic Regression: It uses the logistic function at its core

to determine the relationship between the dependent variable and several inde-

pendent variables.

• Support Vector Machine (SVM): SVM is a supervised algorithm which at-

tempts to extract the best possible hyperplanes for classifying the data.

• Bagging Method - Random Forest (RF): Random forest method constructs nu-

merous decision trees during training, and the outputted class is the mean or

mode of the individual trees. It fixes the overfitting found in decision tree clas-

sification

• Boosting Method – eXtreme Gradient Boosting Model (XGBoost): A super-

vised machine learning algorithm which uses an ensemble of other weaker mod-

els/ algorithms to reduce bias and variance.

Along with these algorithms, Google BERT (Bidirectional Encoder Representations

from Transformers), which is an unsupervised learning algorithm, was also used to

build a classifier (Devlin et al., 2018). BERT uses bidirectional encoding and works on

several pre-trained models released by Google, which can be further fine-tuned to suit

the application or requirement (Devlin et al., 2018). It takes into account the context of

a word from both its left and right sides since it is bidirectional (Devlin et al., 2018).

BERT is built for binary classification out of the box and must be modified to work

with our use case of three classes (Devlin et al., 2018). Also, the training, validation

and testing data must be formatted to suit the input requirements of BERT, which is

done using a combination of Python coding and Excel.

The classifier was built using the following machine learning algorithms/ tools on the

3 expert system corpora and tested. The resultant accuracy metric, which is the fraction

of correctly classified samples is as follows:

Algorithm Accuracy Naïve Bayes 0.77

Linear Classification 0.74 Support Vector Machine 0.77

Bagging Model 0.67 Boosting Model 0.69 Google BERT 0.87

Table 2: High-level classifiers comparison

Google BERT gave the best accuracy values for the test data and was selected as the

best algorithm to use as the high-level classifier and to build the simulation for the

scenario. The simulation is carried out in the following manner:

• The test corpus created, which consists of 50 unique utterances across five con-

versations, is inputted to the BERT based classifier, and the output is obtained.

The output is a confidence score for every utterance for each expert system.

Utterance Short tail Confidence Chitchat Confidence Long tail Confidence Utterance 1 0.27420458 0.530363 0.19543229 Utterance 2 0.13647898 0.6893639 0.17415714 Utterance 3 0.25295562 0.4700206 0.27702382 Utterance 4 0.3946402 0.36452472 0.24083503 Utterance 5 0.3252571 0.38989067 0.2848522 Utterance 6 0.31253842 0.24711472 0.44034687

. . . .

. . . . Table 3: Sample high-level classifier output

• The expert system which returns the highest confidence is selected as the system

which is best placed to respond to the utterance at that conversational turn.

• If the classification was performed as expected, the output of the scenario is

obtained per utterance by verifying if the expert system, which has been ob-

tained after classification, is the same as the ‘golden system’. This is part of the

testing data and has been labelled by the subject matter expert based on the log-

ical response expected.

• The recall metric for the scenario is obtained for comparison purposes during

the analysis stage.

3.5.3. Experiment 2: Weighted High-Level Classifier In this experiment, the confidences obtained from the high-level classifier built on

Google BERT (as in Experiment 1) are weighted (as in Experiment 4) to see if this

brings about a beneficial change in the output of the mixture of experts’ system.

The experiment is carried out in the following manner:

• The confidence scores from the classifier were obtained as in Experiment 1 and

were further weighted (giving a bias to the confidence scores obtained from

each system, as in Experiment 4), and the results were obtained for an equal

weight of 1 across all systems.

• The expert system which returns the highest weighted confidence is selected as

the system which is best placed to respond to the utterance at that conversational

turn.

• If the classification was performed as expected, the output of the scenario is

obtained per utterance by verifying if the expert system which has been obtained

after classification is the same as the ‘golden system’ which is part of the testing

data and has been labelled by the subject matter expert based on the logical

response expected.

• The metric for the scenario (objective) is obtained and is then subject to optimi-

sation using Solver to obtain the maximum value possible for the same by var-

ying the values of weights assigned to all expert systems (constraint).

• The optimisation is performed using both GRG Non-Linear and Evolutionary

methods to allow for comparison.

• The weights obtained after optimisation is carried out are further normalised so

that comprehension is improved. Normalisation is done by fixing a system con-

fidence value to be 1 (in this case, short tail is selected as the business problem

to focus on the short tail and bring in other expert systems without changing the

confidence of this). In that case, the normalised values for the other expert sys-

tems are obtained by dividing their current values with the value of the pre-

normalised short-tail confidence.

3.5.4. Experiment 3: Based on unmodified confidence scores This experiment aims to simulate a scenario wherein the selection of which system is

used to respond with is decided upon by using only the pure confidence scores returned

by the three expert systems, namely chitchat, short tail and long tail hosted in the Wat-

son Assistant and Watson Discovery cloud services respectively. Also, to decide if this

approach is suitable to be used to enable the algorithm to adapt to changes during the

conversation based on parameters related to the conversational state or features from

the user utterance to best judge which system must respond.

The experiment is carried out in the following manner:

• The mixture of experts’ system is tested using the testing data comprising of 50

example user utterances by posing these utterances to all three expert systems

individually.

• The response of the systems in the form of confidence scores is retrieved and is

stored across the utterances.

• The expert system which returns the highest confidence is selected as the system

which is best placed to respond to the utterance at that conversational turn.

• The next stage of the simulation is carried out in Excel. A check is performed

to verify if the ‘golden system’ matches the system obtained based on the con-

fidence calculation. If the classification was performed as expected, the output

of the scenario is obtained per utterance.

• The metric for the scenario is obtained for comparison purposes during the anal-

ysis stage.

3.5.5. Experiment 4: Weighted confidence scores In this experiment, the confidences obtained from the expert systems are weighted in

an attempt to see if this brings about a beneficial change in the output of the mixture of

experts’ system.

The experiment is carried out in the following manner:

• The confidence scores are obtained from the three expert systems as in Experi-

ment 3.

• The scores are then weighted in order to better classify the input utterances. The

weights are assigned an equal value of 1 to start with.

• A check is performed to verify if the ‘golden system’ matches the system ob-

tained based on the confidence calculation. If the classification was performed

as expected, the output of the scenario is obtained per utterance.

• The metric for the scenario (objective) is obtained and is then subject to optimi-

sation using Solver to obtain the maximum value possible for the same by var-

ying the values of weights assigned to all expert systems (constraint).

• The optimisation is performed using both GRG Non-Linear and Evolutionary

methods to allow for comparison.

• The weights obtained after optimisation is carried out are further normalised so

that comprehension is improved. Normalisation is done by fixing a system con-

fidence value to be 1 (in this case, short tail is selected as the business problem

to focus on the short tail and bring in other expert systems without changing its

confidence). In that case, the normalised values for the other expert systems are

obtained by dividing their current values with the value of the pre-normalised

short tail confidence.

3.5.6. Experiment 5: Emulating fallback logic This experiment attempts to mimic the fallback logic employed by Watson in the out

of the box cloud service. The logic can be defined as follows with three significant

variations:

• Version 1:

o if ‘short tail confidence’ > ‘threshold confidence’:

� the short tail system must respond

o else if ‘chitchat confidence’ > ‘threshold confidence’:

� the chitchat system must respond

o else:

� the long tail system must respond

• Version 2:

o if ‘short tail confidence’ > ‘threshold confidence’:

� the short tail system must respond

o else if ‘longtail confidence’ > ‘threshold confidence’:

� The long tail system must respond

o else:

� the chitchat system must respond.

• Version 3 (2 expert systems):

o if ‘combined confidence’ > ‘threshold confidence’:

� the combined system must respond

o else if ‘longtail confidence’ > ‘threshold confidence’:

� the long tail system must respond

The experiment for versions 1 & 2 is conducted in the following way:

• After ingestion and training, the mixture of experts’ system is tested using the

test corpus consisting of 50 example user utterances created earlier.

• The response of the systems in the form of confidence scores is retrieved and

stored across the utterances.

• The logic is then simulated using Excel, and the outputs are obtained for both

the variations with the threshold set at 0.2, which is the default used by Watson.

• After obtaining the outputs, the metric value is calculated. The metric (objec-

tive) is then subject to optimisation using solver to obtain the maximum value

possible for the same by varying the threshold value (constraint).

The experiment for version 3 is conducted in the following two ways:

1. On the cloud service by manipulating the training data for the created assistants:

• The training data consisting of utterances is manipulated so that the utter-

ances and intents falling under short tail and chitchat expert systems are

combined into a single system and long tail is kept as a separate system. The

data is re-ingested into the Watson Assistant service for testing purposes.

• After ingestion, the mixture of experts’ system is tested using the same test-

ing data comprising of 50 example user utterances by posing these utter-

ances to all three expert systems individually.

• The response of the systems in the form of confidence scores is retrieved

and is stored across the utterances.

• The logic is then simulated with the threshold value set at 0.2 (Watson de-

fault value) using Excel and the output for the variation is obtained.

• After obtaining the outputs, the metric value is calculated for the same. The

metric value (objective) is then subject to optimisation using solver to obtain

the maximum value possible for the same by varying the threshold value

(constraint).

2. Building a high-level classifier with the training data mimicking the OOTB

Watson (Conversation AI toolkit) logic:

• The training data consisting of utterances is manipulated so that the utter-

ances and intents falling under short tail and chitchat expert systems are

combined into a single system and long tail is kept as a separate system. The

data is used to build a binary classifier using Google BERT as done in Ex-

periment 1.

• The test corpus created, which consists of 50 unique utterances across five

conversations, is inputted to the BERT based classifier, and the output is

obtained. The output is a confidence score for every utterance concerning

either class of data.

• The logic is then simulated with the threshold value set at 0.2 (Watson de-

fault value) using Excel and the output’s obtained for the variation.

• After obtaining the outputs, the metric value is calculated for the same. The

metric value (objective) is then subject to optimisation using Solver to ob-

tain the maximum value possible for the same by varying the threshold value

(constraint).

3.5.7. Experiment 6: Effect of concurrency This experiment attempts to investigate how systems deal with concurrency, which was

a vital area of the problem, as observed in the literature review in section 2.3.3.2. The

aim is to gauge the effect of concurrency in the user input utterances within a conver-

sation on the mixture of expert system confidence values and output. In order to do so,

an updated test corpus will have to be created which mimics the effect of concurrency.

The experiment is carried out in the following manner:

• An updated test corpus of 70 unique utterances across four conversations is cre-

ated. It is done in a manner which incorporates the occurrence of utterances

which fall under the domain of the same expert system (effect of concurrency)

within conversations.

• This test data is then inputted to the mixture of experts’ system. The response

which comprises of confidence scores and intents is retrieved and is stored

across the utterances.

• A new parameter is introduced to vary the effect of concurrency on the output.

This parameter boosts the confidence returned from a particular expert system

if it is concurrent to the preceding utterance.

• The expert system which returns the highest confidence at the end of the boost-

ing is selected as the system which is best placed to respond to the utterance at

that conversational turn.

• A check is performed to verify if the ‘golden system’ matches the system ob-

tained based on the confidence calculation. If the classification was performed

as expected, the output of the scenario is obtained per utterance.

• The metric for the scenario (objective) is obtained and is then subject to optimi-

sation using solver to obtain the maximum value possible for the same by var-

ying the values of weights assigned to all expert systems (constraint).

• The optimisation is performed using both GRG Non-Linear and Evolutionary

methods to allow for comparison.

4. Results Throughout the experiments, several rule-based systems and parameters were looked at which

could be employed to enable these rules to be adaptable based on the requirements and conver-

sational state in order to gauge the system best placed to respond to the user utterance at the

said conversational turn. These were simulated as variations within and across scenarios, as

mentioned in section 4.5.

The results obtained from the experiments conducted can be analysed as follows:

1. Use of unmodified confidence scores: In the experiments where the confidence scores

were used to create a rule or logic for selection of an expert system within the mixture

of experts’ conversational agent, the test corpus was used to retrieve confidence scores

from the system and then used for further simulations. The confidences thus obtained

were used directly and also in a weighted manner to observe the changes being brought

about as can be seen below in Table 4.

Scenario Number of experts Metric Comments

Based on unmodified confi- dence scores

3 0.82

Weighted confidence scores

0.82 Initial weights of 1 each

3 0.82 GRG optimised weights

0.84 Evolutionary optimised weights

Table 4: Results from simulations based on the use of unmodified confidence scores

2. Use of a high-level classifier: In these scenarios, a high-level classifier was modelled

to mimic the working of the expert system in the Watson cloud service and use the

output confidences derived by testing using the test corpus as a base to formulate further

rules. The confidences were both used as an unmodified value and also as a weighted

component, as seen in Table 5.

Scenario Number of experts Metric Notes

High level classifier

(BERT) 3 0.72

Weighted high-level clas-

sifier (BERT)

0.72 Initial weights of 1 each

0.72 GRG optimised weights

0.90 Evolutionary optimised weights

Table 5: Results from simulations based on the use of a high-level classifier

3. Emulating Watson fallback logic: The inbuilt Watson logic and rules employed by the

Watson Assistant service while acting as a multiple domain system was simulated. The

simulations were carried out in two ways:

a. Method 1: By using the confidence scores from Watson a combined chitchat

(CC) and short tail (ST) corpus with long tail (LT) apart and also varying the

logic while keeping them separate.

b. Method 2: Using a high-level classifier also built on the combined data to in-

vestigate its performance using the test data. The confidences scores were both

used as an unmodified value and also as a weighted component.

The outputs as follows in Table 6 can be analysed to gauge if any of the simulations

would be a good fit for creating the rules for the mixture of experts’ system.

Scenario Number of experts Metric Notes

Emulating fallback

logic

0.86

Threshold set at 0.2 (de-

fault)

2 0.98 Optimised threshold of 0.55

0.50

Version 1 with the threshold

set at 0.2 (Default)

0.86

Version 1 with an optimised

threshold of 0.69

3 0.50

Version 2 with the threshold

set at 0.2 (Default)

0.70

Version 2 with an optimised

threshold of 0.67

High-level classifier 2 0.96

Weighted high-level

classifier

0.96 Initial weights of 1 each

2 0.96 GRG optimised weights

0.96

Evolutionarily optimised

weights

Table 6: Results from simulations mimicking the Watson default logic

4. Testing for effect of concurrency:

A simulation was conducted to test the effect of concurrency being used as a weight in the

selection of an expert system to respond to the utterance at a particular conversational turn.

The presence of concurrency was modelled using a ‘boosting value’ which was used to

boost the value of the corresponding system’s confidence score. The testing for this sce-

nario was conducted using the extended and modified test corpus, which had situations for

concurrency incorporated into it; the output of which can be seen in Table 7.

Scenario

Number of

Experts Metric Comments

Effect of concurrency 3

0.74 Initial boosting value of 0.1

0.74 GRG optimised boost values

0.77 Evolutionary optimised boost values

Table 7: Results from simulation incorporating the effect of concurrency

The values of the accuracy metric were found to be improved upon by using Excel Solver for

optimisation as can be seen in Table 8 for a 2-system architecture and Table 9 for a 3-system

architecture.

Type

Original

metric

Improved

metric

Change

2 system with weights - Weighted classifier 0.94 0.94 0.00

2 system with confidence - OOTB Watson Rules 0.86 0.98 13.95

Table 8: Effect of optimisation on the metric for 2 expert systems

Type

Original

metric

Improved

metric

Change

3 system with weights - Weighted high-level classifier 0.72 0.90 25.00

3 system with weights - Weighted confidences 0.82 0.84 2.44

3 system with confidence - OOTB Rules – Version 2 0.50 0.70 40.00

3 system with confidence - OOTB Rules – Version 1 0.50 0.86 72.00

Testing for concurrency 0.74 0.77 4.05

Table 9: Effect of optimisation on the metric for 3 expert systems

5. Discussion & Conclusion

5.1. Qualitative evaluation A better understanding of the work carried out can be gained by looking at the strengths

and weaknesses of the project on a qualitative basis. This should further help in identifying

opportunities for future work and any associated possibilities in the domain.

Strengths:

1. Uniqueness: There has been minimal work which has been carried out in the domain

of employing expert systems and maintaining a balance between them in the area

of conversational agents. The research carried out will also help the business make

an informed decision on the technologies and rules to use while attempting to build

a multi-domain conversational agent. This makes the work carried out unique in its

aspect.

2. Achieving research goals: Concurrency & robustly handling user inputs came

across as a drawback in present systems during the literature review, and it was

considered as a facet for investigating its effects on how the expert system responds

based on its presence and absence.

3. Test corpus creation: Two datasets were made from scratch for testing purposes in

the use case of a ‘Car Configurator’ to be put to use in the experiments. The dataset

used for extensive testing consisted of 50 user utterances across five conversations

and the dataset used for testing concurrency comprised of 70 utterances across four

conversations. This can be used for future work with minimal modifications based

on requirements.

Weaknesses:

1. Cognitive Bias in testing data: Since both the training (short tail expert system) and

testing datasets were created by me during the process of carrying out the experi-

ments, there is a high likelihood of cognitive bias being introduced into the same.

Cognitive bias is something which is faced by any entity trying to create a corpus

of data for use in a data science problem. This may have led to the introduction of

noise in data and the presence of unwanted filters which can sometimes cause cru-

cial aspects in the domain to be missed out. The presence of such cognitive bias can

lead to misclassification by the expert systems.

2. Small data for training short tail: The corpus of data used for training the short tail

expert system consisted of 16 intents, and approximately 150 example utterances

are significantly small in comparison to the corpora used for training the chitchat

and long-tail expert systems. However, this is precisely the problem businesses are

facing when trying to incorporate a short tail domain for their conversational agent

and is the problem I have aimed to address with this research. They have massive

datasets available for long-tail and chitchat but want to have a short tail facet ready

with minimal effort and a much smaller corpus.

3. Lack of ‘real’ user data: Another drawback which can be taken into account is the

lack of real user testing data which can be put to use for testing the performance of

the created mixture of experts’ system.

5.2. Possibility for future work 1. Two different datasets were used for testing the scenarios. The initial set had 50

utterances across five conversations, and the one updated to account for concur-

rency had 70 utterances across four conversations. If there were more time avail-

able, the possible first step to be taken would be to conduct all experiments and

test all scenarios on the updated dataset as it accounts for concurrency and may

lead to results closer to the ‘ground reality’ in user conversations.

2. Run a beta test on real user data by opening up the platform to a small section of users and thereby obtaining real user conversations upon which further testing,

and performance metrics can be obtained. This could lead to a more thorough in-

vestigation of the problem at hand and its possible solutions.

3. Currently, the mixture of expert systems has been built using only two and three experts based on the requirements. Another aspect to investigate in the future

would be to identify which scenario would work best and how the accuracy will

change with the introduction of n-systems, for example when a fourth one is in-

troduced and to perhaps model a relationship between the same.

5.3. Recommendations & Findings • A comparative analysis was performed between traditional machine learning

algorithms used predominantly for classification tasks and Google BERT; a

state-of-the-art model devised for several natural language processing tasks.

The results were highly favourable for BERT in this particular case having a

small training data corpus.

• Google BERT was also put to the test against a traditional ‘Conversational Ar-

tificial Intelligence (AI)’ configuring toolkit such as Watson Assistant using the

default logic utilised while creating a multi-domain conversational system. This

was done as BERT was found to be optimised for small datasets. The classifier

was found to have a slightly lower performance in comparison to the toolkit.

• The scenarios for rule building while creating a multi-domain mixture of ex-

perts’ system in the conversational agent’s space can be divided into two based

on the data available and the manipulations possible on the same as follows:

o If the training data for chitchat and short tail can be merged into a single

corpus while keeping long-tail separate, the best scenario or rule to adopt

would be the Watson default logic as done in Version 3 of the emulating

fallback logic. This would entail building a mixture of experts’ system

with two domain-specific experts.

o Building a mixture of experts’ system using three separate domain-spe-

cific experts of short tail, chitchat and long tail were found to be the best

possible scenario for cases wherein data merging is not possible. Among

those, a high-level classifier with normalised and weighted confidences

gave the best results.

• Another important finding was the improvement in result metrics when the pa-

rameters such as weights for the system confidences or the cut-off confidences

was optimised using Solver. The average improvement was around 7% for the

2 expert system scenarios and 29% for the 3 expert system scenarios. Thus, it

can be postulated that the need for optimisation possibly increases with an in-

crease in the number and variety of expert systems and their weighting.

References Colby KM, 1975. Artificial Paranoia - 1st Edition. Pergamon Press INC. Maxwell House,

New York, NY, England. Davis, B., 2018. Vodafone’s chatbot is delivering double the conversion rate of its website –

Econsultancy [WWW Document]. URL https://econsultancy.com/vodafones-chatbot- is-delivering-twice-the-conversion-rate-of-its-website/ (accessed 8.31.19).

Devlin, J., Chang, M.-W., Lee, K., Toutanova, K., 2018. BERT: Pre-training of Deep Bidi- rectional Transformers for Language Understanding. ArXiv181004805 Cs.

Ebrahimpour, 2009. Recognition of Persian handwritten digits using Characterization Loci and Mixture of Experts. Int. J. Digit. Content Technol. Its Appl. 3. https://doi.org/10.4156/jdcta.vol3.issue3.5

Estabrooks, A., Japkowicz, N., 2001. A Mixture-of-experts Framework for Text Classifica- tion, in: Proceedings of the 2001 Workshop on Computational Natural Language Learning - Volume 7, ConLL ’01. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9:1–9:8. https://doi.org/10.3115/1117822.1117828

Excel Solver: Which Solving Method Should I Choose?, 2016. . EngineerExcel. URL https://www.engineerexcel.com/excel-solver-solving-method-choose/ (accessed 8.16.19).

Ferrucci, D., Brown, E., Chu-Carroll, J., Fan, J., Gondek, D., Kalyanpur, A.A., Lally, A., Murdock, J.W., Nyberg, E., Prager, J., Schlaefer, N., Welty, C., 2010. Building Wat- son: An Overview of the DeepQA Project. AI Mag. 31, 59–79. https://doi.org/10.1609/aimag.v31i3.2303

Forsyth, R., 1984. Expert systems : principles and case studies. London ; New York : Chap- man and Hall ; New York, NY : Methuen.

Forsyth, R., n.d. The architecture of expert systems 7. Freed, A.R., 2018. Testing Strategies for Chatbots (Part 1)— Testing Their Classifiers

[WWW Document]. Medium. URL https://medium.com/ibm-watson/testing-strate- gies-for-chatbots-part-1-testing-their-classifiers-20becaf5f211 (accessed 8.14.19).

Hampshire, J.B., Waibel, A., 1992. The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14, 751–769. https://doi.org/10.1109/34.142911

Harb, H., Chen, L., Auloge, J.-, 2004. Mixture of experts for audio classification: an applica- tion to male female classification and musical genre recognition, in: 2004 IEEE Inter- national Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763). Presented at the 2004 IEEE International Conference on Multimedia and Expo (ICME) (IEEE Cat. No.04TH8763), pp. 1351-1354 Vol.2. https://doi.org/10.1109/ICME.2004.1394479

Hartikainen, M., Turunen, M., Hakulinen, J., Salonen, E.-P., Adam Funk, J., 2004. Flexible dialogue management using distributed and dynamic dialogue control.

Ignizio, J.P., 1991. Introduction to expert systems: the development and implementation of rule-based expert systems. McGraw-Hill, New York.

Ignizio, J.P., 1990. A brief introduction to expert systems. Comput. Oper. Res. 17, 523–533. https://doi.org/10.1016/0305-0548(90)90058-F

Io, H.N., Lee, C.B., 2017. Chatbots and conversational agents: A bibliometric analysis, in: 2017 IEEE International Conference on Industrial Engineering and Engineering Man- agement (IEEM). Presented at the 2017 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 215–219. https://doi.org/10.1109/IEEM.2017.8289883

Jacobs, R., Jordan, M., J. Nowlan, S., E. Hinton, G., 1991. Adaptive Mixture of Local Expert. Neural Comput. 3, 78–88. https://doi.org/10.1162/neco.1991.3.1.79

Jacobs, R.A., Jordan, M.I., Barto, A.G., 1991. Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cogn. Sci. 15, 219–250. https://doi.org/10.1016/0364-0213(91)80006-Q

Koehler, A., 2017. Meet TOBi the chatbot: The latest addition to our customer service team [WWW Document]. Vodafone Soc. Off. Vodafone UK Blog. URL https://blog.voda- fone.co.uk/2017/04/12/meet-tobi-chatbot-latest-addition-vodafone-uks-customer-ser- vice-team/ (accessed 8.31.19).

Komatani, K., Kanda, N., Nakano, M., Nakadai, K., Tsujino, H., Ogata, T., Okuno, H.G., 2006. Multi-domain Spoken Dialogue System with Extensibility and Robustness Against Speech Recognition Errors, in: Proceedings of the 7th SIGdial Workshop on Discourse and Dialogue, SigDIAL ’06. Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 9–17.

Lin, B.-S., Wang, H., Fen, Q., 1999. Consistent Dialogue Across Concurrent Topics Based On An Expert System Model.

Liu, X., Eshghi, A., Swietojanski, P., Rieser, V., 2019. Benchmarking Natural Language Un- derstanding Services for building Conversational Agents. ArXiv190305566 Cs.

Lu, Z., 2006. A regularized minimum cross-entropy algorithm on mixtures of experts for time series prediction and curve detection. Pattern Recognit. Lett. 27, 947–955. https://doi.org/10.1016/j.patrec.2005.12.002

M. O’Neill, I., Hanna, P., Liu, X., Mctear, M., 2004. Cross domain dialogue modelling: an object-based approach.

Masoudnia, S., Ebrahimpour, R., 2014. Mixture of experts: a literature survey. Artif. Intell. Rev. 42, 275–293. https://doi.org/10.1007/s10462-012-9338-y

Mossavat, S.I., Amft, O., Vries, B. de, Petkov, P.N., Kleijn, W.B., 2010. A bayesian hierar- chical mixture of experts approach to estimate speech quality, in: 2010 Second Inter- national Workshop on Quality of Multimedia Experience (QoMEX). Presented at the 2010 Second International Workshop on Quality of Multimedia Experience (QoMEX), pp. 200–205. https://doi.org/10.1109/QOMEX.2010.5516203

Nakano, M., Funakoshi, K., Hasegawa, Y., Tsujino, H., 2008. A Framework for Building Conversational Agents Based on a Multi-expert Model, in: Proceedings of the 9th SIGdial Workshop on Discourse and Dialogue, SIGdial ’08. Association for Compu- tational Linguistics, Stroudsburg, PA, USA, pp. 88–91.

NatWest begins testing AI driven ‘digital human’ in banking first [WWW Document], n.d. URL https://www.rbs.com/rbs/news/2018/02/natwest-begins-testing-ai-driven-digital- human-in-banking-first.html (accessed 8.31.19).

Niranjan, M., Saipreethy, M.S., Kumar, T.G., 2012. An intelligent question answering con- versational agent using Naïve Bayesian classifier, in: 2012 IEEE International Confer- ence on Technology Enhanced Education (ICTEE). Presented at the 2012 IEEE Inter- national Conference on Technology Enhanced Education (ICTEE), pp. 1–5. https://doi.org/10.1109/ICTEE.2012.6208614

Peng, F., Jacobs, R.A., Tanner, M.A., 1996. Bayesian Inference in Mixtures-of-Experts and Hierarchical Mixtures-of-Experts Models with an Application to Speech Recognition. J. Am. Stat. Assoc. 91, 953–960. https://doi.org/10.1080/01621459.1996.10476965

Rumney, E., 2018. British bank RBS hires “digital human” Cora on probation. Reuters. Setiaji, B., Wibowo, F.W., 2016. Chatbot Using a Knowledge in Database: Human-to-Ma-

chine Conversation Modeling, in: 2016 7th International Conference on Intelligent Systems, Modelling and Simulation (ISMS). Presented at the 2016 7th International

Conference on Intelligent Systems, Modelling and Simulation (ISMS), pp. 72–77. https://doi.org/10.1109/ISMS.2016.53

Shieber, S.M., 1994. Lessons from a Restricted Turing Test. Commun ACM 37, 70–78. https://doi.org/10.1145/175208.175217

Shum, H., He, X., Li, D., 2018. From Eliza to XiaoIce: challenges and opportunities with so- cial chatbots. Front. Inf. Technol. Electron. Eng. 19, 10–26. https://doi.org/10.1631/FITEE.1700826

Simmons, R.F., 1970. Natural Language Question-answering Systems: 1969. Commun ACM 13, 15–30. https://doi.org/10.1145/361953.361963

Suzuki, J., Taira, H., Sasaki, Y., Maeda, E., 2003. Question Classification using HDAG Ker- nel, in: Proceedings of the ACL 2003 Workshop on Multilingual Summarization and Question Answering. Association for Computational Linguistics, Sapporo, Japan, pp. 61–68. https://doi.org/10.3115/1119312.1119320

Tripathi, K.P., 2011. A Review on Knowledge-based Expert System: Concept and Architec- ture. Artif. Intell. Tech. 5.

Turing, A.M., 1950. I.—COMPUTING MACHINERY AND INTELLIGENCE. Mind LIX, 433–460. https://doi.org/10.1093/mind/LIX.236.433

Tzafestas, S.G., Kokkinaki, A.I., Valavanis, K.P., 1993. An Overview of Expert Systems, in: Tzafestas, S. (Ed.), Expert Systems in Engineering Applications. Springer Berlin Hei- delberg, Berlin, Heidelberg, pp. 3–24. https://doi.org/10.1007/978-3-642-84048-7_1

Versace, M., Bhatt, R., Hinds, O., Shiffer, M., 2004. Predicting the exchange traded fund DIA with a combination of genetic algorithms and neural networks. Expert Syst. Appl. 27, 417–425. https://doi.org/10.1016/j.eswa.2004.05.018

Walker, M., S. Aberdeen, J., Boland, J., Bratt, E., S. Garofolo, J., Hirschman, L., N. Le, A., Lee, S., Narayanan, S., Papineni, K., L. Pellom, B., Polifroni, J., Potamianos, A., Prabhu, P., Rudnicky, A., Sanders, G., Seneff, S., Stallard, D., Whittaker, S., 2001. DARPA communicator dialog travel planning systems: the june 2000 data collection. pp. 1371–1374.

Wallace, R.S., 2009. The Anatomy of A.L.I.C.E., in: Epstein, R., Roberts, G., Beber, G. (Eds.), Parsing the Turing Test: Philosophical and Methodological Issues in the Quest for the Thinking Computer. Springer Netherlands, Dordrecht, pp. 181–210. https://doi.org/10.1007/978-1-4020-6710-5_13

Walter, P., Elsen, I., Muller, H., Kraiss, K.-, 1999. 3D object recognition with a specialized mixtures of experts architecture, in, IJCNN’99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339). Presented at the IJCNN’99. In- ternational Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339), pp. 3563–3568 vol.5. https://doi.org/10.1109/IJCNN.1999.836243

Waltinger, U., Breuing, A., Wachsmuth, I., 2012. Connecting Question Answering and Con- versational Agents. KI - Künstl. Intell. 26, 381–390. https://doi.org/10.1007/s13218- 012-0208-1

Wang, Z., Ahmadvand, A., Choi, J.I., Karisani, P., Agichtein, E., n.d. Emersonbot: Infor- mation-Focused Conversational AI Emory University at the Alexa Prize 2017 Chal- lenge 11.

Weigend, A.S., Mangeas, M., Srivastava, A.N., 1995. Nonlinear gated experts for time series: discovering regimes and avoiding overfitting. Int. J. Neural Syst. 06, 373–399. https://doi.org/10.1142/S0129065795000251

Weizenbaum, J., 1966. ELIZA—a Computer Program for the Study of Natural Language Communication Between Man and Machine. Commun ACM 9, 36–45. https://doi.org/10.1145/365153.365168

Williams, J.D., Young, S., 2007. Partially observable Markov decision processes for spoken dialog systems. Comput. Speech Lang. 21, 393–422. https://doi.org/10.1016/j.csl.2006.06.008

Yazdani, M., 1989. Expert Systems Principles and Case Studies, in: Forsyth, R. (Ed.), . Chap- man & Hall, Ltd., London, UK, UK, pp. 173–183.

Yuksel, S.E., Wilson, J.N., Gader, P.D., 2012. Twenty Years of Mixture of Experts. IEEE Trans. Neural Netw. Learn. Syst. 23, 1177–1193. https://doi.org/10.1109/TNNLS.2012.2200299

Zhang, D., Lee, W.S., 2003. Question Classification Using Support Vector Machines, in: Pro- ceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03. ACM, New York, NY, USA, pp. 26–32. https://doi.org/10.1145/860435.860443

Appendices

A. Code for obtaining output from IBM Watson Services ''' This code reads multiple user utter- ance from a file and then parsing the same to all 3 expert sys- tems.The confidence obtained from the sys- tems is then stored as a matrix across single user utter- ances in the outputted excel file '''

import json import ibm_watson import pandas as pd from ibm_watson import DiscoveryV1 #setting configurations for Watson Assistant api_version_assistant = '' apikey_assistant = '' assistant_url = '' shorttail_workspace = '' chitchat_workspace = '' assistant = ibm_watson.AssistantV1( version = api_version_assistant, iam_apikey = apikey_assistant, url= assistant_url ) #setting configurations for long tail/ Watson discovery service api_version_discovery = '' apikey_discovery= '' discovery_url= '' environment_id = '' collection_id = '' discovery = DiscoveryV1( version=api_version_discovery, iam_apikey=apikey_discovery, url=discovery_url ) #read csv file with input test utterances df = pd.read_csv('inputfilepath') #give path to file containing test utterances

#defining columns in dataframe to store confidence values df['shorttail_confidence'] = '0' df['shorttail_intent'] = '0' df['chitchat_confidence'] = '0' df['chitchat_intent'] = '0' df['longtail_confidence'] = '0' df['lt_conf1'] ='0' df['lt_conf2'] ='0' df['lt_conf3'] ='0' #passing utterances to chitchat system for i in range(0,len(df)): input = df['message'].loc[i] response_cc = assistant.message( workspace_id = chitchat_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.chitchat_confidence.iloc[i] = '0' df.chitchat_intent.iloc[i] = 'Invalid' else: df.chitchat_confidence.iloc[i] = response_cc['in- tents'][0]['confidence'] df.chitchat_intent.iloc[i] = response_cc['in- tents'][0]['intent'] #passing utterances to shorttail system for i in range(0,len(df)): input = df['message'].loc[i] response_st = assistant.message( workspace_id = shorttail_workspace, input = { 'text': input } ).get_result() if (response_cc['intents']) == []: df.shorttail_confidence.iloc[i] = '0' df.shorttail_intent.iloc[i] = 'Invalid' else:

df.shorttail_confidence.iloc[i] = response_st['in- tents'][0]['confidence'] df.shorttail_intent.iloc[i] = response_st['in- tents'][0]['intent'] #passing utterances to longtail system for i in range(0,len(df)): user_input = df['message'].loc[i] input_text = "text:"+user_input query_ex = discovery.query(environment_id, collec- tion_id, filter=None, query=input_text, natural_lan- guage_query=None, passages=True, aggregation=None, count=3, re- turn_fields=None, offset=None, sort=None, highlight=True, pas- sages_fields=None, passages_count=3, passages_charac- ters=None, deduplicate=None, deduplicate_field=None, collec- tion_ids=None, similar=None, similar_document_ids=None, simi- lar_fields=None, bias=None, logging_opt_out=None) for j in range(0,len(query_ex.result['results'])): if j<3: #get top 3 results from discov- ery/ longtail system df.longtail_confidence.iloc[i] = query_ex.result['re- sults'][0]['result_metadata']['confidence'] df['lt_conf'+str(j+1)].iloc[i] = query_ex.result['re- sults'][j]['result_metadata']['confidence'] else: break df.to_excel('outputfilepath', index = False) #give path to store the output as an excel file for further manipulation

B. Test Data Types

a. Type 1 Conversation 1 Hi Good Morning How are you? I am looking to buy a car Can you show me red sedans please? Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives Let's buy that

Conversation 2

Hello what's up can we chat can you help me configure a car I am looking for something in the mid 20k pound range I like green SUV's with a sunroof Yes please Does it have driver assist? How much is the insurance going to cost? Ok, let's buy that

Conversation 3 Good evening What's up? How are you? I wanna buy a car I am looking for a sedan with manual transmission What is the power of that car? Does it have 6 airbags? I like it What is the tax liability Ok, let's go ahead and send the quote

Conversation 4 Greetings Good afternoon Describe yourself How can you help I want to configure a car I am looking for a blue automatic sedan What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive for me? I'll be back later

Conversation 5 are you here? Hey Good Morning I want to customise a car Looking for something in the 40k pound range with automatic transmission Does it have dsg transmission? Perfect! Exactly what I am looking for

Can I get a test drive for that nearby I am ready to buy. Please send configuration to dealer

b. Type 2 Conversation 1 Hi What are you I am looking to buy a car Can you show me an automatic red sedan please? I like the Audi a4 Does it come in black? Go back one step Does the car have abs and ebd? I want to know how many mpg it gives This is exactly what I am looking for Can I get a quote of what it's costing now Would it be possible to get a test drive tomorrow? Excellent Goodbye for now

Conversation 2 Hello can we chat help me configure a car I am looking for a something in the mid 20k pound range I want a green hatchback with a sunroof I like the kia rio I want to see that with 17-inch alloys It's beginning to move towards exactly what I am looking for Does it have driver assist? What is the power of that car? Does it have 6 airbags? How much is the insurance going to cost? How much is the cost now? Ok, let's go ahead and send the quote I'll be back later Bye

Conversation 3 Greetings Describe yourself How can you help I want to configure a car I am looking for a diesel SUV

I want it in blue with black alloys Start again I want to experiment I am looking to buy a petrol convertible with automatic transmission I love the mercedes amg-gt I want to see the car in a light green colour instead of black That's perfect What's the tax liability? Does it have dsg transmission? What's the wheelbase of the car? does the car have bluetooth in it? can you book a test drive on Friday at 4pm for me? Great Work You're funny Are you real? I'll be back later bye

Conversation 4 are you here? Good afternoon I want to customise a car Looking for something in the 40k pound range with automatic transmis- sion Which of those have Apple car in them? I'll go for the hyundai i30 I want it in dark blue How many airbags does it have? Does it have ebd? What's the mileage in mpg? That's really nice Can I get a test drive for that nearby now Yes please I am ready to buy. Please send configuration to dealer I am bored! Nah Talk to you later

Machine Learning Dissertation(1).pdf

High Frequency Oscillation Detection Using Wavelet Analysis and

Convolutional Neural Networks

Joe Morris

September 2019

School of Mathematics, Cardiff University

A dissertation submitted in partial fulfilment of the requirements for MSc (in Data Science and Analytics) by taught programme.

Acknowledgements

First and foremost, I would like to thank my project supervisor Alexia Zoumpoulaki, who has

provided fantastic guidance throughout this process. Her insights and advice have been

instrumental in bringing this project to fruition.

I would also like to say thank you to Miguel Navarrete, his expertise in this field has been

invaluable to this project. I am immensely grateful for the time he has taken to produce the

simulated data for the project.

I would also like to thank Supercomputing Wales for allowing me access to their facilities.

iii

List of Acronyms

HFO High frequency oscillation

CNN Convolutional Neural Network

EEG Electroencephalogram

iEEG Intracranial Electroencephalogram

IPSP Inhibitory Postsynaptic Potential

EPSP Excitatory Postsynaptic Potential

Table of Figures

Figure 1: Taking windows of a signal........................................................................................ 6

Figure 2: Naively structured inception module ....................................................................... 15

Figure 3: 1×1 convolutions for depth reduction ...................................................................... 15

Figure 4: Inception module structure with 1×1 convolutional blocks ..................................... 16

Figure 5: Residual connection ................................................................................................. 17

Figure 6: Wavelet analysis of 3 waves .................................................................................... 22

Figure 7: CNN structure proposed by Lai et al ........................................................................ 25

Figure 8: Stem of newly proposed models .............................................................................. 26

Figure 9: Module 1 of Model A ............................................................................................... 28

Figure 10: Module 2 of Model A ............................................................................................. 28

Figure 11: Module 3 of Model A ............................................................................................. 28

Figure 12: Module 4 of Model A ............................................................................................. 28

Figure 13: Module 5 of Model A ............................................................................................. 29

Figure 14: Module 1 of Model B ............................................................................................. 30

Figure 15: Module 2 of Model B ............................................................................................. 30

Figure 16: Module 3 of Model B ............................................................................................. 30

Figure 17: Module 4 of Model B ............................................................................................. 30

Figure 19: Module 5 of Model B ............................................................................................. 30

Figure 18: Module 6 of Model B ............................................................................................. 30

Figure 20: Module 1 of Model C ............................................................................................. 32

Figure 21: Module 2 of Model C ............................................................................................. 32

Figure 22: Module 3 of Model C ............................................................................................. 32

Figure 23: Module 4 of Model C ............................................................................................. 32

Figure 24: Module 5 of Model C ............................................................................................. 33

Figure 25: Confusion matrix for Model B's application on the test set ................................... 35

Figure 26: Confusion matrix for the Industry Standard's application on the test set ............... 35

Figure 27: Venn diagram comparing the performance of Models ........................................... 36

Figure 28: Model B's performance on Ripples over distance .................................................. 39

Figure 29: Industry Standard's performance on Ripples over distance.................................... 39

Figure 30: Model B's performance on Fast Ripples over distance .......................................... 40

Figure 31: Industry Standard's performance on Fast Ripples over distance ............................ 40

Figure 32: Model B's performance on Spikes over distance ................................................... 41

Figure 33: Industry Standard's performance on Spikes over distance ..................................... 41

Figure 34: Model B's performance on Ripple-FastRipples over distances .............................. 42

Figure 35: Industry Standard's performance on Ripple-FastRipples over distances ............... 42

Figure 36: Incorrect predictions of Model B on Ripple-FastRipples ...................................... 42

Figure 37: Incorrect predictions of the Industry Standard on Ripple-FastRipples .................. 42

Figure 38: Model B's performance on Spike-Ripples over distances ...................................... 43

Figure 39: Industry Standard's performance on Spike-Ripples over distances ....................... 43

Figure 40: Incorrect predictions of Model B on Spike-Ripples ............................................... 44

Figure 41: Incorrect predictions of the Industry Standard on Spike-Ripples .......................... 44

Figure 42: Distribution of incorrect predictions by Model B on Spike-Ripples ...................... 45

Figure 43: Distribution of incorrect predictions by the Industry Standard on Spike-Ripples . 45

Figure 44: Model B's performance on Spike-FastRipples over distances ............................... 46

Figure 45: Industry Standard's performance on Spike-FastRipples over distances ................. 46

Figure 46: Incorrect predictions of Model B on Spike-FastRipples ........................................ 46

Figure 47: Incorrect predictions of the Industry Standard on Spike-FastRipples .................... 46

Figure 48: Distribution of incorrect predictions by Model B on Spike-FastRipples ............... 47

Figure 49: Distribution of incorrect predictions by the Industry Standard on Spike-

FastRipples ............................................................................................................................... 47

Summary

This study applies to the field of automated high frequency oscillation (HFO) detection.

HFOs are a biomarker of epileptogenic tissue. Therefore, models capable of automatically

detecting these behaviours could improve the success rates of epileptic tissue removal

surgery, as well as significantly reduce costs for health service providers. Of course, this is a

challenging task. In particular, the difficulty of this problem relates to the disruptive effects of

noise and non-HFO behaviours within iEEG signal. Such signal behaviours are easily mis-

classified as HFOs.

This dissertation project utilizes depth-wise stacks of time-frequency plots, which may allow

for the encoding of accessory information. Once the frequency behaviour over time of the

iEEG signal is captured within these plots, CNN models are used to classify them as either

HFO or non-HFO behaviour. We present new model structures specifically designed to

capture the intricate details within these time-frequency plots. CNN models with these

specific structures have never been utilised on this problem and this study shows the

applicability of these alternative structures to this task.

Three new alternative models are constructed. First, cross-validation is performed in order to

test the stability of each model’s performance. Once an optimal model structure is chosen,

said model and a re-created version of the industry standard model are applied to a final test

set. The model proposed in our research provides more accurate results than the industry

standard model on this task. In fact, the model proposed here performs more accurately on all

wave types simulated.

An analysis of the relationship between the predictive power of the models and the distance

to behaviours is also conducted. We conclude that the smaller the displacement between

electrodes and the position of HFOs, the more accurate the predictions. Conversely, we find

that the smaller the distance to disruptive non-HFO behaviours such as spikes, the larger the

disruptive effects.

vii

Contents Acknowledgements .................................................................................................................... ii

List of Acronyms ..................................................................................................................... iii

Table of Figures ........................................................................................................................ iv

Summary ................................................................................................................................... vi

1. Introduction ............................................................................................................................ 1

2. Background ............................................................................................................................ 2

2.1 EEG .................................................................................................................................. 2

2.2 HFOs ................................................................................................................................ 2

2.3 Electrical Behaviour at the Cellular Level ....................................................................... 3

2.4 Fourier Analysis and Filtering.......................................................................................... 4

2.5 Time Frequency Analysis and Wavelets .......................................................................... 6

3. Literature Review and History of the Field ........................................................................... 8

3.1 Overview of Previous Methods ........................................................................................ 9

3.1.1 Early Papers ............................................................................................................... 9

3.1.2 Clustering Methods.................................................................................................... 9

3.1.3 Linear SVMs............................................................................................................ 10

3.1.4 Neural Networks ...................................................................................................... 11

3.1.5 Convolutional Neural Networks .............................................................................. 11

3.2 Current Industry Standard .............................................................................................. 11

4. Introduction to Convolutional Neural Networks ................................................................. 13

4.1 Overview of Basic Principles ......................................................................................... 13

4.1.1 How Images are Interpreted by a CNN ................................................................... 13

4.1.2 Convolutional Layers .............................................................................................. 13

4.1.3 Pooling Layers ......................................................................................................... 14

4.2 Inception Networks ........................................................................................................ 14

4.3 Residual Connections ..................................................................................................... 16

viii

5. Data Pre-processing ............................................................................................................. 17

5.1 Wave Simulations .......................................................................................................... 17

5.2 Simulating the Effects of Distance ................................................................................. 19

5.3 Time-Frequency Plot Creation ....................................................................................... 20

5.4 Creation of Sets for Cross Validation and Final Testing ............................................... 23

5.5 Limitations of Simulated Data ....................................................................................... 23

6. Construction of Models........................................................................................................ 24

6.1 Reconstruction of the Industry Standard Model ............................................................ 24

6.2 Construction of New Models ......................................................................................... 25

6.2.1 Model A ................................................................................................................... 26

6.2.2 Model B ................................................................................................................... 29

6.2.3 Model C ................................................................................................................... 31

6.3 Applicability of Deep Learning and CNNs to HFO Detection ...................................... 33

7. Applying the Models............................................................................................................ 34

7.1 Cross-Validation Results ................................................................................................ 34

7.2 Test Set Results .............................................................................................................. 35

7.2.1 Overall Performance ................................................................................................ 35

7.2.2 Accuracy Breakdown by Wave Type ...................................................................... 37

7.3 Predictive Power of Models over Distance .................................................................... 38

7.3.1 Simple Waveforms – Ripples, Fast Ripples and Spikes .......................................... 39

7.3.2 Complex Waveforms – Pairings Between Ripples, Fast Ripples and Spikes ......... 41

7.4 Performance on Small Datasets...................................................................................... 48

8. Final Discussion ................................................................................................................... 48

8.1 Limitations to Conclusions ............................................................................................ 49

8.2 Possible Next Steps ........................................................................................................ 49

9. Appendices ........................................................................................................................... 50

10. Bibliography ...................................................................................................................... 53

1. Introduction

Epilepsy is a widespread neurological disorder. For many sufferers, the use of medication can

reduce or even stop the occurrence of seizures. However, in some cases, this course of action

is not feasible. This can be due to a variety of reasons such as the medication being

ineffective or causing particularly adverse side effects. In cases were medicine is not a viable

treatment option, one alternative is the surgical removal of the brain tissue that is responsible.

Of course, the success of these surgical procedures is dependent on the correct identification

of the tissue that is to be removed. Intracranial electroencephalography (iEEG) is a method

carried out pre-surgery in order to locate these epileptogenic zones. iEEG utilises

microelectrodes placed within the exposed brain tissue of the patient. Measurements of

voltage at each electrode site are taken. These continuous voltage signals can give us an

insight into the behaviour of the brain at these locations. Specifically, high frequency

oscillations (HFOs) are of interest, as these signal behaviours have proven to be a biomarker

for the epileptogenic zones of the brain (Jacobs et al. 2008).

The current methods for detecting and subsequently classifying these behaviours are centred

around visual inspection by reviewers. For health care services, the application of automated

techniques promises a more optimal use of their time and financial recourses, as well as

removing human error from the process. However, researchers have struggled to find a

method capable of classifying HFOs from the noisy background of iEEG data.

Recent research indicates that time-frequency plots are a valuable tool for identification of

HFO’s from within iEEG data (Liu et al. 2016). These plots allow for an exceptionally

accurate representation of the frequency of signal and how such frequencies change over

time. In addition, the application of Convolutional Neural Networks (CNNs) to classify these

plots has led to promising results (Lai et al. 2019).

In this research, simulations are used to recreate the scenario of an HFO detection problem.

We then build upon the work of Lai et al and investigate the applicability of new CNN model

structures to this classification problem.

2. Background

When conducting research in a field such as this, it is vital to consider the underlying

scientific foundations. While this dissertation primarily focuses on the development of new

machine learning methods, the science behind the data on which these methods are applied is

important for context.

2.1 EEG

The electroencephalogram (EEG) is a method of measuring the electrical activity of the brain

using electrodes placed on the scalp. However, this method suffers from certain limitations.

For instance, the low conductivity of the skull makes this approach poor at detecting the

electrical activity of the inner brain. The intracranial electroencephalogram (iEEG) is a

variation of the EEG in which the electrical activity of the brain is instead measured through

electrodes placed within the exposed surface of the brain.

At each electrode position, the output of an iEEG is a continuous measurement of the changes

in voltage plotted against time (Luck, 2014, p. 4). More precisely, individual measurements

of voltage are taken at such a high frequency that the wave can be treated as continuous in

practice. The brain is of course always active, which leads to an ever-changing signal

measurement at each electrode.

2.2 HFOs

iEEG signal oscillations can occur at a variety of frequencies. Over the last few decades, the

development of broad-band digital EEG has increased the frequency measurement capacity to

over 500Hz (Navarrete et al. 2016a).

In this field of study, signals are often categorized by their frequency. In fact, it is common

practice to delegate to each frequency band a Greek letter. For example, Delta waves are

usually of relatively low frequency (2–4 Hz), while Gamma waves denote oscillations with a

fairly high frequency (30-100Hz) (Whittingstall et al. 2009). To further complicate matters,

researcher’s definition of what exactly constitutes an HFO varies. One such definition is that

HFOs are oscillatory activities that have a power increase inside the 40-800Hz band and have

durations in the tens of milliseconds (Navarrete et al. 2016a).

Certain HFOs have been found to be physiological in nature. For instance, HFOs have been

measured from several regions of the brain when participants were exposed to visual stimuli

in the form of images (Kucewicz et al. 2014). While many HFOs are manifestations of

physiological processes of the brain, some have proven to be pathological in nature. Of

particular interest is the research linking the appearance of certain HFOs to the seizure onset

zone (Jacobs et al. 2008). Indeed, research strongly links the removal of brain tissue

associated with these HFOs to the success of epilepsy surgery (Wu et al. 2010) (Jacobs et al.

2010).

In this field of epileptic tissue identification, two of the most commonly studied HFOs are

ripples and fast ripples. Ripples are HFOs that occur with a frequency range of 120 - 240 Hz,

while fast ripples are HFOs that occur with a frequency exceeding this range (Navarrete et al.

2016a).

2.3 Electrical Behaviour at the Cellular Level The human brain is composed of billions of cells known as neurons. These are highly

specialized cells that take on a variety of structures in order to carry out specific tasks. To this

end, neurons are classified based on factors such as their function, location, and shape (Squire

et al. 2008, p.4). Despite this wide range of different cell types, there are structures common

to almost all neurons; a cell body (soma), dendrites and an axon. The soma contains the

nucleus and various other organelles that are vital for the neuron’s functionality. Dendrites

extend from this cell body and branch in often elaborate patterns, they are responsible for the

transmission of electrochemical signals that are received from other cells to the soma. This

signal is then passed to other cells via the axon (Squire et al. 2008, p.4)

There are numerous contributors to the measurable electrical activity of the brain. The two

most prominent of these are action potentials and postsynaptic potentials, moreover these

phenomena are interdependent. Action potentials are voltage spikes that are transmitted

through the cell from the soma to the axon (Luck, 2014, p. 39). Postsynaptic potentials either

encourage or inhibit action potentials and their origin is a little more complex.

Synapses can be thought of as the junction between neurons, across which these cells

communicate. When an action potential occurs within a neuron, neurotransmitters are

released, these neurotransmitters then bind to sites on the postsynaptic neuron. When a

neurotransmitter binds to the postsynaptic cell it has one of two possible effects. It may

hyperpolarize the membrane of the postsynaptic cell, this is known as an inhibitory

postsynaptic potential (IPSP), these decrease the probability of the postsynaptic neuron firing

an action potential. Alternatively, a neurotransmitter may depolarize the membrane of the

postsynaptic neuron, this is known as an excitatory postsynaptic potential (EPSP), which

increases the probability of the postsynaptic cell firing an action potential.

If several neurons have action potentials that occur simultaneously, and the axons of the cells

are orientated in parallel to each other, then this voltage may aggregate. However, if one of

these conditions is not met, then this will lead to signal cancellation. The short duration of an

action potential means it is unlikely for these conditions of both timing and orientation to be

met. A large number of summed action potentials are needed to create a voltage capable of

successfully propagating through the resistive brain tissue and scalp. Therefore, electrical

activities originating from action potentials are not always measurable by a classical EEG

(Luck, 2014, p. 39). Conversely, postsynaptic potentials have a longer duration, and the

probability of voltage summation is higher, meaning postsynaptic potentials can usually be

measured from a greater distance (Luck, 2014, p. 40). This means they are more likely to be

measurable using classical EEG.

Research suggests that HFO’s, and in particular HFO’s of a pathological nature, originate

from a number of simultaneously occurring action potentials from groups of suitably

orientated neurons (Cendes et al. 2018). Since iEEG is taken intra-cranially, the resistive

effect of the scalp is removed and the distance between the electrical activity and the

measuring electrodes is reduced. This results in iEEG being a far more sensitive measuring

instrument to certain electrical behaviours. Indeed, the HFOs originating from these summed

action potentials are far more measurable using iEEG rather than EEG.

2.4 Fourier Analysis and Filtering Filtering is a commonly used technique in the field of EEG and iEEG analysis. It is used to

suppress signal behaviours that fall within certain frequency bands (Luck, 2014, p. 226). In

the field of HFO detection, filtering is predominantly applied to iEEG signals as a data pre-

processing method aimed at suppressing low frequency behaviour and isolating oscillations

occurring at higher frequencies. Despite the relative success of this technique within this

field, research also suggests application of filtering in scenarios where it is not suitable to do

so may lead to distortive effects (Bénar et al. 2010).

Filter types are categorised by the frequencies they let pass or suppress. For example, high

pass filters let high frequency oscillations pass, while attenuating low frequencies. Low pass

filters allow low frequencies to pass while attenuating high frequencies. Band pass filters are

used to attenuate high and low frequencies simultaneously at the user’s discretion.

The filtering process is based on the underlying concept of Fourier analysis. Fourier analysis

allows for the deconstruction of a continuous signal in the time domain to a number of sine

waves with various frequencies, amplitudes and phases (Luck, 2014, p. 220). The Fourier

transform, created by the mathematician Joseph Fourier, is the function that allows for this

mapping from the time domain onto the frequency domain.

It’s worth noting that a non-zero measurement on a signal’s chart on the frequency domain

does not necessarily reflect that oscillations occurred at that specific frequency. Rather, the

respective chart of a continuous signal in the frequency domain reflects the sine waves and

their respective properties that are needed to reconstruct the original signal. However, in

practice, oscillations at a particular frequency in the raw signal will usually manifest as strong

power at that frequency within a chart on said frequency domain. (Luck, 2014, p. 225)

Once raw signal is translated from the time domain onto the frequency domain by use of this

transformation, each frequencies power is then multiplied by some pre-defined gain value.

Each gain value is a number between 0 and 1. Subsequently, frequencies are attenuated or

passed based on the corresponding gain value that is to act upon them. For example, in a high

pass filter, high frequencies are multiplied by high gains in order to preserve them, while low

frequencies are inhibited through multiplication by a small gain. The gain values to be

utilized on each specific frequency band is defined by a frequency response function. Once

suitable attenuations have been carried out, this now filtered version of the signal, which lies

on the frequency domain, is translated back onto the time domain by the inverse Fourier

transform to re-form a continuous signal.

2.5 Time Frequency Analysis and Wavelets Time frequency analysis is another commonly used technique in the field of EEG analysis.

Again, this is based upon the deconstruction of waves into the wave frequencies needed to re-

create the original signal. However, in this method, information concerning the time in which

these frequencies occur is also maintained.

When considering a method to properly examine how the frequency of a wave changes over

time, the Fourier transform proves itself to be extremely ineffective. In fact, taking a Fourier

transform of an entire iEEG signal would mean the loss of all information as to what

frequencies occurred at what which point in time.

A possible solution to this problem could be to take several sample points over the section of

signal we’re interesting in. At each sample point chosen, a window of pre-specified size

could be centred on the point. We could then conduct Fourier analysis within these windows

and attribute the results as happening at the point corresponding to the centre of a window.

To visualize how such a method may be carried out, figure 1 shows how several windows

could be used to analyse two seconds of signal we would like to analyse for the presence of

an HFO.

Note that while figure 1 utilizes only four windows, each of which is disconnected from all

other windows, there is nothing to stop us from increasing the number of windows and

2 seconds of signal

Windows

Raw Signal

Figure 1: Taking windows of a signal

ensuring that these windows overlap to give a more in depth analysis of the signal at all

points in time. Unfortunately, there are huge limitations imposed by using such a method.

One such problem is that the relevance of all oscillatory behaviour within a window is

weighted equally by the Fourier transform, and yet the behaviour is attributed to a single

point in time. This is not ideal, as there is potential for interesting oscillatory behaviour that

occurs on the outer border of a window to be attributed to a point in time despite the

behaviour occurring before/after that point in time.

Another limitation is imposed by the use of a constant window size. This is because of the

balancing act of localizing wave behaviour in terms of both their time and frequency

simultaneously. Different window sizes will have specific benefits for the task. For instance,

small window sizes mean we are likely to have a high resolution in terms of time, while we

will find it difficult to pick up low frequency behaviour that takes a longer time to complete

oscillations. On the other hand, larger windows are suited to picking up on the lower

frequency activity that occur over larger time intervals. However, by using larger windows,

we are less certain of where exactly this behaviour is occurring, i.e. we lose resolution in

terms of time.

A solution to these issues is wavelet analysis. There are many types of wavelets, each with

their own benefits for signal processing. The wavelets utilised in this research are Gabor

wavelets, which are formed by multiplying sine waves of particular frequencies by a

Gaussian function.

Rather than deconstructing a wave into sine waves of varying frequencies, as the Fourier

transform does, wavelet analysis deconstructs a signal into many wavelets of different

frequencies. This is achieved by a process of convolution between the signal and the different

wavelets. In practical applications, wavelets of low frequency and therefore longer time

duration are convolved with the signal to pick up low frequency, long duration activities of

the raw signal. While small duration, high frequency wavelets are convolved with the signal

to pick up high frequency activity.

By using a mix of wavelet frequencies, we escape the limitations of fixed window sizes.

Additionally, since we are considering a Gabor wavelet, which is a product of a Gaussian bell

curve, when convolution takes place between the wavelet and signal within certain windows

of time, the points in the outer region of such a window are attenuated according to their

relative position in time. This means that the more central a point in a window, the more

strongly it is attributed to that point in time.

A time-frequency image is simply a 2D matrix in which the rows stand for each frequency

considered in a wavelet analysis, and the columns the specific times at which these

frequencies occurred. Wavelet analysis allows for the deconstruction of signal into the

frequencies that occur while also maintaining a high temporal resolution.

3. Literature Review and History of the Field

Early research was able to identify HFO’s and link these oscillations to the seizure onset zone

(Fisher et al. 1992). Before attempts at automation were proposed, researchers deployed

manual review to identify oscillations of interest. This was a monotonous, time consuming

task and required the efforts of highly trained experts in the field. Their excellent work laid

the foundations of the field today.

Since then, numerous research papers have aimed to create a method for the automated

detection of HFOs. As the field has progressed, a range of machine learning techniques have

been applied in hope of finding a suitable model. Due to the large number of papers

published in the field, a full overview of each method applied is beyond the scope of this

dissertation. However, in this chapter, an overview that highlights some of the most notable

and successful papers is given.

It should be noted, that most HFO detection techniques can be thought of as a three-step

process. Initially, filtering is applied to the raw signal in order to bring to focus frequencies of

interest. Secondly, putative HFO’s are identified by the application of a threshold based on

selected characteristics of the filtered wave. Lastly, more advanced machine learning models

are used to distinguish true HFO’s from background noise and errors (Navarrete et al. 2016a).

3.1 Overview of Previous Methods 3.1.1 Early Papers The earliest attempt at automated HFO detection was a process put forward in 2002 by Staba

et al (Staba et al. 2002). Raw signal was first filtered to attenuate all frequency behaviour

outside of the 100-500Hz range. This filtered signal was then passed over by a 3-millisecond

sliding window, for each window a calculation of the root mean square error was taken.

Successive RMS values calculated to be greater than 5 standard deviations above the mean

RMS value of the entire signal, with duration of 6ms or longer, were selected as putative

HFOs. These potential HFOs were then put through an additional rule, that there must be a

minimum 6 peaks greater than 3 standard deviations above the mean value of the rectified

band pass signal. The method, while not providing a completely effective solution to

automated HFO detection, provided further evidence for the existence of different forms of

HFOs, specifically the existence of ripples and fast ripples.

In Crépon et al, a Hilbert transform was applied to the high pass filtered signal in order to

obtain a signal envelope (Crépon et al. 2009). Local maxima of this signal envelope were

considered putative HFOs. Unfortunately, the classification methods detected many false

positives. This led to the research team having to manually review potential false positives

using visual inspection of both the raw signal and time-frequency maps. Therefore, this could

only be considered a semi-automated procedure.

3.1.2 Clustering Methods As the field of automated HFO detection progressed, so did the application of statistical

techniques and algorithms that have been successful in other fields. A multitude of papers

have applied clustering methods to this classification problem.

Research conducted by Blanco et al was able classify HFOs into three distinct classes through

the use of k-medoid clustering (Blanco et al. 2010). With a fourth group of artifacts also

formed. Their methods included the use of the RMS based method for pre-selecting putative

HFOs proposed by Staba et al. The output of these clustering methods gave further evidence

for the existence of distinct sub-classes of HFO, such as ripples, fast ripples, and mixed

frequency events. This research was particularly notable as the first paper to apply

unsupervised methods to the problem.

In 2019, both semi-supervised k means and mean shift clustering algorithms were utilised

(Du et al. 2019). Several pre-processing steps were taken. Firstly, data normalization was

conducted using min-max normalization, before several filtering methods were applied. The

Teager energy operator and wavelet entropy were used as features in the semi-supervised k

means clustering algorithm to separate pathological and physiological HFOs. Previously

labelled data was used to initiate the k-means clustering algorithm. Remaining data was then

labelled based off relative position to these pre-created cluster centres and was used to

iteratively calculate new centroids. The group labelled as pathological HFOs then had an

unsupervised mean shift clustering algorithm applied to further divide into sub-classes.

3.1.3 Linear SVMs

Research by Matsumoto et al provided an in depth analysis that led to a further understanding

of how event frequency, duration and amplitude is linked to physiological and pathological

HFOs (Matsumoto et al. 2013). This paper proposed a linear SVM for HFO classification.

Visual scanning and finger movement exercises were used to yield physiological HFOs

within iEEG recordings of patients. These physiological events were then compared to the

pathological HFOs also recorded. The linear SVM provided extremely mixed results, for

instance specificity ranged between 32.61% and 99.38% according to which patient the HFOs

were recorded from.

Jrad et al proposed the use of a multi-class linear SVM to classify HFOs from artifacts and

noise, as well as classify HFOs themselves into 4 distinct sub-groups (Jrad et al. 2016). Gabor

atoms were utilized to deconstruct raw signal into specific frequency bands, before energy

ratios and temporal information were used as features for input to the multiclass SVM. A

particularly interesting part of this study was the use of simulated data to test their model. As

well as using recorded data from epilepsy patients, this paper simulated HFO data by

inserting real-life events into background activity. This is a technique similar to the one

utilized in this dissertation and provides support for the possible applicability of simulated

data in HFO detection.

3.1.4 Neural Networks

The first attempt at the application of neural networks to the HFO classification problem was

proposed by Dümpelmann et al (Dümpelmann et al. 2012). In this paper, a radial basis

function (RBF) network was proposed as a classifier. Pre-processing of the raw signal was

conducted via the application of a high pass filter. Three features were extracted and used to

form input vectors for the network; short time energy, short time line length and short time

instantaneous frequency. Unfortunately results showed particularly low sensitivity and

specificity of 49.1% and 36.3% respectively.

Lopez-Cuevas et al investigated the use of an artificial recurrent neural network (ARNN) to

classify HFO’s in rats (López-Cuevas et al. 2013). Feature extraction was carried out by the

use of approximate entropy to highlight points of high unpredictability in the raw signal.

While the paper mentions that the ARNN was trained and tested on this data, the results were

not released.

3.1.5 Convolutional Neural Networks The first application of a CNN to this classification task was proposed by Zuo et al (Zuo et al.

2019). In this paper, data pre-processing consisted of applying band-pass filtering on the raw

signal. The raw signal was then divided into one second periods. Oscillatory behaviour

considered to be noise and artifacts was removed. Greyscale images in which a higher

amplitude of the signal was plotted using darker colours were used as an input for the

network. In this paper, the CNNs ability to classify HFOs from non-HFOs were measured

against commonly used methods such as a Short Time Energy Detector, Short line length

detector, Hilbert detector and MNI detector that were implemented in the RIPPLELAB

application (Navarrete et al. 2016b). In most instances, the proposed CNN structures were

able to gain more accurate results than these highly respected models.

3.2 Current Industry Standard

A true evaluation of which approach and specific paper has been the most successful in the

field of automated HFO detection is impossible. This is primarily because papers test their

models on sets of putative HFOs generated by their own specific patients, pre-processing

steps, and opinions of what exactly constitutes an HFO. These datasets then offer varying

levels of difficulty to models and make a direct comparison of accuracies between papers

unsuitable in most situations. The propensity of papers to focus on sub-problems of the field,

such as only considering ripples or fast ripples but not both, also makes direct comparisons

between papers laborious and often inappropriate.

The industry standard model in this dissertation is taken to be a method proposed by Lai et al.

This uses short time energy calculations to identify putative HFOs, before a CNN is

developed to classify their time-frequency plots (Lai et al. 2019).

iEEG data was collected from five patients using subdural electrodes. The raw signal was

first visually inspected and examples of extreme noise and artifacts were removed. The

continuous signal was partitioned into 15-minute segments before bandpass filtering was

applied to highlight events in the range of 80-250Hz (corresponding to ripples) and 250-

500Hz (corresponding to fast ripple activity). After filtering, each 15-minute segment

containing events of these frequencies were normalized. These segments were further split

into 10ms windows in which the short time energy is calculated. If STE values were above a

certain threshold for three successive windows, then a 2 second window centred on this

region was deemed a putative HFO.

For each putative HFO, a time-frequency plot was then created. These plots were used as the

input for the CNN. Two specialist reviewers visually marked each plot as either an HFO or

non-HFO in a double-blind trial, providing a labelled dataset for the CNN to predict against.

The dataset consisted of 14,998 HFOs, however the number of non-HFO’s used in this

dataset is unspecified. 10% of this data is taken as the test set, with the remaining dataset

again split into a training set (80%) and a validation set (20%). The results on the test set are

sensitivities of 88.16% and 93.37% on Ripples and Fast Ripples respectively.

The reasons behind taking this methodology as the industry standard are three-fold. Firstly,

the use of time-frequency plots using wavelet analysis has proven a promising method in

HFO detection (Liu et al. 2016). Secondly, there are a multitude of reasons why CNN

classifiers have a high applicability to HFO detection, as discussed in section 4.4. And

finally, the results themselves are promising. While, as suggested earlier, direct results

comparisons should be taken with a pinch of salt in this field. This method does provide more

accurate results than most other papers, including the research by Zuo et al, which in turn

showed more accurate results than several well-respected automated detection algorithms.

4. Introduction to Convolutional Neural Networks

4.1 Overview of Basic Principles

CNNs are often difficult to interpret, therefore it was deemed necessary in this dissertation to

give a brief overview of the key foundational principles.

4.1.1 How Images are Interpreted by a CNN

Coloured images are 3-dimensional tensors. Two of these dimensions specify the horizontal

and vertical positions of pixels, while the third defines the colour channel. When an image is

used as input for a CNN, each pixel is seen as an individual element of this tensor.

4.1.2 Convolutional Layers

CNNs are layered structures. Several different layer types are used to carry out specific tasks

in a network. Convolutional layers are a foundational component of any CNN. In these

layers, the mathematical process of convolution occurs between the input and a pre-defined

number of learnable kernels (also known as filters). Kernels convolve across the input, dot

products are taken at each position, and the results of these dot products are the entries to a 2-

dimensional activation map. These activation maps stacked in the depth dimension are the

output of the convolutional layer. Consequently, the number of kernels defined in a layer

effects the shape of the output. For example, a convolutional layer that operates with n

kernels would result in a depth wise stack of n 2-dimensional activation maps.

The spatial dimensions of the kernels are another hyperparameter to consider. While the

stride hyperparameter controls how the kernels move across the input. For instance, defining

a stride of 2 means the kernels move 2 pixels at a time over the input. Consequently, the size

of each activation map is determined by the stride and kernel size.

4.1.3 Pooling Layers

Pooling layers serve to reduce the size of an input. In these layers, a pooling window of pre-

specified size passes over the input with a designated stride. The most common form of

pooling is max pooling, in which the maximum value from each pooling window is selected

to be part of the output. However, average pooling is also commonplace, where the mean of

the elements within each window is calculated.

4.2 Inception Networks

A method for increasing the performance of CNNs in classification tasks, is to increase the

computational complexity of a model. In a sequential CNN model this can be achieved by

extending depth (additional layers) or width (the number of filters at each convolutional

layer). Deep sequential models have proven their merits in fields such as image classification

(Krizhevsky et al. 2012). However, the consequence of increasing the number of layers/filters

is that we also increase the computational cost of a model, which in turn leads to longer

training times. Moreover, a sequential structure is limited due to the requirement that we must

define the filter size at each layer. Such an approach stipulates that only features of an exact

spatial size can be learned at each point in a model.

Due to the repercussions associated with deeper sequential models, their usability in HFO

detection and the medical field in general is limited. Ideally, the aim is to create models that

can identify more fine detailed features while also maintaining a reasonably low

computational requirement and training time.

One such CNN model structure that may satisfy these requirements are inception networks.

These do not rely on a simple sequential structure, rather they utilise sub-structures known as

inception modules. Inception modules are acyclic in structure and consist of towers orientated

in parallel to each other.

While a sequential structure would only

permit for the application of either a

convolutional or pooling process with a

fixed window size being applied one after

another. An inception module instead

applies this variety of processes in parallel.

Consequently, an inception module can

learn filters with a variety of spatial sizes

simultaneously, while also considering the

applicability of pooling at each stage.

Figure 2 gives an example of a generalised inception

module. An input is taken, and various procedures are applied in parallel. The outputs of

these parallel towers are then concatenated into a single tensor and act as an input for the next

inception module.

Of course, this is an extremely

computationally expensive solution. For this

reason, convolutional processes with 1×1

filter sizes, a step size of 1 and a smaller

number of filters than the depth of the input

are used to reduce the size of a tensor just

before it is acted upon by computationally

expensive convolutional processes.

A 1×1 convolution process undertaken with a step size of 1 and k filters will of course

maintain the shape of any input spatially, but the depth of the output shape will be of size k.

Importantly, the amount of information lost during this process has been found to be minimal

Convolution with m × m filters

Convolution with n × n filters Pooling

Input

Concatenate

𝑦 𝑘

𝑦

𝑥

𝑧

𝑥

Figure 2: Naively structured inception module

Figure 3: 1×1 convolutions for depth reduction

in practise (Szegedy et al. 2015). This reduction method allows us to amplify a model both in

terms of depth and width while maintaining a comparatively low computational cost.

4.3 Residual Connections

Residual networks are an idea that was first put forward by He et al. They propose a solution

to the degradation of accuracy observed when training very deep CNNs (He et al. 2016).

In this method, the tensor formed before a convolutional block is added to the output of said

convolutional block. The intuition in this idea is that the effects of a convolutional block can

be thought of as a function H acting upon an input x. It is proposed that since a block is

capable of learning a complicated function H(x), it makes sense that it could also learn the

residual function i.e H(x) – x (He et al. 2016). Ergo, residual connections mean that a

convolutional block learns a residual f(x) where H(x) = x + f(x). The input information x

skips the effects of a convolutional block and is added to the output of a block despite the

relevant function not being applied to it.

Convolution with m × m filters

Convolution with n × n filters

Pooling

Concatenate

Input

Convolution with p × p filters

Convolution with 1 × 1 filters

Figure 4: Inception module structure with 1×1 convolutional blocks

This type of connection helps to solve issues such as a vanishing gradient and has allowed for

the learning of deeper networks. The implementation of this connection adds no extra

parameters and requires no extra computational power other than the inconsequential amount

required for the element-wise addition of the tensors.

5. Data Pre-processing

5.1 Wave Simulations

The data used within most HFO detection papers are sourced from epileptic patients.

However, due to several limiting factors, this study uses artificially created data produced by

Miguel Navarrete, an expert in this field. Factors include the unavailability of iEEG data

obtained in such a medical trial, the ethical issues surrounding the recording of new data and

the time constraints imposed by this being an MSc dissertation rather than a full research

paper. In practical terms, we are simulating putative HFOs already identified by some

method. All data was simulated using MatLab.

Since the objective is to produce artificial data that replicates the task of HFO classification,

it makes sense to setup a system that properly represents how iEEG data is recorded. In a

real-life situation, measuring apparatuses are implanted into patients exposed brain tissue at

various positions. Each apparatus has a grouping of electrodes that record voltage. Our setup

considers a generalised version of such a measuring device on which 4 electrodes are

positioned equidistance apart.

𝑥

Convolutional Block

𝑥

𝐻(𝑥)

Convolutional Block

𝐻(𝑥) = 𝑓(𝑥) + 𝑥

Figure 5: Residual connection

Normal Connection Residual Connection

While there is no available iEEG data that contains HFOs, there does exist iEEG data without

these events present. These signals are used as a baseline within which we implant oscillatory

behaviours of interest. Specifically, baseline signals are separated into 2 second intervals, and

oscillatory behaviours are implanted into the centre of these segments. To give a realistic and

demanding challenge, nine distinct behaviours are formulated. Each type of signal behaviour

is a waveform often encountered by HFO detectors.

Each simulated wave is formulated with its own distinct properties such as frequency and

duration. These properties are randomly set to values between pre-defined limits normally

exhibited by that type of waveform. Below is a summary of each simulated waveform:

• Ripples: This is an HFO event in which signal oscillates at a relatively high

frequency. Frequency is randomly set between 120Hz and 240Hz. This wave may

repeat anything between 8 and 20 times.

• Fast Ripples: This is another HFO event, like a ripple, where signal is oscillating

with high frequency. However, the frequency of this waveform is greater than that of

a ripple. Fast ripple frequency is randomly set within 240Hz to 450Hz. This wave

may repeat anything between 5 and 15 times.

• Spike: This is a non-HFO event. Signal reaches high amplitude before suddenly

falling, of course this quick activity that occurs with high frequency, meaning it

represents an especially difficult problem for any HFO detector to classify this non-

pathological spiking activity from an HFO. Instead of being defined in terms of how

many times a wave repeats, spikes are defined over a randomised time interval.

Spikes may be anything from 0.025 to 0.08 milliseconds in duration.

• Ripple-FastRipple: This is a simulated event in which a ripple and fast ripple are set

to occur simultaneously. The challenge for an HFO detection model is to disregard

any disruptive effects both HFOs may have on each other.

• Spike-Ripple: This is an event in which both an HFO in the form of a ripple, as well

as a non-HFO behaviour in the form of a spike are simulated simultaneously. Models

must flag this behaviour as an HFO despite the occurrence of signal distorting non-

HFO behaviour occurring.

• Spike-FastRipple: In this instance we simulate an HFO in the form of a fast ripple,

as well as a non-HFO in the form of a spike to occur simultaneously.

• Baseline: No wave is simulated; this is simply the raw baseline signal without any

distinctive behaviour within it. Of course, raw signal, even without specific waves

within it, may contain a fair amount of noise that has the potential to be misidentified

as HFO activity.

• Noise: This is a simulation of extremely noisy signal often encountered in iEEG data.

No specific type of wave is simulated, instead signal is distorted. Noise is defined

over 0.2 seconds and is created using a pink noise function in MatLab. This gives the

signal in this region of time added irregularity, which may be mistaken for HFO

behaviour.

• Artifact: This is an anomalous activity not associated with the electrical behaviour of

the brain but rather behaviour from external sources. Artifacts are created by passing a

gaussian wave through a step function, this replicates the huge jumping behaviour

within signal created by external sources.

Any signal in which a Ripple or Fast Ripple occurs should be classified as an HFO. I.e.

Ripples, Fast Ripples, Ripple-FastRipples, Spike-Ripples and Spike-FastRipples are labelled

as HFOs. While Spikes, Baseline, Noise and Artifacts are labelled as non-HFOs.

5.2 Simulating the Effects of Distance

The use of simulations gives the opportunity to study HFO detection from an alternative

perspective. Specifically, we would like to investigate whether the distance between an

electrode and the source of an event influences detection accuracy. Intuitively, we would

expect behaviours that occur further from the point of measurement to be more challenging to

correctly identify.

In order to properly simulate the effects of distance, the behaviours from the previous section

are set to occur within a 3-dimensional coordinate system. Within this system, the electrode

positions are constant. Ripples, Fast Ripples and Spikes are set to occur no closer than 0.01,

0.005 and 0.005 metres to an electrode respectively. The maximum distance between these

behaviours and any electrode is upper bounded at 0.04 metres.

In order to simulate the resistive properties of brain tissue, we calculate the expected effects

of travelling through gray matter across the pre-specified distance using methods derived by

Logothetis et al (2007). Where the average impedance for grey matter in a medio-lateral

direction is taken to be 75 ohms.

Through the application of the methods put forward in this research, the attenuating effects of

distance on each event can be calculated. These scaled waves are then projected onto the

baseline signal. The higher the distance, the less prominent the behaviour within the signal,

meaning it should be more difficult to correctly identify.

5.3 Time-Frequency Plot Creation

Section 2.5 covered the fundamentals of time-frequency analysis. Time-frequency plots

derived from wavelet analysis have proven to be a valuable method for predicting HFOs (Liu

el. 2016) (Lai et al. 2019). This project builds upon such research and utilizes a new way to

encode time-frequency data into 3-dimensional plots. These plots contain not only the

information that would be available within a regular 2-dimensional time-frequency plot, but

also information on how an electrodes signal varies from the electrodes in its immediate

vicinity.

In a real-life case, groups of electrodes on a measuring apparatus are implanted to a small

depth within the brain. From a logical standpoint is makes sense to take advantage of this

grouped structure to calculate the probability of an HFO having occurred, rather than rely on

each electrode to work independently.

This new method can be thought of as a stacking of three time-frequency plots where each

layer is sourced from a particular signal. Since we have three 2-d time frequency plots

stacked in the depth dimension, we can visualise this matrix as an RGB image, where the first

time-frequency plot corresponds to red, the second to green and the third to blue.

The first signal, from which we create a time-frequency plot that corresponds to red in the

final image, is simply the normal signal measured from that specific electrode.

The second signal is formed by taking the difference between the specified electrode and the

most closely positioned electrode below it in the coordinate system. In the case of the

electrode positioned at the bottom, the signal of the electrode positioned immediately above it

is subtracted. This signal is then a representation of the voltage difference measured between

that electrode and a closely positioned electrode.

The third signal is formed by first taking the mean over the signals measured from each of the

4 electrodes and then subtracting it from that electrodes signal. The middle 0.5 seconds of

each signal is used to create the time-frequency plots.

Once all three 2-dimensional time-frequency images are created, these are stacked to form a

3-dimensional matrix. An example of such a matrix, with the signals used to form it, is

visualised using the RGB configuration in figure 6.

The time-frequency plots are created using a wavelet analysis algorithm. Specifically, Gabor

wavelet transforms are used in this instance to create plots that encode the behaviour over a

duration of 0.5 seconds.

In cr

ea si

ng F

re qu

en cy

0.5 Seconds

Wavelet Analysis

0.5 Seconds

Figure 6: Wavelet analysis of 3 waves

5.4 Creation of Sets for Cross Validation and Final Testing

The end result of these simulations is a total of 512,760 time-frequency matrices. 285,400 of

these are HFOs, while 227,360 are non-HFOs. Python is used to extract matrices from the

MatLab files and save each as a .png filetype.

When creating sets for training and testing of models, stratified random sampling is used.

HFOs and non-HFOs are taken as the strata, which ensures that the ratio of HFOs to non-

HFOs is maintained in each set. 10% of the data is taken as the test set, which is composed of

28,540 HFOs and 22,736 non-HFOs. From the remaining 90% of data, a 4-fold cross-

validation is composed. A 4-fold cross-validation stipulates training models on 75% of a set

and testing on 25%. Consequently, in each fold models are trained using 192,645 HFOs and

153,468 non-HFOs, before being tested on 64,215 HFOs and 51,156 non-HFOs.

This dissertaion proposes three different model structures. First, the 4-fold cross-validation is

conducted for these models. This provides a reliable measure of the stability of each models

predictive power. An optimally designed model is then chosen based upon the mean accuracy

obtained over the folds of the cross validation.

The data previously used to make the cross-validation folds is then combined into a set of

256,860 HFOs and 204,624 non-HFOs. The chosen model and the re-created industry

standard models are trained on this set before being applied to the hold-out test set. An in

depth analysis of the results can then be conducted.

5.5 Limitations of Simulated Data Simulated data provides a reasonable alternative for this project, especially when considering

said data is created by an expert in this particular field. Additionally, simulations allow for a

proper study of distance effects on HFO detection, something that has not been conducted

before. Of course, it is vital we understand the potential pitfalls of these simulation methods,

and the limitations this imposes on any conclusions that are made in this research.

One implication that must be considered is that this experimental simulative setup may lead

to mis-labelling by wave diminishment. Consider the simulation of a Ripple. It is possible

that this Ripple is created with a small amplitude as well as a large distance from the

electrode. This could mean it’s magnitude within the baseline signal is diminished to such an

extent, that we must consider the question: when is a Ripple truly considered a Ripple? The

wave considered may be so minute that an expert would no longer consider this type of

behaviour an HFO, despite this, our system labels it as an HFO regardless. This is a by-

product of the simulation set up, and although very rare, it is possible that some instances of

the dataset may replicate this situation.

A similar issue is that the simulation may lead to mis-labelling by overly distortive wave

effects. In this simulation method, hybrid waves formed from both an HFO and non-HFO are

created. These simulations are all labelled as HFOs, since we would like a model to detect the

appearance of this behaviour despite the effects of the disruptive non-HFO activity. However,

this may lead to instances where the non-HFO behaviour is so disruptive, and the HFO

behaviour so diminished and effected by noise, that the final wave no longer properly

resembles any kind of HFO behaviour. Again, this is a possible occurrence in which we label

the behaviour as an HFO, despite the fact that upon visual inspection, experts may no longer

classify the wave as an HFO.

Finally, it is important to consider how the relative cardinality of wave subsets produced

leads to a disproportioned dataset. In these simulations, each type of wave is given an equal

probability to occur. This produces a dataset with roughly equal numbers of each type of

wave. In a real life study, it is highly unlikely that these proportions actually occur. In a real-

life situation we may see a very small/large amount of a certain wavetype occur, meaning this

behavior is over/under represented in our dataset. This is not as much of an issue when

considering the performance of models against eachother within this dissertation, since both

models are exposed to the same data. However, it would be unsuitable to make a side-by-side

comparison between the overall accuracies obtained here and papers that derive their data

from a real-life medical study.

6. Construction of Models

6.1 Reconstruction of the Industry Standard Model This dissertation includes graphs of interconnected blocks to visualise network sub-

structures. Each block represents a separate process within a structure. An explanation of how

to read these diagrams can be found in the appendices.

In this study, we re-create the CNN proposed by Lai et al (2019) in order to give a baseline

level of performance. This model is a simple sequential structure with two convolutional

layers and two pooling layers. Refer to figure 7 for a visualisation of this structure.

Unfortunately, the paper gives no information about several key design choices, therefore

some presumptions are made in order to re-create the model. For instance, in the fully

connected layers of the model, both the number of layers and the number of neurons within

each layer are unspecified. We make the presumption that there are 16 neurons in the first

layer and 10 in the second layer. This assumption is derived from the model diagram within

the original paper.

The optimization method used is also left unspecified, and so is assumed to be a standard

stochastic gradient descent. This is the same optimization method used by our models. The

learning rate and its decay settings are also unspecified. In practical runs of the model it was

found that a large initial learning rate often lead to poor results. An initial learning rate of

0.01 was found to be optimal and this was halved every 2 epochs. A batch size of 250 is used,

the same as our proposed models. The finalised model has a total of 3,844,512 parameters, all

of which are trainable.

6.2 Construction of New Models

This dissertation proposes three new models, each of which follow a different structure. We

label these models A, B and C. Each model is optimized via stochastic gradient descent with

a batch size of 250. The initial learning rate is 0.05 and is set to halve every 2 epochs. Due to

the time regulations surrounding access to certain facilities on the Supercomputing Wales

cluster, the maximum number of epochs is restricted to 10. For each convolutional block,

non-linearity is achieved by the application of a rectified linear activation function. Batch

normalization is then applied to this output.

Each model uses depth-wise separable convolution in place of the classical convolutional

process as this has been shown to provide a modest upgrade in accuracy, whilst

Max Pool (3,3) 2 N Input

Conv 16 (3,3) 1 N

Max Pool (3,3) 1 N

Dense 12 Dense 10 Softmax

Figure 7: CNN structure proposed by Lai et al

simultaneously reducing the number of parameters (Chollet et al. 2017). Another

architectural feature common to all models is the use of overlapping pooling, meaning that all

pooling procedures are carried out with a step size smaller than the corresponding pool size.

This has been found to produce more accurate results (Krizhevsky et al. 2012)

Each model uses the same simple, sequentially structured stem. This serves to not only

extract low level features of interest; but also reduce the size of the input tensor before more

complex processes are applied.

Each model includes a final pooling procedure positioned after the inception modules. In

practise, tests of the way in which these models optimized showed a need to reduce the

interdependence between neurons in the network. Therefore, dropout with rate of 0.2 is used

to promote the learning of more robust features and minimize overfitting. Specifically, this

dropout is employed for all weights between the output of the final pooling layer and the

softmax layer.

6.2.1 Model A Model A applies filter factorisation and expansive filter banks to put a focus on capturing

large spatial features in a computationally efficient manner. While convolution with large

filter sizes may potentially extract features of great importance to this specific task, this

comes at the expense of computational efficiency. For example, each time a 7×7 filter is

applied, there are 7×7=49 multiplicative operations required. A 3×3 filter on the other hand

requires 3×3=9 operations. The 7×7 filter is disproportionally 49/9 ≈ 5.44 times more

expensive. A proposed solution to this problem is to replace convolutional layers that depend

upon large filter sizes with a series of convolutional processes with smaller filter sizes. Not

only does this reduce computational expense, theory suggests this method is able to extract

high dimensional features in a similar way to that of large filter sizes (Szegedy et al. 2016).

Taking this idea further, we can employ asymmetric factorization of convolutional layers, i.e

the replacement of single convolutional layers that employ n×n filters with two layers that

Input Conv 64 (3,3) 1 N Max Pool (3,3) 2 N

Conv 32 (3,3) 2 N

Conv 128 (3,3) 1 N

Conv 192 (3,3) 2 N

Figure 8: Stem of newly proposed models

employ 1×n and n×1 filter size respectively. Again, this reduces computational cost and we

can still capture high dimensional features. In the original paper proposing this method of

asymmetric factorization, this method was particuarly effective when acting upon inputs of

sizes m×m where m ∈ [12,20]. Conversely, this method was less effective on earlier layers

(Szegedy et al. 2016).

Another method employed in this model is the use of filter bank expansion. This is a method

in which a convolutional layer of a tower within an inception module is replaced by two

convolutional processes to be considered in parallel. Szegedy et al (2016) promoted the use

of this method when applied to a spatially small input.

The 2nd and 4th inception modules act without padding and employ a step size of 2 within one

process of each tower. This serves to reduce the spatial size of their inputs. In contrast, the 1st

3rd and 5th modules employ padding throughout and a step size of 1 in order to extract

features while maintaining spatial size for the next module.

An overview of the structure of Model A is given in

the table on the right. The finalised model has

3,442,717 parameters, 3,427,805 of which are

trainable and 14,912 of which are non-trainable.

Figures 9-13 give a graphical representation of the

inception modules used in the model.

Process Output Shape Stem (29, 29, 192)

Module 1 (29, 29, 256) Module 2 (14, 14, 224) Module 3 (14, 14, 768) Module 4 (6, 6, 1280) Module 5 (6, 6, 2048) Final Pool (1, 1, 2048) Softmax (0, 0, 2)

Prev Layer

Conv 192 (1,1) 1 Y

Conv 64 (3,3) 1 Y

Conv 64 (1,1) 1 Y

Max Pool (3,3) 1 Y

Conv 192 (1,1) 1 Y

Depth Concat

Conv 64 (1,1) 1 Y

Conv 96 (3,3) 1 Y

Conv 64 (1,1) 1 Y

Conv 96 (3,3) 2 N

Max Pool (3,3) 2 N

Conv 64 (1,1) 1 Y

Depth Concat

Conv 64 (3,3) 2 N

Conv 64 (3,3) 1 Y

Prev Layer

Conv 128 (1,1) 1 Y

Conv 192 (1,1) 1 Y

Conv 128 (1,1) 1 Y

Avg Pool (3,3) 1 Y

Conv 192 (7,1) 1 Y

Conv 192 (1,7) 1 Y

Conv 192 (7,1) 1 Y

Conv 192 (1,1) 1 Y

Conv 128 (1,7) 1 Y

Conv 192 (7,1) 1 Y

Depth Concat

Prev Layer

Avg Pool (3,3) 2 N

Conv 192 (1,7) 1 Y

Conv 192 (7,1) 1 Y

Conv 128 (1,1) 1 Y

Conv 192 (3,3) 2 N

Conv 192 (1,1) 1 Y

Conv 320 (3,3) 2 N

Depth Concat

Figure 9: Module 1 of Model A

Figure 10: Module 2 of Model A

Figure 11: Module 3 of Model A

Figure 12: Module 4 of Model A

6.2.2 Model B Model B is a more classical inception network design. A maximum filter size of 5×5 is

employed, but redistribution of its computational budget is used to create a deeper network

than models A and C. In Model B, similarly structured inception modules are repeated with

increasing filter numbers at each stage.

Again, spatial size of the input tensor is

decreased throughout the model. However, in

this case, this reduction is achieved by

intermediate pooling layers located just after the

2nd and 4th inception modules. The finalised

model has 1,738,573 total parameters,

1,728,541 of which are trainable and 10,032 of

which are non-trainable.

Process Output Shape Stem (29, 29, 192)

Module 1 (29, 29, 288) Module 2 (29, 29, 480) Module 3 (14, 14, 512) Module 4 (14, 14, 512) Module 5 (6, 6, 832) Module 6 (6, 6, 1024) Final Pool (1, 1, 1024) Softmax (1, 1, 2)

Prev Layer

Conv 320 (1,1) 1 Y

Conv 192 (1,1) 1 Y

Avg Pool (3,3) 1 Y

Conv 384 (1,3) 1 Y

Depth Concat

Conv 384 (3,1) 1 Y

Conv 384 (3,3) 1 Y

Conv 448 (1,1) 1 Y

Conv 384 (1,3) 1 Y

Conv 384 (1,1) 1 Y

Conv 384 (3,1) 1 Y

Figure 13: Module 5 of Model A

Conv 16 (1,1) 1 Y

Conv 32 (5,5) 1 Y

Depth Concat

Conv 96 (1,1) 1 Y

Conv 128 (3,3) 1 Y

Conv 64 (1,1) 1 Y

Prev Layer

Max Pool (3,3) 1 Y

Conv 64 (1,1) 1 Y

Conv 32 (1,1) 1 Y

Conv 96 (5,5) 1 Y

Depth Concat

Conv 128 (1,1) 1 Y

Conv 192 (3,3) 1 Y

Prev Layer

Conv 64 (1,1) 1 Y

Max Pool (3,3) 1 Y

Conv 128 (1,1) 1 Y

Max Pool (3,3) 2 N

Conv 16 (1,1) 1 Y

Conv 48 (5,5) 1 Y

Depth Concat

Conv 96 (1,1) 1 Y

Conv 208 (3,3) 1 Y

Conv 64 (1,1) 1 Y

Prev Layer

Max Pool (3,3) 1 Y

Conv 192 (1,1) 1 Y

Conv 24 (1,1) 1 Y

Conv 64 (5,5) 1 Y

Depth Concat

Conv 112 (1,1) 1 Y

Conv 224 (3,3) 1 Y

Conv 64 (1,1) 1 Y

Prev Layer

Max Pool (3,3) 1 Y

Conv 160 (1,1) 1 Y

Max Pool (3,3) 2 N

Conv 32 (1,1) 1 Y

Conv 128 (5,5) 1 Y

Depth Concat

Conv 160 (1,1) 1 Y

Conv 320 (3,3) 1 Y

Conv 128 (1,1) 1 Y

Prev Layer

Max Pool (3,3) 1 Y

Conv 256 (1,1) 1 Y

Conv 48 (1,1) 1 Y

Conv 128 (5,5) 1 Y

Depth Concat

Conv 198 (1,1) 1 Y

Conv 384 (3,3) 1 Y

Conv 128 (1,1) 1 Y

Prev Layer

Max Pool (3,3) 1 Y

Conv 384 (1,1) 1 Y

Figure 14: Module 1 of Model B Figure 15: Module 2 of Model B

Figure 16: Module 3 of Model B Figure 17: Module 4 of Model B

Figure 18: Module 5 of Model B Figure 19: Module 6 of Model B

6.2.3 Model C

Model C employs residual connections within an inception network structure. As discussed in

section 4.3, residual connections were first proposed as a method for training extremely deep

networks (He et al. 2016). Despite the models proposed in this paper lacking such depth,

residual connections remain an important feature to investigate the applicability of in this

task. Moreover, residual connections have been deployed in inception architectures for image

recognition with great effect (Szegedy et al. 2017).

Inception modules are designed with less width than the modules proposed in models A and

C, again this is in keeping with the designs that have been successful for inception networks

(Szegedy et al. 2017). In this model, the 1st and 3rd modules employ residual connections,

while others are classical inception modules that act to reduce spatial dimensions.

Factorization, as used in Model A is utilised.

It should be noted that after the depth-wise concatenation in modules that employ residual

connections, this output is often of reduced depth when compared to the input of the module.

Due to this discrepancy, an element-wise addition would be impossible. Therefore a 1×1

convolutional layer with a step size of 1 is utilised, where the number of filters is equal to the

depth of the input tensor. This method is used in order to scale up the output from the

module.

Additionally, in a residual connection, before the processed tensor is added to the tensor from

the previous layer, the values are multiplied by a scaling factor of 0.2. This scaling is a

technique proposed to fix the training instability that can be exhibited when the number of

kernels within a module become exceptionally

large (Szegedy et al. 2017). It proved a necessary

design choice in Model C, as early practical

applications without scaling showed the model

was unable to properly optimize. The finalised

model has 3,427,005 total parameters, 3,411,069

of which are trainable and 15,936 of which are

non-trainable.

Process Output Shape Stem (29, 29, 192)

Module 1 (29, 29, 192) Module 2 (14, 14, 960) Module 3 (14, 14, 960) Module 4 (6, 6, 1856) Module 5 (6, 6, 1856) Final Pool (1, 1, 1856) Softmax (0, 0, 2)

Prev Layer

Conv 32 (1,1) 1 Y

Conv 32 (3,3) 1 Y

Depth Concat

Conv 32 (1,1) 1 Y

Conv 32 (3,3) 1 Y

Conv 192 (1,1) 1 Y Scale Addition

Max Pool (3,3) 2 N

Conv 384 (3,3) 2 N

Depth Concat

Conv 384 (3,3) 2 N

Conv 192 (3,3) 1 Y

Prev Layer

Conv 128 (1,1) 1 Y

Conv 128 (1,7) 1 Y

Conv 960 (1,1) 1 Y

Conv 128 (7,1) 1 Y

Scale Addition Depth Concat

Prev Layer

Max Pool (3,3) 2 N

Conv 384 (3,3) 2 N

Conv 256 (1,1) 1 Y

Conv 256 (3,3) 2 N

Depth Concat

Conv 256 (1,1) 1 Y

Conv 256 (3,3) 2 N

Conv 256 (3,3) 1 Y

Figure 18: Module 1 of Model C

Figure 19: Module 2 of Model C

Figure 20: Module 3 of Model C

Figure 21: Module 4 of Model C

6.3 Applicability of Deep Learning and CNNs to HFO Detection

While this area of research is an active one, models thus far fall short of the accuracies

required to apply an automated technique in the medical field. The challenge of HFO

classification is in the intricacy of the data. Not only can HFOs occur in a variety of

frequencies and durations, non-HFO related electrical behaviours commonly lead to signal

distortion. Consequently, the multitude of different signal behaviours that constitute an HFO

are vast.

Less complex methods such as SVMs and Clustering, while showing reasonable accuracies,

have proven unable to truly capture the intricate factors at play in HFO detection. Deep

learning promises a way to build models with the ability to properly encapsulate the difficulty

of the task.

CNNs have a huge applicability to this problem. Their ability to learn features automatically,

rather than rely on manual feature extraction, distinguishes them from more traditional

methods. Classical predictive models are only as strong as the features, with hopefully

important statistical discriminatory value, that they are fed.

From a logical standpoint, the creation of an HFO classification model can be split into two

steps. The first of these is the way in which the raw signal is encoded to make it interpretable

to models. The second is how the model is structured to find patterns of interest from this

data. The methodology presented in this dissertation therefore provides a comprehensive

solution. Wavelet analysis allows for a more accurate encoding of frequency and time

information. While inception networks provide a model type more appropriately designed to

learn the intricate and subtle features of time frequency plots than the CNN proposed by Lai

et al (2019).

Prev Layer

Conv 192 (1,1) 1 Y

Conv 128 (1,3) 1 Y

Conv 1856 (1,1)

1 Y Conv 128 (3,1) 1 Y

Scale Addition Depth Concat

Figure 22: Module 5 of Model C

7. Applying the Models.

We now apply the models to our data. Cross validation is first used to test the stability of

each models’ predictive power. Once an optimally designed structure has been established,

we apply this model and the industry standard to a final test set. Models were trained on the

Supercomputing Wales cluster. Specifically, training was distributed over 2 GPUs

simultaneously. Model scripts as well as an example bash script are given in the appendices.

7.1 Cross-Validation Results

Table 7.1.1 gives the results of the cross validation. To give some appreciation of a baseline

performance, the industry standard is also applied at this stage. Interestingly, all the designed

models proposed perform more accurately than the industry standard across all four folds.

Furthermore, the time taken to complete the 10 epochs is presented in the format of days:

hours: minutes. Models show negligible difference from the industry standard on this metric.

Model B obtains the highest mean accuracy across the folds of the cross validation. Due to

this performance, we take this to be the optimally designed model and select it for application

on the test set.

Fold 1 Fold 2 Fold 3 Fold 4 Mean Model A 96.74% 96.95% 96.82% 96.93% 96.86% Model B 96.93% 96.81% 97.00% 96.96% 96.925% Model C 96.36% 96.19% 96.44% 96.33% 96.33% Industry Standard

95.81% 95.60% 95.52% 95.87% 95.70%

Fold 1 Fold 2 Fold 3 Fold 4 Mean Model A 01:17:22 01:17:58 01:18:07 01:17:05 01:17:38 Model B 01:19:50 01:18:23 01:17:42 01:18:01 01:18:29 Model C 01:17:20 01:17:53 01:17:32 01:18:30 01:17:48 Industry Standard

01:19:04 01:17:15 01:16:41 01:17:39 01:17:40

7.2 Test Set Results

Model B and the industry Standard model are now applied to the test set in order to simulate

a real-life HFO detection task. As discussed in section 5.4, the data utilised for the 4-fold

cross validation now becomes the training set. Batch size, number of epochs, and

optimisation details are maintained from the previous section.

Not only is an overall analysis of predictive performance given, but also an investigation of

accuracy in respect to each individual wave type. We also inspect whether the distance

between the waveform and electrode has meaningful effects on predictive power.

7.2.1 Overall Performance

Figures 26 and 27 show the confusion matrices from application to the test set of both Model

B and the Industry Standard. Model B has a higher number of true positives and true

negatives, indicating a more effective performance.

Industry Standard Model B

Figure 25: Confusion matrix for Model B's application on the test set

Figure 26: Confusion matrix for the Industry Standard's application on the test set

To obtain a more formal measure of performance, sensitivity and specificity is calculated

using the equations below. Accuracy and the time taken to complete the 10 training epochs

are also presented.

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑇𝑃

𝑇𝑃 + 𝐹𝑁

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 = 1 − 𝐹𝑃

𝑇𝑃 + 𝐹𝑃

Model B performs more effectively than the Industry Standard in terms of overall sensitivity

and specificity. The industry Standard model is faster to complete the 10 epochs, however

such a small difference in training time is arguably negligible in this situation.

Below is an interesting visualisation of the predictive behaviour of each model on the 51,276

instances of the test set and how this behaviour relates to the other model. The two circles of

the Venn diagram represent the group of accurately predicted instances within the test set by

each model.

Model B Industry Standard Training Time 01:21:23 01:20:47 Sensitivity 96.36% 94.81% Specificity 98.34% 97.03% Accuracy 97.07% 95.59%

502 1256

1001

48517

Figure 23: Venn diagram comparing the performance of Models

Model B Industry Standard

48,517 of the instances within the test set are accurately predicted by both models. 1256

instances are accurately predicted by Model B that are inaccurately predicted by the Industry

Standard model. Conversely, 502 instances are correctly predicted by the Industry Standard

model that are incorrectly predicted by Model B. 1001 instances are incorrectly predicted by

both models.

7.2.2 Accuracy Breakdown by Wave Type

While Model B performs more effectively when taking an overall view of the test set, it’s of

course important to understand the reasons for this performance. A more in-depth analysis of

each model’s performance by wave-type is necessary. By evaluating model performance by

wave-type, we can examine if model B is identifying better patterns over all different waves,

or simply performing more optimally on a certain sub-task of the problem.

The above table gives a breakdown by wave type of both Model B and the Industry Standards

performance on HFOs of the test set. Interestingly, Model B has more accurate results for all

types of wave. This suggests Model B has developed more meaningful features than the

Industry Standard independent of the type of HFO prediction is made upon.

Both Models can identify fast ripples with more accuracy than ripples, which is in line with

the finding as of the industry standard paper (Lai et al. 2019). Both models find situations in

which a ripple and fast ripple wave occur simultaneously as the simplest wave type to predict.

Model B Industry Standard Wave N Correct

Prediction Incorrect Prediction

Accuracy Correct Prediction

Incorrect Prediction

Accuracy

Ripple 5651 5520 131 97.68% 5412 239 95.77% Fast Ripples

5768 5655 113 98.04% 5606 162 97.19%

FastRipple- Ripple

5637 5627 10 99.82% 5625 12 99.79%

Spike- FastRipple

5836 5438 398 93.18% 5355 481 91.76%

Spike- Ripple

5648 5261 387 93.15% 5062 586 89.62%

Total 28540 27501 1039 96.36% 27060 1480 94.81%

Perhaps in line with intuition, situations in which a ripple or fast ripple occurs with disruptive

non-HFO behaviour in the form of a spike present the most difficult task for the models.

There is a significant drop in both ripple and fast ripple detection accuracy when spiking

behaviour is set to occur simultaneously.

The below table gives a breakdown by wave type of each model’s performance on non-HFO

events. Interestingly, both models have a higher accuracy when evaluating non-HFO activity

compared to HFO activity. Both models perform well on artifacts and noise, Model B in fact

scores 100% when predicting on artifacts. Spikes present the most challenging waveform to

predict for both models. Model B again performs more accurately on all sub-types of non-

HFO behaviour than the Industry Standard.

7.3 Predictive Power of Models over Distance

Of interest in this field is the performance of a model with respect to the distance between the

electrodes and the behaviour it measures. The challenge in a real-life scenario is locating

specific epileptogenic brain tissue. Ergo, future solutions may look to utilise predictions from

multiple electrodes, all with varying distance to the behaviour, in order to more precisely

locate defective tissue. This means an understanding of how HFO detection accuracy varies

over distance is of great importance. It should be noted that distance measurements are given

in metres within this report.

Model B Industry Standard

Wave N Correct Prediction

Predicted Wrong

Accuracy Correct Prediction

Predicted Wrong

Accuracy

Artifacts 5663 5663 0 100% 5652 11 99.81%

Baseline 5639 5508 131 97.68% 5431 208 96.31%

Noise 5679 5678 1 99.98% 5662 17 99.70%

Spike 5755 5423 332 94.23% 5214 541 90.60%

Total 22736 22272 464 97.96% 21959 777 96.58%

7.3.1 Simple Waveforms – Ripples, Fast Ripples and Spikes

We first investigate HFO accuracy over distance when models are evaluating simple

waveforms. By simple waveforms, we mean waveforms without additional disruptive waves

set to coincide with them. Specifically, we first study simple ripples, fast ripples and spikes.

Figures 28-31 show plots of the output probabilities from a model against the distance to that

respective wave. For example, figure 28 shows how Model B predicts on the ripples within

the test set. By plotting the probability output against the distance from each ripple, we can

analyse how the model’s performance differs in respect to the distance from said Ripples. In

the case of figure 28, we can see how the distribution of incorrectly predicted ripples is

shifted towards larger distances.

The output probability simply reflects the probability with which a model predicts an instance

to be an HFO. Therefore, in plots like these, points that occur more towards the central area

of the y-axis account for instances where the model is less certain about what label should be

allocated to this observation. Comparing figures 28 and 29, we can clearly see that not only

does the Industry Standard incorrectly predict more instances, there is far less certainty in its

predictions.

Figure 28: Model B's performance on Ripples over distance

Figure 29: Industry Standard's performance on Ripples over distance

Figure 30: Model B's performance on Fast Ripples over distance

From observation of each plot, it is explicit that increased distance has some obscuring

effects on ripple and fast ripple detection for both models. By comparing figures 30 and 31, it

is clear that the industry standard carries out a far less accurate prediction of fast ripples than

Model B.

Spiking is a non-HFO wave that both models found relatively challenging to predict. An

incorrect prediction from a model on this wave-type would be labelling the spike as an HFO.

Therefore, incorrect points are this time located above 0.5 on the y-axis. Figures 32 and 33

compare the performance of the models on spikes of the test set. There appears to be high

amount of uncertainty in the predictions from both models, which underlines the difficulty of

predicting this specific waveform. There appears to be a small shift in the distribution of the

incorrect predictions towards smaller distances, indicating that both models find predicting

spikes more difficult the closer the spike is to the electrode. In conclusion, it seems that the

closer a spike is located, the stronger the propensity of models to confuse it with HFO

activity.

Figure 31: Industry Standard's performance on Fast Ripples over distance

Figure 32: Model B's performance on Spikes over distance

7.3.2 Complex Waveforms – Pairings Between Ripples, Fast Ripples and Spikes

The task of classifying HFO activity when both a Ripple and Fast Ripple occur

simultaneously is an important assignment for an HFO detection model. Another important

task for Models is the ability to detect HFO’s when there is the presence of non-HFO

activities occurring simultaneously.

Of course, in these mixed waves, two specific events are occurring with their own respective

distance. Therefore, we first consider each models accuracy over both these distances

simultaneously using 3D-plots.

In figures 34 and 35 we can see the predictions of Model B and the Industry Standard model

in comparison to distance from both the ripple and fast ripple. From our previous analysis, we

have already seen that both models perform relatively well on this wave type, and so

examples of incorrect predictions are sporadic. Although, it appears most incorrect instances

occur at higher distances.

Figure 33; Industry Standard's performance on Spikes over distance

Figure 34:Model B's performance on Ripple- FastRipples over distances

Figure 35: Industry Standard's performance on Ripple- FastRipples over distances

Figure 24: Incorrect predictions of Model B on Ripple- FastRipplesFigure 25: Industry Standard's performance on Ripple-FastRipples over distances

Figure 36: Incorrect predictions of Model B on Ripple-FastRipples

Figure 37: Incorrect predictions of the Industry Standard on Ripple-FastRipples

To make this pattern a little easier to observe, figures 36 and 37 show 2d plots of all incorrect

predictions of this wave type. These can be thought of as projections of incorrect predictions

onto the floor of figures 34 and 35. This allows us to clearly see where these incorrect

observations occur in respect to both the distance from the ripple and fast ripple.

The model’s successes at when predicting on this wave type are to the detriment of the

analysis, since there are so few incorrect predictions to analyse. Despite the relatively small

volume of instances, it seems incorrect predictions for this wave type occur when both the

distance from the ripple and the fast ripple are reasonably large. Intuitively, this makes sense,

as predictive errors only seem to occur when both distances are large enough to make the

recognition of both a ripple and fast ripple relatively difficult.

From our previous analysis, waveforms consisting of both HFO and non-HFO behaviour

proved to be more challenging for both models. The occurrence of spiking activity seems to

hinder each model’s ability to extract the relevant HFO behaviour from the wave.

Figure 38: Model B's performance on Spike- Ripples over distances

Figure 39: Industry Standard's performance on Spike-Ripples over distances

Figure 38 and 39 visualise how the distance from both a ripple and a spiking activity affect

the predictive ability of the models. As previously discussed, Model B predicts more

accurately on this wave type. Not only are there less incorrect predictions from Model B, the

cluster of correctly predicted instances are more greatly concentrated towards the high

probabilities, meaning there is more certainty regarding these observations. The industry

standard shows less certainty in this respect.

How the predictive powers of these models change over distances is again difficult to fully

interpret. We therefore take 2d-projections of all incorrectly predicted instances onto the

floors of figures 38 and 39.

The plots suggest that as the distance to the ripple activity increases, the number of

incorrectly produced observations increases. In contract, most incorrectly predicted

observations seem to occur in situations where the spike occurs a smaller distance away. To

further analyse these effects, boxplots are given in figure 42 and 43. These show the

distributions of the incorrectly predicted instances in terms of the distance from both the

ripple and spike.

Figure 40: Incorrect predictions of Model B on Spike-Ripples

Figure 41: Incorrect predictions of the Industry Standard on Spike-Ripples

The boxplots show that incorrect predictions are more likely to appear in situations where the

distance from the ripple is large, and the distance from the spike is small. This is in line with

logical reasoning. As expected, the smaller the distance from a spike, the more likely it is to

have disruptive effects. Meanwhile, ripple activity is easier to detect and classify when the

wave is less attenuated.

We now consider waves composed of simultaneously occurring fast ripples and spikes. 3D-

plots of distance to both the fast ripple and the spike are given in figures 44 and 45. Again,

model B predicts more accurately. Additionally, the cluster of correctly predicted

observations are more highly concentrated in the upper region, meaning the correct

predictions are made more confidently.

Figure 42: Distribution of incorrect predictions by Model B on Spike-Ripples

Figure 43: Distribution of incorrect predictions by the Industry Standard on Spike-Ripples

Figure 26: Model B's performance on Spike-FastRipples over distancesFigure 27: Distribution of incorrect predictions by the Industry Standard on Spike-Ripples

2d-projections of all incorrectly predicted instances onto the floor of the previous plot are

given in figures 46 and 47 below. Again, this allows us to closely study the distribution of

incorrectly predicted observations in terms of both distances simultaneously.

Figure 44: Model B's performance on Spike- FastRipples over distances

Figure 45: Industry Standard's performance on Spike-FastRipples over distances

Figure 46: Incorrect predictions of Model B on Spike- FastRipples

Figure 47: Incorrect predictions of the Industry Standard on Spike-FastRipples

The plots for each model suggest that as the distance to the fast ripple activity increases, we

in turn see an increase in the number of incorrectly predicted results. There appears to be a

small shift in the plotted values towards regions where the distance from spiking activity is

smaller. This is a difficult pattern to interpret and so we inspect how the incorrect instances

are distributed over the distance to each kind of wave behaviour using boxplots.

Indeed, the plots show how the increasing distance from the fast ripple activity has disruptive

effects on the accuracy of both models. The opposite seems to be true for the spiking

behaviour. These conclusions again reinforce exactly the behaviour we would expect.

Measurements of HFOs in which a fast ripple and spike occur simultaneously, where spiking

behaviour occurs far closer to an electrode than the fast ripple of interest, are prime

candidates to be mis-classified. This is due to the highly distortive effects of the spike and

relatively weak measurement of the Fast Ripple.

Figure 48: Distribution of incorrect predictions by Model B on Spike-FastRipples

Figure 49: Distribution of incorrect predictions by the Industry Standard on Spike-FastRipples

7.4 Performance on Small Datasets

This research utilizes simulated data, which allows for a vast dataset to be generated. In a

real-life application, data is sourced from pre-surgery epileptic patients, making it far scarcer.

Therefore, it is important to inspect how the proposed model adapts to small datasets.

From the final training set, a stratified random sampling is carried out to create a small

dataset of just 20,000 images. Half of which are HFOs and half of which are non-HFOs.

This set is used to train the Industry Standard and Model B. Optimization details are

unchanged from previous sections. These models were then applied to the test set.

Model B gives an accuracy of 94.58% while the industry standard has an accuracy 79.61%.

Model B is able to maintain a relatively high predictive power despite the reduced training set

size. The industry standard shows a far sharper drop off in performance.

8. Final Discussion

In conclusion, we present a model that is more accurate than the industry standard on

simulated data. Not only does the model obtain more accurate results overall, upon further

analysis, the model performs more accurately on all types of signal behaviour simulated.

Of course, testing has only been conducted on simulated data, even so, such successful results

give great promise that this model and alike structures will be similarly effective when acting

upon real-life data. If such effective performances are replicable on real-life data, this could

potentially be a significant breakthrough in the field.

Not only do we present a successful new model. The analysis of how distance to behaviours

effects the classification accuracy of models is also valuable for the field. This analysis shows

how increasing distance from HFO behaviour negatively effects a model’s ability to detect

such behaviour. Furthermore, an investigation into signal behaviours derived from both HFO

and non-HFO events showed the disruptive effects of non-HFO behaviour and how it

increases with closer proximity. While the conclusions from this analysis may seem obvious,

this is the first research to analyse such effects in the field of HFO detection.

8.1 Limitations to Conclusions

While there are many positives to the work presented in this research. There are of course

limitations to what can be concluded. Specifically, the issues surrounding the use of

simulated data, as discussed in section 5.5, mean that it would be unwise to directly compare

the accuracies obtained within this report with accuracies obtained by papers from the field.

Rather, this research should simply show the applicability of these models, and act as a

motivation to test these models on real-life data.

8.2 Possible Next Steps

Results are promising, and a logical next step would be to test the model structures on real-

life data. Although considerations would have to be made to record the relevant iEEG from

pre-surgery patients.

Excitingly, the depth and width of models proposed here could also be increased. Design

choices such as the number of layers and number filters in said layers are restricted in order

to build models trainable in reasonable time limits. Far more deep and complex structures

could be considered. Edits to other model hyperparameters, such as the optimization method

and number of epochs, could be made to possibly yield more accurate results.

It is not unreasonable to suggest that with the application of even more complex structures

than researched here, inception architectures may be able to approach the classification

accuracies necessary to make the utilisation of these models in the medical field practical.

9. Appendices

A. Diagram Explanation This paper includes graphs to represent the network sub-structures. Each block in a graph represents a separate process. Below is an explanation of how to read the blocks that make up these graphs.

Layer type & filters used.

Figure 28: Taking (x,y)

(x,y)

pad pad

The upper section of a block defines the type of process applied at this stage. Conv is used to show that this is a convolutional layer, and in this case the number of kernels used is indicated as a small number to the right of this. Alternatively, ‘Avg Pool‘ and ‘Max Pool’ are used to indicate that this is an average pooling or max pooling layer. The upper section of a block

The bottom left section contains a tuple that defines the filter/pool size use in this convolutional/ pooling layer. The bottom The bottom-middle section gives

an integer ‘s’ that corresponds to the step size used in this layer. The bottom-middle section gives

The bottom-right section defines information on the use of padding. A label of ‘Y’ indicates that the spatial dimensions of the input are maintained using zero-padding. A label ‘N’ indicates that no padding is used.

B. Bash Script Example Example of a bash script used to distribute model training on the Supercomputing Wales cluster. This example relates to the final training of the Industry Standard model before application to the test set.

C. Learning Rate Example Code defining function to control learning rate. This specific example relates to the industry standard model.

D. Inception Module Example Example of an inception module structure defined in Keras. This example is of the 1st inception module of model B.

E. Example of Calling a Model in Keras Calling a model for training. Example relates to calling model B for final training before application to the test set.

10. Bibliography

Bénar, C., Chauvière, L., Bartolomei, F., Wendling, F. 2010. Pitfalls of high-pass filtering for detecting epileptic oscillations: a technical note on “false” ripples. Clinical Neurophysiology 121(3), pp. 301-310.

Blanco, J.A., Stead, M., Krieger, A., Viventi, J., Marsh, R. et al. 2010. Unsupervised Classification of High-Frequency Oscillations in Human Neocortical Epilepsy and Control Patients. Journal of Neurophysiology 104(5), pp. 2900–2912.

Cendes, F. and Meador, K. 2018. Searching for the good and bad high-frequency oscillations. Neurology 90(8), pp. 347-348. Chollet, F. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1251- 1258).

Crépon, B., Navarro, V., Hasboun, D., Clemenceau, S., Martinerie, J. et al. 2009. Mapping interictal oscillations greater than 200 Hz recorded with intracranial macroelectrodes in human epilepsy. Brain 133(1), pp. 33-45.

Du, Y., Sun, B., Lu, R., Zhang, C., Wu, H. et al. 2019. A method for detecting high- frequency oscillations using semi-supervised k-means and mean shift clustering. Neurocomputing 350, pp. 102-107.

Dümpelmann, M., Jacobs, J., Kerber, K., Schulze-Bonhage, A. 2012. Automatic 80–250 Hz “ripple” high frequency oscillation detection in invasive subdural grid and strip recordings in epilepsy by a radial basis function neural network. Clinical Neurophysiology 123(9), pp. 1721-1731.

Fisher, R., Webber, W., Lesser., R., Arroyo., Uematsu, S. 1992. High-frequency EEG activity at the start of seizures. Journal of Clinical Neurophysiology 9(3), pp. 441-448.

He, K., Zhang, X., Ren, S. and Sun, J., 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770- 778).

Jacobs, J., LeVan, P., Chander, R., Hall, J., Dubeau, F. et al. 2008. Interictal high-frequency oscillations (80-500 Hz) are an indicator of seizure onset areas independent of spikes in the human epileptic brain. Epilepsia 49(11), pp. 1893–1907.

Jacobs, J., Zijlmans, M., Zelmann, R., Chatillon, C., Hall., J. et al. 2010. High-frequency electroencephalographic oscillations correlate with outcome of epilepsy surgery. Annals of Neurology 67(2), pp. 209–220.

Krizhevsky, A., Sutskever, I. and Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105).

Logothetis, N.K., Kayser, C. and Oeltermann, A., 2007. In vivo measurement of cortical impedance spectrum in monkeys: implications for signal propagation. Neuron, 55(5), pp.809- 823.

Matsumoto, A., Brinkmann, B., Stead, M., Matsumoto, J., Kucewicz. et al. 2013. Pathological and physiological high-frequency oscillations in focal human epilepsy. Journal of Neurophysiology 110(8), pp. 1958–1964.

Jefferys, J., Menendez de la Prida, L., Wendling, F., Bragin, A. et al. 2012. Mechanisms of physiological and epileptic HFO generation. Progress in Neurobiology 98(3), pp. 250-264.

Jrad, N., Kachenoura, A., Merlet, I., Bartolomei, F., Nica, A. et al. 2016. Automatic detection and classification of high-frequency oscillations in depth-EEG signals. IEEE Transactions on Biomedical Engineering, 64(9), pp.2230-2240.

Kucewicz, M., Cimbalnik, J., Matsumoto, J., Brinkmann, B., Bower, M. et al. 2014. High frequency oscillations are associated with cognitive processing in human recognition memory. Brain 137(8), pp. 2231-2244.

Lai, D., Zhang, X., Z., Ma, K., M., Chen, Z., Chen W. et al. 2019. Automated Detection of High Frequency Oscillations in Intracranial EEG Using the Combination of Short-Time Energy and Convolutional Neural Networks. IEEE Access 7, pp. 82501-82511.

Liu, S., Sha, Z., Sencer., A., Aydoseli, A., Bebek, N. et al. 2016. Exploring the time– frequency content of high frequency oscillations for automated identification of seizure onset zone in epilepsy. Journal of Neural Engineering 13(2).

López-Cuevas, A., Castillo-Toledo, B., Medina-Ceja, L., Ventura-Mejía, C., Pardo-Peña, K. 2013. An algorithm for on-line detection of high frequency oscillations related to epilepsy. Computer Methods and Programs in Biomedicine 110(3), pp. 354-360.

Luck, S. (2014). An Introduction to The Event-Related Potential Technique. 2nd ed. Cambridge, Mass: MIT Press.

Navarrete, M., Pyrzowski, J., Corlier, J., Valderrama, M. and Le Van Quyen, M., 2016a. Automated detection of high-frequency oscillations in electrophysiological signals: Methodological advances. Journal of Physiology-Paris, 110(4), pp.316-326.

Navarrete, M., Alvarado-Rojas, C., Le Van Quyen, M., Valderrama, M. 2016b. RIPPLELAB: A Comprehensive Application for the Detection, Analysis and Classification of High Frequency Oscillations in Electroencephalographic Signals. PLoS ONE 11(6).

Schirrmeister, R.T., Springenberg, J., Dominique, L., Fiederer, J., Glasstetter, M. et al. 2017. Deep learning with convolutional neural networks for EEG decoding and visualization. Human Brain Mapping 38(11), pp. 5391–5420.

Squire, L., Bloom, F., Spitzer, N., Du Lac, S., Ghosh, A., Berg, D. (2008). Fundamental Neuroscience. 3rd ed. San Diego: Academic Press.

Staba, R.J., Wilson, C.L., Bragin., A., Fried, Engel, J. 2002. Quantitative Analysis of High- Frequency Oscillations (80–500 Hz) Recorded in Human Epileptic Hippocampus and Entorhinal Cortex. Journal of Neurophysiology 88(4), pp. 1743–1752.

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V. and Rabinovich, A., 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1-9).

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J. and Wojna, Z., 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818-2826).

Szegedy, C., Ioffe, S., Vanhoucke, V. and Alemi, A.A., 2017, February. Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence.

Whittingstall, K., Logothetis, N. 2009. Frequency-band coupling in surface EEG reflects spiking activity in monkey visual cortex. Neuron 64(2), pp. 281-289.

Worrell, G., Parish, L., Cranstoun, S., Jonas, R., Baltuch, G. et al. 2004. High‐frequency oscillations and seizure generation in neocortical epilepsy. Brain 127(7), pp. 1496-1506.

Wu, J., Sankar, R., Lerner, J., Matsumoto, J., Vinters, H. et al. 2010. Removing interictal fast ripples on electrocorticography linked with seizure freedom in children. Neurology 75(19), pp. 1686-1694.

Zijlmans, M., Jiruska, P., Zelmann, R., Leijten, F., Jefferys, J. et al. 2012. High-frequency oscillations as a new biomarker in epilepsy. Annals of Neurology 71(2), pp. 169–178. Zuo, R., Wei, J., Li, X., Li, C., Zhao, C., Ren, Z., Liang, Y., Geng, X., Jiang, C., Yang, X. and Zhang, X., 2019. Automated Detection of High Frequency Oscillations in Epilepsy Based on a Convolutional Neural Network. Frontiers in computational neuroscience, 13.

Modelling Tail Dependence of Stock Indexes by GARCH-Copula Model(1).pdf

Modelling Tail Dependence of Stock

Indexes by GARCH-Copula Model

Chenghao Li

September 2019

School of Mathematics,

Cardiff University

A dissertation submitted in partial fulfilment of the

requirements for MSc (in Operational Research, Applied Statistics and Financial Risk) by taught programme.

CANDIDATE’S ID NUMBER

1869787

CANDIDATE’S SURNAME

Please circle as appropriate Mr / Miss / Ms/ Mrs / Rev / Dr / Other ……Li….......

CANDIDATE’S FULL FORENAMES

Chenghao

DECLARATION This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 1 This dissertation is being submitted in partial fulfilment of the requirements for the degree of ………MSc…………(insert MA, MSc,MBA, etc, as appropriate) Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 2 This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended. Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 3 – I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations. Signed ……………………………………………. (candidate) Date ………………………… STATEMENT 4 - BAR ON ACCESS APPROVED I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee. Signed ……………………………………………. (candidate) Date …………………………

Executive Summary

Abstract

Economic globalization has increased the linkage between financial markets. Estimating

the correlation between the log returns of the global stock market has a strong practical

significance for financial asset pricing and financial risk management. In this paper, the

GARCH-copula model is used to estimate the correlation and tail dependence between the

log returns of FTSE100, S&P500, Nikkei225 and HS300.

Methodology

In this paper, the GARCH(1,1) model is used to fit the log return of each stock index and

obtain the marginal distributions of the log returns. The two indexes distributions are then

joined as a bivariate joint distribution using the Gaussian copula, T copula and Clayton copula

functions, respectively. The correlation between the log returns is examined according to the

parameters of the copula function. The tail dependence coefficients between the log returns are

investigated by the nature of tail dependence in T copula and Clayton copula.

Results

As a result, there is a clear correlation between the log returns of the stock indexes, and

there is strong tail dependence correlation between each pair of log returns. This shows that

there is linkage between global financial markets, and in extreme cases, its correlation will

increase. It was also found that the correlation between FTSE100 and S&P500's log returns is

the strongest, and the correlation between HS300 and log returns of other indices is relatively

weaker. To a certain extent, this partly reflects the greater freedom of capital flow between the

United Kingdom and the United States, while China has the control of capital flows.

Acknowledgements

Throughout the writing of this dissertation I have received a great deal of support and assistance. I

would first like to thank my supervisor, Dr. Anqi Liu, whose expertise was invaluable in the formulating of

the research topic and methodology in particular.

I would like to acknowledge my friend Jiliang Zhu who provided me with a cloud server account, which

enabled my code to run for a long time on the cloud server to get the research results.

Contents Abstract ............................................................................................................................................................... 1

1 Introduction ................................................................................................................................................... 1

2 Literature Review ........................................................................................................................................ 4

2.1 Techniques Used in Modelling Financial Returns ..................................................................... 4

2.2 Copula Function Estimation Methods .......................................................................................... 6

2.3 Tail Dependence .................................................................................................................................... 7

3 The Model ....................................................................................................................................................... 9

3.1 Model Index Returns .......................................................................................................................... 9

3.2 Estimate Copula Function ............................................................................................................... 10

3.2.1 Estimate Gaussian Copula ...................................................................................................... 10

3.2.2 Estimate T Copula ..................................................................................................................... 11

3.2.3 Estimate Clayton Copula ......................................................................................................... 11

3.3 Tail Dependence .................................................................................................................................. 12

4 Empirical Calibration and Results ....................................................................................................... 13

4.1 The Data ................................................................................................................................................ 13

4.2 Results .................................................................................................................................................... 15

4.2.1 Marginal Distributions ............................................................................................................. 15

4.2.2 Estimated Copula Functions .................................................................................................. 17

4.2.3 Tail Dependence .......................................................................................................................... 17

5 Conclusion .................................................................................................................................................... 19

Reference .......................................................................................................................................................... 21

Modelling Tail Dependence of Stock Indexes by

GARCH-Copula Model

Abstract With the increasing degree of economic globalization, the linkage between

national stock markets is also growing stronger. Especially in the case of extremely

bad conditions, the global stock market is more likely to fall at the same time, which

is the so-called tail dependence. This paper uses the stock market indexes to establish

copula models. The fact of tail dependence between markets will be researched.

1 Introduction Financial markets are interrelated, and portraying the relationship between

financial products or financial markets can help solve financial asset pricing problems

and financial risk management issues. To understand the correlation between financial

markets, traditional correlation measures (such as Pearson’s rho) are often insufficient,

see, e.g., Embrechts et al. (2002). The best way to characterize the correlation between

the variables of financial markets or financial assets is to obtain a joint distribution of

the variables. But directly estimating the joint distribution between variables is

difficult. The Copula theory proposed by Sklar (1959) provided a solution for this.

Sklar's theorem states that any joint distribution can be represented by its marginal

distributions and an appropriate copula function. When the joint distribution is

continuous, the copula function is unique. To illustrate this theorem，let’s consider a

2-dimention joint distribution function H(𝑋1, 𝑋2) . 𝐹1(𝑋1 ) and 𝐹2(𝑋2) are the

marginal distributions corresponding to H(𝑋1, 𝑋2). Then, a copula function can be

found to join the univariate marginal distributions to be the multivariate joint

distribution.

H(𝑋1, 𝑋2) = 𝐶(𝐹1(𝑋1), 𝐹2(𝑋2))

If H(𝑋1, 𝑋2) is continuous, the copula function 𝐶(𝑈1, 𝑈2) is unique. Conversely,

if the 2-dimention copula function 𝐶(𝑈1, 𝑈2) and the marginal distributions 𝐹1(𝑋1)

and 𝐹2(𝑋2) are known, we can find a bivariate distribution function H(𝑋1, 𝑋2) such

that 𝐹1(𝑋1) and 𝐹2(𝑋2) are margins of H(𝑋1, 𝑋2).

Therefore, with the help of copula function, the estimation of joint distribution

function can be generally separated into two steps: 1) estimate the marginal

distributions; 2) chose the copula function and estimate the parameter of the copula

function. It can be said that copula function contains all the dependence information.

Although the copula theory was proposed early, the application of copula theory

in the financial field was in the early 21st century, 60 years later than the time proposed

by the copula theory. In 1999, David X. Li first proposed the use of the copula function

to model the relevance of loan defaults and how to apply it to the pricing of credit

derivatives. But it can be said that its first application in the financial field failed. Wall

Street referenced the method proposed by Li and apply Gaussian Copula to the pricing

of Collateralized Debt Obligations (CDOs) which speeded up the issuance of CDOs.

As a result, we all know that the CDOs market has collapsed, and a large number of

mortgage defaults have caused the value of CDOs to be devastating, and investors

have suffered heavy losses. Felix Salmon (2009) described Gaussian Copula as a

‘recipe for disaster’ in his article. The mainly reason for this failed application is that

Wall Street did not choose the right copula function. Gaussian Copula has no tail

dependence, but in the extreme case of financial markets, the price of assets will show

an increase in correlation. Therefore, if you want to apply copula to finance, whether

it is financial risk management or financial asset pricing, it is necessary to consider

the tail dependence. It is very important to choose the appropriate copula function.

This paper will establish GARCH-Copula models for the log returns of stock

market indexes. Why use the GARCH model to fit log returns is due to the

phenomenon of fat tail and volatility clustering in financial time series. The traditional

econometric models cannot solve these problems well until Engle (1982) proposed

autoregressive conditionally heteroscedastic (ARCH) and Bollerslev (1986) proposed

the problem of generalized ARCH (GARCH). Therefore, this paper will use the

GARCH model to fit log returns. In section 2, we will introduce the method of fitting

financial time series in more detail, and introduce the expression of GARCH model in

detail. The copula model is to build joint distribution and better describe the

correlation between log returns, especially to measure its tail dependence. As

mentioned earlier, the correlation between financial assets will increase significantly

during the financial crisis. This is called “correlation breakdown”. If financial asset

pricing or financial risk management does not take into account the increase in the

correlation of financial assets in an extreme market environment, it will bring losses.

Tail dependence coefficients can capture the correlation between financial assets in

extreme market conditions, so calculating tail dependence has a strong practical

significance.

This paper researches four indexes’ log returns including FTSE100, S&P500,

Nikkei225 and HS300, and then examine their tail dependence. This paper finds that

there is a lower tail dependence between the indices, which indicates that the pricing

of financial products and the management of financial risks should take into account

the existence of tail dependence. In addition, the lower tail dependence coefficient of

FTSE100 and S&P500 is the largest, and the lower tail dependence between HS300

and the other three indexes is relatively weak. This reflects to a certain extent the

relatively weak linkage between China's capital market and the world's major capital

markets.

This paper is organized as follows. Section 2 is a literature review. The literature

review includes three aspects: 1) commonly used models in financial time series

modelling; 2) estimation methods of copula functions; 3) tail dependence. In section

3, the model is described. Section 4 shows the empirical results. Section 5 concludes

this paper.

2 Literature Review

2.1 Techniques Used in Modelling Financial Returns

Before using the copula function to establish a joint distribution, a good estimate

of the marginal distribution is required. In financial time series, such as stock returns,

there are often effects of volatility clustering, fat tail and financial leverage. Volatility

clustering is the variability in time of the conditional variance. In financial time series,

it can be usually observed that the high volatility is clustered in some time interval.

Fat tail is another common phenomenon in financial returns which means the spread

of returns is significantly larger than that corresponding to the normal distribution.

Financial leverage effect is the phenomenon that positive and negative information

have asymmetric influence on the whole variance of time series.

It is significant that many financial time series of returns cannot be assumed to be

normal. The key point to precise modelling of financial returns is the volatility

modelling. Currently, a broad class of GARCH processes with fat-tailed innovations

are used to model the volatility of financial returns. Some well know GARCH-type

processes are listed below.

The GARCH model was created by Bollerslev (1986). The financial time series

of returns {𝑅𝑡 } is said to be modelled with GARCH(p,q) when

{ 𝑅𝑡 = 𝜇 + 𝜀𝑡

𝜀𝑡 = 𝜎𝑡 𝑣𝑡 𝜎𝑡2 = 𝜔 + ∑ 𝛼𝑖 𝜀𝑡−𝑖

2𝑝 𝑖=1 + ∑ 𝛽𝑗 𝜎𝑡−𝑗

2𝑞 𝑗=1

（2.1）

In the above formula, 𝑣𝑡 denotes the innovations. The advantage of the GARCH

model is it can deal with conditional heteroscedasticity. In formula (2.1) the

conditional variance is described by the function 𝜎𝑡2 = 𝜔 + ∑ 𝛼𝑖 𝜀𝑡−𝑖 2𝑝

𝑖=1 +

∑ 𝛽𝑗 𝜎𝑡−𝑗 2𝑞

𝑗=1 . Formula (2.1) is a constant mean GARCH model and it can be extended

to be AR-GARCH model or ARMA-GARCH model when autoregressive effect is

considered. Time series {𝑅𝑡 } modelled with AR(k)-GARCH(p,q) model can be

expressed as

{ 𝑅𝑡 = ∑ 𝜑𝑙 𝑅𝑡−𝑙𝑘𝑙=1 + 𝜀𝑡

𝜀𝑡 = 𝜎𝑡 𝑣𝑡 𝜎𝑡2 = 𝜔 + ∑ 𝛼𝑖 𝜀𝑡−𝑖

2𝑝 𝑖=1 + ∑ 𝛽𝑗 𝜎𝑡−𝑗

2𝑞 𝑗=1

（2.2）

The basic GARCH model does not take the financial leverage effect into account.

The effect of financial leverage can be understood as good information and bad

information will have asymmetric effects on the variance of financial returns.

Obviously, good information corresponds to a positive return, and bad information

corresponds to a negative return. Specific to the return of stock market indexes, a

negative return will make the variance of the index returns larger. Several ways can be

found in the literature to handle the financial leverage effect. An EGARCH model

invented by Nelson(1991) can deal with the disadvantage of the GARCH model. In

EGARCH(p,q) model, the noise process {𝜀𝑡 } satisfies 𝜀𝑡 = 𝜎𝑡 𝑣𝑡 and below

equation:

{ ln(𝜎𝑡2) = 𝛼0 + ∑ 𝛼𝑖 𝑔(𝑣𝑡−𝑖 )

𝑝 𝑖=1 + ∑ 𝛽𝑗 ln (𝜎𝑡−𝑗

2𝑞 𝑗=1 )

g(𝑣𝑡 ) = 𝜃𝑣𝑡 + 𝛿(|𝑣𝑡 | − 𝐸|𝑣𝑡 |) (2.3)

GJR-GARCH proposed by Glosten et.al. (1993) is another model to deal with

financial leverage effect and it is very popular. The noise process {𝜀𝑡 } follows the

GJR-GARCH(p,q) model when {𝜀𝑡 } satisfies 𝜀𝑡 = 𝜎𝑡 𝑣𝑡 and

𝜎𝑡2 = 𝛼0 + ∑ 𝛼𝑖 𝜀𝑡−𝑖 2𝑝

𝑖=1 + ∑ 𝛽𝑗 𝜎𝑡−𝑗 2𝑞

𝑗=1 + ∑ 𝛾𝑖 𝐈{𝜀𝑡−𝑖<0}(𝜀𝑡−𝑖 2𝑝

𝑖=1 ) (2.4)

In the GJR-GARCH model, the 𝐈{𝑥<0} denotes the indicator function that is

𝐈{𝑥<0}(𝑥) = 1 while x < 0 and 𝐈{𝑥<0}(𝑥) = 0 while x ≥ 0 . Through equation

(2.4), it can be found that parameter 𝛾𝑖 determines the sensitivity of conditional

volatility function with respect to negative returns.

Regarding to the distribution of {𝑣𝑡 }，the commonly used distributions are normal,

t-Student (as in Bollerslev, 1987), skewed t (as in Patton, 2004) and GED (generalized

error distribution). The latter three distributions are more suitable to fit {𝑣𝑡 } since

they have heavier tail than normal distributions. Ferenstein and Gasowski(2004)

modelled stock returns by AR-GARCH model and found GED and Student-t

distribution are the best to fit the innovation.

In general, fitting the financial returns requires taking into account the three

characteristics of the financial time series and then selecting the appropriate model to

characterize the three features.

2.2 Copula Function Estimation Methods

Generally, copula function estimation method can be classified into two types,

parametric estimation and non-parametric estimation (Cherubini et al., 2004).

There are three parametric estimation methods:1) Exact maximum likelihood

method; 2) Inference for the margins (IFM) method; 3) Canonical maximum

likelihood method.

Exact maximum likelihood method is based on the canonical representation:

f(𝑥1, 𝑥2, … , 𝑥𝑛) = 𝑐(𝐹1(𝑥1), 𝐹2(𝑥2), … , 𝐹𝑛 (𝑥𝑛)) × ∏ 𝑓𝑗 (𝑥𝑗 ) 𝑛 𝑗=1 (2.5)

Let ℵ = {𝑥1𝑡 , 𝑥2𝑡 , … , 𝑥𝑛𝑡 }𝑡=1𝑇 be the sample data matrix. Thus, the expression for

the log-likelihood function is

𝑙(θ) = ∑ 𝑙𝑛𝑐(𝐹1(𝑥1𝑡 ),𝑇𝑡=1 𝐹2(𝑥2𝑡 ), … , 𝐹𝑛 (𝑥𝑛𝑡 )) + ∑ ∑ ln 𝑓𝑖 (𝑥𝑗𝑡 ) 𝑛 𝑗=1

𝑇 𝑡=1 (2.6)

By maximizing the above log-likelihood function the maximum likelihood

estimator can be found:

𝜃𝑀𝐿𝐸 = 𝑚𝑎𝑥𝜃∈Θ𝑙(𝜃) (2.7)

The exact maximum likelihood method estimate the parameter marginal

distribution and the parameter of copula function at the same time, but this could be

very computationally intensive, especially in the high dimension case. According to

(2.6), it can be found the log-likelihood function is composed by two terms: one term

involving the copula function parameters and one term involving the parameters of the

marginal distributions, so the parameters can be estimated separately not

simultaneously to reduce computational load. Based on this idea, the IFM method (Joe

and Xu, 1996) was proposed. IFM method estimates the parameters in two steps:

Step 1: The parameter 𝜽𝟏 of the marginal distribution is estimated by

�̂�𝟏 = ArgMax𝜽𝟏 ∑ ∑ 𝑙𝑛𝑓𝑗 (𝑥𝑗𝑡 ; 𝜽𝟏 𝑛 𝑗=1

𝑇 𝑡=1 ) (2.8)

Step 2: Then, given �̂�𝟏, the estimation of 𝜽𝟐 is performed

�̂�𝟐 = 𝐴𝑟𝑔𝑀𝑎𝑥𝜽𝟐 ∑ ln 𝑐(𝐹1(𝑥1𝑡 ), 𝐹2(𝑥2𝑡 ), … , 𝐹𝑛 (𝑥𝑛𝑡 ); 𝜽𝟐, 𝑇 𝑡=1 �̂�𝟏) (2.9)

Compared to ML method, IFM method is highly efficient (Joe, 1997).

Canonical maximum likelihood method estimate the copula parameters without

specifying the marginal distribution. This method can be described as following steps:

Step 1: Estimate the marginal distributions by using empirical distribution,

namely �̂�𝑖 (𝑥𝑖𝑡 ) with i = 1, … , n.

Step 2: Estimate the copula parameters via MLE

�̂�𝟐 = 𝐴𝑟𝑔𝑀𝑎𝑥𝜽𝟐 ∑ ln 𝑐( 𝑇 𝑡=1 �̂�1(𝑥1𝑡 ), �̂�2(𝑥2𝑡 ), … , �̂�𝑛 (𝑥𝑛𝑡 ); 𝜽𝟐) (2.10)

Non-parametric estimation no longer assumes a particular parametric copula.

Empirical copula and Kernel copula are two commonly used non-parametric

estimation methods.

2.3 Tail Dependence

Whether financial markets will become more interdependent during the financial

crisis is a concern because it relates to asset allocation and risk management. Many

literatures have pointed out that there is a correlation breakdown between financial

markets, namely, in crash period, there exist a statistically significant increase in

correlation between financial markets. Bertero and Mayer (1989) and King and

Wadhwani (1990) find the correlation of stock returns at the time of the 1987 crash.

Calvo and Reinhardt (1996) find evidence that correlation shifts in the Mexican crisis.

Baijn and Goldfajn (1999) shows evidence of contagion in the currency and equity

markets between Malaysia, Indonesia, Korea and Philippine during the Asian crisis.

Since there is such a phenomenon in the financial market, we want to understand

the dependence between financial markets in extreme cases. For example, in an

extreme case, when a financial asset has a huge loss, we want to know whether another

asset will also suffer a huge loss, what is the degree of correlation between the two

assets and which method should be used to accurately measure the correlation. Some

empirical studies, such as Ane and Kharoubi (2003), indicate that tail dependence is a

useful tool to describe correlations in extreme cases.

The most common definition of tail dependence, discussed by Sibuya (1960) and

Joe (1997) among others, is the following approach. Let (X, Y) be a random vector,

and the joint distribution function is F, and the marginal distribution functions of X

and Y are G and H, respectively.

Then its upper tail dependence coefficient is

𝜆𝑈 = lim 𝑡→1−

𝑃{G(X) > t|𝐻(𝑌) > 𝑡} (2.11)

Lower tail dependence coefficient is

𝜆𝐿 = lim 𝑡→0+

𝑃{G(X) ≤ t|𝐻(𝑌) ≤ 𝑡} (2.12)

By this definition, it can be found that the tail dependence coefficient is exactly

equal to the probability that one variable exceeds one high/low threshold and the other

variable also exceeds one high/low threshold. As Juri (2002) pointed out that the aim

to researching the tail dependence between random variables is to know the probability

that a random variable will change similarly when one random variable changes.

Tail dependence coefficient can also be defined by copula function

𝜆𝑈 = lim 𝑢→1−

1−2𝑢+𝐶(𝑢,𝑢) 1−𝑢

(2.13)

𝜆𝐿 = lim 𝑢→0+

𝐶(𝑢,𝑢) 𝑢

(2.14)

The tail dependence coefficient defined by the copula function depends only on

the form and the parameters of the copula function itself, and it is independent of the

marginal distributions. Therefore, the copula function can be easily used to study the

tail dependence. Just select the appropriate copula function and estimate the

parameters of the copula function to get the tail dependence coefficient. Patton (2006)

considered an extension of the theory of copulas and found evidence of asymmetric

exchange rate dependence. Rodriguez（2007）utilized copula approach and found that

in times of financial turmoil, tail dependence will be more prevalent. Aloui et al. (2011)

employed a multivariate copula approach to capture the tail dependence of four

emerging markets and the US markets.

This paper will use the copula function to model the stock market indexes and

examine their tail dependence.

3 The Model The modelling steps are divided into three steps: 1) fit the log return of each index

with the GARCH(1,1) model to obtain the conditional distribution of log return; 2)

Estimate the parameters of the copula function after knowing marginal distributions

and selecting certain parametric copula family; 3) Calculate the tail dependence based

on the obtained copula model. These three steps will be described separately below.

3.1 Model Index Returns

Let {𝑆𝑡 } denote the index close price and {𝑅𝑡 } denote the log return, so

{𝑅𝑡 } = {𝑙𝑛 𝑆𝑡−1

𝑆𝑡 } (3.1)

When using the GARCH(1,1) model to fit the {𝑅𝑡 }, this can be expressed as

{ 𝑅𝑡 = 𝜇 + 𝜀𝑡

𝜀𝑡 = 𝜎𝑡 𝑣𝑡 𝜎𝑡2 = 𝛼0 + 𝛼1𝜀𝑡−12 + 𝛽𝜎𝑡−12

(3.2)

In formula (3.2), {𝑣𝑡 } are i.i.d. random variables with zero mean and unit

variance. The distribution form of {𝑣𝑡 } determines the distribution of the marginal

distribution. The commonly used distribution of {𝑣𝑡 } is Gaussian distribution, T

distribution, skewed T distribution and GED (generalized error distribution). The latter

three distribution types can better characterize the fat tail characteristic of financial

time series. This article will use the Students t distribution as the distribution form of

{𝑣𝑡 }, namely 𝑣𝑡 ~𝑡𝑑𝑓 . Therefore, there is one more parameter to be estimated, i.e. the

degree of freedom.

The parameters to be estimated for the t-GARCH(1,1) model are 𝜇, df, 𝛼0, 𝛼1

and 𝛽. After estimating these parameters, the conditional distribution of 𝑅𝑡+1 can be

obtained.

F(r) = P(𝑅𝑇+1 ≤ 𝑟 | 𝐼𝑇 ) = 𝑃(𝜀𝑡+1 ≤ 𝑟 − 𝜇 | 𝐼𝑇 ) = 𝑃(𝜎𝑡+1𝑣𝑡+1 ≤ (𝑟 − 𝜇)|𝐼𝑇 )

= P (𝑣𝑡+1 ≤ 𝑟−𝜇

√𝛼0+𝛼1𝜀𝑡 2+𝛽𝜎𝑡

2 ) = 𝑡𝑑𝑓 (

𝑟−𝜇 𝛼0+𝛼1𝜀𝑡

2+𝛽𝜎𝑡 2) (3.3)

In formula (3.3), 𝐼𝑇 is the set of information up to time T.

3.2 Estimate Copula Function

After having the conditional distribution and historical observations of the four

indexes log returns, we can get {𝑢𝑖,𝑡 = 𝑡𝑑𝑓 ( 𝑟𝑖,𝑡−𝜇

𝛼𝑖,0+𝛼𝑖,1𝜀𝑖,𝑡 2 +𝛽𝑖𝜎𝑖,𝑡

2 )} 𝑡=1

𝑇 𝑖 = 1,2,3,4 and

estimate bivariate copula functions.

There is one point to note before estimating the copula function. The thing is that

since the indexes may not be traded for some days, the samples are not as 𝒓𝑡 =

{𝑟i,𝑡 , 𝑟j,𝑡 }𝑡=1𝑇 , 1 ≤ i ≠ j ≤ 4 which is a complete observation vector at each day. Since

the estimate of the marginal distribution involves only the sample of each index itself,

the day of non-transaction can be removed as a holiday. However, this cannot be done

when estimating the copula function, and the data needs to be pre-processed. The

approach taken in this paper is to select the days when both indices have transactions,

that is, to exclude those sample points that any one of the indexes has no data.

In this paper, the maximum likelihood estimation method is used to estimate the

three different parametric copula functions, namely Gaussian copula, T copula and

Clayton copula.

3.2.1 Estimate Gaussian Copula The density of bivariate Gaussian copula is:

𝑐(𝑢1, 𝑢2, … , 𝑢𝑛 ) =

(2𝜋) 𝑛 2 |𝑅|

1 2

exp(−1 2

𝒙′𝑅−1𝒙)

∏ ( 1 √2𝜋

𝑛 𝑗=1 exp (−

1 2

𝑥𝑗 2))

(3.4)

In formula (3.4), 𝑥𝑗 = Φ−1(𝑢𝑗 ). Then we can get

𝑐(𝑢1, 𝑢2, … , 𝑢𝑛 ) = 1

|𝑅| 1 2

exp (− 1 2

𝝇′(𝑅−1 − 𝐼)𝝇) (3.5)

In formula (3.5), 𝝇 = (Φ−1(𝑢1), Φ−1(𝑢2), … , Φ−1(𝑢𝑛))′ . Then the log

likelihood function is

𝑙(𝛉) = − T 2

ln|𝑅| − 1 2

∑ 𝝇𝑡′ (𝑅−1 − 𝐼)𝝇𝑡𝑇𝑡=1 (3.6)

In the bivariate case the only parameter is the correlation coefficient ρ . The

specific log likelihood function in bivariate case is

𝑙 = − T 2

ln(1 − ρ2) − 1 2

∑[ (Φ−1(𝑢1,𝑡 ))2 + (Φ−1(𝑢2,𝑡 ))2 − 2𝜌Φ−1(𝑢1,𝑡 )Φ−1(𝑢2,𝑡 )

1 − 𝜌2

𝑇

𝑡=1

−(Φ−1(𝑢1,𝑡 ))2 − (Φ−1(𝑢2,𝑡 ))2] (3.7)

The likelihood function values can be obtained by bringing {𝑢1,𝑡 }, {𝑢2,𝑡 } and 𝜌

into the above equation. The parameter 𝜌 can be estimated by iterating the value of

𝜌 such that the value of the log likelihood function is maximized.

3.2.2 Estimate T Copula The density of the bivariate t copula is

𝑐(𝑢1, 𝑢2) = 1

√1−𝜌2

Γ(𝑣+2 2

)Γ(𝑣 2

)(1+ 𝑡𝑣

−1(𝑢1) 2+𝑡𝑣

−1(𝑢2) 2−2𝜌𝑡𝑣

−1(𝑢1)𝑡𝑣 −1(𝑢2)

𝑣(1−𝜌2) )−

𝑣+2 2

Γ2(𝑣+1 2

)(1+ 𝑡𝑣

−1(𝑢1)2

𝑣 )−

𝑣+1 2 (1+

𝑡𝑣 −1(𝑢2)2

𝑣 )−

𝑣+1 2

(3.8)

In the above expression, v is the degree of freedom of T copula functions. It can

be deduced that the log likelihood function is

𝑙(𝜌, 𝑣) = − 𝑇 2

ln(1 − 𝜌2) + T × ln (Γ (𝑣+2 2

)) + 𝑇 × ln (Γ (𝑣 2 )) − 𝑣+2

2 ∑ ln ((1 +

𝑡𝑣−1(𝑢1,𝑡) 2

+𝑡𝑣−1(𝑢2,𝑡) 2

−2𝜌𝑡𝑣−1(𝑢1,𝑡)𝑡𝑣−1(𝑢2,𝑡) 𝑣(1−𝜌2)

))𝑇𝑡=1 −

2 T × ln (Γ (𝑣+1 2

)) + 𝑣+1 2

∑ ln (1 + 𝑡𝑣 −1(𝑢1,𝑡)

𝑣 )𝑇𝑡=1 +

𝑣+1 2

∑ ln (1 + 𝑡𝑣 −1(𝑢2,𝑡)

𝑣 )𝑇𝑡=1 (3.9)

By iterating the value of 𝜌 and v, the estimated parameters can be obtained by

maximizing the log likelihood function value.

3.2.3 Estimate Clayton Copula Clayton Copula is a kind of Archimedean Copula. In bivariate case, its copula

density is

c(u1, u2) = (1 + α)(u1u2)−𝛼−1(u1−𝛼 + u2−𝛼 − 1) −2−1

𝛼 (3.10)

In formula(3.9), α is the parameter of Clayton Copula.

The log likelihood function corresponding to bivariate Clayton Copula is

𝑙(𝛼) = T ∗ ln(1 + α) − (α + 1) ∑(ln (u1,𝑡 + 𝑇

𝑡=1

u2,𝑡 )

−(1 𝛼

+ 2) ∑ ln (𝑇𝑡=1 u1,𝑡 −𝛼 + u2,𝑡 −𝛼 − 1) (3.11)

Similarly to the estimation of Gaussian and T copulas, by iterating the parameter

α and maximizing the log likelihood function (3.10) the parameter 𝛼 can be

estimated.

3.3 Tail Dependence

The calculation of the tail dependence coefficient is based on the calculation

formula of the tail dependence coefficient of each copula, since the tail dependence

coefficient defined by the copula function depends only on the form and parameters

of the copula function itself. The following table shows the formula to calculate the

tail dependence coefficient of Gaussian copula, T copula and Clayton copula.

Table 1: Tail dependence coefficient formula

Copula Function 𝜆𝑢𝑝 𝜆𝑙𝑜𝑤

Gaussian Copula 0 0

T Copula 2𝑡𝑣+1(−√𝑣 + 1√ 1 − 𝜌 1 + 𝜌

) 2𝑡𝑣+1(−√𝑣 + 1√ 1 − 𝜌 1 + 𝜌

)

Clayton Copula 0 2− 1 𝛼

Gumbel Copula 2 − 2− 1 𝛼 0

Frank Copula 0 0

Gaussian copula has no tail dependence, namely its upper and lower tail

dependence are equal to zero. T copula has upper and lower tail dependence and they

are equal. Clayton copula has only lower tail dependence.

4 Empirical Calibration and Results In this section, the daily closing price of the stock market indexes will be modelled,

including FTSE100, S&P500, HS300, and Nikkei225. In order to make the fitting

process smooth, log return is scaled by 100, i.e. {100 × 𝑅𝑡 } is the object to be

modelled. The time interval is from January 1, 2009 to July 31, 2019.

4.1 The Data

Financial time series have some characteristics of their own. Through the

following descriptive statistics, it can be found that the log return of all indexes shows

the case of negative skewness. Except that the log return of FTSE100 has a kurtosis

of less than 3, showing a platykurtic, the log returns of other indices exhibit a

leptokurtic, that is, the kurtosis is greater than 3. This indicates that the financial time

series often has fat tail phenomenon.

Table 2: Descriptive statistics

Indexes Mean Median Std Skewness Krutosis

FTSE100 0.000189 0.000341 0.010012 -0.159446 2.839389

S&P500 0.000436 0.000663 0.010293 -0.329508 5.184206

Nikkei225 0.000324 0.000513 0.013499 -0.477822 4.778648

HS300 0.000276 0.000552 0.015492 -0.623943 4.107534

Some basic features of the financial time series can also be found through time

series plots. The following figure shows the time series plot of {𝑅𝑡 } of FTSE100.

Figure 1: The left is the time series plot of {𝑅𝑡 } of FTSE100. The right is the time series plot of {𝑅𝑡2} of FTSE100.

Through the time series plot at the top of the left, it can be seen that there is

volatility clustering in the log return sequence of FTSE100 where volatility is

obviously not constant, but changes over time. Through ACF and PACF in (a), it seems

that log return does not have autocorrelation. The QQ plot at the bottom left tells us

that there is a fat tail in the distribution of log return. The ACF and PACF on the right

tell us that there is a significant autocorrelation of {𝑅𝑡2}. Similarly, the log returns of

the S&P500, HS300 and Nikkei225 have similar characteristics.

Figure 2: The left is the time series plot of {𝑅𝑡 } of S&P500. The right is the time series plot of {𝑅𝑡2} of S&P500. We can find there exist volatility clustering and fat tail in the {𝑅𝑡 }. And the {𝑅𝑡2} has autocorrelation.

Figure 3: The left is the time series plot of {𝑅𝑡 } of Nikkei225.The right is the time series plot of {𝑅𝑡2} of Nikkei225. We can find there exist volatility clustering and fat tail in the {𝑅𝑡 }. And the {𝑅𝑡2} has autocorrelation.

Figure 4: The left is the time series plot of {𝑅𝑡 } of HS300. The right is the time series plot of {𝑅𝑡2} of HS300. We can find there exist volatility clustering and fat tail in the {𝑅𝑡 }. And the {𝑅𝑡2} has auto- correlation.

According to the above analysis, it can be found that there are fat tail and volatility

clustering in the log return series. Therefore, it is reasonable to use the GARCH model

with the T-distributed innovation to fit the marginal distributions.

4.2 Results

4.2.1 Marginal Distributions The GARCH(1,1) model was established for FTSE100, S&P500, Nikkei225 and

HS300 respectively. The estimated parameters are as follows.

Table 3: estimated parameters for GARCH models

Parameters/Indexes FTSE100 S&P500 Nikkei225 HS300

𝜇 0.0415***

(t=2.842)

0.0838***

(t=6.902)

0.0818***

(t=4.117)

0.0507**

(t=2.415)

𝛼0 0.0254***

(t=3.084)

0.0169***

(t=3.364)

0.0465***

(t=3.133)

9.3504e-03**

(t=2.261)

𝛼1 0.1207***

(t=5.389)

0.1407***

(t=6.422)

0.1162***

(t=5.117)

0.0550***

(t=6.017)

𝛽 0.8580***

(t=33.318)

0.8540***

(t=42.385)

0.8635***

(t=35.189)

0.9443***

(t=112.036)

𝑣 6.8268***

(t=8.025)

4.9861***

(9.763)

5.9979***

(t=8.189)

4.7577***

(t=10.532)

Note: The t statistics are in parentheses. ‘*’ means significant at the 10% significance level, ‘**’ means

significant at the 5% significance level and '***’ means significant at the 1% significance level.

According to the parameter estimation obtained in the above table, the marginal

distributions of the four log returns of FTSE100, S&P500, Nikkei225 and HS300 can

be obtained by (3.3).

Through the parameter estimation result table of the above GARCH (1, 1) model,

we can find that the β value of each GARCH model exceeds 0.85. This shows that

there are strong serial correlations in all four log returns. In addition, we can also find

that 𝛼1 + 𝛽 of each GARCH model is close to 1. 𝛼1 + 𝛽 is called the persistence,

as it defines the speed at which shocks to the variance revert to their long-run values.

This shows that the persistence of these model is very strong, that is if the variance is

increased by an impact, it takes a long time to recover the long-run average level.

4.2.2 Estimated Copula Functions The parameters of the copula function estimated based on the MLE method are

shown in the following table.

Table 4: Estimated parameters of copula functions

Parameters Gaussian Copula T Copula Clayton Copula

FTSE100 vs. S&P500

ρ 0.670 0.666 / df / 5.406 / α / / 1.112

FTSE100 vs. Nikkei225

ρ 0.374 0.358 / df / 12.043 / α / / 0.450

FTSE100 vs. HS300

ρ 0.308 0.273 / df / 6.506 / α / / 0.349

S&P500 vs. Nikkei225

ρ 0.332 0.286 / df / 5.151 / α / / 0.352

S&P500 vs. HS300

ρ 0.257 0.212 / df / 5.804 / α / / 0.250

Nikkei225 vs. HS300

ρ 0.422 0.411 / df / 18.984 / α / / 0.512

From the estimation results of the copula function, it can be found that the

estimated parameters are within a reasonable interval.

For Gaussian copula and T copula, the greater the ρ, the greater the correlation

between the two log return sequences. For Clayton copula, the larger the α, the greater

the correlation between the two log return sequences. From the parameter table, it can

be found that the parameters of Gaussian copula, T copula and Clayton copula show

consistency, that is, the correlation ranking is consistent in each copula model.

4.2.3 Tail Dependence The tail dependence between the log returns can be visually observed first through

the three-dimensional histogram.

Figure 5: 3D histograms

From the above three-dimensional histograms, it can be found that the height of

the bar in the lower-left corner of each graph is higher, that is the two log returns have

a higher frequency of having smaller values at the same time, which indicates that

there is a tail dependence between the two log returns.

The tail dependence coefficient can quantify the magnitude of the tail dependence.

After estimating the parameters of the copula function, the tail dependence coefficient

can be obtained according to the formula of the tail dependence coefficient. The table

below summarizes the tail dependence coefficient.

Table 5: Tail dependence coefficients

T Copula Clayton Copul

FTSE100 vs. S&P500 𝜆𝑈 = 𝜆𝐿 = 0.297 𝜆𝑈 = 0, 𝜆𝐿 = 0.536

FTSE100 vs. Nikkei225 𝜆𝑈 = 𝜆𝐿 = 0.027 𝜆𝑈 = 0, 𝜆𝐿 = 0.214

FTSE100 vs. HS300 𝜆𝑈 = 𝜆𝐿 = 0.074 𝜆𝑈 = 0, 𝜆𝐿 = 0.138

S&P500 vs. Nikkei225 𝜆𝑈 = 𝜆𝐿 = 0.113 𝜆𝑈 = 0, 𝜆𝐿 = 0.140

S&P500 vs. HS300 𝜆𝑈 = 𝜆𝐿 = 0.075 𝜆𝑈 = 0, 𝜆𝐿 = 0.063

Nikkei225 vs. HS300 𝜆𝑈 = 𝜆𝐿 = 0.009 𝜆𝑈 = 0, 𝜆𝐿 = 0.258

Since Clayton Copula can better reflect the lower tail dependence, through the

lower tail dependence coefficient of Clayton Copula, we can find that there is a

relatively strong lower tail dependence between the log returns of each index. This

means that one index has a large probability to fall when the other index falls.

The lower tail dependence coefficient between FTSE100 and S&P500 is the

largest. It seems that UK and USA stock markets have the highest level of financial

contagion. HS300 has relatively low tail dependence coefficients with other indexes

compared to other indexes. This may be explained by China's capital control over the

capital market. Capital control makes the circulation of funds unfree, and the linkage

between stock markets declines.

5 Conclusion This paper uses the GARCH-copula method to establish the bivariate joint

distribution model between stock index log returns. First, the log returns of FTSE100,

S&P500, Nikkei225 and HS300 are fitted by GARCH(1,1) respectively. The GARCH

model better solves the problems of volatility clustering and fat tail in the log returns

sequence. From the parameters of the GARCH model, we can find that the Beta value

of each model is relatively large (larger than 0.85), and the persistence of the model is

close to 1, which means that the variance of each log returns sequence takes a long

time to recover to long-run value after being shocked.

Secondly, Gaussian copula, T copula and Clayton copula are estimated. Finally

the tail dependence is calculated between each index log return and it is found that

there is a strong lower dependence between the indices. It can be seen from the

analysis results that the copula function of FTSE100 and S&P500 have the largest

correlation coefficient parameters, which indicates a strong correlation. At the same

time, the tail dependence coefficient of the two is also relatively large. In contrast, the

correlation between HS300 and the other three indices is much weaker. This shows

that the linkage between the UK and the US stock market is strong, which is

inseparable from the financial system in which capital flows freely. China’s control

over capital flows has always been strict, which will definitely lead to a decline in the

linkage between its stock market and international developed stock markets. However,

on September 10, 2019, the China Foreign Exchange Administration announced the

cancellation of the investment quota limit for QFII (Qualified Foreign Institutional

Investor) and RQFII (RMB Qualified Foreign Institutional Investor). This move will

certainly enhance the linkage between the Chinese stock market and the international

stock market in the future. Human behavior is guided and restricted by various systems.

Therefore, it is obvious that when we do investment decentralization or pricing of

financial products, we should take into account changes of systems which will cause

human behaviour changes and lead to changes in market correlation.

Reference [1] Aloui, R., Aïssa, M.S.B. and Nguyen, D.K., 2011. Global financial crisis, extreme

interdependences, and contagion effects: The role of economic structure?. Journal of

Banking & Finance, 35(1), pp.130-141.

[2] Ane, T. and Kharoubi, C., 2003. Dependence structure and risk measure. The

journal of business, 76(3), pp.411-438.

[3] Baig, T. and Goldfajn, I., 1999. Financial market contagion in the Asian crisis. IMF

staff papers, 46(2), pp.167-195.

[4] Bertero, E., & Mayer, C., 1990. Structure and performance: Global

interdependence of stock markets around the crash of October 1987. European

Economic Review, 34(6), pp.1155-1180.

[5] Bollerslev, T., 1986. Generalized autoregressive conditional

heteroskedasticity. Journal of econometrics, 31(3), pp.307-327.

[6] Bollerslev, T., 1987. A conditionally heteroskedastic time series model for

speculative prices and rates of return. Review of economics and statistics, 69(3),

pp.542-547.

[7] Calvo, S., 1999. Capital flows to Latin America: is there evidence of contagion

effects?. The World Bank.

[8] Cherubini, U., Luciano, E. and Vecchiato, W., 2004. Copula methods in finance.

John Wiley & Sons.

[9]Engle, R.F., 1982. Autoregressive conditional heteroscedasticity with estimates of

the variance of United Kingdom inflation. Econometrica: Journal of the Econometric

Society, pp.987-1007.

[10] Ferenstein, E. and Gasowski, M., 2004. Modelling stock returns with AR-

GARCH processes. SORT-Statistics and Operations Research Transactions, 28(1),

pp.55-68.

[11] Frahm, G., Junker, M. and Schmidt, R., 2005. Estimating the tail-dependence

coefficient: properties and pitfalls. Insurance: mathematics and Economics, 37(1),

pp.80-100.

[12] Glosten, L.R., Jagannathan, R. and Runkle, D.E., 1993. On the relation between

the expected value and the volatility of the nominal excess return on stocks. The

journal of finance, 48(5), pp.1779-1801.

[13] Joe, H. and Xu, J.J., 1996. The estimation method of inference functions for

margins for multivariate models.

[14] Joe, H., 1997. Multivariate models and multivariate dependence concepts. CRC

Press.

[15] Juri, A. and Wüthrich, M.V., 2002. Copula convergence theorems for tail

events. Insurance: Mathematics and Economics, 30(3), pp.405-420.

[16] King, M. A., & Wadhwani, S., 1990. Transmission of volatility between stock

markets. The Review of Financial Studies, 3(1), pp.5-33.

[17] Li, D.X., 2000. On default correlation: A copula function approach. The Journal

of Fixed Income, 9(4), pp.43-54.

[18] Nelson, D.B., 1991. Conditional heteroskedasticity in asset returns: A new

approach. Econometrica: Journal of the Econometric Society, pp.347-370.

[19] Rodriguez, J.C., 2007. Measuring financial contagion: A copula approach. Journal

of empirical finance, 14(3), pp.401-423.

[20] Patton, A. J., 2004. On the out-of-sample importance of skewness and asymmetric

dependence for asset allocation. Journal of Financial Econometrics, 2(1), pp.130-168.

[21] Patton, A.J., 2006. Modelling asymmetric exchange rate dependence.

International economic review, 47(2), pp.527-556.

[22] Salmon, F., 2009. A formula for disaster. Wired, March, pp.74-79.

[23] Sklar, M., 1959. Fonctions de repartition an dimensions et leurs marges. Publ.

inst. statist. univ. Paris, 8, pp.229-231.

ONS Escaping Poor Performance Dissertation(1).pdf

Executive Summary As identified by Her Majesty’s Chief Inspector of Education, Children’s Services and Skills,

there are a group of around 450 state-funded schools in England that have had poor inspection

results in every inspection they have had since 2005. These ‘stuck’ schools are receiving

increased levels of attention, since a poor inspection result is meant to instigate an improvement

in school performance.

This project aimed to investigate the data around stuck schools, to see if it was possible to

generate a machine learning model that could accurately predict whether a poorly performing

school would remain ‘stuck’ or would ‘escape’ the cycle of poor performance by itself.

To do this, a dataset needed to be created. This was done using input variables selected from a

number of publicly available sources. Many were provided by the Department for Education,

whilst details of every school inspection since 2005 were provided by Ofsted. The resulting

dataset had a row for each of the 21,900 open, state-funded schools in England and around 80

columns.

The definition of ‘stuck’ was updated slightly for this project, in consultation with the

Department for Education. The schools that met these new criteria were identified, along with

a different subset – those that had been performing poorly but had recently had one or more

good inspections and ‘escaped’ stuck. Schools that fitted in neither category were not used

further. The created dataset now had binary labels – ‘stuck’ and ‘escaped’ – from which a

binary classifier could be built and tested.

Six different binary classifiers were built in this project, using the Scikit-learn package in

Python: Random Forests, Support Vector Machines, Neural Networks, Gaussian Naïve Bayes,

Logistic Regression and K-Nearest Neighbours. For each model type, the optimum

combination of input features and model hyperparameters was found using an iterative process

involving random grid searches for model hyperparameters and sequential feature selection.

The best models were found to predict the future class of a school with 75% accuracy and area

under the ROC curve values of 75%. If a higher confidence in the prediction of a stuck school

is required, a Support Vector Machine model was correct in 88% of its predictions of stuck

schools (precision), although it only identified 40% of all stuck schools (recall). Logistic

Regression and Gaussian Naïve Bayes generally provided inferior results to the other four

model types, of which K-Nearest Neighbours was found to perform the best overall.

Accuracy values of 75% are useful, and show that the models can predict future inspection

results, but they are unlikely to be high enough to be of direct use to Ofsted and the Department

for Education. Given the selection of well known machine learning techniques used and the

large range of input features and hyperparameters tested, it is considered unlikely that

significant improvement upon these scores can be achieved without a step change in the quality

of the input data or the use of more advanced machine learning techniques that are beyond the

scope of this project.

An analysis of the most important features used by the models in making the classification was

also carried out, showing that different groups of features were being used by the different

models. Features concerning school financial balance were shown to be of high importance to

the Random Forest model in an assessment of its feature importance values.

This project was carried out with input from the two primary stakeholders: Ofsted and the

Department for Education. The results have been presented to them in person, with an

explanation of how they were achieved. They are now in a position to decide the future

direction of this work.

Acknowledgements My thanks in this project go to the EMU team at ONS who have been a pleasure to work with this summer – particularly to Joe for his help in all things from laptop setup to showing me how to get from the bike sheds to the showers and for his interest, support and guidance throughout; to Tim for giving me great flexibility, allowing me to concentrate solely on my project and being on hand if I needed anything; to Rebecca for keeping on finding errors in my list of stuck schools.

David Marshall, my supervisor in Cardiff University, has also been extremely helpful in this project, giving me useful practical advice on carrying out the work and for his assistance with writing this dissertation. I appreciate your willingness to meet on Skype on your day off.

I would also like to thank George and Louise at Ofsted and Pennie and Pippa at the Department for Education for their input during the project and their helpful comments and feedback in the presentation.

Contents 1. Introduction .................................................................................................................................... 1

1.1. School Inspections ....................................................................................................................... 1

1.2. Stuck Schools................................................................................................................................ 2

1.3. Ofsted and the Department for Education .................................................................................. 2

1.4. Tools Used .................................................................................................................................... 3

1.5. Project Aims ................................................................................................................................. 3

1.6. Process Plan ................................................................................................................................. 4

2. Literature Review ............................................................................................................................ 5

2.1. Previous Analysis of Schools Data ................................................................................................ 5

2.2. Machine Learning Models ............................................................................................................ 6

2.2.1. Logistic Regression ................................................................................................................ 6

2.2.2. Random Forests .................................................................................................................... 7

2.2.3. Support Vector Machines ..................................................................................................... 7

2.2.4. Neural Networks ................................................................................................................... 9

2.2.5. Naïve Gaussian Bayes Networks ......................................................................................... 11

2.2.6. K-Nearest Neighbours ......................................................................................................... 12

2.3. Methods for Refining the Models .............................................................................................. 13

2.3.1. Iterative imputation ............................................................................................................ 13

2.3.2. Sequential Forward and Backward Selection ..................................................................... 13

2.3.3. Recursive Feature Elimination ............................................................................................ 14

2.3.4. Oversampling ...................................................................................................................... 14

2.4. Measuring Model Effectiveness ................................................................................................. 15

2.4.1. Cross validation ................................................................................................................... 15

2.4.2. Accuracy .............................................................................................................................. 15

2.4.3. Precision and recall ............................................................................................................. 15

2.4.4. ROC curves .......................................................................................................................... 16

3. Input Data and Pre-Processing ...................................................................................................... 17

3.1. Datasets Available ...................................................................................................................... 17

3.1.1. Ofsted data.......................................................................................................................... 17

3.1.2. DfE data ............................................................................................................................... 17

3.1.3. Get Information About Schools data .................................................................................. 17

3.1.4. List of Academies ................................................................................................................ 18

3.2. Data Stitching ............................................................................................................................. 18

3.3. Data Cleaning ............................................................................................................................. 18

3.3.1. Missing data ........................................................................................................................ 18

3.3.2. Data types ........................................................................................................................... 19

3.3.3. Normalising ......................................................................................................................... 19

3.3.4. Imputation .......................................................................................................................... 19

4. Machine Learning Implementation............................................................................................... 21

4.1. Feature Generation .................................................................................................................... 21

4.1.1. Categorical data .................................................................................................................. 21

4.1.2. Changes over time .............................................................................................................. 21

4.1.3. Performance data ............................................................................................................... 21

4.2. Initial Feature Selection ............................................................................................................. 23

4.3. Stuck School Labelling ................................................................................................................ 23

4.3.1. Options for labels ................................................................................................................ 23

4.3.2. Updated definition of Stuck schools ................................................................................... 24

4.3.3. Data selection and labelling ................................................................................................ 24

4.3.4. Labelling method ................................................................................................................ 25

4.4. Data Summary ............................................................................................................................ 26

4.5. Principal Component Analysis .................................................................................................... 30

5. Results ........................................................................................................................................... 32

5.1. Assessment Methods Used ........................................................................................................ 32

5.2. Computing Set up for Modelling ................................................................................................ 33

5.3. Experiment 1: Finding the optimum machine learning classification model ............................ 37

5.3.1. Experiment 1.1: Finding the best model type ..................................................................... 37

5.3.2. Experiment 1.2: Finding the optimum model hyperparameters ........................................ 43

5.4. Experiment 2: Finding the optimum features to input to the model ........................................ 53

5.4.1. Experiment 2.1: Investigating whether Sequential Forward Selection or Sequential Backward Selection give better results ......................................................................................... 53

5.4.2. Experiment 2.2: Investigating whether Recursive Feature Elimination improves the set of features selected ........................................................................................................................... 55

5.4.3. Key features selected in most effective model ................................................................... 57

5.5. Experiment 3: Finding the optimum operations on the input data ........................................... 59

5.5.1. Experiment 3.1: Determining whether oversampling improves model performance ....... 59

5.6. Attributes of Most Effective Model ........................................................................................... 61

6. Discussion ...................................................................................................................................... 62

7. Conclusions ................................................................................................................................... 67

8. References .................................................................................................................................... 70

9. Appendices .................................................................................................................................... 74

9.1. Appendix 1: Input Variables Used .............................................................................................. 74

9.1.1. School Financial Balance ..................................................................................................... 74

9.1.2. School performance data .................................................................................................... 74

9.1.3. Pupil population and absence data..................................................................................... 75

9.1.4. Spine .................................................................................................................................... 75

9.1.5. Workforce data ................................................................................................................... 75

9.1.6. School finances data ........................................................................................................... 76

9.1.7. Generated variables ............................................................................................................ 77

9.2. Appendix 2: Variables selected for models ................................................................................ 77

9.2.1. Neural Network ................................................................................................................... 77

9.2.2. Support Vector Machine ..................................................................................................... 78

9.2.3. Random Forest .................................................................................................................... 78

9.2.4. Gaussian Naïve Bayes.......................................................................................................... 79

9.2.5. Logistic Regression .............................................................................................................. 79

9.2.6. K-Nearest Neighbours ......................................................................................................... 80

List of Figures Figure 1: A multi layer perceptron with 1 hidden layer containing k units (Scikit-learn, no date b) .... 10 Figure 2: Confusion Matrix .................................................................................................................... 15 Figure 3: Frequency of each overall inspection history ........................................................................ 27 Figure 4: Histograms showing distribution between 'Stuck' schools, 'Escaped' schools and schools that fall into neither category ............................................................................................................... 29 Figure 5: Principal Component Analysis - Cumulative explained variance versus number of principal components used .................................................................................................................................. 30 Figure 6: The results of Principal Component Analysis - the two classes and the ‘other’ remaining schools that fall in neither class are plotted on Principal Component 1 vs Principal Component 2 axes .............................................................................................................................................................. 31 Figure 7: Process for selecting optimum model for each model type .................................................. 38 Figure 8: Performance of the best version of each model type. Note that different measures of the same six models are plotted, sorted in order of decreasing area under the ROC curve score. ........... 39 Figure 9: ROC curve for model with highest area under the ROC curve value for each model type. The lines plotted are the mean scores over the 5 folds of cross validation. ............................................... 40 Figure 10: ROC curves for models in Table 3. Each of the five cross validation folds is plotted, along with a mean and the range of one standard deviation ........................................................................ 42 Figure 11: Precision for Stuck class - The best model from each model type in terms of precision for identifying stuck schools. The same six models are plotted in each chart, sorted in order of precision score. ..................................................................................................................................................... 42 Figure 12: Variation of AUC and Accuracy with different hyperparameters for Support Vector Machines using features listed in section 9.2 ....................................................................................... 45 Figure 13: Variation of AUC and Accuracy with different hyperparameters for Neural Networks using features listed in Section 9.2 ................................................................................................................. 47 Figure 14: Variation of AUC and Accuracy with different hyperparameters for Random Forests using features listed in Section 9.2 ................................................................................................................. 49

Figure 15: Variation of AUC and Accuracy with different hyperparameters for K-Nearest Neighbours using features listed in Section 9.2 ....................................................................................................... 50 Figure 16: Sequential Forward/Backward Selection results: Model accuracy for different numbers of features ................................................................................................................................................. 54 Figure 17: Recursive Feature Elimination - For each of the six different models, for each level of RFE, the highest score of Area Under the ROC curve and Accuracy are plotted.......................................... 56 Figure 18: Oversampling - For each model type, the run with the highest accuracy and area under the ROC curve are shown for the group where oversampling was used on the training data and the group which did not use oversampling. ................................................................................................ 60 Figure 19: ROC curves for the selected models, with randomness disabled in the cross validation process. Fold 1 is therefore the first 20% of the data points, Fold 2 is 20-40% etc. ............................ 64

List of Tables Table 1: Numbers of each category of school ...................................................................................... 27 Table 2: Minimum values of each metric to qualify. ............................................................................ 33 Table 3: Characteristics of the best model for each model type, measured by area under the ROC curve and subject to the minimum constraints of Table 2 ................................................................... 39 Table 4: Difference in training and test set accuracy when the kernel is changed. Each run is otherwise identical, using the parameters shown in Table 3 ............................................................... 46 Table 5: Features that appear in more than one of the selected 6 models ......................................... 57 Table 6: Importance of each feature to the selected Random Forest model....................................... 57 Table 7: Importance of the top 30 features when the parameters selected of the Random Forest model are applied to all features .......................................................................................................... 58 Table 8: Confusion matrix for best model ............................................................................................ 61 Table 9: Properties of best model ......................................................................................................... 61

1. Introduction This project is based on the inspection results of all state-funded schools in England and, in

particular, a small subset of these schools that are considered to be ‘stuck’ in a cycle of poor

performance. The aim of this project is to use machine learning to investigate whether it is

possible to predict the future performance of a school that has been repeatedly receiving poor

inspection results. It has been carried out at the Office for National Statistics in Newport, South

Wales.

1.1. School Inspections School inspections are one of the primary measures by which schools are judged. State-funded

schools in England can be inspected at any time, with a minimum of 15 minutes’ notice given

that an inspector is due to arrive at the premises (Office for Standards in Education, 2018). The

inspector is then required to be granted immediate access to everything they deem necessary

to investigate how the school is performing in a range of areas, such as how the school is being

managed and the effectiveness of the teaching in the classroom.

There are different types of inspection, such as ‘full’ Section 5 inspections that last for two

school days and ‘short’ Section 8 inspections that can last for one day. Section 5 ‘full’

inspections are what is investigated in this project and will be referred to simply as an

inspection from this point onwards.

An inspection will result in the school receiving ratings and feedback on a number of criteria.

The headline figure, which is what is considered in this project, is the ‘Overall Effectiveness’

rating. There are four available ratings:

- Category 1: Outstanding

- Category 2: Good

- Category 3: Requires improvement

- Category 4: Inadequate (subcategories: Serious Weakness and Special Measures)

Whilst it is clear that all schools would want to be ‘Outstanding’, the distinction applied in this

work is whether a school is ‘Good or better’ (Category 1 or 2), or ‘less than Good’ (Category

3 or 4).

The frequency with which a school is inspected can vary greatly. An average school could be

inspected once every four years whilst a poorly performing school may be inspected far more

frequently. In 2012, government policy stated that it was stopping routine inspections of

schools which have received an Outstanding rating (Fowler, 2012), although would monitor

their data and would inspect if it was deemed necessary. At present, this exemption remains in

place although it has been announced that routine inspections of Outstanding schools will be

reinstated (Department for Education, 2019).

There are different types of state-funded schools in England, such as grammar schools and

comprehensives. Recent government policy has led to the creation of academies. These are

schools where funding is received directly from the government, as opposed to non-academy

schools which receive funding from their Local Authority, who are in turn funded by the

government. The creation of these academies has either been voluntary or forced – any school

rated as Inadequate is legally obliged to become an academy.

From September 2019 onwards, the Education Inspection Framework (Ofsted, 2019) will be

in place, which will change how Ofsted carries out inspections.

1.2. Stuck Schools ‘Stuck’ is a phrase used to describe a school with a repeating pattern of poor inspections. As

used by Ofsted (Spielman, 2018), the criteria for a school to be labelled as stuck are:

- Having had at least four inspections since 2005.

- Every inspection rated ‘less than Good’ (Category 3 or 4).

When a school closes and reopens under a new name, whether to become an academy or not,

the closed school is said to be a predecessor of the opened school. The stuck school criteria

consider all inspections of predecessor schools, so a school can be labelled stuck if it has never

been inspected but its predecessor(s) have and meet the criteria.

Stuck schools are a point that have been focused on as an area to improve by Ofsted, who are

working with the Department for Education to look into them and see what they can do to

improve (Spielman, 2018).

1.3. Ofsted and the Department for Education These are two government organisations that are key stakeholders in this work.

The Office for Standards in Education, Children’s Services and Skills is better known as

Ofsted. They are responsible for carrying out inspections on services providing education and

skills, as well as those that care for children and young people. They then publish the reports

of their findings and inform policymakers of the effectiveness of the services. In this work, the

focus is on school inspections carried out and reported by Ofsted.

The Department for Education (DfE) is responsible for the schooling system as a whole. It is

capable of carrying out ‘interventions’ on schools which it believes would benefit from extra

support. These interventions can take many forms, such as increasing funding or providing a

management consultant.

1.4. Tools Used This work was carried out entirely using the Python programming language. The data

processing was carried out using the pandas (McKinney, 2010) and numpy (Walt, Colbert and

Varoquaux, 2011) libraries whilst the Scikit-learn library (Pedregosa FABIANPEDREGOSA

et al., 2011) was used for the machine learning models. Matplotlib (Hunter, 2007) was used to

generate the plots. Sequential Feature Selection was implemented using mlxtend (Raschka,

2018). Oversampling was implemented using imblearn (Lemaitre, Nogueira and Aridas, 2017).

A GitHub repository was used to store the code used throughout this project. It can be accessed

at github.com/crees00/SchoolsData.

1.5. Project Aims The primary aim of this work is to make a model that can accurately predict, for a school with

three consecutive ‘less than Good’ (Category 3 or 4) inspections, whether the next inspection

will be ‘Good or better’ (Category 1 or 2) or ‘less than Good’ (Category 3 or 4). This binary

classification is to be achieved with the greatest accuracy possible.

The second aim is to find the important features in making this prediction and assign an

importance value to them. This is useful in many ways – it allows insight to improve the

accuracy of the model. It also helps to back up the results of the model, providing more weight

to them. It makes the model more accessible and less ‘black box’, allowing the DfE to trust it

more and ensure that evidence is available if a school were to question the model’s output. As

a result, it is important that each variable has an understandable meaning and principal

components cannot be used for modelling.

In order to achieve the two project aims stated above, a dataset must first be created. This

dataset must contain data on every currently open, state-funded school in England, with enough

variables to allow modelling to be carried out. The data must be cleaned and formatted in a

way that is amenable to modelling, with binary class labels generated and added to the dataset.

1.6. Process Plan 1. Create dataset from multiple data sources.

2. Label data.

3. Preliminary analysis of data.

4. Iteratively run hyperparameter searches and Sequential Feature Selection to identify

best combination of hyperparameters and features for each model type.

5. Analyse results and present findings to Ofsted and DfE.

2. Literature Review 2.1. Previous Analysis of Schools Data Analysis of schools data has been completed in a number of different forms for different

purposes. The prediction of secondary school performance using machine learning has been

carried out in Tunisia (Rebai, Ben Yahia and Essid, 2019). Here, Random Forests and

regression trees were used to identify variables associated with strong exam performance for a

school. School size, the male/female split and class size are among the key variables, of which

it is noted that there is a ‘high non-linearity of the relationships between these key factors and

school performance’. Dummy variables are used for the different geographical regions, but this

is shown to provide few positive results as the information is contained in another variable

(urban/rural). The paper does, however, use relatively few variables which would not appear

to cover the range of inputs required.

The complexity of the link between school level data and exam performance is again noted in

a paper which uses regression trees to generate feature importance for predicting school exam

performance (Masci, Johnes and Agasisti, 2018). The percentage of disadvantaged students,

school funding and student truancy are listed as key variables. It is noted that, while these

variables are linked to exam results, exam results themselves are to be used as a variable in this

project. Student exam results have been predicted with relatively poor accuracy using deep

learning (Tanuar et al., 2018).

Student exam results have also been predicted using, among others, Naïve Bayes, Support

Vector Machines, Random Forests and Logistic Regression as it is recognised that no one

algorithm works best for every problem (Canagareddy, Subarayadu and Hurbungs, 2019). In

this paper, a subset of the variables, such as student age and gender, are removed prior to

modelling because they ‘do not have any impact on the predictions’. Their results appear to

show that their classifier performance decreases as a result, suggesting that variables should be

tested in the model before concluding that they are ineffective and removing them from the

analysis.

As noted above, the uses of machine learning prediction identified in the literature apply to

exam performance as opposed to school inspection performance. They also mainly use

individual student data and make predictions on an individual student level. Some analyses

have been undertaken to characterise stuck schools (Spielman, 2018; Thomson, 2019) but these

do little beyond providing summary statistics.

There therefore exists an opportunity to investigate whether machine learning techniques can

be used to predict future inspection performance of schools.

2.2. Machine Learning Models Each open state-funded school in England forms a data point in this analysis, with many

features and a single binary class: Stuck or Escaped. The aim of this work is to establish a

binary classifier that can accurately predict, for an unlabelled school, whether it will be Stuck

or Escaped.

There are many available machine learning models which can be used to build a binary

classifier. It is assumed that the reader is familiar with the relatively well-known techniques

used in this work and, as such, a brief description is provided. The Scikit-learn implementation

of these models (Pedregosa FABIANPEDREGOSA et al., 2011) have hyperparameters which

can be tuned to improve the model performance. The key hyperparameters for each model are

described under sub-headings, together with the options available for them in Scikit-learn. The

remainder of this chapter uses information from the Scikit-learn website heavily, in the

documentation related to the Scikit-learn tool named.

2.2.1. Logistic Regression Logistic Regression is a well known binary classifier, described based on the Machine Learning

Journal article (Lin, Yu and Huang, 2011). In Logistic Regression, the conditional probabilities

describing the possible outcomes for a single data point are modelled using a logistic function:

𝑃(𝑦 |𝒙) ≡ 1

1 + 𝑒−𝑦𝒘𝑇𝒙

where 𝒙 is the data point, 𝑦 is the class label and 𝒘 is the weight vector. Given binary (two

class) training data with 𝑙 points, Logistic Regression minimises the following cost function:

𝑃(𝒘) = 𝐶 ∑ log (1 + 𝑒𝑖 −𝑦𝑖𝒘𝑇𝒙𝑖 ) +

1 2

𝒘𝑇 𝒘 𝑙

𝑖=1

where 𝐶 > 0 is a penalty parameter. In this project, 𝑙2 regularisation as shown above is used,

with the Limited-memory BFGS (L-BFGS) solver selected for its stability.

Logistic Regression is a useful classifier because it is relatively simple, well known and can

provide insight into the relative importance of the input features.

The module Scikit-learn.linear_model.LogisticRegression was used for this analysis.

2.2.2. Random Forests Random Forests are a type of ensemble classifier based on the decision tree. A known issue

with decision trees is the possibility of overfitting the data due to the tree depth being too great

(Shalev-Shwartz and Ben-David, 2014), but Random Forests can remove this issue. The

Random Forest algorithm is a perturb-and-combine technique, where a diverse set of classifiers

is created by introducing randomness into the classifier construction. The prediction of the

ensemble is given as the averaged prediction of the individual classifiers.

Each tree in the ensemble is built from a random sample drawn with replacement (a bootstrap

sample) from the training set. When splitting each node during the construction of a tree, the

best split is found either from all input features or a random subset of specified size. The

purpose of these two sources of randomness is to decrease the variance of the model. Random

Forests achieve a reduced variance by combining a diverse selection of trees although this can

increase the bias. The Scikit-learn implementation of Random Forests used was Scikit-

learn.ensemble.RandomForestClassifier. This combines classifiers by averaging their

probabilistic prediction instead of letting each classifier vote for a single class.

2.2.2.1. Number of Estimators The number of trees making up the Random Forest. The accuracy of the model converges to a

limit as the number of trees in the forest becomes large. There is therefore a trade-off between

model accuracy and processing time as larger models will take longer to compute.

2.2.2.2. Maximum Number of Features to Consider The size of the random subsets of features to consider when splitting a node. The lower the

number, the greater the reduction of variance, but also the greater the increase in bias.

2.2.2.3. Criterion The function used to measure the quality of the split at each node in a tree. Either the Gini

impurity or the information gain (entropy) can be used in the model.

2.2.2.4. Bootstrap Whether a subset of the data points or all of the data points are used when adding a tree to the

Random Forest.

2.2.3. Support Vector Machines The following description is largely based on Chapter 5 of Pattern Classification (Duda, Hart

and Stork, 2001) and Chapter 16.5 of Numerical Recipes (Press et al., 2007). Support Vector

Machines are a form of linear discriminant function that specialise in separating data where the

pattern is in a higher dimension.

If the data is linearly separable, there exists a hyperplane in 𝑛 dimensions, an 𝑛 − 1

dimensional surface, given by

𝑓(𝒙) ≡ 𝒘. 𝒙 + 𝑏 = 0

that completely separates the training data 𝒙. All that remains is to find 𝒘 (a normal vector to

the hyperplane) and 𝑏 (an offset), then 𝑓(𝒙)will be the decision rule – if 𝑓(𝒙) > 0 then Class=1

and the point is on one side of the hyperplane, if 𝑓(𝒙) < 0 then Class=0 and the point is on the

other side of the hyperplane.

The margin is the perpendicular distance to points nearest to the hyperplane on both sides. The

goal in training a Support Vector Machine is to find the separating hyperplane with the largest

margin. The larger the margin, the better the classifier is expected to perform on unseen test

cases.

The support vectors are two vectors that are equally close to the hyperplane. They pass through

the training samples that define the optimal separating hyperplane and are the most difficult

points to classify as they are nearest to the hyperplane.

An important benefit of the Support Vector Machine is that the complexity of the resulting

classifier is based on the number of support vectors rather than the dimensionality of the

transformed space. Support Vector Machines therefore tend to be less prone to problems of

overfitting than some other methods. As they are based only on the support vectors, outliers

have less impact.

A disadvantage of the Support Vector Machine is that it does not lend itself to providing

probability estimates and it does not indicate the importance of different input features.

The Scikit-learn.svm.SVC module was used to implement Support Vector Machines.

2.2.3.1. C value The optimal decision boundary described above is that which maximises the distance between

the nearest training points and the decision boundary, assuming that the data is linearly

separable. If the data cannot be separated by a hyperplane, then training data points will be

misclassified and the model is said to have a ‘soft margin’. This misclassification is penalised

by the C value in the optimisation process, with a higher C value more strongly penalising

misclassification, with a penalty related to the distance between the misclassified point and the

hyperplane.

Setting C to a high value therefore favours achieving perfect separation of the data, at the risk

of generating an overly complex model. Setting a lower C value favours increasing the margin,

which will perform worse on the training data but could be more robust to variation in the test

data.

2.2.3.2. Kernel function Support Vector Machines can utilise the ‘kernel trick’. If an embedding function exists that

maps the n-dimensional feature vectors to a much higher N-dimensional space, it may be

possible that a very non-linear separating surface in the n-dimensional space maps into a linear

hyperplane in the N-dimensional space. This potentially highly complex mapping does not

need to be computed, instead a kernel is computed that could have come from the mapping.

The three kernel functions available in the model are linear, polynomial and the Gaussian radial

basis function (‘rbf’). The polynomial kernel seeks smoother, more global solutions whilst the

Gaussian radial basis function is more influenced by local nearest neighbour effects.

2.2.3.3. Degree of polynomial If a polynomial kernel is selected, this is the order of the polynomial.

2.2.3.4. Gamma value Gamma value is a parameter used in the polynomial and Gaussian radial basis function kernels.

2.2.4. Neural Networks The following is based largely on Chapter 4.4 of Machine Learning (Mitchell, 1997).

A single perceptron unit with a binary threshold takes a vector of input values 𝒙, calculates a

linear combination of these inputs and outputs a 1 if the result is greater than some threshold

and -1 otherwise. The weights 𝑎𝑖 and bias 𝑏 are real valued constants that are learned so that

the perceptron produces the correct (-1 or +1) output for each of the given training examples.

Output 𝑜(𝒙) = { 1 𝑖𝑓 𝑏 + 𝑎1𝑥1 + ⋯ + 𝑎𝑛𝑥𝑛 > 0 −1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

A single perceptron can only express a linear decision surface. Adding in layers of

perceptrons (Figure 1) allows the model to learn a more complex, non-linear decision surface.

Each layer of perceptrons contains a pre-defined number of perceptron units, each of which

can learn different weights (𝒂) and bias (𝑏). The layers of perceptron units are fully

connected – each unit in one layer is connected to all units in the previous layer and all units

in the next layer. The inputs to the second layer are the outputs from the first layer, and so on.

A multi-layer perceptron model may be classed as a feed-forward Neural Network.

Figure 1: A multi layer perceptron with 1 hidden layer containing k units (Scikit-learn, no date b)

The backpropagation algorithm learns the weights for a multilayer network. It employs gradient descent to attempt to minimise the squared error between the network output values and the target value for these outputs (the training data labels). For each training example, it applies the network to the example, calculates the error of the network output for this example, computes the gradient with respect to the error on this example, then updates all weights in the network.

The number of hidden units governs the complexity of the decision boundary (Duda, Hart and Stork, 2001) so, if the data classes are highly interspersed, more hidden units are required.

Well trained Neural Networks are capable of accurately classifying highly complex non-linear datasets. A disadvantage of the Scikit-learn implementation is that there is a non-convex loss function with more than one local minimum. This can lead to variation between runs with the same data. Neural Networks also require the tuning of a number of hyperparameters, such as the number of hidden units, layers and iterations.

Scikit-learn.neural_network.MLPClassifier was the implementation of Neural Networks used in this project.

2.2.4.1. Activation function For a perceptron unit, this is a binary threshold function. In this work, the rectified linear unit

(ReLU) was used due to its constant gradient (for positive values), ensuring that the learning

rate was not unnecessarily slow for large values.

2.2.4.2. Number of layers The number of hidden layers in the network. It is unlikely that more than three layers would be

required, as it would be necessary to have ‘special problem conditions or requirements to

recommend the use of more than three layers’ (Duda, Hart and Stork, 2001).

2.2.4.3. Number of nodes per layer In order to simplify the parameter search, the decision was taken to have, for a single Neural

Network, the same number of units in each layer. For example, the following setups were

possible: [2,2,2] or [5,5,5,5].

2.2.4.4. Solver Three solvers are available in the Scikit-learn implementation:

- Stochastic Gradient Descent – Updates parameters using the gradient of the loss

function with respect to a parameter that needs adaptation.

- Adam – This is also a stochastic optimiser but it can automatically adjust the amount to

update parameters based on adaptive estimates of lower-order moments.

- L-BFGS – This approximates the Hessian matrix which represents the second-order

partial derivative of a function. It then approximates the inverse of the Hessian matrix

to perform parameter updates.

2.2.4.5. Alpha Alpha is a regularisation term which helps avoid overfitting by penalising weights with large

magnitudes.

2.2.5. Naïve Gaussian Bayes Networks This description is based on ‘The Optimality of Naïve Bayes’ (Zhang, 2004).

Naïve Gaussian Bayes networks are based on applying Bayes’ theorem with the ‘naïve’

assumption of conditional independence between every pair of features given the value of the

class label 𝑦. Bayes’ theorem states the following relationship, given class label 𝑦 and

dependent feature vector 𝒙:

𝑃(𝑦 |𝑥1, … , 𝑥𝑛) = 𝑃(𝑦)𝑃(𝑥1, … , 𝑥𝑛 |𝑦)

𝑃(𝑥1, … , 𝑥𝑛)

Using the naïve conditional independence assumption for all features, this relationship is

simplified to:

𝑃(𝑦 |𝑥1, … , 𝑥𝑛 ) = 𝑃(𝑦) ∏ 𝑃(𝑥𝑖 |𝑦)

𝑛 𝑖=1

𝑃(𝑥1, … , 𝑥𝑛)

As 𝑃(𝑥1, … , 𝑥𝑛 ) is constant given the input, we can use the following classification rule:

𝑃(𝑦 |𝑥1, … , 𝑥𝑛 ) ∝ 𝑃(𝑦) ∏ 𝑃(𝑥𝑖 |𝑦) 𝑛

𝑖=1

⇒ 𝑦 ̂ = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑦

𝑃(𝑦) ∏ 𝑃(𝑥𝑖 |𝑦) 𝑛

𝑖=1

Maximum A Posteriori estimation can then be used to estimate 𝑃(𝑦) and 𝑃(𝑥𝑖 |𝑦), with 𝑃(𝑦)

then being the relative frequency of class label 𝑦 in the training set.

In the Gaussian Naïve Bayes algorithm used, the likelihood of the features is assumed to be

Gaussian:

𝑃(𝑥𝑖 |𝑦) = 1

√2𝜋𝜎𝑦2 exp (−

(𝑥𝑖 − 𝜇𝑦 ) 2

2𝜎𝑦2 )

The parameters 𝜎𝑦 and 𝜇𝑦 are estimated using maximum likelihood. There are no

hyperparameters to tune in this model.

The conditional independence assumption is rarely true in most real-world applications.

Despite this, there are numerous scenarios where these models have been proven to be

effective. They require a small amount of training data to estimate the necessary parameters

and are extremely fast compared to more sophisticated methods. The decoupling of the class

conditional feature distributions means that each distribution can be independently estimated

as a one-dimensional distribution. This then helps avoid problems with the curse of

dimensionality.

The module used in this project was Scikit-learn.naive_bayes.GaussianNB.

2.2.6. K-Nearest Neighbours The K-Nearest Neighbours algorithm assumes that all data points correspond to points in n-

dimensional space (Mitchell, 1997). The nearest neighbours of an instance are defined in terms

of a distance measure as specified by the user. When k is 1, the algorithm assigns a label to the

test point corresponding to the nearest training point in the space. For larger values of k, the

algorithm assigns the most common class label of the k nearest training points.

Unless weightings are applied, all features are treated equally. As such, if there are large

numbers of unimportant features in the model, the classification can be dominated by these

unimportant features. Also, each test point must be compared to every training point so this

can be computationally intensive.

The implementation of K-Nearest Neighbours used was Scikit-

learn.neighbors.KNeighborsClassifier.

2.2.6.1. K value The larger the value of k, the larger the space being investigated and the less prone to error due

to noise.

2.2.6.2. P measure The order of Minkowski distance to measure the distance between points with. For example,

p=1 is the Manhattan distance and p=2 is the Euclidian distance.

2.3. Methods for Refining the Models Aside from selecting a model type and tuning its parameters (Section 2.2), there are many ways

in which a model can be further improved. The methods used in this project are described

below.

2.3.1. Iterative imputation The models required complete sets of data to function, yet a substantial quantity of the input

data was missing (Section 3.3.1). It was therefore required to use data imputation. The

imputation method selected was the IterativeImputer from Scikit-learn. This models each

feature with missing values as a function of other features, and uses that estimate for

imputation. This is achieved by fitting a regressor to the input data and using this to predict the

missing values in an iterative, round-robin fashion. This imputation method was selected to

provide a more ‘accurate’ prediction of the missing value than simply imputing the mean or

the mode, although this has not been verified. It also maintains a level of variance within the

data.

2.3.2. Sequential Forward and Backward Selection SFS and SBS are described based on the information on the mlxtend website (mlxtend, no

date). This technique is a greedy method of selecting the optimum combination of features to

use in the model. For forward selection, the selected feature set is initially empty and the

available features are input into a list. In each iteration, each feature in the input list is added

to the selected feature set. The model is trained and tested (using cross validation) on the new

variable set, the score is recorded and the feature is removed. When all features in the input list

have been tested, the feature that resulted in the best score is removed from the input list and

added to the selected feature set. This is continued until all features have been selected if an

early stopping criterion is not used. Backward selection works in the reverse fashion – the

selected feature set initially contains all features and one feature is removed in each iteration.

The scoring measure can be selected from standard measures such as model accuracy and the

area under the ROC curve. This is computationally intensive as the model must be trained and

tested a large number of times for each feature selection run.

In this work, the module mlxtend.feature_selection.SequentialFeatureSelector was used to

implement sequential feature selection in the SFS.py script.

2.3.3. Recursive Feature Elimination Recursive Feature Elimination is another method of selecting features and is described in the

Scikit-learn documentation. This is achieved by recursively removing the features which score

lowest in a Logistic Regression until the number of required features is reached. It was achieved

using the Scikit-learn package Scikit-learn.feature_selection.RFE.

2.3.4. Oversampling When using different subsets of the data, with different criteria for positive and negative labels,

it is possible that there can be a large difference between the sizes of the two classes for a binary

classifier. As a result, the classifier can be distorted by having few training points in the less

populous ‘minor’ class. This can be mitigated through the generation of synthetic cases of the

minor class in a technique called oversampling.

An oversampling technique called SMOTE (Chawla et al., 2002) is used in this work. The

following description is based on the documentation of imbalanced-learn (Lemaitre, Nogueira

and Aridas, 2017) which provided the implementation of SMOTE used. This module is

imblearn.over_sampling.SMOTE. The regular SMOTE algorithm takes a point of the minor

class, selects one of its nearest neighbours in the same class and generates a new point by linear

interpolation between the two existing points, giving the new point the label of the minor class.

The result of using this technique is that the training set will have the same number of each

class in it. The more populous ‘major’ class of the training set and all of the test set remain

unchanged.

2.4. Measuring Model Effectiveness A key consideration when building a machine learning model is how the results from the model

should be assessed to identify which are most effective and how they can be improved.

2.4.1. Cross validation In order to provide an indication of the accuracy of the model, it must be tested on previously

unseen data. As a result, for each run, a subset of the data should be removed and held as a test

set, the remainder being the training set. The model can then be trained solely on the training

set and its performance judged on how the model performs when applied to the test set. The

reported performance of the model is then dependent on the training/test split used – some

splits would result in better performance on the test set than others. To reduce this effect, 𝑘

fold cross validation can be implemented as described in section 9.6.2 of Pattern Recognition

(Duda, Hart and Stork, 2001):

The training set is randomly divided into 𝑘 disjoint sets of equal size 𝑛/𝑘, where 𝑛 is the total

number of data points. The classifier is trained 𝑘 times, each time with a different set held out

as a test set. The estimated performance is the mean of these 𝑘 scores. The number of folds

traditionally used is 𝑘 = 10.

The Scikit-learn.model_selection.KFold module was used to implement cross validation

because the full implementations of cross validation in Scikit-learn were not compatible with

the way the models and data had been set up. Setting up the different training and test sets is

demonstrated in the oneFullRun(..) function in genericModelClass.py as shown in Section 5.2.

This generates output for each of the 𝑘 folds separately, so separate post processing is required

to calculate the average output.

2.4.2. Accuracy Accuracy is the percentage of the test points that were correctly classified by the model.

2.4.3. Precision and recall The confusion matrix (Figure 2) summarises a classifier’s performance.

True

1 0

Predicted 1 True positives False positives

0 False negatives True negative Figure 2: Confusion Matrix

Precision shows what proportion of the points classified as class 1 are actually in class 1:

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

Recall shows what proportion of the points that are labelled as class 1 are predicted to be in class 1:

𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠

𝑇𝑟𝑢𝑒 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠 + 𝐹𝑎𝑙𝑠𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠

These measures are particularly useful when there is an uneven split between the two values.

The F1 score is the harmonic mean of precision and recall, giving equal weight to each:

𝐹1 = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙

The confusion matrices, precision and recall were calculated using the classification_report and confusion_matrix classes within Scikit-learn.metrics.

2.4.4. ROC curves An ROC curve (see examples in Figure 10 in Results section) is built upon a model’s false

positive rate and true positive rate – the proportion of points that the model has classed as class

1 that are actually in class 0 and 1 respectively. By calculating these rates for a range of

classifier threshold values, a Receiver Operating Characteristic (ROC) curve can be plotted. If

the threshold is set high enough, all points are predicted to be in class 0, so both rates are zero,

and the opposite holds for when the threshold is set to a minimum value. The method of

obtaining the values for ROC curves for the different model types is set out in Chapter 3 of

‘ROC Analysis of Classifier in Machine Learning: A Survey’ (Majnik and Bosnic, 2011).

The area under an ROC curve can be calculated, for which a perfect classifier would have a

value of 1. An Introduction to ROC Analysis (Fawcett, 2006) states that “The AUC has an

important statistical property: the AUC of a classifier is equivalent to the probability that the

classifier will rank a randomly chosen positive instance higher than a randomly chosen negative

instance.” A classifier which assigns classes at random would expect to achieve an AUC of

0.5. Therefore, for a model to be useful it must have an AUC score exceeding 0.5.

The datapoints and AUC results for the ROC curves plotted in this report were generated using

Scikit-learn.metrics.roc_curve and Scikit-learn.metrics.roc_auc_score respectively.

3. Input Data and Pre-Processing A key part of this project was assembling the data into a useable format compatible with the

machine learning models.

3.1. Datasets Available This project used publicly available data provided by Ofsted and the DfE.

3.1.1. Ofsted data Ofsted provided a record of every inspection that had occurred between 1st September 2005

and 31st August 2018, of which there are approximately 100,000. This data came in the form

of .csv files. The period until 31st August 2015 was covered by a single .csv file provided by

Ofsted, with the remaining data available in a separate file per term from the government

website. As stated in Section 1.1, these 100,000 inspections are neither evenly distributed

between schools nor by time.

Each inspection had around 50 variables, the key ones are shown below.

• URN – Each school has a Unique Reference Number which is the primary identification

of the school. If a school reopens under a different name, or becomes an academy, it is

given a new URN.

• LA/ESTAB Number – The Local Authority / Establishment number is another unique

identifier that was used in the cases where URN was not available.

• Inspection Number – This is the unique reference for the inspection.

• Overall Effectiveness rating – This is the category (1/2/3/4) that is provided.

• Inspection Start Date – This is used for providing the year of the inspection.

3.1.2. DfE data The data provided by the Department for Education had been acquired by the ONS before the

start of this project.

A list of the variables used is provided in the appendix (Section 9.1).

3.1.3. Get Information About Schools data This service, formerly known as EduBase, provides a list of all state-funded schools which are

currently open in England. This analysis focuses solely on these schools, although any

inspection that a current school’s predecessors had are considered in the current school’s

inspection history.

3.1.4. List of Academies This list of schools made a link between a current school and any predecessor school(s) it had.

This could be used in combining the inspection histories of schools.

3.2. Data Stitching An initial aim of the project was to create a single large dataset containing all relevant data for

each open school in England. The list of all currently open schools was used to ensure that

there was one row for each school, then columns were added in batches. The

combineInspData.py, creatingAMonster.py, GDIhelper.py and genericDataIn.py scripts were

used to carry out the following work.

A series of helper functions were created to perform the tasks. For each data source, a dictionary

was filled with the .csv file paths, the column names containing each data type and a selection

of other required inputs. The dictionaries were then passed sequentially to the helper functions

to add columns to the existing dataframe. During this process, the data types were corrected.

Some data was in separate .csv files for different subsets of the data for a given year, with

inconsistent names for equivalent columns. Here, the corresponding column names of the

required columns were identified and placed in lists. The lists of the two input dataframes were

then used to merge the individual columns to provide a combined dataframe that could be

joined to the primary dataframe. A similar operation was carried out on the examination

performance data which is described in section 4.1.3.

The final dataset contained one row for each open, state funded school in England. The URN

(Section 3.1.1) was the unique identifier, followed by columns for each selected variable. A

‘Class’ column, indicating the binary classification assigned to the school (Section 4.5) was

added as a final column.

3.3. Data Cleaning Data retrieved from different sources needed to be cleaned to ensure that information was not

lost whilst ensuring that all data input to the models would be compatible with the models.

3.3.1. Missing data Between the data sources there was a range of levels of data completeness. The completeness

of each variable was assessed to ascertain which had sufficient data to be useful in the analysis.

This was partly carried out using the makePickColsToUse(..) function within

pickColsToUse.py. Some variables were replicated between different data sources, with

different levels of completeness. In this case, the more complete source was used.

The data stitching technique also meant that if a school had a predecessor, the predecessor URN

was not used to find data from when the predecessor was open. The cleaning process also

removed some values that were incorrect, leading to more missing data.

3.3.2. Data types All data was required to be in numeric form, either as a float or an integer, to be compatible

with the models. The input data contained multiple types and formats, often with multiple types

within a single column. A selection of helper functions were required to deal with specific

formats such as percentages and currency. Where data was in an inappropriate format, such as

text in a numerical variable, this was replaced with the blank value np.nan. This was largely

carried out using the GDIhelper.py and genericDataIn.py scripts.

3.3.3. Normalising The data values used range in magnitude from large school budgets to proportions smaller than

one. They also are spread over different distributions. Some of the models used in this analysis,

such as Support Vector Machines, are not scale invariant, so a feature with larger values would

have a disproportionate impact on the results given. Not normalising the data could also lead

to numerical difficulties during the calculation as kernel values usually depend on the inner

products of feature vectors (Hsu, Chang and Lin, 2008).

In order to allow the models to work effectively, the data in all variables used in the analysis

was normalised using the normalise(..) and normaliseSDcol(..) functions in pickColsToUse.py.

This resulted in each variable having a mean value of 0 and a standard deviation of 1.

3.3.4. Imputation The models used in this analysis required that there were no missing values. As discussed in

section 3.3.1, the input data contained many missing values, with no single variable being

complete. In order to run the models using this data, the missing data had to be imputed.

The imputation method described in Section 2.3.1 was implemented as shown in the extract

from pickColsToUse.py below.

Imputing data will clearly introduce inaccuracy into the dataset because it has been generated

by the imputer rather than being a measured value. It is expected that the imputation would

cause the results to be worse than those obtained if the complete, original data were available.

This risk, however, is accepted due to the necessity of passing complete input data to the

models.

4. Machine Learning Implementation 4.1. Feature Generation As well as the features immediately available in the input data, a selection of extra features

were created based on the existing features.

4.1.1. Categorical data Categorical features must be converted to numerical values for use in the models, which was

achieved using fixCategoricalcols(..) in pickColsToUse.py. For some variables, such as

‘Boarding’ and ‘HasBoys’, this was achieved by grouping categories together to give a binary

output. When there were more than two outcomes for a variable, ‘one-hot encoding’ was

implemented using the pandas get_dummies() function. This ensured that all values were

treated equally by the model and would not introduce any unwanted effects.

4.1.2. Changes over time Many of the datasets used in this work are published annually, with data available for around

5 to 10 years. This was accounted for by generating variables that were a difference between

the current value and the corresponding value a specified number of years ago. If this was done,

it was added as a separate variable to the original.

4.1.3. Performance data School examination performance in England is published for pupils sitting examinations at the

end of Key Stage 2 (age 11), Key Stage 4 (age 16) and Key Stage 5 (age 18). Schools are

generally either for pupils aged up to age 11 (primary) or for older children (secondary).

Conducting separate primary and secondary analyses was beyond the scope of this project, so

a means of comparing the examination performance of primary and secondary schools was

implemented using the fixPerfCol(..) function in genericDataIn.py as described below.

The Key Stage 2 data selected shows the percentage of children that reach the ‘expected

standard’ in reading, writing and maths. Two main Key Stage 4 metrics were available for

assessing school examination performance: ‘Attainment 8’ and ‘Progress 8’. Attainment 8 is

an integer measure of a pupil’s performance across 8 core subjects and Progress 8 is this same

measure but relative to a pupil’s previous performance (Department for Education (DfE),

2016). Whilst the Progress 8 measure arguably provides a better indication of the school’s

performance, it was not selected because there is not a comparable Key Stage 2 measure which

measures improvement. The Key Stage 4 measure selected shows the average Attainment 8

score per pupil.

For each set of data, the applicable schools were ranked by their performance. The performance

rankings were then converted to percentages, so the median school would receive a score of

50% and the best would receive 100%. The new column added to the main dataset was then

the percentage ranking score for that school, whether it was for Key Stage 2 or Key Stage 4. If

a school had results for both Key Stage 2 and Key Stage 4, a mean was taken of their two

scores.

4.2. Initial Feature Selection The initial data sources contained over one thousand variables. Many of these were duplicates,

variants of others or not useful. When these were removed, there were still more features than

were necessary for a model with close to 22,000 data points. The initial feature selection was

carried out after consultation with colleagues at the ONS who have worked with these datasets

previously. The strategy used was to keep more features than were expected to be optimal, then

to narrow down the feature space once the initial models had been run.

4.3. Stuck School Labelling The dataset created was unlabelled – it did not contain the school classification labels that

would be needed for modelling. This section describes the process of adding the labels to the

dataset. Selecting the best definition of the classes to be classified by the model would be

important to the success of the project. Two different labelling methods were used, both of

which were binary.

4.3.1. Options for labels The initial basis of the project was to investigate schools classed as ‘Stuck’ by Ofsted (Section

1.2). This group of schools had already received extra focus from the authorities and analysis

had been carried out on them, such as the Ofsted Annual Report (Spielman, 2018). This

definition, however, had some clear flaws.

The first flaw in the existing stuck school definition is that the start date of the time window in

which inspections count towards a school being stuck is fixed at 1st September 2005. As time

passes, this window of time will increase and the effective meaning will change.

A further issue is that a stuck school can only be termed stuck if it has had zero ‘Good or better’

inspections since the starting date. This means that if a school had one ‘Good’ inspection in

September 2005 then it could never be classed as stuck, even if it subsequently received ten

consecutive ‘Inadequate’ (category 4) ratings. Given the expected variation within inspection

results, a single ‘Good’ inspection in 13 years is not unlikely if a school is inspected regularly,

even if it generally receives poor ratings. This would prevent a school from being labelled as

stuck even if it would fit into the thinking behind the stuck definition.

4.3.2. Updated definition of Stuck schools Given the flaws in the existing stuck schools definition, a new definition was agreed based on

input from the Department for Education. This new definition states that, to be stuck, a school

must have:

- At least four previous inspections (including those of predecessors).

- An Overall Effectiveness rating of ‘less than Good’ (category 3 or 4) in each of its four

most recent inspections.

4.3.3. Data selection and labelling With this new definition of stuck schools, it was decided that the modelling should proceed by

dividing the data into three groups:

- Currently Stuck schools (Class=1): As defined in Section 4.3.2. Examples of ratings of

schools that would fall into this category (from first inspection to most recent):

o 4,4,3,4,3

o 2,2,4,4,4,3

- ‘Escaped Stuck’ schools (Class=0): Schools that have had at least 3 consecutive ‘less

than Good’ inspections, with all inspections following this (minimum of one) rated as

‘Good or better’. For example:

o 4,3,4,2

o 2,2,3,3,3,2

o 2,2,4,3,4,4,1,1,2

- All other schools: These schools do not fall into either category and so were removed

from the dataset.

These labels allowed a specific question to be asked: If a school has three consecutive ‘less

than Good’ inspections, can we predict whether its next inspection will also be ‘less than

Good’?

The new class definitions clearly defined the training data labels required and the subset of

schools that the model could be used to predict in future. They also ensured that there was no

overlap between the classes and resulted in a relatively even split between the two classes. A

significant drawback of this approach is that a large proportion of the input data was being

removed and not used in the modelling.

4.3.4. Labelling method The Stuck school definitions of Section 1.2 and 4.3.2 needed to be applied to the dataset. The

selected definition could then be appended onto the dataset as a binary ‘Class’ column which

could be used as the data labels for training and testing the models.

The list of schools which fit into either definition of stuck was not available, so the

schoolClass.py script was written to identify which schools met the different criteria. This was

achieved by first defining an Inspection class in Python and generating an instance for each of

the ~100,000 inspections in the Ofsted data with the attributes listed in Section 3.1.1. Next, a

School class was defined and an instance generated for each school (open or closed) whose

URN matched the URN attribute of an Inspection, populating the School instance with

information from the dataset.

In order to incorporate inspections from predecessor schools, the predecessor URNs were first

identified and added to the relevant School instance using addPredecessorURNsFromDF(..).

For each School, the addPredecessorInspections(..) function was called recursively to work

through the predecessor schools identified and read in all of their Inspection instances. This

ensured that, even if a school’s predecessor itself had a predecessor, then the inspections would

be counted correctly. This then made it easy to work out which schools were stuck (old

definition) using calcStuck(..).

Further functions could then identify which schools were open, sort the inspections by year and

carry out further analysis, such as plotting the total inspection history of each open school as

shown in the following section (Figure 3).

4.4. Data Summary A summary of the number of schools in each category is shown in Table 1.

Table 1: Numbers of each category of school

All inspected

schools

All open schools Class=1

‘Stuck’

Class=0

‘Escaped’

32131 21943 805 908

The overall inspection histories of all schools are shown in Figure 3. The figure shows, for

example, that approximately 3,500 schools have an inspection history of 3 ‘Good or better’

inspections and 0 ‘less than Good’ inspections.

Figure 3: Frequency of each overall inspection history

A series of histograms are plotted in Figure 4 from classComparisonForEachCol.py. Each plot

shows three overlaid histograms which represent the distribution of values between the classes,

with the orange 'Escaped' bars being partially transparent to see the blue 'Stuck' bars behind.

Variable descriptions are provided in Section 9.1. The histogram y-axes are normalised to show

the relative frequencies of each class as opposed to absolute values. They are plotted before the

dataset was normalised so that the x-axis is on the original scale of the variable. For binary

values, 1 represents True and 0 represents False.

Some basic, generalised observations from the plots are as follows:

- Both stuck and escaped schools have much greater rates of pupils eligible for free

school meals, low achievement in key stage 1 and English as an additional language

than the average English state-funded school.

- A much higher proportion of stuck schools are secondary than for escaped schools,

which are both above the overall average. The same applies for academies.

- Stuck schools tend to have generally worse exam performance than escaped schools,

with both classes having generally inferior performance to all other schools.

- Stuck schools tend to have a lower Pupil:Teacher ratio than escaped schools.

- Stuck schools tend to have a greater pupil absence rate than escaped schools. Both

groups have frequencies of absence of 0.05% or greater that are considerably higher

than average.

- Stuck schools tend to have more pupils than escaped schools.

- Stuck schools tend to have had a greater increase in supply staff spend since 4 years

ago than escaped schools.

- There is not a clear link between teacher pay and a school being stuck or escaped.

Figure 4: Histograms showing distribution between 'Stuck' schools, 'Escaped' schools and schools that fall into neither category

4.5. Principal Component Analysis A principal component analysis was run with all input features using Scikit-

learn.decomposition.PCA and the results shown in Figure 5. The plot shows that 25 principal

components are required to capture 80% of the variance and 39 principal components are

required to capture 90% of the variance.

Figure 5: Principal Component Analysis - Cumulative explained variance versus number of principal components used

Stuck, escaped and all other schools are plotted on Principal Component 1 vs Principal

Component 2 axes in Figure 6 using plotClassesOnPCA(..) in PCAallFeatures.py. Note that

points are plotted in the order shown, with the grey points ‘beneath’ the other points if they are

coincident, and the orange points for escaped schools are on top.

The scatter plot shows that the classes are heavily intermingled in terms of principal

components, suggesting that building a classifier to separate them could be challenging. The

first two principal components do, however, only account for 18% of the variance in the data

so there is a large quantity of information that is not contained in the plot.

Figure 6: The results of Principal Component Analysis - the two classes and the ‘other’ remaining schools that fall in neither class are plotted on Principal Component 1 vs Principal Component 2 axes

5. Results During this work, a series of experiments and sub-experiments were undertaken, with the

overall aim of finding the optimum predictive model for this dataset. The model components

investigated in this chapter are all interlinked. For example, the best hyperparameters for a

Support Vector Machine model will depend on the features input to the model whilst the best

choice of features for a Support Vector Machine model will depend on the hyperparameters

set. The approach taken was therefore iterative.

5.1. Assessment Methods Used The number of cross validation folds (Section 2.4.1), 𝑘, used in this project is 5. This provides

a balance between providing enough folds to smooth out the results sufficiently and the

processing time required to effectively run each model five times. All results in this report have

used 5-fold cross validation, where the result provided is the average of the individual scores

of the 5 folds.

Many different metrics were considered when assessing the performance of a single model

(Section 2.4). The simplest measure, accuracy, was used heavily as this is the basic requirement

of a classifier. It was noted that the split of the data between the two classes was 47:53, so a

reported accuracy value would not be misleading due to the class sizes being imbalanced.

When accuracy values were not at a sufficiently high level, the area under the ROC curve was

considered more strongly. This is because, for models which allow tuning such as Random

Forests and Logistic Regression, the operating point of the model could be changed. For a

model with low accuracy, the class boundary decision threshold could be increased. This would

reduce the number of schools predicted to be in the positive class (and therefore reduce recall),

but precision would increase as a higher proportion of these positive predictions would be

correct. It was noted that there is a clear correlation between model accuracy and area under

the ROC curve, so choosing between the two measures had a relatively small impact on the

results.

For some of the measures of performance, the results could be misleading. For example, in the

case with the original definition of stuck schools, a model would achieve over 98% accuracy

by simply predicting every school not to be stuck. In the following results, therefore, minimum

values for a selection of measures are set (Table 2) to reduce the chances of an unhelpful

classifier being selected.

Table 2: Minimum values of each metric to qualify.

Accuracy AUC F1 F0 Recall 1 Recall 0 Precision 1

0.6 0.6 0.25 0.25 0.1 0.1 0.1

5.2. Computing Set up for Modelling The models were set up and run using genericModelClass.py. First, an instance of ModelData

was generated for a run. This contained the data, carried out the train/test split and ran

oversampling and/or recursive feature elimination if specified. Next, an instance of the

specified model type, for example RandomForest, was generated. This inherits from the Model

parent class which has numerous get..(self) and set..(self) methods as well as plotROC(self).

Each instance of a child class of Model is initialised with the ‘dataName’, a ModelData instance

and a dictionary of run parameters.

ModelData instances were stored in modelDataDict and Model instances in modelDict. The

naming of the ModelData and Model instances with generated, descriptive names ensured that

each ModelData instance was only generated once and could be reused, avoiding wasting

computing resource. The fitModel(self, [params]) method for each Model instance was only

called once it had been checked that it had not previously been run.

The runAGroup(..) function was set up to work through each combination of the data inputs

(RFE options and whether or not to use oversampling) for each model type and pass inputs to

runsForModels(..). In the extract of runAGroup(..) below, the model postprocesses and clears

the modelDict if it has greater than 400 entries to save memory.

Every combination of the parameter dictionary of lists was generated using itertools in the two

lines of code indicated below (Rees, 2017) in runsForModels(..), and a random subset (using

random.sample) of the selected size was passed sequentially to oneFullRun(..).

The oneFullRun(..) function then generates the data name, the train/test sets for cross

validation, instances of ModelData and Model and fits the model to the data. The regular

expression re module was used here.

Once the runs were completed, the individual .csv files generated were recombined into one

file in analyseAvgCSVs.py. Columns for the different parameters and run settings (all identified

from the run names using regular expressions) were added to allow analysis. The plots were

then generated using plotBestModel.py.

5.3. Experiment 1: Finding the optimum machine learning classification model The optimum machine learning classification model used depended on many aspects, each of

which were investigated in the following sub-experiments.

5.3.1. Experiment 1.1: Finding the best model type There are a range of well known models available for binary classification, of which six were

selected for consideration at the outset (Section 2.2). In order to compare the effectiveness of

the models, each model type was first optimised as described below and illustrated in Figure

An initial hyperparameter search was carried out for each model type, based on an initial

selection of approximately half of the available variables selected to be expected to be of high

importance to the model. The set of hyperparameters for each model that resulted in the highest

average accuracy score was selected. These six models were then run through Sequential

Forward and Backward Selection. For each of the six models, the run (forwards or backwards)

that led to the highest accuracy score was used. A further hyperparameter search was then

carried out for each model. This used the set of features that had resulted in the highest accuracy

from Sequential Forward or Backward Selection for that model and searched a narrower

hyperparameter search space based on the results of the initial hyperparameter search.

Figure 7: Process for selecting optimum model for each model type

The performance of the six optimised models was then compared across different measures to

ascertain which was the most effective.

5.3.1.1. Results The best performing model for each of the six model types was selected. The models selected

were the models with the highest area under the ROC curve, subject to minimum threshold

scores as specified in Table 2.

Figure 8 shows that the model with the highest area under the ROC curve was the K-Nearest

Neighbours model, which also had the highest accuracy. Neural Networks, Support Vector

Machines and Random Forests all have a similar accuracy whilst the Logistic Regression model

has a higher recall for class 1 and a lower recall for class 0. The K-Nearest Neighbours model

has the highest precision for class 1, whilst the Neural Network and Logistic Regression have

the highest precision for class 0. Gaussian Naïve Bayes has the lowest score for five of the six

measures.

Final model features and hyperparameters

Model used for analysis

Feature set selected from second round of SFS or SBS. Narrower range of hyperparameters

Final hyperparameter search

All features. Best hyperparameters from refined search

Sequential Forward/Backward selection

Feature set selected from SFS or SBS. Narrower range of hyperparameters

Refined hyperparameter search

All features. Best hyperparameters from initial search

Sequential Forward/Backward selection

Initial feature set. Wide hyperparameter range

Initial hyperparameter search

Figure 8: Performance of the best version of each model type. Note that different measures of the same six models are plotted, sorted in order of decreasing area under the ROC curve score.

The parameters used in each model type are shown in Table 3.

Table 3: Characteristics of the best model for each model type, measured by area under the ROC curve and subject to the minimum constraints of Table 2

K-Nearest

Neighbours

Neural

Network

Support

Vector

Machine

Random

Forest

Logistic

Regression

Gaussian

Naïve Bayes

No RFE or

oversampling

No RFE or

oversampling

No RFE or

oversampling

No RFE or

oversampling

Oversampling

used

Oversampling

and RFE with

5 best features

selected

28 Features 19 Features 7 Features 8 Features 9 Features 10 Features

(before RFE)

K = 41 5 layers C = 1.455 12 estimators

P = 1 12 nodes per

layer

RBF kernel 6 features

Adam solver Gamma =

0.244

Entropy

criterion

Alpha =

0.0667

Bootstrapping

used

The ROC curves for the six models in Table 3 are plotted in Figure 9. The K-Nearest

Neighbours model’s ROC curve is above the other models’ curves for almost all values of

threshold. All curves exhibit a similar shape, although the Logistic Regression and Gaussian

Naïve Bayes models have a much greater false positive rate than the other models for lower

values of true positive rate.

Figure 9: ROC curve for model with highest area under the ROC curve value for each model type. The lines plotted are the mean scores over the 5 folds of cross validation.

For each model in Table 3, separate ROC curves are plotted in Figure 10. The mean values

correspond to those plotted in Figure 9. The K-Nearest Neighbours model has the greatest range

in performance between folds, with a wider band of one standard deviation from the mean. The

K-Nearest Neighbours, Neural Network, Random Forest and Support Vector Machine models

are all relatively consistent up to a true positive rate of around 0.4, although there is a wider

spread for the Neural Network. The curves for K-Nearest Neighbours are smoother because

there are fewer (k) threshold values from which to plot the curve.

Figure 10: ROC curves for models in Table 3. Each of the five cross validation folds is plotted, along with a mean and the range of one standard deviation

An alternative measure of the best classifier is to select the classifier, subject to the minimum

thresholds in Table 2, with the best precision for data points in class 1: Stuck schools. The

results in Figure 11 show that the Support Vector Machine model has a significantly higher

precision than the other models. It has the lowest recall for class 1 of any of the models, with

a high recall for class 0. It is therefore predicting that a much higher proportion of schools will

be in class 0 than in class 1.

Figure 11: Precision for Stuck class - The best model from each model type in terms of precision for identifying stuck schools. The same six models are plotted in each chart, sorted in order of precision score.

5.3.1.2. Conclusion The performance of the best model of five of the six model types is similar, with Gaussian

Naïve Bayes having an appreciably lower performance. K-Nearest Neighbours produces the

model with the highest area under the ROC curve and the best accuracy. If a high precision for

predicting stuck schools is required, then the Support Vector Machine model plotted in Figure

11 is best. Gaussian Naïve Bayes is again the poorest performer by this measure.

The ROC curve of the K-Nearest Neighbours model in Figure 9 shows that it is the best

performing classifier because its ROC curve is above the other curves throughout. For K-

Nearest Neighbours, the Neural Network, Support Vector Machine and the Random Forest,

there is little to choose between the classifiers if the thresholds are set to achieve a true positive

rate of 0.4.

If the alternative measure of precision for stuck schools is used, a Support Vector Machine

model is superior to the other model types. It only identifies 40% of the schools that will

become stuck but around 88% of those that it selects will become stuck. If a smaller, higher

confidence selection of schools is to be identified as likely to become stuck then this model is

superior to the others.

5.3.2. Experiment 1.2: Finding the optimum model hyperparameters Within many of the machine learning models available there are many parameters that can be

adjusted to finetune the working of the model. When running these models, it is important to

consider that the default settings may not provide the best results for the input data supplied.

The parameters for consideration in this experiment, selected based on their perceived

likelihood of improving the model and availability in the Scikit-learn model implementation,

are described for each model in Section 2.2. Logistic Regression and Gaussian Naïve Bayes

did not have any tuneable parameters so do not feature in this experiment.

In order to find the optimum parameters for each model type, a series of parameter searches

were conducted, as described in the following paragraphs. Figure 7 shows how these steps

fitted into the overall process used.

Round 1

For categorical parameters, each value of the parameter was added to the parameter search

space. For each numeric parameter, an initial maximum and minimum value were selected

based on judgement from past experience and literature. This range was made wide enough

that the optimum value was considered highly likely to fall in the range. Between these two

values were then added a series of intermediate values on either a linear or logarithmic scale,

depending on the values involved.

For each model type, all possible combinations of the run parameter values were generated.

Due to the finite computer processing time available, not all combinations could be

investigated. As a result, for each run, the number of parameter combinations to investigate,

e.g. 50, was input. The program would then run the model with 50 different randomly selected

combinations of the parameters. This effectively implemented a subset of the runs available

through a full grid search. It was decided that this would allow a wider investigation of the

parameter space than a grid search, increasing the probability of finding optimal parameters

than a full search through a narrower search space. It did, however, lead to the possibility of

missing the optimum combination of parameters from within the ranges selected.

For each combination of parameter values selected, the model was run and its cross validation

results were recorded.

Rounds 2 and 3

For each parameter, the accuracy and AUC scores for the most effective models from the

previous round were plotted against the parameter values. From inspection of these plots, the

range of values which was most likely to contain the optimum was identified. The parameter

values in the search space were then adjusted, reducing the range and increasing the resolution

within the range. The parameter search was then run again on the narrower search space.

5.3.2.1. Results The results of the final hyperparameter search for each of the four models with tuneable

parameters are shown in the following figures. They are not subject to the minimum threshold

values specified in Table 2. Recursive Feature Elimination and Oversampling are not used in

this section. The final selected parameters for each model type are shown in Table 3.

Figure 12: Variation of AUC and Accuracy with different hyperparameters for Support Vector Machines using features listed in section 9.2

The Support Vector Machine results in Figure 12 show that accuracy and AUC appear to peak

at around a C value of 10. The gamma value has been heavily investigated at values less than

0.1 but there appears to be a peak in AUC between 0.1 and 1. This is not reflected in the values

of accuracy, where the peak is between 0.01 and 0.1. The higher AUC values with gamma

greater than 0.1 largely coincide with C values between 1 and 10. The radial basis function

kernel outperformed the polynomial kernel of degree 1, 2 and 3. Of the polynomial kernels, the

best performance was with a polynomial kernel of degree 2. This was investigated further, as

shown in Table 4. Higher order polynomials were trained and tested in the same way as the

other models, and using the same C and gamma values as in the Support Vector Machine in

Table 3. High order polynomial kernels were not used throughout the experiment due to the

lack of sufficient available computing power.

Table 4: Difference in training and test set accuracy when the kernel is changed. Each run is otherwise identical, using the parameters shown in Table 3

Kernel Training

accuracy

Test

accuracy

RBF

Gamma=0.2442

85% 72%

Polynomial

Degree 1

74% 72%

Polynomial

Degree 2

79% 72%

Polynomial

Degree 3

86% 71%

Polynomial

Degree 4

87% 70%

Polynomial

Degree 5

87% 70%

Polynomial

Degree 6

87% 69%

Polynomial

Degree 7

86% 68%

Polynomial

Degree 8

85% 66%

Figure 13: Variation of AUC and Accuracy with different hyperparameters for Neural Networks using features listed in Section 9.2

The Neural Network results in Figure 13 show that the Adam solver scored best on both

measures. There is small variation in the peak results when the number of layers and nodes per

layer are varied, with a clear decrease in accuracy and AUC when there are less than 3 nodes

per layer. Varying alpha over a large logarithmic scale appears to have a negligible effect on

the model performance.

Figure 14: Variation of AUC and Accuracy with different hyperparameters for Random Forests using features listed in Section 9.2

The accuracy and AUC of a Random Forest model are shown to be generally higher when

using entropy as the scoring criterion. The best scores do not appear to be attained when large

numbers of estimators are used, with higher scores achieved when fewer than 250 estimators

were used. It is noted, however, that the highest scores do coincide with the most densely tested

area of the parameter search space. A peak in both AUC and accuracy appears to exist when

the maximum number of features to consider is set between 5 and 10. The use of bootstrapping

tends to result in higher scores of AUC and average, but the small number of highest AUC

scores occurred when bootstrapping was not used.

Figure 15: Variation of AUC and Accuracy with different hyperparameters for K-Nearest Neighbours using features listed in Section 9.2

The value of k used in K-Nearest Neighbours (Figure 15) has a clear influence on both AUC

and accuracy, with peaks in the range from 25 to 45. The best scores are achieved with a p

parameter value of 1.

5.3.2.2. Conclusion For the K-Nearest Neighbours model, the two parameters trained both have a clear impact on

model performance. The model works best when it takes the most common class label amongst

the nearest 25-45 points. For values of k below 4, there is a noticeable drop in performance

which is much less than if k is 50. This suggests that there is noise in the model, as just taking

the single nearest neighbour is much worse than averaging over a larger number. Increasing

the value of k smooths the decision boundary. The best value of p to use from the results is 1.

The Adam solver proved to be the most effective for the Neural Network models which was

expected due to its widespread popularity. There appears to be little variation in performance

when the number of layers is varied between 2 and 7, and the number of units per layer varied

between 3 and 19. If this behaviour is representative of the reality, a smaller network would be

preferred to reduce processing times and complexity. Alpha does not appear to affect

performance in the range tested.

The Support Vector Machine optimum model has a C value of just greater than 1. This implies

that a balance between fitting the data points accurately, but without overfitting, is required as

the penalty for misclassifying a point is not too great. The kernel that gave the best results was

the radial basis function kernel. This agrees with the expectation that it is a useful default

kernel. It is perhaps more surprising that the polynomial kernel of degree 2 performs better

than the kernel of degree 3. This indicates that the best decision boundary found in these

experiments is not overly complex and is discussed further.

Given the highly interspersed nature of the data (Figure 6), the most complex kernel function

was expected to have the most success. This result, and the accuracy rates of around 70%,

suggests that the best Support Vector Machine investigated does not have a highly complex

boundary (in the original, non-transformed feature space) and simply finds areas of higher

density of each class and separates them. It also suggests that, if high levels of complexity do

not lead to better results, that there is a tendency for the model not to generalise well i.e. to

overfit the training data.

On further investigation, Table 4 shows that increasing model complexity does result in an

increased accuracy score on the training set. This suggests that, to achieve higher training

scores, a more complex model would be required than those that have been used in this

experiment. Higher order polynomials were found to increase the training accuracy slightly up

to a maximum of 87% before diminishing. The table shows, however, that increasing the degree

of the polynomial decreases the test set accuracy, meaning that the model is becoming more

overfitted to the training data. For the same accuracy on the training data, the radial basis

function kernel performs better on the test data.

For Random Forests, the more estimators used, the better the model is expected to perform,

with the performance reaching an asymptotic limit. It is therefore required to find the smallest

number which gives an appropriate performance level. Small numbers of estimators are not

shown because they were ruled out in earlier rounds of the hyperparameter search as being

ineffective. The results show that the model performance varies little with the number of

estimators, so using large numbers of estimators appears to be a waste of computing power.

The highest scores, however, are for models with fewer than 250 estimators, which defies

expectations. This can be explained by the random nature of the Random Forest.

The highest scores are located where the parameter has been tested the most. Random Forests

use randomness in two ways to build the trees (Section 2.2.2). Each Random Forest, even if

generated using exactly the same hyperparameters, will therefore likely generate different

results. By repeatedly testing in a confined search space, it is to be expected that a greater range

of outcomes is achieved. As this experiment is concentrating on the highest scoring models,

this distorts the results, as results which are unusually high are reported as the optimal solutions.

If the same density of testing was carried out over the range from 250 to 2000 estimators, it is

expected that results similar to (or slightly better than) those for below 250 estimators would

be attained.

The maximum number of features to consider peaks between 5 and 10, so a value is selected

from that range. Bootstrapping and the decision criterion selected do not appear to make too

much difference to the results.

Overall, there can be reasonable confidence that, for the parameters that were available, the

models are near their optimum. A more thorough search could be completed with more

computing power, which would have three effects. The first is that the parameter values could

be further refined, for small potential gains in performance. This could be achieved by using a

finer and finer resolution grid search, reducing the range of values tested each time. This would

also be helped by completing the full searches as opposed to random subsets.

More computing power could also be used to investigate the random nature of the results more

thoroughly, to characterise the noise and identify which results are due to noise and which are

due to the effect itself. This could most easily be done by simply repeating runs multiple times.

The final way these hyperparameters could be optimised if more computing power were

available would be to investigate a wider range of parameters. Many parameters could not be

investigated as thoroughly as desired because of the computing power required. For example,

neither high order polynomial kernels nor high C values for Support Vector Machines could

be used as they took too long to compute. Deeper Neural Networks with more combinations of

hidden layers could also be investigated.

5.4. Experiment 2: Finding the optimum features to input to the model The second aim of the project was to identify which features had the greatest importance in

predicting the future performance of a poorly performing school. For some of the models used,

such as Logistic Regression, the model can easily generate a measure of the importance of each

feature in the model. For other models, such as Support Vector Machines and Neural Networks,

there is no direct method of determining the importance of an individual feature to a model. As

shown in Section 5.3.1, Support Vector Machines and Neural Networks have proven to be

some of the more successful models and, as a result, the selection of input features for these

models is an important consideration.

One way to infer a ranking of feature importance to the models is by using Sequential Forward

or Backward Selection (Section 2.3.2). This can be used on any classification model as it

simply trains and tests the model with different combinations of features and records which

features result in the best scores. This is, however, extremely computationally intensive and so

can only be used sparingly.

5.4.1. Experiment 2.1: Investigating whether Sequential Forward Selection or Sequential Backward Selection give better results

Sequential Selection is a greedy algorithm for selecting the optimum combination of features

for a given model (see Section 2.3.2) and was implemented in SFS.py. As a single measure of

model effectiveness must be used, and input prior to running the algorithm, the accuracy was

selected. On each iteration of the algorithm, the feature that resulted in the highest accuracy

would be added or removed from the selected feature set.

Because the algorithm is greedy and does not necessarily reach an optimal solution, the results

of forward and backward selection can be different.

5.4.1.1. Results Figure 16 show the results of the experiment which was the second round of sequential feature

selection (see Figure 7). In each graph, a single set of model hyperparameters were used. The

model accuracy is plotted for each number of features, for both forward and backward

selection. The forward selection was limited to the first 50 features and both forward and

backward selection had a minimum of three features. A dotted line indicates the highest

accuracy achieved during the process.

Figure 16: Sequential Forward/Backward Selection results: Model accuracy for different numbers of features

In each of the plots, the maximum accuracy is achieved with forward selection. The difference

between the maximum accuracy for forward and backward selection ranges from 1% to 5%.

For most sizes of feature set, forward selection selects features which lead to a higher accuracy.

The only exceptions to this are for feature sets in the range 30-50 features for the Neural

Network and Support Vector Machine. In these cases, backward selection gave an accuracy

that was similar to or better than that from forward selection.

K-Nearest Neighbours shows that there is not much change in performance from features sets

of size 15 to 45 features for forward selection. For the other models, a peak in the range of 10

to 20 features is followed by a clear decrease in accuracy when further features are added.

5.4.1.2. Conclusion As shown by the results of this experiment, for the models selected, sequential forward

selection is more effective than sequential backward selection. The maximum accuracy

achieved is consistently higher for forward selection and, for a given size of feature set, the

model accuracy is almost always higher when forward selection is used.

For most models, a clear peak exists which indicates a possible feature set to select for use.

With K-Nearest Neighbours, however, the accuracy stays within a range of around 0.5% whilst

a further 30 features are added.

The plots show a difference in performance between forward and backward selection. This is

due to the algorithms being greedy and heading for local maxima. It therefore cannot be

concluded with confidence that the optimum set of features is that which results in the highest

accuracy score in this sequential feature selection process.

5.4.2. Experiment 2.2: Investigating whether Recursive Feature Elimination improves the set of features selected

Recursive Feature Elimination (Section 2.3.3) is another technique for feature selection. A

large quantity of runs were carried out in which RFE was incorporated at different levels. Each

run had a specified number of features to use – 5, 10, 15, 20, 25, All (no RFE). The features

selected by RFE would be used, the others removed from the input data to the model prior to

model fitting. The input feature sets used were those selected as optimum from the second

Sequential Forward Selection run in Section 5.4.1. Where the number of features for selection

in RFE is greater than the number of features in the input feature set, the RFE process has no

effect and the full input feature set is used. The hyperparameters used were the ranges of

hyperparameters tested in the final hyperparameter search.

The aim of this experiment was to find out if the set of features input to the model would prove

to be the best, or whether a subset of the features would result in an improved model.

5.4.2.1. Results Figure 17 shows the results of the experiments for Recursive Feature Elimination.

Figure 17: Recursive Feature Elimination - For each of the six different models, for each level of RFE, the highest score of Area Under the ROC curve and Accuracy are plotted

The results show that applying recursive feature elimination to a set of features which has been

selected through sequential feature selection does not result in an increase to the area under the

ROC curve or the model accuracy. The K-Nearest Neighbours plots show that the more features

that are used, the higher the accuracy of the model. The Gaussian Naïve Bayes plot does not

have a value for where the RFE value is 5 because no run met the minimum thresholds set (see

Section 5.1)

5.4.2.2. Conclusion The plots show that applying recursive feature elimination to take a subset of the input features,

which were themselves a subset of the initial features, does not improve the performance of the

model. This adds confidence that the features selected using SFS are a good selection. As the

RFE process and SFS process are different, the fact that they do not contradict each other

increases confidence in the results obtained.

5.4.3. Key features selected in most effective model The features selected for each model are shown in the appendices (Section 9.2).

Of the six feature sets selected, the features that are common to at least two sets are shown in

Table 5. As the most frequently occurring features only occur in half of the models, there is

clearly variation in which features work best for different models.

Table 5: Features that appear in more than one of the selected 6 models

Feature name Frequency Energy_2yrDiff 3 Supply.Staff_2yrDiff 3 TotalRevBalance Change 7yr 3 ISSECONDARY 2 Other_2yrDiff 2 Energy_4yrDiff 2 Total revenue balance (1) 2017-18 2 HasBoys 2 Learning.Resources.2018 2 Special 2 HasGirls 2 Back.Office_4yrDiff 2 Energy.2018 2 TotalRevBalance Change 4yr 2 Premises_4yrDiff 2

The importance of each feature can be calculated from a Random Forest as the decrease in node

impurity weighted by the probability of reaching that node (Ronaghan, 2018). This was first

calculated for the features in the Random Forest model used, with the results shown in Table

Table 6: Importance of each feature to the selected Random Forest model

Feature name Feature

importance TotalRevBalance Change 4yr 0.31334

TotalRevBalance Change 7yr 0.253867 Catering.2018 0.126941 Teaching.Staff.2018 0.107894 Self.Income_2yrDiff 0.095085 AGEL 0.041989 ISSECONDARY 0.026097 ISPRIMARY 0.016324 GOR_North West 0.00784 GOR_East Midlands 0.005651 Special 0.003589 HasGirls 0.001383

The selected Random Forest model was then trained and tested using all of the original input

features. The feature importance was then calculated and shown in Table 7.

Table 7: Importance of the top 30 features when the parameters selected of the Random Forest model are applied to all features

Feature name Feature

importance Total revenue balance (1) 2017-18 0.126383 TotalRevBalance Change 7yr 0.118881 TotalRevBalance Change 4yr 0.099824 TotalRevBalance Change 2yr 0.078365 Total.Spend.pp_4yrDiff 0.023139 Supply.Staff.2018 0.01918 Supply.Staff_2yrDiff 0.019165 Total revenue balance (1) as a % of total revenue income (6) 2017-18 0.018994 PERCTOT 0.016633 Supply.Staff_4yrDiff 0.0146 PerformancePctRank 0.014318 Total.Income.pp_4yrDiff 0.011931 AcademyNew 0.011606 Energy.2018 0.011453 Catering.2018 0.011407 Catering_4yrDiff 0.011356 Learning.Resources_2yrDiff 0.011065 Back.Office_2yrDiff 0.011061 Teaching.Staff_2yrDiff 0.010531 Premises_4yrDiff 0.010522 ICT.2018 0.010342 Back.Office.2018 0.010228 Consultancy.2018 0.009657 PTFSM6CLA1A__18 0.009642 Mean Gross FTE Salary of All Teachers (£s) 0.009461

Total.Income.pp.2018 0.009431 ICT_2yrDiff 0.009284 Energy_2yrDiff 0.009065 AGEH 0.008857 Consultancy_2yrDiff 0.008537

Table 7 shows that the Random Forest classifier places a high value on financial data, with the

top 8 features being of this type. The level of pupil absence is also an important feature, as are

school examination results. The spend on supply teachers also appears to be an important

factor.

5.5. Experiment 3: Finding the optimum operations on the input data The performance of any model is reliant on the data that is input. A series of experiments were

set up to determine which operations on the input data would result in the most effective model.

5.5.1. Experiment 3.1: Determining whether oversampling improves model performance Each run in the parameter search was run both with and without oversampling and the results

recorded. This allows direct comparison between the sets of runs as each group (oversampled

or not oversampled) contains runs with identical parameters aside from whether oversampling

is used. The SMOTE oversampling algorithm was implemented as described in Section 2.3.4.

This experiment tests whether the use of this oversampling technique caused an improvement

in the model results.

5.5.1.1. Results The results of the experiment are shown in Figure 18.

Figure 18: Oversampling - For each model type, the run with the highest accuracy and area under the ROC curve are shown for the group where oversampling was used on the training data and the group which did not use oversampling.

The plots show that the best runs with and without oversampling have very similar scores for

both accuracy and area under the ROC curve. Logistic Regression shows a larger best accuracy

when oversampling is used.

5.5.1.2. Conclusion The results show that applying oversampling makes little difference to the best model found

for each model type. In the case of Logistic Regression, an increase in accuracy is observed

when oversampling is applied. The minimal effect of oversampling on this dataset can largely

be explained by the fact that the dataset has a relatively even split between the two classes

(stuck and escaped stuck). As a result, there are few oversampled data points required to make

the split exactly 50:50.

This technique is expected to be far more effective when there is a more uneven split between

the classes, such as when using data with the initial definition of stuck school (see Section 4.5).

5.6. Attributes of Most Effective Model The most effective model overall is the K-Nearest Neighbours model with k=41 and p=1. The

confusion matrix is shown in Table 8, its scores across different measures are shown in Table

9 and its parameters were specified in Table 3.

Table 8: Confusion matrix for best model

True

1 0

Predicted 1 474 119

0 331 789

Table 9: Properties of best model

Accuracy 0.737

AUC 0.759

Precision for class 0

0.694

Precision for class 1

0.746

Recall for class 0

0.889

Recall for class 1

0.514

6. Discussion The results show that the best models found have AUC and accuracy scores of between 70%

and 75%. The results are relatively self-consistent in that the results appear to be heading

towards a limit of around 75%.

The majority of the work has been carried out using the newer definition of stuck schools,

which has led to a more challenging classification task. As is seen in Figure 6, the two classes

are strongly overlapped and have proved challenging to separate. Judging by the consistency

of the results achieved in this project, it appears that increasing the accuracy and AUC scores

significantly would require a major change in the techniques used, since those used in this work

have been used thoroughly. There are, however, many areas in which the existing techniques

used could be optimised further.

School examination results were suspected to be an indicator of likely school inspection

success, yet they only feature in one of the models. One reason why they do not feature could

be the fact that primary and secondary schools have completely different scoring systems which

were resolved in Section 4.1.3. This leads to a more general issue with the data: that it is

measured and recorded in different ways over the years. A large amount of the data used is

solely for the most recent academic year, so is relatively self-consistent. Many other data

points, though, are time based, showing a change over the years.

Time is an important aspect of this analysis which is difficult to bring in. Firstly, school

inspections take place at irregular intervals (Section 1.1). When labelling schools based on their

histories, the two simple ways of doing this are to arrange them in order, most recent first, or

to arrange them by the date that they took place. Each has its advantages, but also causes a loss

of information. Given that inspection order was selected, a possible downside is that the data

used takes into account some differences from 7 years previously, whilst all of the relevant

inspections could have taken place in the last 3 years.

The second issue with time is the use of current data to predict the future performance of

schools. The data used is based on the current year, or a change from a number of years

previous. The class labels, as noted, are not being considered against time.

A possible area of weakness in the modelling is that schools are linked to their predecessors.

Generally, it is assumed that this is a sensible decision as schools do not change completely if

they close and reopen – it is usually still the same building with the same teachers. Sometimes,

though, schools have a list of many predecessors, where many smaller schools have been

merged into a larger school. All of the predecessor schools’ inspection results count towards

the new school’s results, so if four small schools with a recent poor inspection each merge into

a large new school, that school is immediately labelled as stuck.

The way that the data has been merged also means that data for the current school is used, but

for predecessors it is not. Therefore, if the school opened in 2016 then it will not have any data

in the combined dataset for dates before 2016. This is likely a source of significant missing

data and could be remedied by writing a program to look up predecessor school data if there is

no data available for the currently open school. This would need to be done carefully, however,

given that the predecessor is not necessarily a direct representation of the current school.

Missing data is an important area of uncertainty. Given that none of the features were complete

and the models required fully complete data, there was little option but to impute the missing

data (Section 3.3.4). The selected imputer uses other values in the row and column of the

missing data point to infer what value to insert. As a result, the imputed value depends on the

other features in the dataset. If the missing data in a new column is imputed as soon as the

column is added to the dataset, the missing values will be imputed with different values to those

if all of the columns were imputed at the same time, once they had all been added. Whichever

imputation technique is selected, it will add error to the model which has not been

characterised.

The data also showed that there is great variation throughout it, with a clear difference in

performance between different sections of the data. If the data is sorted by URN (i.e. in

approximate order of when the school was opened) and the random selection option of the cross

validation is disabled then the plot in Figure 19 can be generated. Moving from Fold 1 through

to Fold 5 is then moving approximately from the oldest 20% of schools to the newest 20%. The

performance of the model appears to improve drastically from one fold to the next. For

example, the area under the curve in some of the models varies from 55% in the first fold to

over 85% in the fourth fold. These results appear counterintuitive because, as explained above,

the most complete data is expected to be for the oldest schools because their data would not be

lost due to being assigned to a predecessor. Given that the effect is so pronounced, there appears

to be good reason to investigate this further, to see if a significant improvement could be made

to the model.

Figure 19: ROC curves for the selected models, with randomness disabled in the cross validation process. Fold 1 is therefore the first 20% of the data points, Fold 2 is 20-40% etc.

Cross validation was used throughout this work, with all results reported being averages over 5 fold cross validation. Given the clear variation in the data, it would likely make sense to increase to 10 folds. This would minimise the distortion of the results from the unevenness of the data. Re-running the same experiment generates different results due to the differing cross validation splits, introducing noise into the results.

Using accuracy and AUC as the measures for determining the performance of the model assumes that false positives and false negatives have equal cost (Adams and Hand, 2000). In this work, these measures have been used heavily as the costs are assumed equal. Further work and discussions with the relevant stakeholders may lead to adjusted measures being used.

The main focus of this work has been on the updated definition of stuck schools, and modelling them against schools that have ‘escaped’ being stuck (Section 4.3). Using the original definition of stuck schools, and modelling against all of the rest of the dataset, the results are much improved. One reason for this is that there are more data points. By modelling just the two subsets of the data, 90% of the data is removed and not used. On top of this, a school in the original stuck category is likely very different from a randomly selected other school. Using the new definition and subsets, however, the two classes that are to be distinguished are likely to be very similar. They have both had a run of three consecutive poor inspections, with potentially a single inspection different between them. There are approximately 430 schools in class 1 under the original stuck definition, with approximately 800 schools in class 1 under the new definition. The old class 1 is a subset of the new class 1.

The technique used for selecting the hyperparameters and the features were both computationally intensive, requiring the training and testing of large quantities of models. The nature of the searches meant that sub-optimal combinations would be tried, taking far longer than other combinations of hyperparameters and features. Throughout this work, computing power was at a premium. The code was set up to work best in these conditions, for example routinely dumping results to .csv files and python ‘pickles’ to free up memory, limiting hyperparameter ranges to those that processed faster and designing the code to run unattended for hours/days. Further information on this topic is provided in Section 5.2. This project was also undertaken on four different laptops so file locations, version control and module versions were watched closely.

The use of sequential feature selection is a useful technique because it allows feature selection for models which do not generate a feature importance value. It is a greedy algorithm which will not necessarily find the optimal combination of parameters. It does, however, provide confidence that adding or removing a single feature from the feature set would not improve performance because if it did, the feature would have been selected by the algorithm. The only manual input into the feature selection was done at the start, in selecting and preparing the features to add to the dataset. Once they had been added to the dataset, the features were selected by the algorithms in the process.

Some of the more complex models, such as the Neural Network, are unlikely to be optimal. There are many parameters and architectures available which can be altered to potentially improve the model, which would take a large quantity of time and computing power. For example, the number of iterations and using layers with varying numbers of units have not been investigated. Neural Networks were not originally considered for this work due to the small quantity of training data available, so shallow networks are expected to work best. The models could also be optimised by making a more thorough investigation into performance results when the model is applied to the

training data. This would give more insight into the overfitting/underfitting nature of the model and how it could be improved.

7. Conclusions The primary aim of this project was to create a dataset and a resulting model that could

accurately predict the future performance of a school in terms of inspection results. These aims

have been achieved to a good level, given the time and resources available, and the complexity

of the problem.

Firstly, a literature review was carried out in Section 2 to identify and assess previous work in

the field and to investigate techniques to be used in this work. It does not appear that work to

use machine learning to predict school inspection results has been published in the literature,

although a number of studies have been done to predict examination results. Although related

to inspection results, the two are clearly distinct. The literature provided information on six

machine learning models to implement in this project and how they may be optimised.

The dataset creation has been achieved by generating a dataset of every currently open state-

funded school in England, and is explained in Sections 3 and 4. This dataset has over 80

variables which have come from a range of input sources that are of multiple formats. Some of

the variables have been generated through feature engineering. The variables selected have

been narrowed down from over 1000 in the input data. The data have been cleaned, normalised

and the missing values imputed. The data have also been assigned labels using an updated

definition of the binary classes, and data not relevant have been removed from the dataset.

Many models have been trained and tested (Section 5), to find the best performing model for

this task. For each of the six model types identified in the literature survey, the optimum

combination of hyperparameters and features have been found. This was achieved iteratively

using random hyperparameter grid searches and recursive feature selection. The feature sets

found are shown to differ for each model type, with the feature importance shown for Random

Forest models.

The classification accuracy and the area under the ROC curve of the best performing models

overall are around 75%, which is considered a good result when the complexity of the task, the

time available and the quality of the input data are taken into account. The two classes to be

distinguished are very similar, and the available training set is only a small subset of the

complete input data. If a higher confidence in the prediction of which schools will become

stuck is required, an 88% precision has been achieved for a Support Vector Machine with 40%

recall.

This work has been presented to both the Department for Education and Ofsted to explain what

has been done and the results achieved. It has been agreed that it is unlikely that further work

to model this data will provide a significant gain in predictive power beyond that of the

classifiers described in this report. At the time of writing, it is not yet known if this work will

be continued within the ONS beyond the project period.

An initial step in terms of further work would be to determine whether an increase in

classification accuracy was possible and, if so, by how much. If there is room for improvement,

an investigation would be required to find out why the models are not producing optimum

results. This would start with an analysis of training and test data to see the overfitting

tendencies of the models.

Should an improvement in classifier performance be sought, there are many possibilities for

work that would provide this. Firstly, there are likely to be techniques to extract high quality

classifications from the classifiers presented. For example, a more thorough investigation of

varying the thresholds could provide a classifier with an excellent precision, even if the recall

is low. The classifiers may also be effective for subsets of the data. For example, the classifier

has not been run on just primary schools. There is known to be a difference in the data between

primary and secondary schools, and this work has considered the two together.

The techniques described in this report could simply be extended with more computing power

available. A more exhaustive hyperparameter search, investigating more hyperparameters over

a larger range would be trivial to implement using the code written for this project. More

sequential feature selection runs, with more different models could also be carried out, with

randomly generated starting feature sets to try to find better results than those generated by the

greedy algorithm.

The input data is clearly an area that can be improved, that could lead to improvements in

model performance. Adding in more features could be done with more time available, but the

key thing would be to reduce the missing data. One part of this would be to incorporate data

for predecessor schools. Another would be to investigate whether more complete data sources

are available. The technique for imputation used, whilst likely to be the best available, also

could be investigated as it has a large bearing on the results and imputes different values

depending on what other features are in the dataset.

If a classifier of high quality is not attained, there is likely a large quantity of information to

gain from carrying out a more traditional statistical analysis. This would also tie in with the

more qualitative work that has been completed by Ofsted and the Department for Education.

Working on this project has been an excellent opportunity to carry out a machine learning

project in the ‘real world’, testing out a large variety of the skills acquired this year and in

previous years. It has been interesting to try out many different machine learning techniques

and find out what works best for myself. Working with the ONS, Ofsted and the Department

for Education has also given me some experience of how carrying out a machine learning

project on data can lead to insight and the change of government policy.

Given that there have been three months from first arriving at the ONS to completing this

report, there is only a finite quantity of work that can be achieved. Many steps of the project

could easily have been allocated more than three months themselves so, as a result, there are

areas where more thorough work could have been done. This would, however, have directly

resulted in not completing some of the later work.

The aim of the project was to carry out a full, from start to finish, data science project, with a

real risk that the end was not reached in time. For example, creating the dataset proved to be a

real challenge that put the project behind schedule. The techniques used had not been attempted

before by the team, who were interested to know if they would be of any use. There was

therefore uncertainty in how long the models would take to implement and whether they would

generate anything useful. Given that meaningful results have been generated and the results

have been communicated to the stakeholders within the allotted time, the goals of this project

are considered to have been achieved.

8. References Adams, N. M. and Hand, D. J. (2000) ‘Improving the practice of classifier performance

assessment’, Neural Computation, 12(2), pp. 305–311. doi: 10.1162/089976600300015808.

Canagareddy, D., Subarayadu, K. and Hurbungs, V. (2019) ‘A Machine Learning Model to

Predict the Performance of University Students BT - Smart and Sustainable Engineering for

Next Generation Applications’, in Fleming, P. et al. (eds). Cham: Springer International

Publishing, pp. 313–322.

Chawla, N. et al. (2002) ‘SMOTE: Synthetic Minority Over-sampling Technique’, J. Artif.

Intell. Res. (JAIR), 16, pp. 321–357. doi: 10.1613/jair.953.

Department for Education (2019) New drive to continue boosting standards in schools -

GOV.UK. Available at: https://www.gov.uk/government/news/new-drive-to-continue-

boosting-standards-in-schools (Accessed: 19 September 2019).

Department for Education (DfE) (2016) ‘Progress 8: How Progress 8 and Attainment 8

measures are calculated’, pp. 1–5. Available at:

https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data

/file/561021/Progress_8_and_Attainment_8_how_measures_are_calculated.pdf.

Duda, R. O., Hart, P. E. and Stork, D. G. (2001) Pattern classification. 2nd ed. New York ;

Chichester: Wiley (A Wiley-Interscience publication).

Fawcett, T. (2006) ‘An introduction to ROC analysis’, Pattern Recognition Letters, 27(8), pp.

861–874. doi: 10.1016/j.patrec.2005.10.010.

Fowler, J. (2012) ‘Ofsted Inspection of Outstanding schools - The Education ( Exemption

from School Inspection ) ( England ) Regulations 2012’, (January), pp. 0–2. Available at:

https://www.lgiu.org.uk/wp-content/uploads/2012/05/Ofsted-Inspection-of-Outstanding-

schools-The-Education-Exemption-from-School-Inspection-England-Regulations-2012.pdf.

Hsu, C.-W., Chang, C.-C. and Lin, C.-J. (2008) ‘A Practical Guide to Support Vector

Classification’, BJU international, 101(1), pp. 1396–400. Available at:

http://www.csie.ntu.edu.tw/~cjlin%0Ahttp://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.p

df.

Hunter, J. D. (2007) ‘Matplotlib: A 2D Graphics Environment’, Computing in Science &

Engineering, 9(3), pp. 90–95. doi: 10.1109/MCSE.2007.55.

Lemaitre, G., Nogueira, F. and Aridas, C. K. (2017) ‘Imbalanced-learn: A Python Toolbox to

Tackle the Curse of Imbalanced Datasets in Machine Learning’, Journal of Machine

Learning Research2, 18(17), pp. 1–5. Available at: http://jmlr.org/papers/v18/16-365.html.

Lin, C., Yu, H. and Huang, F. (2011) ‘Dual Coordinate Descent Methods for Logistic

Regression and Maximum Entropy Models’, Machine Learning, 2(85), pp. 41–75.

Majnik, M. and Bosnic, Z. (2011) ROC Analysis of Classifiers in Machine Learning : A

Survey Technical report MM-1 / 2011.

Masci, C., Johnes, G. and Agasisti, T. (2018) ‘Student and school performance across

countries: A machine learning approach’, European Journal of Operational Research.

Elsevier B.V., 269(3), pp. 1072–1085.

McKinney, W. (2010) ‘Data Structures for Statistical Computing in Python’, in van der Walt,

S. and Millman, J. (eds) Proceedings of the 9th Python in Science Conference, pp. 51–56.

Mitchell, T. M. (Tom M. (1997) Machine learning. New York: McGraw-Hill.

mlxtend (no date) Sequential Feature Selector - mlxtend. Available at:

http://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/

(Accessed: 12 September 2019).

Office for Standards in Education (2018) ‘School inspection handbook’, Ofsted School

Inspection Handbook, (September). doi: 10.4324/9780203416242_chapter_2.

Ofsted (2019) ‘Education inspection framework for September 2019’, (May), pp. 1–14.

Available at: www.legislation.gov.uk/uksi/2014/3283/contents/made;

Pedregosa FABIANPEDREGOSA, F. et al. (2011) ‘Scikit-learn: Machine Learning in

Python’, Journal of Machine Learning Research, 12, pp. 2825–2830. Available at:

http://Scikit-learn.sourceforge.net.

Press, W. H. et al. (2007) Numerical Recipes : The Art of Scientific Computing. 3rd ed.

Cambridge ; New York: Cambridge University Press.

Raschka, S. (2018) ‘MLxtend: Providing machine learning and data science utilities and

extensions to Python’s scientific computing stack’, Journal of Open Source Software, 3(24),

p. 638. doi: 10.21105/joss.00638.

Rebai, S., Ben Yahia, F. and Essid, H. (2019) ‘A graphically based machine learning

approach to predict secondary schools performance in Tunisia’, Socio-Economic Planning

Sciences. Elsevier, (June), p. 100724. doi: 10.1016/j.seps.2019.06.009.

Rees, G. (2017) List all possible permutations from a python dictionary of lists - Code

Review Stack Exchange. Available at:

https://codereview.stackexchange.com/questions/171173/list-all-possible-permutations-from-

a-python-dictionary-of-lists (Accessed: 19 September 2019).

Ronaghan, S. (2018) The Mathematics of Decision Trees, Random Forest and Feature

Importance in Scikit-learn and Spark. Available at: https://medium.com/@srnghn/the-

mathematics-of-decision-trees-random-forest-and-feature-importance-in-Scikit-learn-and-

spark-f2861df67e3 (Accessed: 15 September 2019).

Scikit-learn (no date a) 1.1. Generalized Linear Models — Scikit-learn 0.21.3 documentation.

Available at: https://Scikit-learn.org/stable/modules/linear_model.html#linear-model

(Accessed: 11 September 2019).

Scikit-learn (no date b) Neural Network models (supervised) - 1.17.1 Multi-layer Perceptron.

Available at: https://Scikit-learn.org/stable/modules/neural_networks_supervised.html

(Accessed: 11 September 2019).

Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning - From

Theory to Algorithms. 1st Ed. Cambridge University Press.

Spielman, A. (2018) The Annual Report of Her Majesty’s Chief Inspector of Education,

Children’s Services and Skills 2017/18 - GOV.UK. Available at:

https://www.gov.uk/government/publications/ofsted-annual-report-201718-education-

childrens-services-and-skills/the-annual-report-of-her-majestys-chief-inspector-of-education-

childrens-services-and-skills-201718 (Accessed: 12 September 2019).

Tanuar, E. et al. (2018) ‘Using Machine Learning Techniques to Earlier Predict Student’s

Performance’, in 2018 Indonesian Association for Pattern Recognition International

Conference (INAPR), pp. 85–89. doi: 10.1109/INAPR.2018.8626856.

Thomson, D. (2019) A look at Ofsted’s ‘stuck’ schools - FFT Education Datalab. Available

at: https://ffteducationdatalab.org.uk/2019/01/a-look-at-ofsteds-stuck-schools/ (Accessed: 18

September 2019).

Walt, S. van der, Colbert, S. C. and Varoquaux, G. (2011) ‘The NumPy Array: A Structure

for Efficient Numerical Computation’, Computing in Science & Engineering, 13(2), pp. 22–

30. doi: 10.1109/MCSE.2011.37.

Zhang, H. (2004) ‘The Optimality of Naive Bayes Naive Bayes and Augmented Naive

Bayes’, Aa, 1(2), p. 3.

9. Appendices 9.1. Appendix 1: Input Variables Used The variables used in this project are shown below, along with a brief description of their

meaning. Variable names with a year suffix mean that the value was recorded in that year. For

example, VariableName_16 would correspond to the value VariableName had in the academic

year 2015-16.

9.1.1. School Financial Balance Total revenue balance (1) 2017-18 – Float

Total revenue balance (1) as a % of total revenue income (6) 2017-18 – Float

TotalRevBalance Change 7yr – Float (calculated) – Difference between Total revenue

balance in 2010-11 and 2017-18.

TotalRevBalance Change 4yr – Float (calculated) – Difference between Total revenue

balance in 2013-14 and 2017-18.

TotalRevBalance Change 2yr – Float (calculated) – Difference between Total revenue

balance in 2015-16 and 2017-18.

9.1.2. School performance data TOTPUPS__18 – Integer – Total number of pupils in the school.

PTKS1GROUP_L__18 – Float – Values taken from columns with the following names:

“Percentage of pupils at the end of key stage 4 with low prior attainment at the end of key stage

2”, “Percentage of pupils in cohort with low KS1 attainment”, “% pupils in cohort with low

KS1 attainment”

PTKS1GROUP_M__18 – Float – Values taken from columns with the following names:

“Percentage of pupils at the end of key stage 4 with medium prior attainment at the end of key

stage 2”, “Percentage of pupils in cohort with medium KS1 attainment”, “% pupils in cohort

with medium KS1 attainment”.

PTKS1GROUP_H__18 – Float – Values taken from columns with the following names:

“Percentage of pupils at the end of key stage 4 with high prior attainment at the end of key

stage 2”, “Percentage of pupils in cohort with high KS1 attainment”, “% pupils in cohort with

high KS1 attainment”.

PTFSM6CLA1A__18 – Float – Percentage of pupils in school eligible for free school meals.

Values taken from columns with the following name: “Percentage of pupils at the end of key

stage 4 who are disadvantaged”, “Percentage of pupils who are disadvantaged”.

PTMOBN__18 – Float – Values taken from columns with the following names: “Percentage

of pupils at the end of Key Stage 4 who are non-mobile”, “Percentage of eligible pupils

classified as non-mobile”.

PSENELSE__18 – Float – Percentage of pupils with Special Educational Needs. Values taken

from columns with the following names: “Percentage of eligible pupils with SEN with

Statement or EHC plan”, “Percentage of Pupils with statements or supported at school action

plus”, “Percentage of pupils at the end of key stage 4 with special educational needs (SEN)

with a statement or Education, health and care (EHC) plan”, “Percentage of key stage 4 pupils

with statements of SEN (Special Educational Need) or on School Action Plus”.

PerformancePctRank – Float (calculated) – Percentage ranking of schools based on exam

performance. A full explanation is provided in 4.1.3. Values taken from columns with the

following names: “Average Attainment 8 score per pupil”, “Percentage of pupils reaching the

expected standard in reading, writing and maths”.

9.1.3. Pupil population and absence data PNUMEAL – Float – Percentage of pupils with English not as first language.

PNUMFSM – Float – Percentage of pupils eligible for free school meals.

PERCTOT – Float – Percentage of overall absence (authorised and unauthorised) for the full

2017/18 academic year.

9.1.4. Spine ISPRIMARY – Binary – Whether the school is primary.

ISSECONDARY – Binary – Whether the school is secondary.

ISPOST16 – Binary – Whether the school is for pupils aged over 16.

AGEL – Integer – The lowest age of pupils.

AGEH – Integer – The highest age of pupils.

9.1.5. Workforce data Pupil : Teacher Ratio – Float.

Mean Gross FTE Salary of All Teachers (£s) – Integer – Mean Full Time Equivalent salary

of teachers in school.

9.1.6. School finances data These variables are an annual total amount of money. Each of the titles in this section represents

three variables:

- The variable value in 2018

- The difference between the variable values in 2016 and 2018

- The difference between the variable values in 2014 and 2018.

Self.Income - Integer – Annual self generated income

Total.Income.pp – Integer – Annual total income per pupil

Teaching.Staff – Integer – Annual spend on teaching staff

Supply.Staff – Integer – Annual spend on supply staff

Ed.Support.Staff – Integer – Annual spend on educational support staff

Premises – Integer – Annual spend on premises

Back.Office – Integer – Annual spend on back office staff

Catering – Integer – Annual spend on catering

Other.Staff – Integer – Annual spend on other staff

Energy – Integer – Annual spend on energy

Learning.Resources – Integer – Annual spend on learning resources

ICT – Integer – Annual spend on IT equipment

Consultancy – Integer – Annual spend on external consultants

Other – Integer – Other annual spend

Total.Spend.pp – Integer – Total spend per pupil

9.1.7. Generated variables Boarding – Binary (converted from text) - Whether the school has pupils who stay at the school

overnight. Positive cases: “Boarding school”, “Children’s home (Boarding school)”, “College

/ FE residential accommodation”. All other cases negative.

SixthForm – Binary (converted from text) – Whether the school has a sixth form.

HasBoys – Binary (converted from text) – School has male students. Positive cases: “Mixed”,

“Boys”.

HasGirls – Binary (converted from text) – School has female students. Positive cases:

“Mixed”, “Girls”.

Maintained – Binary – School is a local authority maintained school

Academy – Binary – School is an academy

Special – Binary – School is a special school

GOR_East Midlands / GOR_East of England / GOR_London / GOR_North East /

GOR_North West / GOR_South East / GOR_South West / GOR_West Midlands /

GOR_Yorkshire and the Humber – Binary variables – Whether school is in the Government

Office Region.

9.2. Appendix 2: Variables selected for models Each optimised model used a different subset of the available variables. The variables used for

each model are shown in this appendix.

9.2.1. Neural Network 'Total revenue balance (1) 2017-18',

'Other.Staff_2yrDiff',

'Premises_4yrDiff',

'Mean Gross FTE Salary of All Teachers (Â£s)',

'Supply.Staff_2yrDiff',

'BoardingNew',

'TotalRevBalance Change 7yr',

'TotalRevBalance Change 2yr',

'Supply.Staff_4yrDiff',

'PNUMEAL',

'GOR_West Midlands',

'MaintainedNew',

'GOR_North East',

'ISPOST16',

'Supply.Staff.2018',

'Other.2018',

'Learning.Resources_2yrDiff',

'TotalRevBalance Change 4yr',

'ICT.2018',

9.2.2. Support Vector Machine 'TotalRevBalance Change 7yr',

'Total revenue balance (1) as a % of total revenue income (6) 2017-18',

'Supply.Staff_2yrDiff',

'Mean Gross FTE Salary of All Teachers (Â£s)',

'Consultancy_4yrDiff',

'Ed.Support.Staff_4yrDiff',

'Consultancy.2018',

9.2.3. Random Forest 'TotalRevBalance Change 4yr',

'PerformancePctRank',

'Supply.Staff_4yrDiff',

'ISSECONDARY',

'TotalRevBalance Change 7yr',

'PTKS1GROUP_H__18',

'Total revenue balance (1) as a % of total revenue income (6) 2017-18',

'AGEL',

9.2.4. Gaussian Naïve Bayes 'TotalRevBalance Change 4yr',

'Total.Income.pp.2018',

'Supply.Staff.2018',

'Total revenue balance (1) 2017-18',

'Self.Income.2018',

'Mean Gross FTE Salary of All Teachers (Â£s)',

'PSENELSE__18',

'TotalRevBalance Change 7yr',

'Ed.Support.Staff.2018',

'Consultancy_2yrDiff'

9.2.5. Logistic Regression 'Total revenue balance (1) 2017-18',

'Supply.Staff_2yrDiff',

'TotalRevBalance Change 7yr',

'Other_4yrDiff',

'Teaching.Staff_4yrDiff',

'Other.Staff_4yrDiff',

'Other.Staff_2yrDiff',

'Supply.Staff.2018',

'Ed.Support.Staff.2018'

9.2.6. K-Nearest Neighbours 'Total revenue balance (1) 2017-18',

'TotalRevBalance Change 4yr',

'HasGirlsNew',

'HasBoysNew',

'ISPRIMARY',

'Catering.2018',

'GOR_North East',

'Ed.Support.Staff_4yrDiff',

'BoardingNew',

'Back.Office_2yrDiff',

'Consultancy.2018',

'Consultancy_2yrDiff',

'SpecialNew',

'Ed.Support.Staff_2yrDiff',

'PERCTOT',

'Other.Staff_4yrDiff',

'Learning.Resources.2018',

'Other.2018',

'Other_4yrDiff',

'Energy_4yrDiff',

'TotalRevBalance Change 7yr',

'Teaching.Staff.2018',

'Total revenue balance (1) as a % of total revenue income (6) 2017-18',

'Learning.Resources_4yrDiff',

'ISSECONDARY',

'SixthFormNew',

'ISPOST16',

'Catering_2yrDiff',

Tata Steel Optimisation Dissertation(1).pdf

Steelmaking Continuous Casting Production Schedule Optimization for TATA Steel

Sotirios Filippou

September 2019

School of Mathematics, Cardiff University

A dissertation submitted in partial fulfilment of the requirements for MSc (in Operational Research,

Applied Statistics and Financial Risk) by taught programme.

CANDIDATE’S ID NUMBER

1872049

CANDIDATE’S SURNAME

Mr. Filippou

CANDIDATE’S FULL FORENAMES

Sotirios

DECLARATION This work has not previously been accepted in substance for any degree and is not concurrently submitted in candidature for any degree. Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 1 This dissertation is being submitted in partial fulfilment of the requirements for the degree of MSc (insert MA, MSc, MBA, etc., as appropriate) Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 2 This dissertation is the result of my own independent work/investigation, except where otherwise stated. Other sources are acknowledged by footnotes giving explicit references. A Bibliography is appended. Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 3 – I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing, and for the title and summary to be made available to outside organisations. Signed ……………………………………………. (candidate) Date 19/09/2019 STATEMENT 4 - BAR ON ACCESS APPROVED I hereby give consent for my dissertation, if accepted, to be available for photocopying and for public viewing after expiry of a bar on access approved by the Graduate Development Committee. Signed ……………………………………………. (candidate) Date 19/09/2019

Executive Summary Iron and steel production is one of the most significant industries worldwide. The steel

manufacturing process is divided into three stages, ironmaking, steelmaking-continuous

casting and production of finished products. In this paper, we focus on the steelmaking-

continuous casting stage, in which liquid iron is transformed into steel slabs with the

addition of required alloys and removal of impurities, solidifies and forms slabs.

The steelmaking-continuous casting process has been described as the bottleneck of the

steel production process. Thus, optimal scheduling of this process could result in many

advantages. However, it is a combinatorial problem with complex practical constraints

and strict requirements and is considered as one of the most difficult industrial planning

and scheduling problems.

Across the literature several methods attempting to solve this problem can be found. Each

steelmaking-continuous casting scheduling problem may differ from the others

considering the facilities, the way demand is translated into production planning and any

other additional constraints. As a result, a universal method that can be applied to solve

such a problem does not exist.

The process followed by TATA Steel Port Talbot Works consists of five stages with

multiple machines at all stages. Furthermore, it includes a variety of constraints that result

in high complexity in mathematically formulating a scheduling problem for this process.

Attempting to solve this problem, three different mathematical models were developed.

The first model schedules the last stage of the process, and the second one uses the results

of the first one to schedule the preceding stage. The third model uses the results of the

first one and attempts to schedule all remaining stages. It should be pointed that the second

or the third model may not be able to obtain a feasible solution for any output of the first

one.

The three models were tested on a realistic scenario. A complete schedule was acquired

for the two last stages from the first two models. However, a solution for the third model

could not be acquired in a reasonable time frame. Thus, it was tested on a smaller scale

problem for which a feasible solution was acquired.

Several suggestions on how the three models could be improved and used as part of a

complete scheduling system are included in the paper. The models could be combined

with different methods, to obtain a complete feasible schedule. Furthermore, different

solution methods should be tested since they may be proven more advantageous for

business purposes.

Acknowledgments

I would like to thank my sponsor supervisor at TATA Steel, Mr. James Watson, for his

guidance, assistance, suggestions and patience throughout the process of writing this

dissertation.

I would also like to thank my university supervisor, Dr. Tony Lewins, for his expertise,

support and guidance on how to approach this research topic. Furthermore, I would like

to thank Dr. Jonathan Thompson for his contribution, willingness to help and always

responding to my questions and queries.

I need to express my gratitude to my parents for their continuous support and

encouragement. In addition, I would like to acknowledge my fellow postgraduate students

and friends at Cardiff University. Your advice and friendship helped me complete this

dissertation. I would like to single out Artemis Giannakopoulou for her support and

sympathetic ear.

Table of Contents Executive Summary .......................................................................................................... 1

Acknowledgments ............................................................................................................. 3

Table of Figures ................................................................................................................ 5

Table of Tables .................................................................................................................. 5

Abstract ............................................................................................................................. 6

1. Introduction ................................................................................................................ 7

2. Literature Review .................................................................................................... 11

3. Process Description and Constraints ....................................................................... 17

4. Mathematical Formulation ....................................................................................... 20

4.1. Model 1: Continuous Casting Scheduling ........................................................ 20

4.1.1. Notation ..................................................................................................... 20

4.1.2. Model Constraints ..................................................................................... 21

4.1.3. Objective Function .................................................................................... 24

4.2. Model 2: Treatment Units Scheduling ............................................................. 24

4.2.1. Notation ..................................................................................................... 24

4.2.2. Constraints ................................................................................................ 25

4.2.3. Objective Function .................................................................................... 27

4.3. Model 3: Basic Oxygen and Secondary Steelmaking Scheduling ................... 27

4.3.1. Notation ..................................................................................................... 27

4.3.2. Constraints ................................................................................................ 28

4.3.3. Objective Function .................................................................................... 30

5. Experimental Tests .................................................................................................. 31

6. Recommendations for Further Development ........................................................... 37

7. Conclusion ............................................................................................................... 43

References ....................................................................................................................... 44

Appendix A: Flying Tundish Change (FTC) between Products ..................................... 47

Appendix B: Flying Tundish Change (FTC) between Products ..................................... 48

Appendix C: Xpress Code Model 1 ................................................................................ 49

Appendix D: Xpress Code Model 2 ................................................................................ 51

Appendix E: Xpress Code Model 3 ................................................................................ 55

Table of Figures Figure 1. Steel Manufacturing Process ............................................................................. 7 Figure 2. Steelmaking-Continuous Casting Process at TATA Steel .............................. 18 Figure 3. Allocation of sequences to continuous casting machines – Case Study 1 ....... 32 Figure 4. Treatment Unit 1 (RH) Schedule – Case Study 1 ............................................ 33 Figure 5. Treatment Unit 2 (RD) Schedule – Case Study 1 ............................................ 33 Figure 6. . Treatment Unit 3 (CAS1) Schedule – Case Study 1...................................... 33 Figure 7. Treatment Unit 4 (CAS2) Schedule – Case Study 1 ....................................... 34 Figure 8. Heats Schedule – Case Study 1 ....................................................................... 34 Figure 9. Heats Schedule - Case Study 2 ........................................................................ 36 Figure 10. Proposed Heuristic ......................................................................................... 39

Table of Tables Table 1. Sequences - Case Study 1 ................................................................................. 31 Table 2. Availability of the continuous casting machines - Case Study 1 ...................... 31 Table 3. Machine ID Number – Figure 8 ........................................................................ 35 Table 4. Heats – Case Study 2 ........................................................................................ 35 Table 5. Machine ID Number – Figure 9 ........................................................................ 36

Abstract The purpose of this paper was to develop a mixed integer linear programming model for

scheduling the steelmaking continuous casting process at TATA Steel Port Talbot Works.

Several formulations and solution methods exist in the literature. However, the

formulation of a model highly depends on the considered process and the respective

constraints. Three models attempting to schedule different stages of the process were

developed. The models were tested using experimental data and feasible solutions were

acquired. Further research on solution methods and improvement to the current models

are required for the development of a complete scheduling system.

1. Introduction Iron and steel production is one of the most significant industries worldwide since its

products are used as primary materials for a variety of other industries such as automobile,

construction and manufacturing (Missbauer et al., 2009; Tang et al., 2014).

Input materials, such as iron, ore and scrap, are used to manufacture steel products in a

process that can be divided into three stages:

(1) Ironmaking: iron ore, coke and a fluxing agent are transformed into molten iron

(also called hot metal) in blast furnaces.

(2) Steelmaking-continuous casting: through melting, refining and continuous

casting, the molten iron is converted into solid slabs with a specified chemical

composition.

(3) Production of finished products: slabs are shaped into coils by hot and cold rolling

and take their final form via various processes including continuous or batching

annealing, electro-galvanizing and continuous galvanizing (Missbauer et al.,

2009; Tang and Wang, 2008).

In Figure 1, a representation of the described process can be found.

Figure 1. Steel Manufacturing Process

This industry is characterized by high-temperature high-weight material flow,

sophisticated technological procedures, major investment and high energy consumption.

Additionally, the steelmaking continuous casting (SCC) stage is regularly described as

the bottleneck of the iron and steel making process since it has high-cost energy and

equipment requirements, runs continuously and its total capacity is smaller than the

capacity of the rest of the stages in the iron and steelmaking process (Tang et al., 2002).

As a result, effective scheduling of the SCC processes can be advantageous in several

ways, including minimizing material and energy requirements, increasing profit, reducing

costs and superior response to customer demand (Li et al., 2012). However, scheduling of

such a process is a combinatorial problem with complex practical constraints and strict

requirements on material and flow continuity subject to processing times at the different

stages and transportation and waiting times between them (Li et al., 2012; Tang et al.,

2000). According to Zhu et al., 2010, it is an NP-complete problem and can be described

as a “specific hybrid flow shop scheduling problem” that includes multiple jobs,

operations and machines.

As already mentioned, the steelmaking phase consists of three sub-stages, steelmaking,

refining and continuous casting. Starting with the steelmaking phase, oxygen combustion

taking place in a converter or an electric arc furnace decreases the impurity components

(carbon, sulphur, silicon, etc.) of the molten iron to acceptable levels converting it to

molten steel that contains the major alloy contents. A basic production unit in the SCC

stage is termed as charge and refers to the simultaneous smelting in a single converter

(Tang et al., 2002). The words job and heat are alternative terms for charge and are used

interchangeably in this paper. Several slabs for different orders can be casted form a single

charge, but they need to have the same steel grade (Tang et al., 2002).

Afterwards, the output is placed into ladles and transferred to refining furnaces. During

refining, any remaining impurities are removed from the molten steel or additional alloy

elements are added. In the case all refining furnaces are occupied, charges must be held

till the processing of the preceding charges is completed. Waiting causes a decline in the

temperature of the molten steel and reheating is required. The longer the waiting time, the

longer the temperature drop and the higher the energy requirements for reheating (Tang

et al., 2002).

Continuous casting follows refining. The molten steel is poured into a tundish from where

it is tapped into the caster and solidifies into slabs at the bottom of the caster. A series of

jobs successively casted on the same caster without any interruptions is named a cast.

Between consecutive jobs, the casting machine does not require any setup time, but if the

caster must be changed between two casts a significantly long setup time is needed.

Furthermore, a removal time is required for cleaning the equipment between the casts.

However, setup and removal time is not included in the duration of the operation since

only the equipment and not the charges are involved (Tang et al., 2002).

Several machines are usually available for the same stage of the process. The goal of the

SCC production scheduling problem is to establish on which machine each job will be

processed at each production stage, determine the respective processing time and the

sequence of jobs on each machine (Tang et al., 2014). A defined SCC problem includes

two types of data:

(1) Production data: information on job grouping (cast) and jobs. This information is

acquired from solving a batch problem that links required slab production to

required jobs and arranges jobs into casts.

(2) Process data: include information on the available machines, processing times,

transportation times, process route and casting speed (Tang et al., 2014).

The SCC scheduling problem is considered as one of the most difficult industrial planning

and scheduling problems (Sbihi et al., 2014). Across the literature several methods

attempting to solve this problem can be found. Among the most common ones are

mathematical programming, heuristics, simulation, expert systems and artificial

intelligence. Each SCC scheduling problem may differ from the others considering the

facilities, the way demand is translated into production planning and any other additional

constraints. A solution may be obtained either optimally or heuristically. In this paper,

focus is given on mathematical programming, and specifically mixed integer

programming,

This paper discusses a problem presented by TATA Steel considering the SCC process at

their steel plant in Port Talbot, UK. The specifics of the problem are discussed in Section

3. Currently, humans schedule the production without the use of any optimization

methods. An attempt to identify possible solution methods and related challenges and

model part of the process using mixed integer programming is presented in this paper.

Furthermore, recommendations on improving the scheduling process are given.

The rest of the paper is organized as follows. In Section 2, a brief review of similar

problems found in literature and their proposed solution methods is presented. Section 3

introduces the SCC scheduling problem as described by Tata Steel Strip Products UK Port

Talbot Works. In Section 4, three mixed integer linear programming models are described,

one for scheduling the continuous casting stage of this problem, one for scheduling the

refining stage and one for scheduling all stages preceding continuous casting. In Section

5, examples of the implementation of these models are presented. In Section 6,

recommendations on how a complete scheduling system could be developed are

discussed. Finally, Section 7 draws the conclusions.

2. Literature Review In this section, literature related to this paper is reviewed. In particular, several examples

of the SCC scheduling problem and their proposed solution methods are presented. The

various solution methods that have been presented by researchers can be categorized into

mathematical programming, heuristics and artificial intelligence. However, many

examples of combing these methods exist. A complete review of mathematical

programming methods was prepared by Tang et al. (2001). Also, Dutta and Fourer (2001)

reviewed mathematical programming applications in the integrated steel industry.

The SCC scheduling problem can be described as a hybrid flowshop problem with

parallel machines at one or more stages (Zhao et al., 2011; Li et al., 2014; Atighehchian

et al., 2009; Tang et al., 2002; Pan et al., 2013; Missbauer et al., 2009). It is one of the

most complex and difficult hybrid flowshop problems due to the additional practical

constraints (e.g. job sequencing, precedence) and the more complicated scheduling

criteria (Li et al., 2014; Atighehchian et al., 2009; Tang et al., 2002). Pan et al. (2013)

and Li et al. (2014) described the problem as a realistic hybrid flowshop problem. The

traditional hybrid flowshop problem involves a process with multiple stages and multiple

machines at one or more stages in which all jobs go through the required stages in the

same order. The realistic hybrid flowshop problem is a generalization of the traditional

one that includes realistic considerations and constraints (Pan et al., 2013). The realistic

hybrid flowshop problem has been studied excessively due to its significant industrial

applications (Pan et al., 2013). However, the differences in the complex production

constraints, the production mode and the objectives of different steel producers result in

need of developing scheduling systems adapted to the specifications of each

manufacturer. Additionally, although Gupta (1988) proved that the two stage hybrid

flowshop problem is NP-complete and several publications on solving a hybrid flowshop

problem exist, these methods cannot be used for the SCC scheduling problem due to the

additional practical constraints. (Tang et al., 2002). As a result, researches have studied

several solution methods specifically for the SCC scheduling problem.

Harjunkoski and Grossmann (2001) suggested a decomposition method instead of

modelling one large-scale and unsolvable MILP. They considered a problem of a four-

stage process, with two machines in the first stage and a single machine at the rest three

stages, multiple product types and a predetermined processing time of each product at

each stage. The problem was divided into the following sub-problems: (1) grouping jobs

into sequences. (2) scheduling each sequence individually. (3) combining all individual

schedules. (4) LP-improvement problem which attempts to make improving changes to

the schedule developed in the previous phase. The model was tested using real-world data

and solved using GAMS-19.5/XPRESS-MP.

Missbauer et al. (2009) created a computerized scheduling system for a steel plant in

Austria. They modelled the SCC problem as a mixed integer linear problem (MILP) and

used a heuristic algorithm for solving it. They divided the problem into four sub-problems:

(1) creating a schedule for the continuous casters. (2) assigning the jobs to the parallel

machines at the steelmaking and refining stages. (3) sequencing the jobs at the

steelmaking and refining machines. (4) determining the timing of each operation. Their

heuristic algorithm attempts to solve the problem in three stages: (1) scheduling the

continuous casters taking into account the capacity of the steelmaking and refining stages

and the supply of liquid metal. (2) scheduling all jobs at the remaining stages. (3) solving

a linear problem to make improvements on the schedule. In other words, stages (1) and

(2) are heuristics that fix the values of the binary variables of their MILP model and

determine the initial values for the continuous variables. Based on these fixed values, the

final values of the continuous variables are calculated at stage (3). A complete

computerized scheduling program was developed based on their model, but it was not

implemented on existing software. A software vendor developed a customized software.

Tang et al. (2000) proposed a mathematical programing model for overcoming machine

conflicts in SCC scheduling. As Missbauer et al. (2009) suggested, they also divided the

whole SCC problem into four sub problems: (1) Cast sequencing. Casts are scheduled at

the continuous casters prioritizing those with the nearest delivery time. Resource

constraints are not considered at this stage and the problem is formulated as a single

machine sequence problem. (2) Creating sub-schedules. Scheduling the jobs of each cast

at the rest of the stages based on time progress. (3) Creating a “rough” schedule combining

the sub-schedules from the previous step. (4) Elimination of machine conflicts. Since

resource constraints are not considered in the previous steps, machine conflicts exist in

the “rough” schedule that must be eliminated to obtain a feasible schedule. The first three

steps are completed using human-computer interaction while the last one is solved using

a mathematical model. It is a non-linear program which can be transformed into a liner

problem and solved by standard software packages. A combination of human-computer

interaction and the proposed model resulted in the development of a scheduling system in

MS C 6.0 language and SYBASE database system.

Bellabdaoui and Teghem (2006) presented a MILP model for scheduling the SCC process

of an Arcelor Group site in Belgium. The considered process consisted of three stages and

two parallel machines at each stage. They also accounted for the transportation time

between the machines. Their objective was to create a production schedule given one job

sequence for each casting machine and their initial condition (i.e. the instance a machine

becomes available) while minimizing the completion time for all sequences. Processing

times at the first two stages and transportation times were fixed and entered as input

parameters while the processing time of a charge on the casters was a decision variable

that varied between a lower and an upper bound. The model was implemented in

OMPartners software.

Fanti et al. (2016) developed a MILP model for a more complicated process. They

considered a four-stage process (melting, refining, degassing, casting/stripping) with

multiple machines. The last stage could be performed on two different types of machines

(continuous casting or ingot casting machines). They divided the problem into four sub-

problems that they modelled separately, and then, connected them introducing a set of

additional constraints. Thus, the solution was a global optimal. The four sub-models were

defined as: (1) SCC flow, establishing the sequence of the machines on which each charge

would be processed starting from the melting and ending to casting. (2) Ladle scheduling,

assigning ladles to charges. (3) Continuous casting machines scheduling. (4) Ingot casting

machines scheduling. The optimization model considered deterministic processing times

for each machine and its objective was to minimize the completion time of the last job.

The model was implemented in C++.

Sbihi et al. (2014) attempted to introduce a generalized formulation of the SCC problem.

They modelled a three stage process as a MILP without directly considering any material

handling resources, such as ladles or transportation machines. For each caster, the number

of sequences and the number of charges assigned to each sequence was predetermined.

The objective was to schedule the jobs of each sequence at all stages while maximizing

productivity. The processing times at the first two stages of the SCC process and the

transportation times between machines were constant while the processing time of a

charge at the third stage was determined by the model and it was required to be between

an upper and a lower bound. The model was tested on CPLEX software.

Tang et al. (2002) modelled the scheduling problem as an integer program and used

Lagrangian relaxation for solving it. They modelled a three-stage process. The number of

casts and charges was predetermined as well as the processing and transportation times of

each charge at each stage. They proposed a solution method that incorporated Lagrangian

relaxation, dynamic programming and heuristics. After relaxing machine capacity

constraints in their formulation, the problem could be divided into simpler sub-problems.

Using dynamic programming, these simpler problems were solved in a low level while

Lagrangian multipliers were changing iteratively at a high level. After iteration

completion, a heuristic was used to modify the sub-problem solutions in order for a

feasible global solution to be obtained. Visual C++ language was used for the model

implementation.

Tang and Liu (2007) created a deterministic mixed integer program model for scheduling

production orders based on data from Baosteel in Shanghai, China. Considering the SCC

and the rolling stages of the steel making process, the objective was to schedule each

production order at each stage under capacity constraints while minimizing the weighted

completion time of all orders. As Tang et al. (2002) proposed, the solution method

presented was based on a combination of Lagrangian relaxation, linear programming and

heuristics and implemented using Visual C++.

Similarly, Mao et al. (2014) formulated the SCC problem as a MILP and used Lagrangian

relaxation to solve it. They modelled it as a hybrid flowshop problem, and their objective

was to minimize earliness/tardiness of the completion of charge processing. As in Tang

et al. (2002), relaxation on machine capacity constraints resulted into two simpler sub-

problems for which several solution algorithms were tested and compared. It was

concluded that their proposed Lagrangian relaxation method results in better quality

solutions in less time than convectional Lagrangian relaxation techniques. Algorithms

were implemented in C#.

Researchers have also studied different versions of the SCC scheduling problem. For

example, Naphade et al. (2001) formulated a batching problem. Considering the size of

the charges, received orders and their delivery time and different product characteristics,

they developed a MILP that determined which charges should be used for each order and

processing times. The objective was to minimize tardiness of delivery time and waste.

Due to the computational difficulty of the problem and time restrictions, they developed

a two-level heuristic algorithm that decomposed the problem into simpler sub-problems

to solve it. The algorithm was implemented by using C++ language. Additionally, Tan

and Liu (2013) formulated the SCC scheduling problem considering a variable electricity

price. The objective was to obtain a daily schedule minimizing electricity and production

costs. The proposed solution method consisted of two stages. At the first one, using

mathematical programming, a relative schedule for each cast was acquired without

considering the electricity price. At the second stage, a scheduling problem for all casts

with resource constraints and variable electricity was modelled and solved by using a

combination of heuristics and constraint propagation. The model was tested using real-

world data and implemented using JAVA language.

Pacciarelli and Pranzo (2004) presented a different heuristic approach. Their model was

based on the alternative graph, a generalization of the disjunctive graph of Roy and

Sussman. It was solved using a beam search procedure and implemented in C language.

Furthermore, Zhao et al. (2011) described a two-step solution approach. In the first step,

tabu search algorithm arranged the jobs on the machines. In the second one, starting and

ending times of the processing of jobs at all stages were determined solving a linear

programming model. Their model was implemented using Visual C++.

Using artificial intelligence to solve the SCC scheduling problem has also been

researched. For instance, Pan et al. (2013) described an artificial bee colony algorithm

that scheduled a multiple-stage, multiple-machine SCC process with predefined casts and

processing times at each stage. Furthermore, Li et al. (2014) proposed a fruit fly

optimisation algorithm for solving the SCC scheduling problem formulated it as hybrid

flowshop problem with predefined casts and processing times. Atighehchian et al. (2009)

presented an approach that combined ant colony optimization and non-linear optimization

techniques for solving a similar version of the problem. The solution method consisted of

two stages: (1) assigning jobs to machines and determining sequencing. (2) determining

timings of the jobs on the machines.

Several formulations and solution approaches exist in literature; however, there is not a

universal model or methodology that can be applied in all cases. Each problem is highly

dependable on the structure of the considered steel plant the desired objective. This

obvious when comparing the problem presented by TATA Steel (described in Section 3)

to the literature findings above.

3. Process Description and Constraints In this section, the SCC process followed at TATA Steel, Port Talbot and all the

restrictions applied to it are presented.

The steel and slab process consists of three different parts: basic oxygen steelmaking,

secondary steel making and continuous casting. Liquid iron is supplied by the blast

furnaces at an inconstant rate. The liquid iron is transferred using torpedoes. During the

basic oxygen steel making process, the liquid iron is poured in iron ladles. The content of

a ladle is termed as a heat and each heat is used to produce slabs of a specific product type

and width. The filled ladle is transferred to the next stage during which desulphurisation

happens. To complete the basic oxygen steelmaking process, the liquid iron is transferred

to one of the two available vessels which have already been charged with scrap (approx.

80 % (270 - 310t) of the charge is hot metal and 20% (60-90t) is scrap). After the vessel

has been loaded, a copper tipped, water cooled lance is lowered into the vessel and oxygen

is blown at a rate of 1000 m3/min.

The next part of the process is called dwell and it involves tapping and transfer, treatment

and floatation. The vessels are tapped into ladles which are transferred to one of the four

available treatment units where the secondary steelmaking process takes place. Based on

the type of the products that are being produced different treatment units may require to

be used. Then, floatation follows and ladles are transferred to casting machines. The last

step is continuous casting where the hot metal is solidified into slabs in one of the casting

machines.

Figure 2 illustrates the different paths that can be followed by a heat before it is

transformed into slabs.

Figure 2. Steelmaking-Continuous Casting Process at TATA Steel

Before attempting to create a production schedule for this process, the constraints related

to this process must be introduced. Firstly, the hot metal stock needs to be kept between

an upper and a lower limit to avoid any disturbances in the production. As mentioned

above, the hot metal arrives at an inconstant rate. Additionally, several constraints related

to operation timings exist. The processing time for several parts of the process are not

fixed, but they need to lay within a predetermined range. The total dwell time, but also its

three different parts, specifically, tap and transfer, treatment and floatation can be

adjusted, but their duration needs to be between a minimum and a maximum value that

are predetermined. For the treatment and the whole dwell process, these values depend on

the product type while the tap & transfer and the floatation times are independent of the

product and they need to be in the range of 15-25 minutes. Similarly, the processing time

of a heat in the casting machines is calculated based on the casting speed that is controlled

by adjusting the speed utilization percentage. The casting speed depends on the product,

width and the casting machine that is used, and the speed utilization ratio is usually in the

range of 70% to 90%. From the casting speed, the emptying time (processing time) is

determined. The duration of the remaining steps of the steelmaking process (ex. vessel

blowing) are fixed and independent of any other variables.

Additionally, there are several constraints related to the casting machines. Casting needs

to be continuous in all machines with a gap of two to four minutes between the ending

time of the previous heat and the arrival of the next one. Any interruption would result in

a two-hour break in the steelmaking making process. As sequence length is defined the

number of continuous heats of the same product and width. For each combination of

product, width and caster, there is a minimum and a maximum value for the sequence

length.

Lastly, there are constraints in the order of sequences. Some products cannot be

manufactured on the same caster after specific products due to restrictions in the flying

tundish changes. A table presenting permitted flying tundish changes is available in

Appendix A. Furthermore, changes equal to or less than 200mm are allowed regarding

width between two consecutive sequences. For example, a sequence of product 319x with

width 3600mm can be placed after a sequence of product 319x with width 3400mm, but

not after one of product 319x with width 3200mm. If two sequences do not satisfy these

conditions and they are scheduled one after the other on the same continuous casting

machine a two-hour gap must exist between the ending time of the preceding and the

starting time of the following sequence.

4. Mathematical Formulation In this section, three MILP models are described. The first model schedules the heats at

the continuous casting stage, and the second one uses the results of the first one to

schedule the heats on the treatment units. The third one uses the results of the first one

and attempts to schedule the rest of the stages.

4.1. Model 1: Continuous Casting Scheduling A MILP model for scheduling a predetermined number of casts (sequences) at the final

stage of the process (continuous casting) was formulated. For the development of this

model, the following assumptions were made:

x The length, product type and width of all sequences is predetermined.

x The processing time of all heats in the same sequence is constant but not fixed.

x All sequences can be processed on any casting machine independently of their

width and product type.

x Not all continuous casting machines are available to start operating at time 0.

x The minimum acceptable processing time of a heat is 70% of its respective

maximum time.

x Each machine has a determined number of positions, and each heat processed on

that machine occupies a position. In reality, such a parameter does not exist, but it

was added in the formulation to avoid machine conflicts (i.e. no more than one

heats are being processed on the same machine simultaneously). The number of

positions is set equal to the total number of sequences, so in case all sequences

need to be assigned to one machine due to restrictions, a solution can be obtained.

However, this number can be modified. For instance, if the same number of

sequences needed to be scheduled on all machines the number of possible

positions could be adjusted accordingly.

4.1.1. Notation Several notations for defying the problem set, indices, parameters and variables are

required to formulate the model:

Sets, Indices and Parameters

𝑃 Set of product types. 𝑆 Set of sequences. 𝑀 Set of continuous casting machines. 𝑉 Set of possible positions in a continuous casting machine (vk=N is the

last position). 𝑝, 𝑝′ Indices of product types. 𝑠, 𝑠′ Indices of sequences. 𝑚 Index of continuous casting machines. 𝑣 Index of positions. N Number of total sequences. 𝐽𝑠 Number of heats in sequence s. 𝑊𝑠 Width of slabs that will be produced from the heats of sequence s. 𝑌𝑠,𝑝 =1 if the heats of sequence s will be used to produce slabs of product type

p; 0 otherwise. 𝑀𝑇𝑝 Maximum processing of a heat of product p. 𝑅𝑝,𝑝′ =1 if set-up time is required in order a sequence of product type 𝑝′ to be

processed after a sequence of product type p on the same machine; 0 otherwise.

𝐴𝑉𝑚 Starting time of continuous casting machine m.

Decision Variables

𝑆𝑇𝑣,𝑚 Starting time of position v on machine m. 𝐸𝑇𝑣,𝑚 Ending time of position v on machine m. 𝑋𝑠,𝑣,𝑚 =1 if sequence s is processed on position v of machine m.

𝑆𝑅𝑠,𝑠′ ,𝑣,𝑚 =1 if sequence 𝑠′ is assigned to the (v+1)th position of machine m, sequence s to vth position and set-up time is required between sequences s and 𝑠′.

4.1.2. Model Constraints The constraints developed represent the relationships a solution must meet considering

machine conflicts, production continuity, sequence succession allowance, etc. A

solution must satisfy the following constraints:

∑ ∑ 𝑋s,v,m

𝑉

𝑣

𝑀

𝑚

= 1

for all s ϵ S

(1)

∑ 𝑋s,v,m

𝑆

𝑠

≤ 1

for all m ϵ M, v ϵ V

(2)

𝑋s,v,m + 𝑋 𝑠′,v+1,m − 1 ≤ 𝑆𝑅𝑠,𝑠′,𝑣,𝑚 for all m ϵ M, 𝑝, 𝑝′ϵ P, 𝑠, 𝑠′ ϵ S, v=1…vk-1, 𝑅𝑝,𝑝′=1, 𝑌𝑠,𝑝=1, , 𝑌𝑠′,𝑝′=1

(3)

𝑋s,v,m + 𝑋 𝑠′,v+1,m − 1 ≤ 𝑆𝑅𝑠,𝑠′,𝑣,𝑚 for all m ϵ M, 𝑠, 𝑠′ϵ S, v=1…vk-1, Ws-Ws’ > 200

(4)

𝑋s,v,m + 𝑋 𝑠′,v+1,m − 1 ≤ 𝑆𝑅𝑠,𝑠′,𝑣,𝑚 for all m ϵ M, 𝑠, 𝑠′ϵ S, v=1…vk-1, Ws’-Ws > 200

(5)

𝑆𝑇𝑣,𝑚 = 𝐸𝑇𝑣−1,𝑚 + 120 ∗ ∑ ∑ 𝑆𝑅𝑠,𝑠′,𝑣−1,𝑚

𝑆

𝑠′

𝑆

𝑠

for all m ϵ M, v=2…vk

(6)

𝐸𝑇𝑣,𝑚 ≤ 𝑆𝑇𝑣,𝑚 + ∑ ∑ 𝑌𝑠,𝑝 ∗ 𝐽𝑠 ∗ 𝑀𝑇𝑝 ∗ 𝑋𝑠,𝑣,𝑚

𝑃

𝑝

𝑆

𝑠

for all m ϵ M, v ϵ V

(7)

𝐸𝑇𝑣,𝑚 ≥ 𝑆𝑇𝑣,𝑚 + ∑ ∑ 𝑌𝑠,𝑝 ∗ 𝐽𝑠 ∗ 0.7 ∗ 𝑀𝑇𝑝 ∗ 𝑋𝑠,𝑣,𝑚

𝑃

𝑝

𝑆

𝑠

for all m ϵ M, v ϵ V

(8)

∑ 𝑋s,v,m

𝑆

𝑠

− ∑ 𝑋s,v+1,m

𝑆

𝑠

≥ 0

for all m ϵ M, v=1…vk-1

(9)

𝑆𝑇1,𝑚 ≥ 𝐴𝑉𝑚 for all m ϵ M,

(10)

𝑋𝑠,𝑣,𝑚 ∈ {0, 1} for all s ϵ S, m ϵ M, v ϵ V

(11)

𝑆𝑅𝑠,𝑠′,𝑣,𝑚 ∈ {0, 1} for all s, s’ ϵ S, m ϵ M, v ϵ V

(12)

𝑆𝑇𝑣,𝑚 ≥ 0 for all m ϵ M, v ϵ V

(13)

𝐸𝑇𝑣,𝑚 ≥ 0 for all m ϵ M, v ϵ V

(14)

Constraint (1) ensures that each heat is processed on one and only one machine. Constraint

(2) means that at most one heat is assigned to every position of a machine (i.e. no more

than one heats are being processed on the same machine simultaneously). Constraints (3)

- (5) determine if set-up time is required between two sequences.

Constraint (3) checks if set-up time is required between two sequences because their

assigned product types cannot be processed on the same machine in succession.

Constraints (4) - (5) check if set-up time is required between two sequences due to the

difference in their assigned widths. Constraint (6) sets starting time of a position to be

equal to the ending time of the previous one on in the same machine plus any set-up time

if it is required. Thus, continuous casting is ensured.

Constraints (7) – (8) ensure that the processing time of each sequence is between an

allowed range based on the number of heats they consist of and the minimum and

maximum allowed processing times of the heats of their assigned product type.

Constraint (9) requires positions of all machines to be filled without leaving any empty

positions between two occupied ones. Differently, a position could not be occupied and

would have the same starting and ending time that would be equal to the ending time of

the previous one and the starting time of the following one. In this case, if set-up time was

required between the two successive sequences that were not occupying two successive

positions it would not be included in the calculations. For example, consider two

sequences s and s’ that their width difference is more than 200mm and s’ is scheduled to

be treated on machine m right after s is treated on m. Because of their width difference, set-up time is required between the processing of the two sequences. Without including

constraint (9), s could be assigned to position v and s’ to v+2 while position v+1 remained

empty. In this scenario, from constraint (7) results that 𝐸𝑇𝑣,𝑚 = 𝑆𝑇𝑣+1,𝑚 = 𝐸𝑇𝑣+1,𝑚 =

𝑆𝑇𝑣+2,𝑚 since 𝑆𝑅𝑠,𝑠′,𝑣+2,𝑚 = 0. So, the set-up time is not added and the obtained schedule

is not feasible.

Constraint (10) forces the starting time of the first position of all machines to be greater

or equal to the time the respective machines can begin operating. Constraints (11) – (14)

are introduced to strengthen the formulation.

4.1.3. Objective Function The aim of the model is to determine a feasible schedule while maximizing productivity.

The objective function is to minimize the total completion times of the sequences. All

sequences at machine m finish with the end of the last position at time 𝐸𝑇𝑁𝑘,𝑚. Thus, the

objective function is formulated as:

Min 𝑍1 = ∑ 𝐸𝑇𝑁,𝑚

𝑀

𝑚

(15)

4.2. Model 2: Treatment Units Scheduling

A MILP model for scheduling the heats at the treatment units after they have been

scheduled at the continuous casting stage is presented. It should be highlighted that it is

not certain that a solution exists for any output of the previous model. The following

assumptions were made while formulating this model:

x As casting machines in the previous model, treatment units have a determined

number of positions, and each heat processed on that unit occupies a position. In

reality, such parameter does not exist, but it was added in the formulation to avoid

machine conflicts. The number of positions is set equal to the total number of jobs,

but this number can be modified.

x No transportation time is considered between the treatment units and the casting

machines. In reality, the transportation time lays within the range of 15-25

minutes.

4.2.1. Notation The following notation was used in formulating the model:

Sets, Indices and Parameters

𝑃 Set of product types. 𝐽 Set of heats. 𝑈 Set of treatment units. 𝐾 Set of possible positions in a continuous casting machine (kl=L is the

last position). 𝑝 Index of product types. 𝑗 Index of heats. 𝑢 Index of treatment units. 𝑘 Index of positions. L Number of total heats.

𝑆𝐶𝐶𝑗 Starting time of heat j at the continuous casting stage. 𝐷𝑗,𝑝 =1 if heat j will be used to produce slabs of product type p; 0 otherwise.

𝑀𝑎𝑥𝑇𝑝 Maximum processing time of a heat of product p. 𝑀𝑖𝑛𝑇𝑝 Minimum processing time of a heat of product p. 𝑀𝑅𝑝,𝑢 =1 if a heat of product type 𝑝 cannot be processed on treatment unit u; 0

otherwise.

Decision Variables

𝐵𝑇𝑘,𝑢 Starting time of position k on treatment unit u. 𝐹𝑇𝑘,𝑢 Ending time of position k on treatment unit u. 𝑄𝑗,𝑘,𝑢 =1 if heat j is processed on position k of treatment unit u.

4.2.2. Constraints The constraints developed represent the relationships a solution must meet considering

machine conflicts, machine restrictions, etc. A solution must satisfy the following

constraints:

∑ ∑ 𝑄j,k,u

𝐾

𝑘

𝑈

𝑢

= 1

for all j ϵ J

(16)

∑ 𝑄j,k,u

𝐽

𝑗

≤ 1

for all u ϵ U, k ϵ K

(17)

𝐹𝑇𝑘,𝑢 = ∑ 𝑄j,k,u

𝐽

𝑗

∗ 𝑆𝐶𝐶𝑗

for all u ϵ U, k ϵ K

(18)

𝑄j,k,u = 0 for all u ϵ U, 𝑝ϵ P, 𝑗 ϵ J, k ϵ K, 𝑀𝑅𝑝,𝑢 = 1, 𝐷𝑗,𝑝 = 1

(19)

𝐹𝑇𝑘,𝑢 ≤ 𝐵𝑇𝑘,𝑢 + ∑ ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑎𝑥𝑇𝑝 ∗ 𝑄j,k,u

𝑃

𝑝

𝐽

𝑗

for all u ϵ U, k ϵ K

(20)

𝐹𝑇𝑘,𝑢 ≥ 𝐵𝑇𝑘,𝑢 + ∑ ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑖𝑛𝑇𝑝 ∗ 𝑄j,k,u

𝑃

𝑝

𝐽

𝑗

for all u ϵ U, k ϵ K

(21)

𝐹𝑇𝑘,𝑢 ≤ 𝐵𝑇𝑘+1,𝑢 for all u ϵ U, v=1…L-1

(22)

𝐵𝑇𝑘,𝑢 ≥ 0 for all m ϵ M, v ϵ V

(23)

𝐹𝑇𝑘,𝑢 ≥ 0 for all m ϵ M, v ϵ V

(24)

𝑄j,k,u ∈ {0, 1} for all s ϵ S, m ϵ M, v ϵ V

(25)

Constraint (16) ensures that each heat is processed on one and only one machine.

Constraint (17) means that at most one heat is assigned to every position of a machine.

Constraint (18) sets the ending time of a position equal to the starting time at the

continuous casting stage of the heat assigned to this position. Constraint (19) restricts

heats to be treated on units that cannot process their assigned product types.

Constraints (20) – (21) ensure that the processing time of each heat is between the allowed

range for its product type. Constraint (22) requires the ending time of a position to be

smaller or equal to the starting time of the next position on the same machine. Constraints

(23) – (25) are added to strengthen the formulation.

4.2.3. Objective Function The objective function is minimizing the processing time of all heats. It is formulated as

the sum of the difference between the ending and the starting time of all positions:

Min 𝑍2 = ∑ ∑(𝐹𝑇𝑘,𝑢 − 𝐵𝑇𝑘,𝑢)

𝑈

𝑢

𝐾

𝑘

(26)

4.3. Model 3: Basic Oxygen and Secondary Steelmaking Scheduling A MILP model for scheduling the heats at the remaining stages after they have been

scheduled at the continuous casting stage is described. This model was developed after

modifying the sub-model of steelmaking and casting flow discussed in Fanti et al. (2016).

It should be highlighted that it is not certain that a solution exists for any output of Model

1. The following assumptions were made while formulating this model:

x No transportation time is considered between the machines at the different stages.

4.3.1. Notation The following notation was used in formulating the model:

Sets, Indices and Parameters

𝑃 Set of product types. 𝐽 Set of heats.

𝑀 Set of machines. 𝐼 Set of stages. 𝑝 Index of product types.

𝑗, k Indices of heats. 𝑚, 𝑢 Indices of machines.

𝑖 Index of stages (il is the last stage). L Number of total heats. G A large number.

𝑆𝐶𝐶𝑗 Starting time of heat j at the continuous casting stage. 𝐷𝑗,𝑝 =1 if heat j will be used to produce slabs of product type p; 0 otherwise.

𝑀𝑎𝑥𝑇𝑝 Maximum processing time of a heat of product p at the treatment units. 𝑀𝑖𝑛𝑇𝑝 Minimum processing time of a heat of product p at the treatment units.

𝑀𝑅𝑝,𝑚 =1 if a heat of product type 𝑝 cannot be processed on machine m; 0 otherwise.

SMi, m =1 if machine m can be used at stage i; 0 otherwise. PTm Processing time of a heat at machine m (This set to 0 for the treatment

units). Decision Variables

𝐵𝑇𝑗,𝑖,𝑚 Starting time of heat j at stage i on machine m. 𝐹𝑇𝑗,𝑖,𝑚 Ending time of heat j at stage i on machine m.. 𝑥𝑗,𝑖,𝑚 =1 if heat j is processed on machine m at stage i.

𝑦𝑗,𝑘,𝑖,𝑚 =1 if heats j and k are both processed on machine m at stage i and heat j precedes k.

4.3.2. Constraints

The following constraints represent the relationships a solution must meet considering

machine conflicts, machine restrictions, etc.:

∑ 𝑥j,i,m

𝑀

𝑚

= 1

for all i ϵ I, j ϵ J

(27)

𝐹𝑇𝑗,𝑖𝑙,𝑚 = 𝑥j,il,m ∗ 𝑆𝐶𝐶𝑗 for all j ϵ J, m ϵ M

(28)

𝑥j,i,m ≤ 0 for all m ϵ M, 𝑝ϵ P, 𝑗 ϵ J, i ϵ I, 𝑀𝑅𝑝,𝑚 = 1, 𝐷𝑗,𝑝 = 1

(29)

𝑥j,i,m ≤ 𝑆𝑀𝑖,𝑚 for all m ϵ M, 𝑗 ϵ J, i ϵ I

(30)

𝐹𝑇𝑗,𝑖𝑙,𝑚 ≥ 𝐵𝑇𝑗,𝑖𝑙,𝑚 + ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑖𝑛𝑇𝑝 ∗ 𝑥𝑗,𝑖𝑙,𝑚

𝑃

𝑝

for all m ϵ M, j ϵ J

(31)

𝐹𝑇𝑗,𝑖𝑙,𝑚 ≤ 𝐵𝑇𝑗,𝑖𝑙,𝑚 + ∑ 𝐷𝑗,𝑝 ∗ 𝑀𝑎𝑥𝑇𝑝 ∗ 𝑥𝑗,𝑖𝑙,𝑚

𝑃

𝑝

for all m ϵ M, j ϵ J

(32)

𝐹𝑇𝑗,𝑖,𝑚 + 𝐵𝑇𝑗,𝑖,𝑚 ≤ 𝑥𝑗,𝑖,𝑚 ∗ 𝐺 for all m ϵ M, j ϵ J, i ϵ I

(33)

𝐹𝑇𝑗,𝑖,𝑚 = 𝐵𝑇𝑗,𝑖,𝑚 + 𝑥𝑗,𝑖,𝑚 ∗ 𝑃𝑇𝑚 for all m ϵ M, j ϵ J, i 1..(il-1)

(34)

𝐵𝑇𝑗,𝑖+1,𝑚 ≤ 𝐹𝑇𝑗,𝑖,𝑚 + (2 − 𝑥𝑗,𝑖,𝑚 − 𝑥𝑗,𝑖+1,𝑢) ∗ 𝐺 for all m ϵ M, j ϵ J, i 1..(il-1)

(35)

𝐵𝑇𝑗,𝑖+1,𝑚 ≥ 𝐹𝑇𝑗,𝑖,𝑚 − (2 − 𝑥𝑗,𝑖,𝑚 − 𝑥𝑗,𝑖+1,𝑢) ∗ 𝐺 for all m ϵ M, j ϵ J, i 1..(il-1)

(36)

𝐵𝑇𝑘,𝑖,𝑚 ≥ 𝐹𝑇𝑗,𝑖,𝑚 − (1 − 𝑦𝑗,𝑘,𝑖,𝑚) ∗ 𝐺 for all m ϵ M, j, k ϵ J, i ϵ I

(37)

𝑦𝑗,𝑘,𝑖,𝑚 + 𝑦𝑘,𝑗,𝑖,𝑚 ≥ 𝑥𝑗,𝑖,𝑚 + 𝑥𝑘,𝑖,𝑚 − 1 for all m ϵ M, j, k ϵ J, i ϵ I, k ≠ i

(38)

𝑦𝑗,𝑘,𝑖,𝑚 ≤ 𝑥𝑗,𝑖,𝑚 for all m ϵ M, j, k ϵ J, i ϵ I

(39)

𝑦𝑗,𝑘,𝑖,𝑚 ≤ 𝑥𝑘,𝑖,𝑚 for all m ϵ M, j, k ϵ J, i ϵ I

(40)

𝐵𝑇𝑗,𝑖,𝑚 ≥ 0 for all m ϵ M, i ϵ I, j ϵ J

(41)

𝐹𝑇𝑗,𝑖,𝑚 ≥ 0 for all m ϵ M, i ϵ I, j ϵ J

(42)

𝑥𝑗,𝑖,𝑚 ∈ {0, 1} for all m ϵ M, i ϵ I, j ϵ J

(43)

𝑦𝑗,𝑘,𝑖,𝑚 ∈ {0, 1} for all m ϵ M, i ϵ I, j, k ϵ J

(44)

Constraint (27) ensures that each heat is processed on one and only one machine at all

stages. Constraint (28) sets the ending time of a heat at the secondary steelmaking stage

equal to its starting time at the continuous casting stage. Constraint (29) restricts heats to

be assigned to machines that cannot process their product type. Constraint (30) forces

heats to be treated at the appropriate machines at each stage. In other words, if machine

m is a treatment unit, then Constraint (30) ensures it is used only for processing heats at

the secondary steelmaking stage.

Constraints (31) – (32) ensure that the processing time of each heat at the secondary

steelmaking stage is between the allowed range for its product type. Constraint (33) sets

the starting and the ending time of a heat at machine equal to zero, if it is not treated on

that machine. Constraint (34) sets the ending time of a heat at a specific machine equal to

its starting time plus the respective processing time at the stage preceding secondary

steelmaking.

Constraints (35) - (36) ensure that the starting time of a heat at a specific stage is equal to

the ending time of that heat at the previous stage. Constraint (37) forces the starting time

of a heat to be greater than the ending time of all the preceding heats treated on the same

machine. Constraint (38) guarantees that if two heats are processed on the same machine,

then one precedes the other. Constraints (39) – (44) are used to strengthen the formulation.

4.3.3. Objective Function The objective function is minimizing the processing time of all heats at the treatment units

(the last stage). It is formulated as the sum of the difference between the ending and the

starting time of all jobs at the last stage:

Min 𝑍3 = ∑ ∑(𝐹𝑇𝑗,𝑖𝑙,𝑚 − 𝐵𝑇𝑗,𝑖𝑙,𝑚)

𝑀

𝐽

𝑗

(45)

5. Experimental Tests The models presented in Section 4 were implemented on FICO Xpress-Optimizer Solver.

The codes developed for the three models can be found in Appendices C, D and E. This

section discusses the results of a case study. A set of six sequences consisting of 48 heats

in total was considered. A detailed description of the data is displayed on Tables 1 and 2.

Table 1. Sequences - Case Study 1

Sequence Heats Product

Type Width Possible Treatment

Units

Max Treatment Time(min)

Min Treatment Time(min)

Max Casting

Time(min) 1 6 300x 3200 RH, RD 45 35 118 2 10 319x 2000 CAS1, CAS2, RH, RD 35 25 72.6 3 8 336x 1800 CAS1, CAS2, RH, RD 35 25 72.6 4 8 345x 2200 CAS1, CAS2 35 25 72.6 5 8 319x 1800 CAS1, CAS2, RH, RD 35 25 72.6 6 8 336x 2000 CAS1, CAS2, RH, RD 35 25 72.6

Table 2. Availability of the continuous casting machines - Case Study 1

Continuous Casting Machine Starting Time

CC1 0 CC2 20 CC3 0

Figure 3 is a visualization of the results obtained from Model 1 (after they were associated

with the results of Model 2, so starting times of machines 1 and 3 are not 0). The allocation

of the sequences to the continuous casting machines and their respective starting and

ending times are presented on the graph. It was determined that all constraints included in

the model were satisfied. No more than one sequences are processed on a machine

simultaneously, all sequences are treated, casting is continuous, restrictions on the order

of sequences on the machines due to product types and width are satisfied and all casters

start operating after the time they became available.

Figure 3. Allocation of sequences to continuous casting machines – Case Study 1

The results acquired from Model 1 were used as input for Model 2. However, data

preprocessing was required before attempting to solve Model 2. The starting time of each

heat at the continuous casting stage was required. If sequence s was assigned to position

v of machine m, then the starting times of the heats in this sequence were determined as follows:

𝑆𝐶𝐶𝑗 = 𝑆𝑇𝑣,𝑚 + (𝑗 − 1) ∗ ( 𝐸𝑇𝑣,𝑚 − 𝑆𝑇𝑣,𝑚

𝐽𝑠 )

for all j ϵ Js

(46)

Using this data and Model 2, the schedules shown in Figures 4 to 7 were acquired for the

four treatment units. Again, all constraints considered in Model 2 were satisfied. All heats

are treated on the appropriate units based on product type, no more than one heat is being

processed on a unit at a specific instance, and the ending times of the heats at the treatment

stage are equal to the their respective starting times at the continuous casting stage.

Sequence 6

Sequence 1

Sequence 2

Sequence 3

Sequence 4

Sequence 5

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

CC1

CC2

CC3

Time

CC M

ac hi

ne s

Figure 4. Treatment Unit 1 (RH) Schedule – Case Study 1

Figure 5. Treatment Unit 2 (RD) Schedule – Case Study 1

Figure 6. . Treatment Unit 3 (CAS1) Schedule – Case Study 1

0 100 200 300 400 500 600 700 800 900 1000

Time

H ea

0 100 200 300 400 500 600 700 800 900 1000 1100 1200

Time

H ea

0 200 400 600 800 1000 1200

Time

H ea

Figure 7. Treatment Unit 4 (CAS2) Schedule – Case Study 1

Furthermore, the results of Model 1 and 2 were combined and the schedule presented in

Figure 8 was obtained. Each line (color) represents a heat, each horizontal segment of a

line represents the processing of that heat at the respective unit/machine and the bullets

the starting and ending times on each machine. The vertical segments connect the

machines on which a heat is treated at the different stages. Table 3 indicates which

machine corresponds to each number on the vertical axis of the graph. It is observed that

a feasible schedule for the continuous casting machines and the treatment units combined

can be acquired from Models 1 and 2.

Figure 8. Heats Schedule – Case Study 1

0 100 200 300 400 500

Time

H ea

7 0 200 400 600 800 1000 1200

M ac

hi ne

Time

Table 3. Machine ID Number – Figure 8

ID Number Machine 1 Treatment Unit 1 (CAS1) 2 Treatment Unit 2 (CAS2) 3 Treatment Unit 3 (RH) 4 Treatment Unit 4 (RD) 5 Continuous Caster 1 (CC1) 6 Continuous Caster 2 (CC2) 7 Continuous Caster 3 (CC3)

The same data as in Model 2 were inputted in Model 3. It was attempted to solve it twice.

However, both runs were terminated by the user before a solution was obtained due to

time restrictions. The first run was terminated after 76043.6s (approx. 21 hours) and the

second one after 15.204.8s (approx. 4.5 hours). To check the validity of Model 3 a set of

10 heats was inputted. More details of the input data can be found on Table 4.

Table 4. Heats – Case Study 2

Heat Continuous Casting

Starting Time Product Type 1 50.82 336x 2 101.64 336x 3 152.46 300x 4 203.28 300x 5 254.1 336x 6 304.92 336x 7 355.74 319x 8 406.56 319x 9 457.38 345x

10 508.2 345x

The schedule acquired after solving Model 3 is shown in Figure 9. Table 5 shows which

machine corresponds to each number on the vertical axis of the graph. It can be observed

that all constraints (machine conflicts, processing times, continuity, etc.) were satisfied.

However, further research on whether this model could be used for business purposes is

required due to time restrictions.

Figure 9. Heats Schedule - Case Study 2

Table 5. Machine ID Number – Figure 9

ID Number Machine 1 Hot Metal Pouring (HM1) 2 Hot Metal Pouring (HM2) 3 Desulphurization (DeS1) 4 Desulphurization (DeS2) 5 Vessel Blow (Vessel1) 6 Vessel Blow (Vessel2) 7 Treatment Unit 1 (CAS1) 8 Treatment Unit 2 (CAS2) 9 Treatment Unit 3 (RH)

10 Treatment Unit 4 (RD)

10 0 50 100 150 200 250 300 350 400 450 500 550

M ac

hi ne

s Time

6. Recommendations for Further Development The SCC scheduling problem presented by TATA Steel proved to be challenging and an

end-to-end solution could not be developed in the required time frame. Further research

and investigation on whether other solution methods can be effectively applied is required.

In this section, the challenges faced in formulating a mathematical model for the process

are discussed. Furthermore, recommendations on how this project could be continued are

given.

The SCC process at TATA Steel Port Talbot involves numerous constraints and decision

variables while the predetermined parameters are limited. It was determined that the

decision variables needed to be reduced to formulate a MILP model. Hence, a model in

which the number of sequences to be scheduled, the number of heats per sequence and

their product type are input parameters was developed. The majority of literature

examples studied follow a similar approach. The grouping of heats (sequences) is an input

parameter determined by solving a batching problem that considers demand (or received

orders and their delivery time). Thus, it is proposed that a different model is developed

for this purpose, so its output can be used as input for the continuous casting model.

As already discussed in Section 2, several approaches found in literature decompose the

SCC scheduling problem into sub-problems to obtain a solution. Due to the size of the

considered problem and the large number of constraints, a similar approach is

recommended. Combining a batching problem for grouping heats and the two models

introduced in section 4 with additional ones could lead to obtaining a schedule for the

whole SCC process. The two developed sub-models schedule the last two stages of the

SCC process. One or more additional sub-models that schedule the rest of the stages are

necessary for obtaining an end-to-end solution. The rest of the stages could be modelled

as a single-stage hybrid flowshop problem since processing times are constant for all

heats, there are only two parallel machines and no machine-product type restrictions exist.

However, in this case, a feasible schedule cannot be obtained if one of the parallel

machines is unavailable.

Another challenge is combining the different sub-models. This difficulty derives from

accounting for both machine conflicts and varying processing times. Considering the sub-

models presented in this paper, two possible ways for connecting them are introducing

non-linear constraints or develop a heuristic. For example, consider Models 1, 2 and 3

formulated in Section 4 along with the following parameters:

𝐹𝑗,𝑠 =1 if heat j belongs to sequence s; 0 otherwise.

𝐻𝑉𝑗 The number of the position of heat j in sequence s for which 𝐹𝑗,𝑠 = 1

And the following decision variable:

𝐻𝑃𝑗,𝑝 =1 if heat j is assigned to a sequence of product type p

Then, Model 1 models can be connected with Model 2 or 3 by adding the nonlinear

constraint (27) and constraint (28):

𝑆𝐶𝐶𝑗 = ∑ 𝐹𝑗,𝑠 ∗ {∑ ∑ 𝑋𝑠,𝑣,𝑚 ∗ [𝑆𝑇𝑣,𝑚 + (𝐻𝑉𝑗 − 1) ∗ (

𝐸𝑇𝑣,𝑚 − 𝑆𝑇𝑣,𝑚 𝐽𝑠

)] 𝑉

𝑣

𝑀

𝑚

} 𝑆

𝑠

for all j ϵ J

(47)

𝐻𝑃𝑗,𝑝 = ∑ 𝐹𝑗,𝑠 ∗ 𝑌𝑠,𝑝

𝑆

𝑠

for all j ϵ J, p ϵ P

(48)

In this way, if Model 2 (or 3) cannot find a feasible solution for the output of the

continuous casting sub-model the processing time of the heats at the continuous casting

stage will be adjusted accordingly. Appropriate methods for solving a nonlinear problem

or converting such a model to a linear model should be investigated.

As for the heuristic option, an algorithm that manipulates the continuous casting solution,

so a feasible solution for the treatment units scheduling model can be obtained is

suggested. This can be achieved by finding feasible solutions of the casting problem with

increased cost (i.e. total completion time) and using them as an input for Model 2 or Model

3 until a feasible solution is obtained. A way for finding feasible solutions of the casting

scheduling problem while the cost increases gradually must be established. This process

could be repeated for a specified number of iterations and if a solution is not obtained than

the heat grouping could be differentiated. A visualization of such as process is presented

in Figure 10.

Figure 10. Proposed Heuristic

In case Model 2 was used, the next step would be to schedule the remaining stages of the

process. One or more sub-models could be developed for this purpose. Connecting any

additional sub-models with the existing ones could be achieved as discussed above.

However, attempting to schedule the whole SCC process solving a single MILP problem

may have several disadvantages. Due the size and complexity of the considered problem,

such an approach could require an inefficient for business purposes amount of time to be

solved or result in an NP-hard problem. Although the SCC scheduling problem can be

considered as a realistic hybrid flowshop problem and it has been proven that a two-stage

hybrid flowshop problem is NP-complete (Pan et al., 2013; Li et al., 2014; Gupta, 1988),

further investigation is required to determine if this approach would be practical for TATA

Steel. Their SCC process consists of five stages without considering transportation

between the machines and several extra practical constraints. For this reason, different

methods like artificial intelligence and heuristics could be further researched. As

mentioned in Section 2, many researchers have used these techniques to solve similar

problems. Formulating linear or nonlinear models may be proven useful as a first to

understand the specifics of the problem; however, feasible solutions could be obtained in

less time using these methods.

Furthermore, additional improvements to the existing models can be made. The proposed

continuous casting scheduling model assumes that the processing time for all heats of a

sequence is the same; however, it can vary as long as it is within the respective acceptable

range. Finding a way to integrate this to the model would give more flexibility to the

treatment unit scheduling model and result in solutions with reduced cost. In the

formulated models, transportation time between two stages is not considered. Thus,

including it is required for obtaining a feasible schedule. It should be highlighted that the

transportation times between the vessel blow and the treatment stages and the treatment

and the continuous casting stages are not fixed, but they must be within specified bounds.

Adjusting these values could result in reduced-cost solutions, but it could not be

determined how this could be integrated in the model. The transportation times between

the rest of the stages are constant and independent of product type and which machine is

used.

Although the processing and transportation times at several stages need to lay within a

specified range and are not predetermined, target values have been set for all of them.

Another improvement would be to force the actual times to be as close as possible to the

target values. This could be achieved by introducing new variables that represent the

difference of the actual processing times from their target values and adding them to the

objective function. Larger differences would result in increased cost. In case this factor is

not considered as important as the total completion time, a weighted objection function

could be used. Thus, it can be controlled what factors influence the cost the most.

Additionally, a significant constraint was not included in the models formulated in this

paper. As mentioned in Section 3, hot metal is produced at an inconstant rate and the hot

metal stock must constantly between an acceptable lower and upper bound. The difficulty

with adding such a constraint is that a time variable is not included, only starting and

ending times of processing at the different stages. As a result, the hot metal stock cannot

be calculated at all instances.

If a complete scheduling system is developed the next stage is to integrate the maintenance

schedule in it. This means that one or more machines are not available for several time

intervals. If the timing of maintenance of a machine is known, then setting the starting

and the ending times of all heats assigned to that machine to be smaller or larger than the

starting and the ending time of the maintenance of this machine is a possible way of

formulating this constraint. If the maintenance of a machine happens after a specific

number of heats has been processed integrating the maintenance schedule in the

production schedule is more complicated. Also, more input data such as when was the last

time each machine was maintained are required.

The use of a different software or programming language to further develop the

scheduling system is required. The models formulated in this paper were implemented in

FICO Xpress-Optimizer Solver. However, designing a complete scheduling system on it

seems inefficient due to its limited capabilities. Furthermore, the majority of the models

discussed in Section 2 were implemented in programming languages such as C++ and C#.

As a result, it would be recommended shifting from FICO Xpress-Optimizer Solver to a

programming language, especially if a heuristic or artificial intelligence method was

attempted. Furthermore, the FICO Xpress Optimizer has interfaces through a library

accessible from C/C++, .NET, Java and Visual Basic for Applications (VBA). Thus, the

models presented in this paper could be integrated in a scheduling system developed in

one of these languages.

7. Conclusion Concluding, the SCC scheduling problem as presented by TATA Steel Port Talbot Works

is challenging and significantly more complex than similar problems found in the

literature. The main difficulty derives from the fact that various product types need to be

considered and heats must be scheduled in sequences at the continuous casting stage.

Combining the above with machine conflict resolving and not fixed processing times at

various stages result in great difficulties that must be overcome when mathematically

formulating the problem.

It is suggested that the problem is divided to simpler one that can be solved separately and

be combined in the end. In this paper, three models are presented. Model 1 attempts to

schedule a set of sequences at the continuous casting stage while Model 2 uses the results

of Model 1 to schedule the heats at the secondary steelmaking stage. Model 3 attempts to

schedule the heats at all stages before continuous casting based on the results of Model 1.

It should be highlighted that a feasible solution of Model 2 or 3 may not exist for any

output of Model 1. Additionally, during an attempt to schedule a total of 48 heats at all

stages, results could not be acquired from Model 3 in a reasonable time frame. However,

a feasible solution was obtained from both Models 1 and 2.

Further research on mathematical formulation and solution methods of the problem is

necessary. The models discussed in the paper could serve as part of a larger scheduling

system or as guidance for developing a different solution method. It is suggested that the

application of heuristics methods is studied since a MILP model may be efficient for

business purposes due to the large scale of the problem and the numerous constraints.

References Atighehchian, A., Bijari, M. and Tarkesh, H. (2009). A novel hybrid algorithm for scheduling steel-making continuous casting production. Computers & Operations Research, [online] 36(8), pp.2450-2461. Available at: https://www.sciencedirect.com/science/article/pii/S0305054808001937 [Accessed 21 Aug. 2019].

Bellabdaoui, A. and Teghem, J. (2006). A mixed-integer linear programming model for the continuous casting planning. International Journal of Production Economics, [online] 104(2), pp.260-270. Available at: https://www.sciencedirect.com/science/article/pii/S0925527304004384 [Accessed 18 Aug. 2019].

Dutta, G. and Fourer, R. (2001). A Survey of Mathematical Programming Applications in Integrated Steel Plants. Manufacturing & Service Operations Management, [online] 3(4), pp.387-400. Available at: https://www.scholars.northwestern.edu/en/publications/a-survey-of-mathematical- programming-applications-in-integrated-s [Accessed 22 Aug. 2019].

Fanti, M., Rotunno, G., Stecco, G., Ukovich, W. and Mininel, S. (2016). An Integrated System for Production Scheduling in Steelmaking and Casting Plants. IEEE Transactions on Automation Science and Engineering, [online] 13(2), pp.1112-1128. Available at: https://www.researchgate.net/publication/282427141_An_Integrated_System_for_Pr oduction_Scheduling_in_Steelmaking_and_Casting_Plants [Accessed 18 Aug. 2019].

Gupta, J. (1988). Two-Stage, Hybrid Flowshop Scheduling Problem. The Journal of the Operational Research Society, [online] 39(4), p.359. Available at: https://www.jstor.org/stable/2582115 [Accessed 22 Aug. 2019].

Harjunkoski, I. and Grossmann, I. (2001). A decomposition approach for the scheduling of a steel plant production. Computers & Chemical Engineering, [online] 25(11-12), pp.1647-1660. Available at: https://www.sciencedirect.com/science/article/pii/S0098135401007293 [Accessed 22 Aug. 2019].

Li, J., Pan, Q., Mao, K. and Suganthan, P. (2014). Solving the steelmaking casting problem using an effective fruit fly optimisation algorithm. Knowledge-Based Systems, [online] 72, pp.28-36. Available at: https://www.sciencedirect.com/science/article/pii/S0950705114003220#b0105 [Accessed 21 Aug. 2019].

Li, J., Xiao, X., Tang, Q. and Floudas, C. (2012). Production Scheduling of a Large- Scale Steelmaking Continuous Casting Process via Unit-Specific Event-Based Continuous-Time Models: Short-Term and Medium-Term Scheduling. Industrial &

Engineering Chemistry Research, [online] 51(21), pp.7300-7319. Available at: https://pubs.acs.org/doi/full/10.1021/ie2015944 [Accessed 15 Aug. 2019].

Mao, K., Pan, Q., Pang, X. and Chai, T. (2014). A novel Lagrangian relaxation approach for a hybrid flowshop scheduling problem in the steelmaking-continuous casting process. European Journal of Operational Research, [online] 236(1), pp.51- 60. Available at: https://www.sciencedirect.com/science/article/pii/S0377221713009090 [Accessed 21 Aug. 2019].

Missbauer, H., Hauber, W. and Stadler, W. (2009). A scheduling system for the steelmaking-continuous casting process. A case study from the steel-making industry. International Journal of Production Research, 47(15), pp.4147-4172.

Naphade, K., Wu, S., Storer, R. and Doshi, B. (2001). Melt Scheduling to Trade Off Material Waste and Shipping Performance. Operations Research, [online] 49(5), pp.629-645. Available at: https://pubsonline.informs.org/doi/abs/10.1287/opre.49.5.629.10611 [Accessed 21 Aug. 2019].

Pacciarelli, D. and Pranzo, M. (2004). Production scheduling in a steelmaking- continuous casting plant. Computers & Chemical Engineering, [online] 28(12), pp.2823-2835. Available at: https://www.sciencedirect.com/science/article/pii/S0098135404002637 [Accessed 22 Aug. 2019].

Pan, Q., Wang, L., Mao, K., Zhao, J. and Zhang, M. (2013). An Effective Artificial Bee Colony Algorithm for a Real-World Hybrid Flowshop Problem in Steelmaking Process. IEEE Transactions on Automation Science and Engineering, 10(2), pp.307- 322.

Sbihi, A., Bellabdaoui, A. and Teghem, J. (2014). Solving a mixed integer linear program with times setup for the steel-continuous casting planning and scheduling problem. International Journal of Production Research, 52(24), pp.7276-7296.

Tang, L. and Liu, G. (2007). A mathematical programming model and solution for scheduling production orders in Shanghai Baoshan Iron and Steel Complex. European Journal of Operational Research, [online] 182(3), pp.1453- 1468. Available at: https://www.sciencedirect.com/science/article/pii/S0377221706009830 [Accessed 21 Aug. 2019].

Tang, L., Liu, J., Rong, A. and Yang, Z. (2000). A mathematical programming model for scheduling steelmaking-continuous casting production. European Journal of Operational Research, [online] 120(2), pp.423-435. Available at: https://www.sciencedirect.com/science/article/pii/S0377221799000417 [Accessed 15 Aug. 2019].

Tang, L., Luh, P., Liu, J. and Fang, L. (2002). Steel-making process scheduling using Lagrangian relaxation. International Journal of Production Research, [online] 40(1), pp.55-70. Available at: https://www.tandfonline.com/doi/abs/10.1080/00207540110073000 [Accessed 16 Aug. 2019].

Tang, L. and Wang, G. (2008). Decision support system for the batching problems of steelmaking and continuous-casting production. Omega, [online] 36(6), pp.976- 991. Available at: https://www.sciencedirect.com/science/article/pii/S0305048307001223 [Accessed 15 Aug. 2019].

Tang, L., Zhao, Y. and Liu, J. (2014). An Improved Differential Evolution Algorithm for Practical Dynamic Scheduling in Steelmaking-Continuous Casting Production. IEEE Transactions on Evolutionary Computation, [online] 18(2), pp.209-225. Available at: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6473881 [Accessed 15 Aug. 2019].

Tan, Y. and Liu, S. (2013). Models and optimisation approaches for scheduling steelmaking–refining–continuous casting production under variable electricity price. International Journal of Production Research, 52(4), pp.1032-1049.

Zhao, Y., Jia, F., Wang, G. and Wang, L. (2011). A hybrid tabu search for steelmaking-continuous casting production scheduling problem. In: 2011 International Symposium on Advanced Control of Industrial Processes (ADCONIP). [online] IEEE, pp.pp. 535-540. Available at: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=5930486&isnumber=593 0387 [Accessed 22 Aug. 2019].

Zhu, D., Zheng, Z. and Gao, X. (2010). Intelligent Optimization-Based Production Planning and Simulation Analysis for Steelmaking and Continuous Casting Process. Journal of Iron and Steel Research International, [online] 17(9), pp.19-24. Available at: https://link.springer.com/content/pdf/10.1016/S1006- 706X%2810%2960136-7.pdf [Accessed 15 Aug. 2019].

Appendix A: Flying Tundish Change (FTC) between Products

Appendix B: Flying Tundish Change (FTC) between Products

The casting time depends on the casting speed which is calculated based on a variety of factors such as continuous casting machine, product type etc.

Appendix C: Xpress Code Model 1

!@encoding CP1252 model Model1 uses "mmxprs"; !gain access to the Xpress-Optimizer solver !optional parameters section parameters ! SAMPLEPARAM1='c:\test\' ! SAMPLEPARAM2=false PROJECTDIR='' ! for when file is added to project end-parameters !sample declarations section declarations products=1..4 totalsequences=6 totalpositions=6 sequences=1..totalsequences seqj:array(sequences)of integer restr:array(products,products)of integer positions=1..totalpositions machines=1..3 seqw:array(sequences)of integer seqp:array(sequences,products)of integer MR:array(products,machines)of integer maxtime:array(products)of real AV:array(machines)of real ST:array(positions, machines) of mpvar ET:array(positions, machines) of mpvar w:array(sequences,positions,machines) of mpvar SR:array(sequences,sequences,positions,machines)of mpvar ! ... Objective:linctr end-declarations seqw::[3200,2000,1800,2200,1800,2000] seqj::[6,10,8,8,8,8] seqp::[1,0,0,0, 0,1,0,0, 0,0,1,0, 0,0,0,1, 0,1,0,0, 0,0,1,0] maxtime::[118,72.6,72.6,72.6] restr::[0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0] AV::[0,20,0] forall(s in sequences,v in positions, m in machines)w(s,v,m)is_binary forall(s1 in sequences,s2 in sequences,v in positions, m in machines)SR(s1,s2,v,m)is_binary forall(s in sequences)sum(m in machines,v in positions)w(s,v,m)=1

forall(m in machines,v in positions)sum(s in sequences)w(s,v,m)<=1 forall(m in machines,v in 1..(totalpositions-1)) do

sum(s in sequences)w(s,v,m)-sum(s in sequences)w(s,v+1,m)>=0 end-do forall(p1 in products, p2 in products,s1 in sequences,s2 in sequences)do

if restr(p1,p2)=1 and seqp(s1,p1)=1 and seqp(s2,p2)=1 then forall(m in machines,v in 1..(totalpositions-1))do w(s1,v,m)+w(s2,v+1,m)-1<=SR(s1,s2,v,m) end-do

end-if end-do forall(p1 in products, p2 in products,s1 in sequences,s2 in sequences)do

if seqw(s1)-seqw(s2)>200 or seqw(s2)-seqw(s1)>200 then forall(m in machines,v in 1..(totalpositions-1))do w(s1,v,m)+w(s2,v+1,m)-1<=SR(s1,s2,v,m) end-do

end-if end-do forall(m in machines)ST(1,m)>=AV(m) forall(m in machines, v in positions)ST(v,m)>=0 forall(m in machines,v in 2..totalpositions)ST(v,m)=ET(v- 1,m)+120*sum(s1 in sequences) sum(s2 in sequences)SR(s1,s2,v-1,m) forall(m in machines,v in positions)ET(v,m)<=ST(v,m)+sum(s in sequences)sum(p in products)seqp(s,p)*seqj(s)*(0.7*maxtime(p))*w(s,v,m) forall(m in machines,v in positions)ET(v,m)>=ST(v,m)+sum(s in sequences)sum(p in products)seqp(s,p)*seqj(s)*maxtime(p)*w(s,v,m) totaltime:=sum(m in machines)ET(totalpositions,m) if PROJECTDIR <> '' then

setparam('workdir', PROJECTDIR) writeln("Project directory: " + PROJECTDIR)

end-if writeln("Begin running model") minimize(totaltime) forall(m in machines,v in positions,s in sequences)do

if getsol(w(s,v,m))=1 then forall(i in 1..seqj(s))writeln(s,", ",getsol(ST(v,m))+(i- 1)*((getsol(ET(v,m))-getsol(ST(v,m)))/seqj(s)),", ",getsol(ST(v,m))+i*((getsol(ET(v,m))- getsol(ST(v,m)))/seqj(s)),", ",m)

end-if end-do !... writeln("End running model") end-model

Appendix D: Xpress Code Model 2 !@encoding CP1252 model Model2 uses "mmxprs"; !gain access to the Xpress-Optimizer solver !optional parameters section parameters ! SAMPLEPARAM1='c:\test\' ! SAMPLEPARAM2=false PROJECTDIR='' ! for when file is added to project end-parameters !sample declarations section declarations totaljobs=48 jobs=1..totaljobs products=1..4 machines=1..4 positions=1..totaljobs SCC:array(jobs)of real JP:array(jobs,products)of integer MaxT:array(products)of integer MinT:array(products)of integer MR:array(products,machines)of integer ST:array(positions,machines)of mpvar ET:array(positions,machines)of mpvar x:array(jobs,positions,machines)of mpvar ! ... Objective:linctr end-declarations SCC::[2000, 2050.82, 2101.64, 2152.46, 2203.28, 2254.1, 2304.92, 2355.74, 2406.56, 2457.38, 2508.2, 2559.02, 2609.84, 2660.66, 2711.48, 2762.3, 2813.12, 2863.94, 2914.76, 2965.58, 3016.4, 3067.22, 3118.04, 3168.86, 2020, 2102.6,

2185.2, 2267.8, 2350.4, 2433, 2000, 2050.82, 2101.64, 2152.46, 2203.28, 2254.1, 2304.92, 2355.74, 2406.56, 2457.38, 2508.2, 2559.02, 2609.84, 2660.66, 2711.48, 2762.3, 2813.12, 2863.94] JP::[0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,0,1,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 1,0,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0,

0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,1,0,0, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1, 0,0,0,1] MaxT::[45,35,35,35] MinT::[35,25,25,25] MR::[1,1,0,0, 0,0,0,0, 0,0,1,1, 0,0,0,0] forall(v in positions,m in machines,j in jobs)x(j,v,m)is_binary forall(j in jobs)sum(m in machines,v in positions)x(j,v,m)=1 forall(v in positions,m in machines)sum(j in jobs)x(j,v,m)<=1 forall(v in positions,m in machines)ET(v,m)=sum(j in jobs)x(j,v,m)*SCC(j) forall(p in products,m in machines,j in jobs)do

if MR(p,m)=1 and JP(j,p)=1 then forall(v in positions)x(j,v,m)=0

end-if end-do forall(m in machines,v in positions)ET(v,m)<=ST(v,m)+sum(j in jobs,p in products)x(j,v,m)*MaxT(p)*JP(j,p) forall(m in machines,v in positions)ET(v,m)>=ST(v,m)+sum(j in jobs,p in products)x(j,v,m)*MinT(p)*JP(j,p) forall(m in machines,v in positions)ST(v,m)>=0 forall(m in machines,v in positions)ET(v,m)>=0 forall(m in machines,v in 1..totaljobs-1)ET(v,m)<=ST(v+1,m) obj:=sum(v in positions,m in machines)ET(v,m)-sum(v in positions,m in machines)ST(v,m) if PROJECTDIR <> '' then

setparam('workdir', PROJECTDIR) writeln("Project directory: " + PROJECTDIR)

end-if writeln("Begin running model") minimize(obj) forall(m in machines,v in positions, j in jobs)do

if getsol(x(j,v,m))=1 then writeln(j," ",getsol(ST(v,m))," ",getsol(ET(v,m))," ",m," ",v) end-if

end-do !... writeln("End running model") end-model

Appendix E: Xpress Code Model 3 !@encoding CP1252 model Model3 uses "mmxprs"; !gain access to the Xpress-Optimizer solver !optional parameters section parameters ! SAMPLEPARAM1='c:\test\' ! SAMPLEPARAM2=false PROJECTDIR='' ! for when file is added to project end-parameters !sample declarations section declarations M=10000 stages=1..4 totaljobs=10 jobs=1..totaljobs products=1..4 machines=1..10 SCC:array(jobs)of real JP:array(jobs,products)of integer MaxT:array(products)of integer MinT:array(products)of integer MR:array(products,machines)of integer SM:array(stages,machines)of integer PT:array(machines)of integer ST:array(jobs,stages,machines)of mpvar ET:array(jobs,stages,machines)of mpvar x:array(jobs,stages,machines)of mpvar y:array(jobs,jobs,stages,machines)of mpvar ! ... Objective:linctr end-declarations SCC::[2050.82, 2101.64, 2152.46, 2203.28, 2254.1, 2304.92, 2355.74, 2406.56, 2457.38, 2508.2] JP::[0,0,1,0, 0,0,1,0, 1,0,0,0, 1,0,0,0, 0,0,1,0, 0,0,1,0, 0,1,0,0, 0,1,0,0, 0,0,0,1, 0,0,0,1] MaxT::[45,35,35,35] MinT::[35,25,25,25] MR::[0,0,0,0,0,0,1,1,0,0,

0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,1,1, 0,0,0,0,0,0,1,0,0,0] SM::[1,1,0,0,0,0,0,0,0,0, 0,0,1,1,0,0,0,0,0,0, 0,0,0,0,1,1,0,0,0,0, 0,0,0,0,0,0,1,1,1,1] PT::[18,18,25,25,23,23,0,0,0,0] forall(i in stages,m in machines,j in jobs)x(j,i,m)is_binary forall(i in stages,m in machines,j in jobs,k in jobs)y(j,k,i,m)is_binary forall(j in jobs,i in stages)sum(m in machines)x(j,i,m)=1 forall(m in machines,j in jobs)ET(j,4,m)=x(j,4,m)*SCC(j) forall(p in products,m in machines,j in jobs)do

if MR(p,m)=1 and JP(j,p)=1 then forall(i in stages)x(j,i,m)<=0

end-if end-do forall(i in stages,m in machines)do

forall(j in jobs)x(j,i,m)<=SM(i,m) end-do forall(m in machines,j in jobs)ET(j,4,m)<=ST(j,4,m)+sum(p in products)(x(j,4,m)*MaxT(p)*JP(j,p)) forall(m in machines,j in jobs)ET(j,4,m)>=ST(j,4,m)+sum(p in products)(x(j,4,m)*MinT(p)*JP(j,p)) forall(m in machines,i in stages,j in jobs)ST(j,i,m)>=0 forall(m in machines,i in stages,j in jobs)ET(j,i,m)>=0 forall(m in machines,i in stages,j in jobs)ST(j,i,m)+ET(j,i,m)<=x(j,i,m)*M forall(i in 1..3,j in jobs,m in machines)ET(j,i,m)=ST(j,i,m)+PT(m)*x(j,i,m) forall(i in 1..3,j in jobs,m in machines,v in machines)ST(j,i+1,v)<=ET(j,i,m)+(2-x(j,i,m)-x(j,i+1,v))*M forall(i in 1..3,j in jobs,m in machines,v in machines)ST(j,i+1,v)>=ET(j,i,m)-(2-x(j,i,m)-x(j,i+1,v))*M forall(j in jobs,v in jobs,i in stages,m in machines)ST(v,i,m)>=ET(j,i,m)-(1-y(j,v,i,m))*M forall(a in jobs,b in jobs,i in stages,m in machines)do

if a<>b then y(a,b,i,m)+y(b,a,i,m)>=x(a,i,m)+x(b,i,m)-1

end-if

end-do forall(a in jobs,b in jobs,i in stages,m in machines)y(a,b,i,m)<=x(a,i,m) forall(a in jobs,b in jobs,i in stages,m in machines)y(a,b,i,m)<=x(b,i,m) obj:=sum(j in jobs,m in machines)ET(i,4,m)-sum(j in jobs, m in machines)ST(i,4,m) if PROJECTDIR <> '' then

setparam('workdir', PROJECTDIR) writeln("Project directory: " + PROJECTDIR)

end-if writeln("Begin running model") minimize(obj) !... writeln("End running model") end-model