Transport data analysis
Institute for Transport Studies FACULTY OF ENVIRONMENT
TRAN5032M
Transport Data Collection & Analysis
Data Collection – Railway stream
Dr Chiara Calastri
This lecture
• Coursework briefing
o Overview
o Data sources
o Structure and format
• Other rail data sources
Coursework briefing
• Learning outcomes (or assessment criteria...):
• Reflection on what research questions can be answered with different
types of data
• Data analysis, including choosing adequate statistical tools
• Clear presentation of a report where a reader can follow your
reasoning and understand your results easily
• Reflection on implications of obtained results and what the next steps
should be
Coursework briefing
• Where to find it:
Learning ResourcesCourseworkCoursework MSc Railway OMP
• Individual work, no team activity!
• Two datasets:
• Primary data: data collected during the fieldwork (intercept
questionnaire)
• Secondary data: Network Rail Cancellations and Significant Lateness
(CaSL) multi-year data, provided by the Office of Rail and Road.
Primary data
• Introduce the dataset
• Which technique was used to collect the data?
• Who answered the survey?
• Analyse the data
• Present some general descriptive statistics
• Include graphs if and where appropriate
• State clearly what you will analyse and why: what is your hypothesis and how
will you test it? Why is it relevant?
• Discuss and conclude
• Present and interpret your results clearly, also comparing what you obtained
with your hypothesis/any relevant literature
• Identify strengths and weaknesses in data, the usefulness of results and
potential suggestions for future analyses/data collection
Primary data - analysis
• Identify one or more variables of interest to you
• Select a statistical technique for your analysis
• The statistical tools and analysis needs to be suitable for
the research question
• The research question is chosen by you
Primary data - Handling missing data
• Your dataset might have missing data
o Respondents who dropped out
o Respondents who did not answer certain questions
o Errors in transferring the data from forms to Excel can run checks?
• Missing data occur frequently in real datasets
• If too many data points are missing you can consider
excluding a respondent not reliable
• If only one or two answers are missing the data is still
usable Do not use the specific observation
Treat missing observations as a separate category
Primary data - reporting
• Use your experience (fieldwork, previous experience?)
• Use background knowledge/lectures/relevant literature
• Build the report with a clear structure that shows not only
your understanding but your ability to communicate your
work help is available!
Secondary data
• Introduce the dataset
• Who collected the data?
• Which technique was used to collect the data?
• Is it a representative study?
• Why was the data collected?
• Analyse the data
• Present your research question and descriptive statistics
• Include graphs if and where appropriate
• How will you test your hypothesis/es? Why is/are it/they relevant?
• Discuss and conclude
• Present and interpret your results clearly, also comparing what you obtained
with your hypothesis/any relevant literature
• Identify strengths and weaknesses in data, the usefulness of results and
potential suggestions for future analyses/data collection
The Office of Rail and Road
• The Office of Rail and Road (ORR) is the independent
economic and safety regulator for Britain’s railways, and
monitor of performance and efficiency for England’s
Strategic Road Network. They:
• Regulate & set targets for Network Rail
• Report on performance
• Regulate health & safety standards across rail
• Oversee competition and consumer rights
• Regulate HS1 (link to the channel tunnel)
Network Rail is the
owner and
infrastructure manager
of most of the railway
network in Great
Britain
ORR Statistics and reports
• ORR publish a range of
statistics about railway
performance, rail usage
and safety
• Data on many topics
are presented as
reports
• They give you an idea
of how data can be
described/visualised
ORR datasets
• ORR also publishes source data, which might/might not be
used in some reports. The datasets are accessible here:
https://tinyurl.com/y44j8v7s
The dataset (1/2)
• The file contains 5 sheets:
1. Delay minutes by Category of Delay and Train Operating Company
2. PPM* failures by Category of Delay and Train Operating Company
3. CaSL** failures by Category of Delay and Train Operating Company
4. Full Cancellations by Category of Delay and Train Operating Company
5. Part Cancellations by Category of Delay and Train Operating
Company
*: Public Performance Measure
**: Cancelled or Significantly Late
The dataset (2/2)
• Data from 2011-12 to 2018-19 and divided in periods
The dataset (2/2)
• Broken down by train operating company (TOC) and
category of delay
• Plenty of opportunities for analysis:
• Analysis at year or period level
• Comparison between different TOC
• Change in different types of delays over time
• Types of delays within a given TOC
• .....
• Up to you!
Format
• Typed written report including
o Front cover
o Index/List of contents
o Structured sections
o List of (used) references
• Accuracy matters
o Clear captions on figures and equation numbers
o Clear reference to figures, tables and cited work in the text
o Specify units of measurement
• Make sure you address all requirements in the text of the
task assignment
Marking criteria
• Marks will be awarded as follows:
1. Applying statistics (30%).
2. Reflecting on data quality (30%)
3. Recommending future work or improvements (30%)
4. Presentation of coursework (10%)
Referencing
• Make sure you support your statements either with your
experience or existing work.
• As a rule of thumb, include between 10-20 references
• Reference can be books, conference and journal articles,
government/technical/project reports.
• Use the Leeds Harvard referencing style in the References
section.
• References ≠ Bibliography!
Sourcing journal articles via
Google Scholar
Click here for
“Advanced
search”
..or use the
regular search
tool and refine
later
Additional search criteria
Be specific
Consider if
different names
are used in the
literature
Depending on
the subject, time
matters!
Citing and link to PDF versions
Click here to copy
the citation to
include in your
references
List of papers
citing this one
Direct link to a
Word count
• Maximum word count is 2000 (not including figures, tables
and references)
• You should not bypass the word count by making extra text
into a figure or table. Figures and tables have to be
explained in the text!
• You can write less than 2000, but on previous experience
this is the right word count for the task. If you have a lot less
than this, go back over your work and check that you have
made enough distinct points/arguments
Make it yours
• You will find reports and commentaries on this data
• Remember you shouldn’t simply report descriptive statistics
like most reports do
• Tip: do your analysis first
• Try to be creative with your
hypothesis
• You might find something that
existing analyses have not found!
Institute for Transport Studies FACULTY OF ENVIRONMENT
Other rail data sources
A quick overview
Network Rail
• Network Rail owns and operates the railway infrastructure
in England, Wales and Scotland. This includes
• 20,000 miles of track
• 30,000 bridges, tunnels and viaducts
• thousands of signals, level crossings and stations.
• They manage 20 of the UK's largest stations while all the
others, over 2,500, are managed by the country’s train
operating companies.
Network Rail data
• On the website, Network rails provide many reports as well
as datasets
Network rail datasets and
reports (among others)
• Annual expenditure (2012-2018)
• Bridge strikes
• Business expenses (also travel)
• Business performance
• Cable theft report
• Carbon and energy use
• Close calls
• Environmental incidents
• Equality diversity and inclusions and family policies
• Lost property
• Public and passenger accidents
• Safety
• Spend on information technology
• Suicide statistics
• Workforce count
Station 2013 2014 2015 2016
Birmingham New Street 136 555 601 666
Edinburgh 265 764 784 837
London Euston 870 864 934 899
Glasgow 275 655 651 661
Kings Cross 132 231 424 325
Leeds 154 526 526 566
Liverpool 66 257 233 219
Liverpool Street 431 416 53 329
Manchester 144 618 646 641
Paddington 296 401 454 389
London Victoria 910 1695 1418 1527
Totals 3679 6982 6724 7059
Suicide figures 2009-2019
Office of Rail and Road: wide
range of datasets
Rail ticket sales data
• The rail industry’s central ticketing system is LENNON
• LENNON holds information on the vast majority of national
rail tickets purchased in Great Britain
• Such dataset is used for (among others):
o Disaggregate demand analysis
o Time series analysis
o Rail passenger kilometres
• LENNON is a fundamental tool for rail forecasting (see
Passenger Demand Forecasting Handbook)
• It is not publicly available
Timetabling
• Rail delivery group (National Rail)
o Fares data
o Routeing data
o Timetable data
• These data are of a “technical” nature and not open-source
but can be accessed
• Detailed documentation to help users
• Reports to learn the basic statistics and workings of the
system
National Rail Travel Survey
(NRTS)
• Self-completion questionnaire handed in rail stations
• Main survey component is “at station”
• Small number of surveys conducted on-train
• Some of the information collected:
o origin and destination station;
o true origin and destination (postcode);
o time of departure and arrival;
o ticket type;
o season tickets and railcards held;
o journey purpose; and
o day of week of journey.
o access and egress mode to/from the station
National Travel Survey
• Ongoing survey collecting information on all journeys made
in a week by a sample of households
• Household characteristics also collected
• Limited information on long journeys
• Valuable data to assess modal competition
Other government data
• Living Costs and Food Survey
• International Passenger Survey
• Both are conducted over time
• Such surveys collect information such as:
o journey length
o number of rail trips made
o mode share
o expenditure on travel
o access modes to airports