Data science
Dr Nina Dethlefs [email protected]
16 September 2021
Assignment - Understanding Artificial Intelligence
771763 - 2021/2022
Description:
The assessment for this module is via a portfolio of work that will be assembled over the
course of our three lab sessions. Topics will include (1) a data analytics, interpretation +
visualisation task (lab session 1, 4/5 Nov - 18/19 Nov), (2) a computer vision task (lab session
2, 02/03 Dec), and (3) an ethical analysis (based on the lecture and materials in Week 5, w/c
29 Nov). You will receive formative support on all of these items during our lab sessions.
Specific topics that should be included in the portfolio are as follows.
Component 1 — Water quality analysis:
This component uses CEFAS’ 2021 data on biotoxins and phytoplantkon (see https://
www.cefas.co.uk/data-and-publications/habs/england-and-wales-biotoxins-and-
phytoplankton-results-2021/) to find patterns of higher or lower concentration of either (or
both) according to features provided. You should read the data into a program (second tab on
phytoplankton), clean it and then train a multi-layer feed-forward neural network to predict
from a set of input features whether the phytoplankton level detected is above the threshold
specified (see end of file). You will need to make a range of decisions in your analysis on data
cleaning, network architecture and evaluation setup.
You should answer the following questions:
• Specify the accuracy you achieved across 3 architectural modifications (e.g. different
numbers of layers, different hyperparameters, etc.)
• Why do you think your accuracy is not higher / lower?
DATA ANALYSIS AND VISUALISATION 1
• What effect does the optimisation function have on network performance?
• What happens if you include more than 4 (hidden) layers?
• What is the effect of the data size on your accuracy?
Generate and include in your your report the most suitable graphical plot of the data.
Component 2 — Multi-object recogniser:
Download the “vehicles” dataset from here and adapt your CNN from the lab session to
recognise the 4 object types in the dataset. Generate a graphical plot of your training and
validation accuracy during training. Then answer the following questions:
• How long does the network need to train until reaching an accuracy of 95% (or
does it not reach this level at all)?
• What is the tradeoff between using many layers (i.e. having a “deeper” network) and
accuracy? And layers and time?
• What is the effect of changing the pooling mechanism, e.g. average vs max?
As a follow-on part, collect your own dataset of images containing the four object
categories above. Make sure that they occur in different context, e.g. close-up, far-away, in a
busy visual context, in an isolated image, etc. It is up to you how you collect these images- you
can either take photos yourself or collect images from the internet. You should collect 20
images and copy these into your report, so I can see them.
• How well does your network do at classifying these images?
• Does fine-tuning make a difference?
Extra challenge - integrate explainability methods, such as tf-explain (https://github.com/
sicara/tf-explain ) to visualise how your model makes predictions for a small set of example
images.
Component 3 — Discussion of Ethics in AI:
Choose one of these three research papers to discuss:
DATA ANALYSIS AND VISUALISATION 2
1. Energy and Policy Considerations for Deep Learning in NLP
2. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification
3. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜
In 800 words: highlight briefly the ethical challenge described and the researchers’
approach to uncovering and addressing it. Discuss in more detail areas of applied AI where
you speculate similar challenges may occur and what incentives can be provided to AI
researchers to tread carefully around ethical challenges.
This part of your portfolio should use a formal academic writing style and references
in Harvard style, see here for guidance.
Marking and components
Portfolio 100%, with each component being worth 1/3 of the overall mark.
DO NOT include programming code into the report, i.e. screenshots or similar. If you
want to present an algorithm, neural network architecture etc., then use pseudocode, a
diagram or some other presentation that is not copy-pasted code.
Code submission:
You will need to submit your code alongside your report. It will not be marked
separately but will be checked to ensure that it supports the functionality described in the
report and is not plagiarised.
Hand-in and deadline:
The portfolio is due: 14 December 2021, 2pm
Hand-in will be via Canvas.
Marking criteria:
DATA ANALYSIS AND VISUALISATION 3
Portfolio marking criteria and weighting:
Criteria DISTINCTION MERIT PASS
Component 1 - Neural Net
All questions are answered (correctly) and quantitative evidence is provided to support the answer (e.g. a table of results, learning plot, reasoned explanation + reference).
Evidence of 3 architectural variants is provided.
The data is fully visualised in an appropriate plot.
Code is submitted and fully replicates the results.
Top mark >90% - a small discussion paragraph is written that relates your own findings with the background literature on the topic (note: you’ll need to identify this literature yourself)
100 points max
Most questions are answered (correctly) and some evidence is provided to support the answer.
Evidence of more than 1 architectural variants is provided.
A visualisation of the input data is provided.
Code is submitted and fully replicates the results.
69 points max
Some questions are answered (correctly) and some evidence is provided to support the answer.
Code is submitted and fully replicates the results.
59 points max
Criteria
DATA ANALYSIS AND VISUALISATION 4
Component 2 - Computer vision
A CNN is successful trained for the multi-label object recognition task, a learning plot and results are provided in evidence.
All questions are answered (correctly) and quantitative evidence is provided to support the answer.
A dataset is gathered and shown in the report as evidence. The dataset is varied and includes multiple visual perspectives.
The code successfully transfers to the new data (accuracy is not important here).
Code is submitted and fully replicates the results.
Top mark >90% - extra challenge is fully completed
100 points max
A CNN is successful trained for the multi-label object recognition task, a learning plot and results are provided in evidence.
Most questions are answered (correctly) and quantitative evidence is provided to support the answer.
A dataset is gathered and shown in the report as evidence.
The code transfers to the new data (accuracy is not important here).
Code is submitted and fully replicates the results.
69 points max
A CNN is trained for the multi-label object recognition task, some evidence is provided for this.
Some questions are answered (correctly) and quantitative evidence is provided to support the answer.
A dataset is gathered and shown in the report as evidence.
Code is submitted and fully replicates the results.
59 points max
Component 3 - Ethics The research question and methodology of the academic paper is clearly stated.
The ethical dilemma is identified and stated clearly.
At least 3 real world applications of the research are proposed and ethical consequences are discussed in a manner that is analytical, critical and reflective.
Top mark >90% - a set of novel recommendations is generated from your review that could influence policy making.
100 points max
The research question and methodology of the academic paper is stated.
The ethical dilemma is identified and stated.
At least 1 real world application of the research are proposed and ethical consequences are discussed in a manner that is analytical, critical and reflective.
69 points max
The research question and methodology of the academic paper is stated.
The ethical dilemma is identified and stated.
At least 1 real world application of the research are proposed and ethical consequences are discussed
59 points max
DISTINCTION MERIT PASSCriteria
DATA ANALYSIS AND VISUALISATION 5