data mining and neutral networks using python

profileManishag
CT3_2021.pdf

MA4022/MA7022 DATA MINING and NEURAL NETWORKS

Computational Task 3, 2021

Due date 17.04.2021, 23:59

For this task you need to download 4 time series from the Yahoo!Finance website: Any

student should have their own unique set of time series!

Please collect available data for three years 2018-2020

Please pay attention that for your analysis the time moments should be sorted from oldest to newest.

Use the daily closing price.

1. Data evaluation and elementary preprocessing. Analyse completeness of data. Are there missed

data (besides weekends)? How many missed data points are in your time series? Are the dates of

missed values the same for all your time series? What may be the reasons for missing? How can you

handle the missed values in your data (explain at least three approaches)? Use the simple rule: fill in a

missed value by the closest in time past existing value. Plot the results. Normalise to the z-score (zero

mean and unit standard deviation). Plot the results. (15 marks)

3. Segmentation. Prepare the bottom-up piecewise linear segmentation for the transformed and

normalised log-return time series. Use the following mean square errors tolerance levels: 1%, 5%,

10% (the thresholds of the mean square errors). Plot the results. Are the segments similar for different

time series you analysed? (25 marks)

4. Prediction. Chose one of the transformed and normalised time series as a target 𝑔(𝑡) and other 3 as supporting data 𝑑1(𝑡), 𝑑2(𝑡), 𝑑3(𝑡), where 𝑡 = 1, … , 𝑇. Provide scatter diagrams of (g(t),g(t+1)).

Evaluate the error of the “next-day forecast”, 𝑔 (𝑡 + 1) = 𝑔(𝑡).

Use data for 2018 as the training set and find the predictor of 𝑔(𝑡 + 1) (the next day value) as a linear function Ψ of 𝑔(𝑡), 𝑑1(𝑡), 𝑑2(𝑡), 𝑑3(𝑡):

𝑔 (𝑡 + 1) = Ψ(𝑔(𝑡), 𝑑1(𝑡), 𝑑2(𝑡), 𝑑3(𝑡)) (1)

(linear regression). Evaluate the training set error. Use data for 2019 as a test set and evaluate the test

set error for this set. Also, use data for 2020 as a test set and evaluate the test set error for this set.

Compare these errors. Compare these errors to the errors of the “next-day forecast”. Comment.

Provide plots of 𝑔(𝑡), 𝑔 (𝑡), and the residual. Present the (𝑔(𝑡), 𝑔 (𝑡)) scatter diagram. (30 marks)

5. Adaptive predictors. For each given value of the “frame width”, Δ=5, 10, 30, create and test

the following adaptive predictor. For every T> Δ create the training set with Δ input vectors (𝑔(𝑡),

𝑑1(𝑡), 𝑑2(𝑡), 𝑑3(𝑡)) (𝑡 = 𝑇 − Δ, … , 𝑇-1) and the corresponding outputs 𝑔(𝑡 + 1).

In more detail, the input vectors 𝒙𝑖 and the output values 𝑦𝑖 for a given T are

𝒙1 = (𝑔(𝑇 − Δ), 𝑑1(𝑇 − Δ), 𝑑2(𝑇 − Δ), 𝑑3(𝑇 − Δ)), 𝑦1 = 𝑔(𝑇 − Δ + 1)

………..

𝒙𝑖 = (𝑔(𝑇 − Δ + 𝑖 − 1), 𝑑1(𝑇 − Δ + 𝑖 − 1), 𝑑2(𝑇 − Δ + 𝑖 − 1), 𝑑3(𝑇 − Δ + 𝑖 − 1)),

𝑦𝑖 = 𝑔(𝑇 − Δ + 𝑖) Where i=1,2,…, Δ. Find the linear regression (1) for each T> Δ. Test this linear regression for the next time value, t=T+1.

In more detail, for each T there is one test example with the input vector 𝒙𝑡𝑒𝑠𝑡 and output value 𝑦𝑡𝑒𝑠𝑡:

𝒙𝑡𝑒𝑠𝑡 = (𝑔(𝑇), 𝑑1(𝑇), 𝑑2(𝑇), 𝑑3(𝑇)), 𝑦𝑡𝑒𝑠𝑡 = 𝑔(𝑇 + 1)

Please pay attention that this example does not belong to a training set for this value of T.

Find the residuals at these test time moments. Plot these residuals and the values 𝑔(𝑡), 𝑔 (𝑡). Present the (𝑔(𝑡), 𝑔 (𝑡)) scatter diagram (t=T+1). Calculate the mean square error. Compare to the previous task. Comment. (30 marks)