Week4 :Q/A data analytics

rav
Chapter_82.pptx

Data Science and Big Data Analytics

Chap 8: Advanced Analytical Theory and Methods:

Time Series Analysis

1

Chapter Sections

8.1 Overview of Time Series Analysis

8.1.1 Box-Jenkins Methodology

8.2 ARIMA Model

8.2.1 Autocorrelation Function (ACF)

8.2.2 Autoregressive Models

8.2.3 Moving Average Models

8.2.4 ARMA and ARIMA Models

8.2.5 Building and Evaluating an ARIMA Model

8.2.6 Reasons to Choose and Cautions

8.3 Additional Methods

Summary

2

8 Time Series Analysis

This chapter’s emphasis is on

Identifying the underlying structure of the time series

Fitting an appropriate Autoregressive Integrated Moving Average (ARIMA) model

3

Time series analysis attempts to model the underlying structure of observations over time

A time series, Y =a+ bX , is an ordered sequence of equally spaced values over time

The analyses presented in this chapter are limited to equally spaced time series of one variable

8.1 Overview of Time Series Analysis

4

The time series below plots #passengers vs months (144 months or 12 years)

8.1 Overview of Time Series Analysis

5

The goals of time series analysis are

Identify and model the structure of the time series

Forecast future values in the time series

Time series analysis has many applications in finance, economics, biology, engineering, retail, and manufacturing

8.1 Overview of Time Series Analysis

6

8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology

A time series can consist of the components:

Trend – long-term movement in a time series, increasing or decreasing over time – for example,

Steady increase in sales month over month

Annual decline of fatalities due to car accidents

Seasonality – describes the fixed, periodic fluctuation in the observations over time

Usually related to the calendar – e.g., airline passenger example

Cyclic – also periodic but not as fixed

E.g., retail sales versus the boom-bust cycle of the economy

Random – is what remains

Often an underlying structure remains but usually with significant noise

This structure is what is modeled to obtain forecasts

7

8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology

The Box-Jenkins methodology has three main steps:

Condition data and select a model

Identify/account for trends/seasonality in time series

Examine remaining time series to determine a model

Estimate the model parameters.

Assess the model, return to Step 1 if necessary

This chapter uses the Box-Jenkins methodology to apply an ARIMA model to a given time series

8

8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology

The remainder of the chapter is rather advanced and will not be covered in this course

The remaining slides have not been finalized but can be reviewed by those interested in time series analysis

9

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

Step 1: remove any trends/seasonality in time series

Achieve a time series with certain properties to which autoregressive and moving average models can be applied

Such a time series is known as a stationary time series

10

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

A time series, Yt for t= 1,2,3, ... t, is a stationary time series if the following three conditions are met

The expected value (mean) of Y is constant for all values

The variance of Y is finite

The covariance of Y, and Y, h depends only on the value of h = 0, 1, 2, .. .for all t

The covariance of Y, andY,. h is a measure of how the two variables, Y, andY,_ h• vary together

11

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

The covariance of Y, andY,. h is a measure of how the two variables, Y, andY,_ h• vary together

If two variables are independent, covariance is zero.

If the variables change together in the same direction, cov is positive; conversely, if the variables change in opposite directions, cov is negative

12

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

A stationary time series, by condition (1), has constant mean, say m, so covariance simplifies to

By condition (3), cov between two points can be nonzero, but cov is only function of h – e.g., h=3

If h=0, cov(0) = cov(yt,yt) = var(yt) for all t

13

8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average

A plot of a stationary time series

14

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

From the figure, it appears that each point is somewhat dependent on the past points, but does not provide insight into the cov and its structure

The plot of autocorrelation function (ACF) provides this insight

For a stationary time series, the ACF is defined as

15

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

Because the cov(0) is the variance,

the ACF is analogous to the correlation function of two variables, corr (yt , yt+h), and

the value of the ACF falls between -1 and 1

Thus, the closer the absolute value of ACF(h) is to 1, the more useful yt can be as a predictor of yt+h

16

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

Using the dataset plotted above, the ACF plot is

17

8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)

By convention, the quantity h in the ACF is referred to as the lag, the difference between the time points t and t +h.

At lag 0, the ACF provides the correlation of every point with itself

According to the ACF plot, at lag 1 the correlation between Y, andY, 1 is approximately 0.9, which is very close to 1, so Y, 1 appears to be a good predictor of the value of Y,

In other words, a model can be considered that would express Y, as a linear sum of its previous 8 terms. Such a model is known as an autoregressive model of order 8

18

8.2 ARIMA Model 8.2.2 Autoregressive Models

For a stationary time series, y, t= 1, 2, 3, ... , an autoregressive model of order p, denoted AR(p), is

19

8.2 ARIMA Model 8.2.2 Autoregressive Models

Thus, a particular point in the time series can be expressed as a linear combination of the prior p values, Y, _ i for j = 1, 2, ... p, of the time series plus a random error term, c,.

the c, time series is often called a white noise process that represents random, independent fluctuations that are part of the time series

20

8.2 ARIMA Model 8.2.2 Autoregressive Models

In the earlier example, the autocorrelations are quite high for the first several lags.

Although an AR(8) model might be good, examining an AR(l) model provides further insight into the ACF and the p value to choose

An AR(1) model, centered around 6 = 0, yields

21

8.2 ARIMA Model 8.2.3 Moving Average Models

For a time series, y 1 , centered at zero, a moving average model of order q, denoted MA(q), is expressed as

the value of a time series is a linear combination of the current white noise term and the prior q white noise terms. So earlier random shocks directly affect the current value of the time series

22

8.2 ARIMA Model 8.2.3 Moving Average Models

the value of a time series is a linear combination of the current white noise term and the prior q white noise terms, so earlier random shocks directly affect the current value of the time series

the behavior of the ACF and PACF plots are somewhat swapped from the behavior of these plots for AR(p) models.

23

8.2 ARIMA Model 8.2.3 Moving Average Models

For a simulated MA(3) time series of the form Y, = E1 - 0.4 E, 1 + 1.1 £1 2 - 2.S E:1 3 where e, - N(O, 1), the scatterplot of the simulated data over time is

24

8.2 ARIMA Model 8.2.3 Moving Average Models

The ACF plot of the simulated MA(3) series is shown below

ACF(0) = 1, because any variable correlates perfectly with itself. At higher lags, the absolute values of terms decays

In an autoregressive model, the ACF slowly decays, but for an MA(3) model, the ACF cuts off abruptly after lag 3, and this pattern extends to any MA(q) model.

25

8.2 ARIMA Model 8.2.3 Moving Average Models

To understand this, examine the MA(3) model equations

Because Y1 shares specific white noise variables with Y1 _ 1 through Y1 _ 3,, those three variables are correlated to y1 • However, the expression of Yr does not share white noise variables with Y1_ 4 in Equation 8-14. So the theoretical correlation between Y1 and Y1 _ 4 is zero. Of course, when dealing with a particular dataset, the theoretical autocorrelations are unknown, but the observed autocorrelations should be close to zero for lags greater than q when working with an MA(q) model

26

8.2 ARIMA Model 8.2.4 ARMA and ARIMA Models

In general, we don’t need to choose between an AR(p) and an MA(q) model, rather combine these two representations into an Autoregressive Moving Average model, ARMA(p,q),

27

8.2 ARIMA Model 8.2.4 ARMA and ARIMA Models

If p = 0 and q =;e. 0, then the ARMA(p,q) model is simply an AR(p) model. Similarly, if p = 0 and q =;e. 0, then the ARMA(p,q) model is an MA(q) model

Although the time series must be stationary, many series exhibit a trend over time – e.g., an increasing linear trend

28

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

For a large country, monthly gasoline production (millions of barrels) was obtained for 240 months (20 years).

A market research firm requires some short-term gasoline production forecasts

29

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

library (forecast )

gas__prod_input <- as. data . f rame ( r ead.csv ( "c: / data/ gas__prod. csv")

gas__prod <- ts (gas__prod_input[ , 2])

plot (gas _prod, xlab = "Time (months) ", ylab = "Gas oline production (mi llions of barrels ) " )

30

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

Comparing Fitted Time Series Models

The arima () function in Ruses Maximum Likelihood Estimation (MLE) to estimate the model coefficients. In the R output for an ARIMA model, the log-likelihood (logLl value is provided. The values of the model coefficients are determined such that the value of the log likelihood function is maximized. Based on the log L value, the R output provides several measures that are useful for comparing the appropriateness of one fitted model against another fitted model.

AIC (Akaike Information Criterion)

A ICc (Akaike Information Criterion, corrected)

BIC (Bayesian Information Criterion)

31

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

Normality and Constant Variance

32

8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model

Forecasting

33

8.2 ARIMA Model 8.2.6 Reasons to Choose and Cautions

One advantage of ARIMA modeling is that the analysis can be based simply on historical time series data for the variable of interest. As observed in the chapter about regression (Chapter 6), various input variables need to be considered and evaluated for inclusion in the regression model for the outcome variable

34

8.3 Additional Methods

Autoregressive Moving Average with Exogenous inputs (ARMAX)

Used to analyze a time series that is dependent on another time series.

For example

Retail demand for products can be modeled based on the previous demand combined with a weather-related time series such as temperature or rainfall.

Spectral analysis is commonly used for signal processing and other engineering applications.

Speech recognition software uses such techniques to separate the signal for the spoken words from the overall signal that may include some noise.

Generalized Autoregressive Conditionally Heteroscedastic (GARCH)

A useful model for addressing time series with nonconstant variance or volatility.

Used for modeling stock market activity and price fluctuations.

8.3 Additional Methods

Kalman filtering

Useful for analyzing real-time inputs about a system that can exist in certain states.

Typically, there is an underlying model of how the various components of the system interact and affect each other.

Processes the various inputs,

Attempts to identify the errors in the input, and

Predicts the current state.

For example

A Kalman filter in a vehicle navigation system can

Process various inputs, such as speed and direction, and

Update the estimate of the current location.

8.3 Additional Methods

Multivariate time series analysis

Examines multiple time series and their effect on each other.

Vector ARIMA (VARIMA)

Extends ARIMA by considering a vector of several time series at a particular time, t.

Can be used in marketing analyses

Examine the time series related to a company’s price and sales volume as well as related time series for the competitors.

Summary

Time series analysis is different from other statistical techniques in the sense that most statistical analyses assume the observations are independent of each other. Time series ana lysis implicitly addresses the case in which any particular observation is somewhat dependent on prior observations.

Using differencing, ARIMA models allow nonstationary series to be transformed into stationary series to which seasonal and nonseasonal ARMA models can be appl ied. The importance of using the ACF and PACF plots to evaluate the autocorrelations was illustrated in determining ARIMA models to consider fitting. Aka ike and Bayesian Information Criteria can be used to compare one fitted A RIMA model against another. Once an appropriate model has been determined, future values in the time series can be forecasted

38