week 8-data science
Data Science and Big Data Analytics
Chap 8: Advanced Analytical Theory and Methods:
Time Series Analysis
1
Chapter Sections
8.1 Overview of Time Series Analysis
8.1.1 Box-Jenkins Methodology
8.2 ARIMA Model
8.2.1 Autocorrelation Function (ACF)
8.2.2 Autoregressive Models
8.2.3 Moving Average Models
8.2.4 ARMA and ARIMA Models
8.2.5 Building and Evaluating an ARIMA Model
8.2.6 Reasons to Choose and Cautions
8.3 Additional Methods
Summary
2
8 Time Series Analysis
This chapter’s emphasis is on
Identifying the underlying structure of the time series
Fitting an appropriate Autoregressive Integrated Moving Average (ARIMA) model
3
Time series analysis attempts to model the underlying structure of observations over time
A time series, Y =a+ bX , is an ordered sequence of equally spaced values over time
The analyses presented in this chapter are limited to equally spaced time series of one variable
8.1 Overview of Time Series Analysis
4
The time series below plots #passengers vs months (144 months or 12 years)
8.1 Overview of Time Series Analysis
5
The goals of time series analysis are
Identify and model the structure of the time series
Forecast future values in the time series
Time series analysis has many applications in finance, economics, biology, engineering, retail, and manufacturing
8.1 Overview of Time Series Analysis
6
8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology
A time series can consist of the components:
Trend – long-term movement in a time series, increasing or decreasing over time – for example,
Steady increase in sales month over month
Annual decline of fatalities due to car accidents
Seasonality – describes the fixed, periodic fluctuation in the observations over time
Usually related to the calendar – e.g., airline passenger example
Cyclic – also periodic but not as fixed
E.g., retail sales versus the boom-bust cycle of the economy
Random – is what remains
Often an underlying structure remains but usually with significant noise
This structure is what is modeled to obtain forecasts
7
8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology
The Box-Jenkins methodology has three main steps:
Condition data and select a model
Identify/account for trends/seasonality in time series
Examine remaining time series to determine a model
Estimate the model parameters.
Assess the model, return to Step 1 if necessary
This chapter uses the Box-Jenkins methodology to apply an ARIMA model to a given time series
8
8.1 Overview of Time Series Analysis 8.1.1 Box-Jenkins Methodology
The remainder of the chapter is rather advanced and will not be covered in this course
The remaining slides have not been finalized but can be reviewed by those interested in time series analysis
9
8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average
Step 1: remove any trends/seasonality in time series
Achieve a time series with certain properties to which autoregressive and moving average models can be applied
Such a time series is known as a stationary time series
10
8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average
A time series, Yt for t= 1,2,3, ... t, is a stationary time series if the following three conditions are met
The expected value (mean) of Y is constant for all values
The variance of Y is finite
The covariance of Y, and Y, h depends only on the value of h = 0, 1, 2, .. .for all t
The covariance of Y, andY,. h is a measure of how the two variables, Y, andY,_ h• vary together
11
8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average
The covariance of Y, andY,. h is a measure of how the two variables, Y, andY,_ h• vary together
If two variables are independent, covariance is zero.
If the variables change together in the same direction, cov is positive; conversely, if the variables change in opposite directions, cov is negative
12
8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average
A stationary time series, by condition (1), has constant mean, say m, so covariance simplifies to
By condition (3), cov between two points can be nonzero, but cov is only function of h – e.g., h=3
If h=0, cov(0) = cov(yt,yt) = var(yt) for all t
13
8.2 ARIMA Model ARIMA = Autoregressive Integrated Moving Average
A plot of a stationary time series
14
8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)
From the figure, it appears that each point is somewhat dependent on the past points, but does not provide insight into the cov and its structure
The plot of autocorrelation function (ACF) provides this insight
For a stationary time series, the ACF is defined as
15
8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)
Because the cov(0) is the variance,
the ACF is analogous to the correlation function of two variables, corr (yt , yt+h), and
the value of the ACF falls between -1 and 1
Thus, the closer the absolute value of ACF(h) is to 1, the more useful yt can be as a predictor of yt+h
16
8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)
Using the dataset plotted above, the ACF plot is
17
8.2 ARIMA Model 8.2.1 Autocorrelation Function (ACF)
By convention, the quantity h in the ACF is referred to as the lag, the difference between the time points t and t +h.
At lag 0, the ACF provides the correlation of every point with itself
According to the ACF plot, at lag 1 the correlation between Y, andY, 1 is approximately 0.9, which is very close to 1, so Y, 1 appears to be a good predictor of the value of Y,
In other words, a model can be considered that would express Y, as a linear sum of its previous 8 terms. Such a model is known as an autoregressive model of order 8
18
8.2 ARIMA Model 8.2.2 Autoregressive Models
For a stationary time series, y, t= 1, 2, 3, ... , an autoregressive model of order p, denoted AR(p), is
19
8.2 ARIMA Model 8.2.2 Autoregressive Models
Thus, a particular point in the time series can be expressed as a linear combination of the prior p values, Y, _ i for j = 1, 2, ... p, of the time series plus a random error term, c,.
the c, time series is often called a white noise process that represents random, independent fluctuations that are part of the time series
20
8.2 ARIMA Model 8.2.2 Autoregressive Models
In the earlier example, the autocorrelations are quite high for the first several lags.
Although an AR(8) model might be good, examining an AR(l) model provides further insight into the ACF and the p value to choose
An AR(1) model, centered around 6 = 0, yields
21
8.2 ARIMA Model 8.2.3 Moving Average Models
For a time series, y 1 , centered at zero, a moving average model of order q, denoted MA(q), is expressed as
the value of a time series is a linear combination of the current white noise term and the prior q white noise terms. So earlier random shocks directly affect the current value of the time series
22
8.2 ARIMA Model 8.2.3 Moving Average Models
the value of a time series is a linear combination of the current white noise term and the prior q white noise terms, so earlier random shocks directly affect the current value of the time series
the behavior of the ACF and PACF plots are somewhat swapped from the behavior of these plots for AR(p) models.
23
8.2 ARIMA Model 8.2.3 Moving Average Models
For a simulated MA(3) time series of the form Y, = E1 - 0.4 E, 1 + 1.1 £1 2 - 2.S E:1 3 where e, - N(O, 1), the scatterplot of the simulated data over time is
24
8.2 ARIMA Model 8.2.3 Moving Average Models
The ACF plot of the simulated MA(3) series is shown below
ACF(0) = 1, because any variable correlates perfectly with itself. At higher lags, the absolute values of terms decays
In an autoregressive model, the ACF slowly decays, but for an MA(3) model, the ACF cuts off abruptly after lag 3, and this pattern extends to any MA(q) model.
25
8.2 ARIMA Model 8.2.3 Moving Average Models
To understand this, examine the MA(3) model equations
Because Y1 shares specific white noise variables with Y1 _ 1 through Y1 _ 3,, those three variables are correlated to y1 • However, the expression of Yr does not share white noise variables with Y1_ 4 in Equation 8-14. So the theoretical correlation between Y1 and Y1 _ 4 is zero. Of course, when dealing with a particular dataset, the theoretical autocorrelations are unknown, but the observed autocorrelations should be close to zero for lags greater than q when working with an MA(q) model
26
8.2 ARIMA Model 8.2.4 ARMA and ARIMA Models
In general, we don’t need to choose between an AR(p) and an MA(q) model, rather combine these two representations into an Autoregressive Moving Average model, ARMA(p,q),
27
8.2 ARIMA Model 8.2.4 ARMA and ARIMA Models
If p = 0 and q =;e. 0, then the ARMA(p,q) model is simply an AR(p) model. Similarly, if p = 0 and q =;e. 0, then the ARMA(p,q) model is an MA(q) model
Although the time series must be stationary, many series exhibit a trend over time – e.g., an increasing linear trend
28
8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model
For a large country, monthly gasoline production (millions of barrels) was obtained for 240 months (20 years).
A market research firm requires some short-term gasoline production forecasts
29
8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model
library (forecast )
gas__prod_input <- as. data . f rame ( r ead.csv ( "c: / data/ gas__prod. csv")
gas__prod <- ts (gas__prod_input[ , 2])
plot (gas _prod, xlab = "Time (months) ", ylab = "Gas oline production (mi llions of barrels ) " )
30
8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model
Comparing Fitted Time Series Models
The arima () function in Ruses Maximum Likelihood Estimation (MLE) to estimate the model coefficients. In the R output for an ARIMA model, the log-likelihood (logLl value is provided. The values of the model coefficients are determined such that the value of the log likelihood function is maximized. Based on the log L value, the R output provides several measures that are useful for comparing the appropriateness of one fitted model against another fitted model.
AIC (Akaike Information Criterion)
A ICc (Akaike Information Criterion, corrected)
BIC (Bayesian Information Criterion)
31
8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model
Normality and Constant Variance
32
8.2 ARIMA Model 8.2.5 Building and Evaluating an ARIMA Model
Forecasting
33
8.2 ARIMA Model 8.2.6 Reasons to Choose and Cautions
One advantage of ARIMA modeling is that the analysis can be based simply on historical time series data for the variable of interest. As observed in the chapter about regression (Chapter 6), various input variables need to be considered and evaluated for inclusion in the regression model for the outcome variable
34
8.3 Additional Methods
Autoregressive Moving Average with Exogenous inputs (ARMAX)
Used to analyze a time series that is dependent on another time series.
For example
Retail demand for products can be modeled based on the previous demand combined with a weather-related time series such as temperature or rainfall.
Spectral analysis is commonly used for signal processing and other engineering applications.
Speech recognition software uses such techniques to separate the signal for the spoken words from the overall signal that may include some noise.
Generalized Autoregressive Conditionally Heteroscedastic (GARCH)
A useful model for addressing time series with nonconstant variance or volatility.
Used for modeling stock market activity and price fluctuations.
8.3 Additional Methods
Kalman filtering
Useful for analyzing real-time inputs about a system that can exist in certain states.
Typically, there is an underlying model of how the various components of the system interact and affect each other.
Processes the various inputs,
Attempts to identify the errors in the input, and
Predicts the current state.
For example
A Kalman filter in a vehicle navigation system can
Process various inputs, such as speed and direction, and
Update the estimate of the current location.
8.3 Additional Methods
Multivariate time series analysis
Examines multiple time series and their effect on each other.
Vector ARIMA (VARIMA)
Extends ARIMA by considering a vector of several time series at a particular time, t.
Can be used in marketing analyses
Examine the time series related to a company’s price and sales volume as well as related time series for the competitors.
Summary
Time series analysis is different from other statistical techniques in the sense that most statistical analyses assume the observations are independent of each other. Time series ana lysis implicitly addresses the case in which any particular observation is somewhat dependent on prior observations.
Using differencing, ARIMA models allow nonstationary series to be transformed into stationary series to which seasonal and nonseasonal ARMA models can be appl ied. The importance of using the ACF and PACF plots to evaluate the autocorrelations was illustrated in determining ARIMA models to consider fitting. Aka ike and Bayesian Information Criteria can be used to compare one fitted A RIMA model against another. Once an appropriate model has been determined, future values in the time series can be forecasted
38