# Time Series 101

So, you want to predict the future? This is the missing Time Series guide using Python I have been looking for on the entire internet. Source code link is at the bottom.

A Time Series is a set of observations recorded at different times. Reason being why Time Series analysis is very important is that based on those observations over time, a pattern may emerge and predictions could be made about the future. There are four components of Time Series, two of which Trend and Seasonality are discussed here, and the rest Cyclical and Noise/Random can be referred to this Wikipedia article. This whole Time Series thing is fully a statistical business. So, brace yourself.

## Python Housekeeping

First thing’s first. You may need to perform a bit of housekeeping before you start reading the CSVs and feeding into your time series pipeline. Begin with declaring the source code encoding of the Python script, so that you may use text data encoded in Unicode because the default encoding is 7-bit ASCII.

```
#!/usr/bin/python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as mp
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.stattools import adfuller, arma_order_select_ic
np.set_printoptions(threshold=np.inf)
```

A few obligatory imports whenever you would like to play with data: Numpy, Pandas and Matplotlib. In addition to those, a statistical library named statsmodels. If you are on OS X and having trouble with Matploblib, check out my other post OpenCV 3.0 Up & Running where I showed how to make Matplotlib work for you.

## Climate Change: The Dataset

Climate Change is a hot topic now, so let’s see how hot it is going to be using Time Series analysis. You may download and read general description and instruction of the dataset here. Data points start from as early as 1750 for pretty much possibly all countries and cities. Some countries I did not even hear about before, and some countries have names in Unicode. In this post, I focused only on the `GlobalLandTemperaturesByCountry.csv`

file which contains Country-wise temperature data.

Here are top five records of the dataset:

```
dt AverageTemperature AverageTemperatureUncertainty Country
0 1743-11-01 4.384 2.294 Åland
1 1743-12-01 NaN NaN Åland
2 1744-01-01 NaN NaN Åland
3 1744-02-01 NaN NaN Åland
4 1744-03-01 NaN NaN Åland
```

The indices began with 0, followed by the `dt`

column which is essentially the Dates of the recordings, Average temperature and the Country name. For this post, we do not care about AverageTemperatureUncertainty.

## Preparing the Dataset

Let us begin with dropping a column we won’t care about namely AverageTemperatureUncertainty. Let us also filter by a country, eg. Canada, until we intend to analyze temperatures of the entire globe, which won’t be a feasible scope for this post.

```
df = pd.read_csv('GlobalLandTemperaturesByCountry.csv', sep=',',
skipinitialspace=True, encoding='utf-8')
df = df.drop('AverageTemperatureUncertainty', axis=1)
df = df[df.Country == 'Canada']
df = df.drop('Country', axis=1)
print df.shape
df = df[df.AverageTemperature.notnull()]
print df.shape
```

Because the country is also chosen here, the Country column can be safely deleted as well. Here’s the possible output of the above snippet:

```
(2941, 3)
(2504, 3)
```

This means that 437 rows have missing values. We will get back to the missing values shortly. For now, let us change the traditional indices `0...n-1`

to the dates. Pandas has enough batteries to do just that:

```
df.index = pd.to_datetime(df.dt)
df = df.drop('dt', axis=1)
df = df.ix['1820-01-01':] # disable this, if you use fillna below
df = df.sort_index()
df.AverageTemperature.fillna(method='pad', inplace=True)
```

It was noticed that for Canada, the dataset has no missing values from as early as 1820, so we can either opt for that or we can fill the missing values. However, the dataframe needs to be sorted in ascending order in order for us to work on prediction. Now fillna can be applied to replace the `NaN`

values with the last known value. Here is an example:

```
... before fillna ...
1769-07-01 00:00:00 13.953
1769-08-01 00:00:00 11.621
1769-09-01 00:00:00 NaN
1769-11-01 00:00:00 NaN
... after fillna ...
1769-07-01 00:00:00 13.953
1769-08-01 00:00:00 11.621
1769-09-01 00:00:00 11.621
1769-10-01 00:00:00 11.621
```

**Bummer:** If you intend to feed any data into a Series before 1900, for example 1820, here’s what you’d get: *year=1820 is before 1900; the datetime strftime() methods require year >= 1900*. Therefore, you may have to either stick to 1900 and later or a suitable hack that gracefully degrades this issue.

## Visualizing the Dataset

I think we have done much, but seen little. Let us take a look at the state of the data as of now:

```
df.AverageTemperature.fillna(method='pad', inplace=True)
mp.plot(df.AverageTemperature)
mp.show()
```

Now look at the massive flat lines, which was caused due to `fillna`

. I am going to cowardly ignore them, and stick to data points on or after 1820. Although I am cowardly backing out there are several techniques that could be used to resolve this matter and create artificial spikes and falls by doing groupbys, resampling, filling with mean, median and what not. But, for such a large range of missing values, I do not dare. Still makes no sense? Let’s try a smaller date range and check to see what is going on there. Let’s look at last 10-year data:

**Seasonal:** As you can see every year there is a pattern from the beginning of the year to the end. This characteristic is known as seasonality.

### Rolling Mean/Moving Average

Running a rolling mean aka. moving average through the data is almost obligatory just to get a more comprehensive insight into the data. The rolling mean window is kept 12 just to signify that we are interested in 12-month data points per year.

```
df.AverageTemperature.plot.line(style='b', legend=True,
grid=True, label='Avg. Temperature (AT)')
ax = df.AverageTemperature.rolling(window=12).mean().plot.line(style='r',
legend=True, label='Mean AT')
ax.set_xlabel('Date')
mp.legend(loc='best')
mp.title('Weather timeseries visualization')
mp.show()
```

**Trend:** It essentially shows the trend. As you can see, for the most part of the recorded history, the average temperatures throughout the year were usually under -5C until 1994, where it went above -5C and it now still stays there. This trend emerged in a very large timespan, otherwise it would hurt the stationarity of the Time Series and forecasting might not be possible. More on that later.

## Is the Data any Good?

Predictions are as good as the data. The dataset must be stationary, otherwise there is very little chance that the predictions to be made would be of any meaning. What is data stationarity anyway? If the dataset has its key statistical properties unchanged over time such as mean value, the variance and the autocorrelation, it is said to be stationary. I find this to be a fine read on this. There is a test called Augmented Dickey–Fuller test, which helps to find stationary properties in a Time Series. The following function was inspired by this:

```
# Usage: test_stationarity(df.AverageTemperature)
def test_stationarity(df):
print 'Results of Dickey-Fuller Test:'
dftest = adfuller(df)
indices = ['Test Statistic', 'p-value',
'No. Lags Used', 'Number of Observations Used']
output = pd.Series(dftest[0:4], index=indices)
for key, value in dftest[4].items():
output['Critical Value (%s)' % key] = value
print output
```

The output turns out to be quite great - meaning that the Time Series is very stationary:

```
Results of Dickey-Fuller Test:
Test Statistic -4.572483
p-value 0.000145
No. Lags Used 24.000000
Number of Observations Used 2300.000000
Critical Value (5%) -2.862797
Critical Value (1%) -3.433196
Critical Value (10%) -2.567439
```

One way to interpret the results is that the Test Statistic here is lower than Critical Value (1%), so it can be concluded with 99% confidence that the Time Series is stationary. What to do if the Time Series is non-stationary? One way is to apply Autoregressive integrated moving average (ARIMA) to the series to make it stationary. A bit more on that later.

## Prediction Theories

Now that we have prepared our data it is time to fit a curve or learn and predict from data. There are various ways to attempt that, but I am going to cover only a few. Brace yourself as the theories are back with vengence.

### Moving Average (MA) Model

Moving Average and Moving Average Models are different yet similar. The former is basically what we did earlier when we ran rolling mean, and the latter is used to fit/model usually an univariate Time Series and denotes that the output depends linearly on the current and various past values of a stochastic or imperfectly predictable term. MA model is always stationary. If you intend to pick only `q`

previous terms, it can be specified by the following:

MA(q) = X_{t} = μ + ε_{t} + θ_{1}ε_{t-1} + … + θ_{q}ε_{t-q}, where μ is the mean of the series, θ are parameters and ε is the White Noise/Error.

According to Gauss–Markov theorem, ε has the following properties:

- Mean, E(ε
_{t}) = 0, ∀t - All have the same finite variance, Var(ε
_{t})=σ^{2}, ∀t - Error terms are not correlated, Cov(ε
_{t}, ε_{s}) = 0 forall t≠s

A simpler intuition of ε_{i} is that it’s the difference between the moving average at i^{th} instance and the actual value, hence the name error.

### Autoregressive (AR) Model

The Autoregressive Model is exactly the same as MA model, except that AR Model is not always stationary. If you intend to pick only `p`

previous terms, it can be specified as below:

AR(p) = X_{t} = c + φ_{1}X_{t-1} + … + φ_{p}X_{t-p} + ε_{t}, where c is a constant, φ are parameters and the random variable ε_{t} is white noise.

**Lag operator:** returns the value of a previous case. Therefore, if a lag operation is performed on an element, it returns its predecessor. Lag is used in both AR and MA models.

### Autoregressive–moving-average (ARMA) Model

It’s the union of AR and MA model, and specified as below (from Wikipedia):

ARMA(p, q) =

However, question remains what are the optimal values for `p`

and `q`

? And how can we find them?

#### Finding the Best `p`

and `q`

Many use Autocorrelation function (ACF) and Partial autocorrelation function (PACF) to try to determine visually as to what could be good `p`

and `q`

estimates. One may argue, it’s quite vague and subjective. Quoting from Matlab’s documentation: *The ARMA lags cannot be selected solely by looking at the ACF and PACF…* What Matlab does is that it uses an approach based on Akaike information criterion (AIC) and Bayesian information criterion (BIC) in order to automatically try with different `p`

and `q`

values and proposes a systematic and measurable way to choose your set. Both of them have massive literatures which are totally centered around statistics, but for mere mortals like ourselves, we have excellent third-party Python packages to deal with them, such as statsmodels:

```
print arma_order_select_ic(df.AverageTemperature, ic=['aic', 'bic'], trend='nc',
max_ar=4, max_ma=4, fit_kw={'method': 'css-mle'})
```

This will print out this large matrix laying all possible combination of `p`

and `q`

within the range we specified earlier in `max_ar`

and `max_ma`

.

```
{'bic_min_order': (3, 3), 'aic_min_order': (3, 3),
'bic': 0 1 2 3 4
0 NaN 9551.833892 8627.365113 8074.089428 7696.441328
1 9055.274290 8272.865241 7839.717253 8535.844432 8010.955486
2 7069.951373 NaN 6633.176005 6500.003441 6380.696664
3 6836.470928 6838.489737 NaN 5223.201762 NaN
4 6829.801936 6849.877710 NaN 5802.381136 5246.509542,
'aic': 0 1 2 3 4
0 NaN 9541.396072 8611.708384 8053.213790 7670.346779
1 9044.836471 8257.208512 7818.841614 8509.749884 7979.642028
2 7054.294644 NaN 6607.081457 6468.689982 6344.164296
3 6815.595289 6812.395189 NaN 5186.669394 NaN
4 6803.707387 6818.564252 NaN 5760.629858 5199.539355}
```

Now the question is how to interpret this matrix? You can take either BIC or AIC, but we computed both just to show that multiple ways exist. Let us assume we are taking BIC into consideration. We find that at `(p=3, q=3)`

location `(p=row#, q=col#)`

, BIC is the lowest. That’s the sweet spot.

## Predict the Future, Already!

We have everything we need. Now let’s fit the curve, in other words, learn the data points. `ARMA`

method can be invoked directly with a `method`

which indicates what algorithm to use while calculating the ARMA. You may use Kalman filter, Conditional Sum of Squares, etc. There are variations of ARMA, such as ARIMA, ARMAX, etc. ARIMA is important in making a Time Series stationary, which takes an extra parameter which is known as `d`

, differencing by eliminating trend and seasonality. As usual, statsmodels has fantastic API, ARIMA. In this case, I have used Conditional Sum of Squares Maximum Likelihood Estimation (css-mle) in other words a variation of Kalman filter.

```
# Fit the model
ts = pd.Series(df.AverageTemperature, index=df.index)
model = ARMA(ts, order=(3, 3))
results = model.fit(trend='nc', method='css-mle')
print(results.summary2())
# Plot the model
fig, ax = mp.subplots(figsize=(10, 8))
fig = results.plot_predict('01/01/2003', '12/01/2023', ax=ax)
ax.legend(loc='lower left')
mp.title('Weather Time Series prediction')
mp.show()
predictions = results.predict('01/01/2003', '12/01/2023')
# You can manipulate/print the predictions after this
```

The green line here is the 10-year historic weather data and the blue line is the predicted next 10-year predicted data. The gray area as you can see is the potential area where the predicted line could peak or fall over time with 95% confidence. Meaning that the line is very unlikely to go beyond the gray areas. If you have taken a closer look, you find that the blue line has been trying to fit itself with the historic data from the beginning.

## Conclusion

There you have it! In this post, I have shown how you can teach a computer program to follow data points, learn from it, and then predict the future. The source code is available here.