So, you want to predict the future? This is the missing Time Series guide using Python I have been looking for on the entire internet. Source code link is at the bottom.

A Time Series is a set of observations recorded at different times. Reason being why Time Series analysis is very important is that based on those observations over time, a pattern may emerge and predictions could be made about the future. There are four components of Time Series, two of which Trend and Seasonality are discussed here, and the rest Cyclical and Noise/Random can be referred to this Wikipedia article. This whole Time Series thing is fully a statistical business. So, brace yourself.

Python Housekeeping

First thing’s first. You may need to perform a bit of housekeeping before you start reading the CSVs and feeding into your time series pipeline. Begin with declaring the source code encoding of the Python script, so that you may use text data encoded in Unicode because the default encoding is 7-bit ASCII.

#!/usr/bin/python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
import matplotlib.pyplot as mp
from statsmodels.tsa.arima_model import ARMA
from statsmodels.tsa.stattools import adfuller, arma_order_select_ic
np.set_printoptions(threshold=np.inf)

A few obligatory imports whenever you would like to play with data: Numpy, Pandas and Matplotlib. In addition to those, a statistical library named statsmodels. If you are on OS X and having trouble with Matploblib, check out my other post OpenCV 3.0 Up & Running where I showed how to make Matplotlib work for you.

Climate Change: The Dataset

Climate Change is a hot topic now, so let’s see how hot it is going to be using Time Series analysis. You may download and read general description and instruction of the dataset here. Data points start from as early as 1750 for pretty much possibly all countries and cities. Some countries I did not even hear about before, and some countries have names in Unicode. In this post, I focused only on the GlobalLandTemperaturesByCountry.csv file which contains Country-wise temperature data.

Here are top five records of the dataset:

           dt  AverageTemperature  AverageTemperatureUncertainty Country
0  1743-11-01               4.384                          2.294   Åland
1  1743-12-01                 NaN                            NaN   Åland
2  1744-01-01                 NaN                            NaN   Åland
3  1744-02-01                 NaN                            NaN   Åland
4  1744-03-01                 NaN                            NaN   Åland

The indices began with 0, followed by the dt column which is essentially the Dates of the recordings, Average temperature and the Country name. For this post, we do not care about AverageTemperatureUncertainty.

Preparing the Dataset

Let us begin with dropping a column we won’t care about namely AverageTemperatureUncertainty. Let us also filter by a country, eg. Canada, until we intend to analyze temperatures of the entire globe, which won’t be a feasible scope for this post.

df = pd.read_csv('GlobalLandTemperaturesByCountry.csv', sep=',', 
        skipinitialspace=True, encoding='utf-8')

df = df.drop('AverageTemperatureUncertainty', axis=1)
df = df[df.Country == 'Canada']
df = df.drop('Country', axis=1)
print df.shape
df = df[df.AverageTemperature.notnull()]
print df.shape

Because the country is also chosen here, the Country column can be safely deleted as well. Here’s the possible output of the above snippet:

(2941, 3)
(2504, 3)

This means that 437 rows have missing values. We will get back to the missing values shortly. For now, let us change the traditional indices 0...n-1 to the dates. Pandas has enough batteries to do just that:

df.index = pd.to_datetime(df.dt)
df = df.drop('dt', axis=1)
df = df.ix['1820-01-01':] # disable this, if you use fillna below
df = df.sort_index()
df.AverageTemperature.fillna(method='pad', inplace=True)

It was noticed that for Canada, the dataset has no missing values from as early as 1820, so we can either opt for that or we can fill the missing values. However, the dataframe needs to be sorted in ascending order in order for us to work on prediction. Now fillna can be applied to replace the NaN values with the last known value. Here is an example:

... before fillna ...
1769-07-01 00:00:00              13.953
1769-08-01 00:00:00              11.621
1769-09-01 00:00:00                 NaN
1769-11-01 00:00:00                 NaN
... after fillna ...
1769-07-01 00:00:00              13.953
1769-08-01 00:00:00              11.621
1769-09-01 00:00:00              11.621
1769-10-01 00:00:00              11.621

Bummer: If you intend to feed any data into a Series before 1900, for example 1820, here’s what you’d get: year=1820 is before 1900; the datetime strftime() methods require year >= 1900. Therefore, you may have to either stick to 1900 and later or a suitable hack that gracefully degrades this issue.

Visualizing the Dataset

I think we have done much, but seen little. Let us take a look at the state of the data as of now:

df.AverageTemperature.fillna(method='pad', inplace=True)
mp.plot(df.AverageTemperature)
mp.show()

Time Series data after fillna

Now look at the massive flat lines, which was caused due to fillna. I am going to cowardly ignore them, and stick to data points on or after 1820. Although I am cowardly backing out there are several techniques that could be used to resolve this matter and create artificial spikes and falls by doing groupbys, resampling, filling with mean, median and what not. But, for such a large range of missing values, I do not dare. Still makes no sense? Let’s try a smaller date range and check to see what is going on there. Let’s look at last 10-year data:

From 2003 to 2013

Seasonal: As you can see every year there is a pattern from the beginning of the year to the end. This characteristic is known as seasonality.

Rolling Mean/Moving Average

Running a rolling mean aka. moving average through the data is almost obligatory just to get a more comprehensive insight into the data. The rolling mean window is kept 12 just to signify that we are interested in 12-month data points per year.

df.AverageTemperature.plot.line(style='b', legend=True, 
    grid=True, label='Avg. Temperature (AT)')
ax = df.AverageTemperature.rolling(window=12).mean().plot.line(style='r', 
    legend=True, label='Mean AT')
ax.set_xlabel('Date')
mp.legend(loc='best')
mp.title('Weather timeseries visualization')
mp.show()

From 2003 to 2013

Trend: It essentially shows the trend. As you can see, for the most part of the recorded history, the average temperatures throughout the year were usually under -5C until 1994, where it went above -5C and it now still stays there. This trend emerged in a very large timespan, otherwise it would hurt the stationarity of the Time Series and forecasting might not be possible. More on that later.

Is the Data any Good?

Predictions are as good as the data. The dataset must be stationary, otherwise there is very little chance that the predictions to be made would be of any meaning. What is data stationarity anyway? If the dataset has its key statistical properties unchanged over time such as mean value, the variance and the autocorrelation, it is said to be stationary. I find this to be a fine read on this. There is a test called Augmented Dickey–Fuller test, which helps to find stationary properties in a Time Series. The following function was inspired by this:

# Usage: test_stationarity(df.AverageTemperature)
def test_stationarity(df):
    print 'Results of Dickey-Fuller Test:'
    dftest = adfuller(df)
    indices = ['Test Statistic', 'p-value', 
                'No. Lags Used', 'Number of Observations Used']
    output = pd.Series(dftest[0:4], index=indices)
    for key, value in dftest[4].items():
        output['Critical Value (%s)' % key] = value
    print output

The output turns out to be quite great - meaning that the Time Series is very stationary:

Results of Dickey-Fuller Test:
Test Statistic                  -4.572483
p-value                         0.000145
No. Lags Used                   24.000000
Number of Observations Used     2300.000000
Critical Value (5%)             -2.862797
Critical Value (1%)             -3.433196
Critical Value (10%)            -2.567439

One way to interpret the results is that the Test Statistic here is lower than Critical Value (1%), so it can be concluded with 99% confidence that the Time Series is stationary. What to do if the Time Series is non-stationary? One way is to apply Autoregressive integrated moving average (ARIMA) to the series to make it stationary. A bit more on that later.

Prediction Theories

Now that we have prepared our data it is time to fit a curve or learn and predict from data. There are various ways to attempt that, but I am going to cover only a few. Brace yourself as the theories are back with vengence.

Moving Average (MA) Model

Moving Average and Moving Average Models are different yet similar. The former is basically what we did earlier when we ran rolling mean, and the latter is used to fit/model usually an univariate Time Series and denotes that the output depends linearly on the current and various past values of a stochastic or imperfectly predictable term. MA model is always stationary. If you intend to pick only q previous terms, it can be specified by the following:

MA(q) = Xt = μ + εt + θ1εt-1 + … + θqεt-q, where μ is the mean of the series, θ are parameters and ε is the White Noise/Error.

According to Gauss–Markov theorem, ε has the following properties:

  • Mean, E(εt) = 0, ∀t
  • All have the same finite variance, Var(εt)=σ2, ∀t
  • Error terms are not correlated, Cov(εt, εs) = 0 forall t≠s

A simpler intuition of εi is that it’s the difference between the moving average at ith instance and the actual value, hence the name error.

Autoregressive (AR) Model

The Autoregressive Model is exactly the same as MA model, except that AR Model is not always stationary. If you intend to pick only p previous terms, it can be specified as below:

AR(p) = Xt = c + φ1Xt-1 + … + φpXt-p + εt, where c is a constant, φ are parameters and the random variable εt is white noise.

Lag operator: returns the value of a previous case. Therefore, if a lag operation is performed on an element, it returns its predecessor. Lag is used in both AR and MA models.

Autoregressive–moving-average (ARMA) Model

It’s the union of AR and MA model, and specified as below (from Wikipedia):

ARMA(p, q) = ARMA(p, q)

However, question remains what are the optimal values for p and q? And how can we find them?

Finding the Best p and q

Many use Autocorrelation function (ACF) and Partial autocorrelation function (PACF) to try to determine visually as to what could be good p and q estimates. One may argue, it’s quite vague and subjective. Quoting from Matlab’s documentation: The ARMA lags cannot be selected solely by looking at the ACF and PACF… What Matlab does is that it uses an approach based on Akaike information criterion (AIC) and Bayesian information criterion (BIC) in order to automatically try with different p and q values and proposes a systematic and measurable way to choose your set. Both of them have massive literatures which are totally centered around statistics, but for mere mortals like ourselves, we have excellent third-party Python packages to deal with them, such as statsmodels:

print arma_order_select_ic(df.AverageTemperature, ic=['aic', 'bic'], trend='nc',
            max_ar=4, max_ma=4, fit_kw={'method': 'css-mle'})

This will print out this large matrix laying all possible combination of p and q within the range we specified earlier in max_ar and max_ma.

{'bic_min_order': (3, 3), 'aic_min_order': (3, 3), 
'bic':      0            1            2            3            4
0          NaN  9551.833892  8627.365113  8074.089428  7696.441328
1  9055.274290  8272.865241  7839.717253  8535.844432  8010.955486
2  7069.951373          NaN  6633.176005  6500.003441  6380.696664
3  6836.470928  6838.489737          NaN  5223.201762          NaN
4  6829.801936  6849.877710          NaN  5802.381136  5246.509542, 
'aic':      0            1            2            3            4
0          NaN  9541.396072  8611.708384  8053.213790  7670.346779
1  9044.836471  8257.208512  7818.841614  8509.749884  7979.642028
2  7054.294644          NaN  6607.081457  6468.689982  6344.164296
3  6815.595289  6812.395189          NaN  5186.669394          NaN
4  6803.707387  6818.564252          NaN  5760.629858  5199.539355}

Now the question is how to interpret this matrix? You can take either BIC or AIC, but we computed both just to show that multiple ways exist. Let us assume we are taking BIC into consideration. We find that at (p=3, q=3) location (p=row#, q=col#), BIC is the lowest. That’s the sweet spot.

Predict the Future, Already!

We have everything we need. Now let’s fit the curve, in other words, learn the data points. ARMA method can be invoked directly with a method which indicates what algorithm to use while calculating the ARMA. You may use Kalman filter, Conditional Sum of Squares, etc. There are variations of ARMA, such as ARIMA, ARMAX, etc. ARIMA is important in making a Time Series stationary, which takes an extra parameter which is known as d, differencing by eliminating trend and seasonality. As usual, statsmodels has fantastic API, ARIMA. In this case, I have used Conditional Sum of Squares Maximum Likelihood Estimation (css-mle) in other words a variation of Kalman filter.

# Fit the model
ts = pd.Series(df.AverageTemperature, index=df.index)
model = ARMA(ts, order=(3, 3))
results = model.fit(trend='nc', method='css-mle')
print(results.summary2())

# Plot the model
fig, ax = mp.subplots(figsize=(10, 8))
fig = results.plot_predict('01/01/2003', '12/01/2023', ax=ax)
ax.legend(loc='lower left')
mp.title('Weather Time Series prediction')
mp.show()

predictions = results.predict('01/01/2003', '12/01/2023')
# You can manipulate/print the predictions after this

10-year weather prediction from 10-year past data

The green line here is the 10-year historic weather data and the blue line is the predicted next 10-year predicted data. The gray area as you can see is the potential area where the predicted line could peak or fall over time with 95% confidence. Meaning that the line is very unlikely to go beyond the gray areas. If you have taken a closer look, you find that the blue line has been trying to fit itself with the historic data from the beginning.

Conclusion

There you have it! In this post, I have shown how you can teach a computer program to follow data points, learn from it, and then predict the future. The source code is available here.