In [1]:
# jupyter nbconvert --execute --to html '.\Step-by-Step Guide to Time Series Analysis.ipynb'
from IPython.display import HTML

contents_css = """
.contents ol{
    text-align: center;
    list-style-position: inside;
}
.contents a {
    color: var(--ultraviolet) !important;
    font-size: 150% !important;
}
.contents a:hover {
    color: var(--electron) !important;
}
.contents li::marker {
    color: var(--molten);
    font-size:150% !important;
}
.contents li {
    margin-top: 15px;
    margin-bottom: 15px;
    text-align: left;
    margin-left: 30%;
}

.contents li p {display:none}

/*
.contents li:hover p {
    display:inline;
}

.contents > li > ol > li > ol{display:none}
.contents > li > ol > li:hover > ol{
    display:inline;
}*/

.contents li li {
    margin-left: 30px;
}

.contents li li {
    font-size: 90%;
}

.contents li li li {
    font-size: 80%;
}

/* dropdown arrow hack with checkbox */
.arrow {
  border: solid var(--molten);
  border-width: 0 3px 3px 0;
  display: inline-block;
  padding: 3px;
  margin: 0 0 3px 5px;
}

.left {
  transform: rotate(135deg);
  -webkit-transform: rotate(135deg);
}

.down {
  transform: rotate(45deg);
  -webkit-transform: rotate(45deg);
}

input:checked + label.arrow.left {
    transform: rotate(45deg);
  -webkit-transform: rotate(45deg);
}

input:checked + label + br + p {
    display: inline
}

.contents li ol {
    display:none
}

input:checked + label + br + p + ol {
    display:inline
}

input {
    display:inline
}

input[type="checkbox"] {
    display:none
}
"""

htmlstr = """
<html>
<head>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Rubik&display=swap" rel="stylesheet">
    <style>
        :root {
                --ultraviolet: #4d61f4;
                --molten: #ff7c66;
                --supernova: #ffe352;
                --white-dwarf: #f4f4f4;
                --electron: #64abff;
                --deep-space: #242456;
                --aurora: #59d8a1;
            }
        body {
            font-family: Rubik, sans-serif;
            }
        div {font-size: 103%; color: var(--deep-space) !important;}
        div.CodeMirror-code {font-family: monospace; color: var(--deep-space) !important;}
        .rendered_html table, div.output_area pre {color: var(--deep-space) !important;}
        h1 {font-family: "Times New Roman", serif; color:var(--ultraviolet); font-size: 300% !important;}
        h2 {font-family: "Times New Roman", serif; color:var(--ultraviolet); font-size: 200% !important;}
        h3 {font-family: "Times New Roman", serif; color:var(--electron); font-size: 150% !important;}
        h4 {font-family: "Times New Roman", serif; color:var(--molten); font-size: 125% !important;}
        .rendered_html h4 {
            margin-top: 0;
        }
        span#author {color:molten !important;}
        a#return_to_contents {
            font-family: "Times New Roman", serif;
            color:var(--molten);
            font-size: 120% !important;
        }
        /*img.resize {width:80%; height:80%;}
        img.small {width:40%; height:40%;}*/
        """ + contents_css + """
    </style>
</head>
<script>
    $('div#notebook-container > div.code_cell:nth-of-type(1)').hide();
</script>
</html>
"""
HTML(htmlstr)
Out[1]:

Time Series Analysis

A Step-by-Step Guide


By H Gulliver for Multiverse


This guide is meant to walk you through the steps you might follow when analysing time series data and forecasting future values. It starts with an overview of the key steps to take, then gives a more detailed run-through, with a running example and Python code. The overview of key steps provides hyperlinks to the relevant sections of the detailed run-through.

Time series analysis is a huge subject. This guide covers only the most common steps you should follow and a few of the most common forecasting models. It is certainly not a complete guide to the subject, but should help you get started with analysing time series data.

The idea is you should follow along with this guide while analysing a time series of your own. The guide will be a reference for what steps to take, and the code to use. No two time series are the same, and the steps needed to analyse a time series vary. Sometimes you might need to skip some of these steps, sometimes you might need extra steps. Again, this guide is a starting point, not comprehensive.

The guide is based around using Python to analyse your time series, and example code is shown, with outputs. The main steps will be the same in any language, but the code will of course be different. An excellent resource for time series concepts in general, and their implementation in R, is this site, which is referenced several times below.

Overview

Click the dropdown arrow by a heading to see a description and subheadings. Click the (sub)section heading to go to the detailed walkthrough of that section.

  1. Import your time series data

    Read your time series into Pandas, convert the time column to a datetime, and set it as the index.

  2. Clean your data

    Check for odd or missing values. Look at summary statistics and the head and tail and sense-check them.

    1. Fill in nulls

      Either "pad" missing values (copy the last known value forward), or fill them in with zeroes or another suitable value, depending on context.

    2. Resample to fill in missing dates

      If there could be entire timestamps missing, insert those and fill in their values as you did for nulls. You can also resample to a different level of granularity - such as looking annually instead of monthly.

  3. Exploratory data analysis

    Plot your time series and identify key features. Apply a transformation if needed (such as a logarithm) and plot the autocorrelation function to confirm key features.

    1. Plot your time series

      Plot the raw time series. Look for outliers, trend, seasonality, cyclic behaviour, and heteroskedasticity. Identify any sudden changes in the series' behaviour over time and try to identify what might have caused them.

    2. If necessary, transform your time series

      If your series is heteroskedastic (its variance changes over time) you will need to transform it and work with the transformed series. A logarithmic transformation is often effective. Graph the transformed series and look for features, as you did with the raw series.

    3. Plot the autocorrelation function for your time series

      Look at the correlation of your series with past values of itself (use the transformed series, if you have applied a transformation). Confirm your observations from plotting the series itself by looking for trend (which will show as a slow decrease towards 0 in the ACF) and seasonality (which will show as regular spikes at multiples of the seasonal period).

  4. Decompose your time series

    Study the trend and seasonal (if there is one!) components of your time series more closely to better understand the behaviour. You might compare one (or both) of these components to other time series to identify relationships that can explain the behaviour of your data (for example, the trend component of sales data might relate to unemployment rates).

    1. For a seasonal time series

      Use a seasonal decomposition function to split your (possibly transformed) time series into trend, seasonal, and residual components. Explain the behaviour of the trend and seasonal components to better understand your series - maybe compare them with other time series. Use the residuals to check the quality of the decomposition (the residuals should show no trend, seasonality, or heteroskedasticity).

    2. For a non-seasonal time series

      Use a rolling mean to extract the trend component from your time series and separate it from the residuals. Use the residuals to check the quality of the decomposition (the residuals should show no trend, seasonality, or heteroskedasticity). Study the trend component and explain its behaviour to better understand your time series - maybe compare it with another time series.

  5. Model your time series

    Choose an appropriate model for your time series. We will look at SARIMAX models here, but other options exist. Once you have chosen a model (or several candidates), fit it to a training subset of your data and test it on a testing subset. Once you have settled on a final model, retrain it on your whole dataset and produce forecasts. You might want to build convenience functions to let you automatically update the model and forecasts as new data come in.

    1. Select an appropriate model

      Choose one or more models to fit to your data. There are many types of model available, but we will focus on choosing SARIMAX models.

      1. Differencing and stationarity

        Check if your time series is stationary. If it isn't, use differencing (seasonal differencing at most once to remove seasonality and simple differencing to remove trends) to get a (roughly) stationary time series. Make a note of how many times you differenced, both seasonally and simply, as you will need these numbers to fit your model.

      2. Determining the number of AR terms

        Look at the ACF and PACF plots of your differenced time series to identify the number of AR terms. If the PACF drops suddenly to 0 after a small number of lags and/or the ACF is positive at lag 1, then use AR terms - the number of AR terms is the number of lags before the PACF drops to 0. For seasonal AR terms, use the same approach, but looking only at the lags which are multiples of the seasonal period. Make a note of the numbers of simple and seasonal AR terms - you will need these to fit your model.

      3. Determining the number of MA terms

        Look at the ACF plot of your differenced time series to identify the number of MA terms. If the ACF drops suddenly to 0 after a small number of lags and/or the ACF is negative at lag 1, then use MA terms - the number of MA terms is the number of lags before the ACF drops to 0. For seasonal MA terms, use the same approach, but looking only at the lags which are multiples of the seasonal period. Make a note of the numbers of simple and seasonal MA terms - you will need these to fit your model.

      4. Determining the order of your SARIMAX model

        Bring together the number of seasonal and simple AR and MA terms, the seasonal period, and the number of simple and seasonal differencings to find the orders of the SARIMAX model. This is just collecting the results of the previous steps.

      5. Exogenous regressors (other variables)

        If your time series is related to other time series which are easier to forecast accurately (especially ones which are known perfectly in advance, like whether a day will be a public holiday or not), then you can use these as exogenous regressors in your model. Make sure you have these as a column in your dataframe alongside your main (endogenous) time series.

    2. Train-test split

      Split your time series into training and testing sets. The training set should usually be the first 70-80% of the data, and the rest should be the test set. Do not split your data near a sudden change in the behaviour of the series. Split the actual time series, not the differenced version (but, if you've applied a transformation, split the transformed version, not the raw version).

    3. Build your model

      Determine the orders of your model. For a SARIMAX model, the order is (p, d, q), where p is the number of simple AR terms, d the number of times you simply differenced, and q the number of simple MA terms. The seasonal order is (P, D, Q, m), where P is the number of seasonal AR terms, D the number of times you seasonally differenced, Q the number of seasonal MA terms, and m the seasonal period. Create a SARIMAX object by specifying these orders, and fit it to your training data and any exogenous variables you are using.

    4. Evaluate your model

      Use your fitted model to forecast the time series for the test set, and compare with the actual test set values. Look at a graph of the forecasts and confidence intervals vs the true test values, compute an error metric, and look at the residuals to evaluate your model. If you had multiple candidate models, choose the best performing candidate to be your deployed model.

    5. Deploy your model

      Refit your model to the whole dataset and forecast. Undo any transformations you did to get the actual forecast values.

      1. Refit your model and forecast

        Refit your chosen model on the whole time series (both the training and test sets) and forecast from the refit model into the future. If you applied a transformation to your series in the EDA phase, apply the inverse transformation to your forecasts to get the actual forecast values.

      2. Automate the update process for when new data come in

        If you will need to keep updating your model and forecasts as new data come in, set up a script to automate this process. Build in an error metric checker to alert you if the fit of your model gets worse in future.

    6. Some warnings

      Do not try to forecast too far into the future. Confidence intervals are a guideline, not a guarantee. Expect the unexpected (just because your series has always behaved a certain way in the past doesn't mean it will always continue to do the same). Domain knowledge is crucial. Beware of exogenous regressors which aren't known perfectly in advance.

All the imports from libraries appear in the example code when they're first used, but they're also all collected here in one place for your convenience. Depending on your use case, you might not need all of these, and you might want other imports as well.

In [2]:
import pandas as pd
import numpy as np
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
import datetime

Import your time series data

Load your time series data into pandas and view it:

In [3]:
import pandas as pd

airline = pd.read_csv('./Data/airline.csv')
airline.head()
Out[3]:
Month International airline passengers in thousands
0 1949-01 112.0
1 1949-02 118.0
2 1949-03 132.0
3 1949-04 129.0
4 1949-05 121.0

Convert the time column to datetime and set as the index:

In [4]:
airline.Month = pd.to_datetime(airline.Month, format='%Y-%m')
airline.set_index('Month', inplace=True)
airline.head()
Out[4]:
International airline passengers in thousands
Month
1949-01-01 112.0
1949-02-01 118.0
1949-03-01 132.0
1949-04-01 129.0
1949-05-01 121.0

Note: the format keyword argument tells Pandas how the dates are formatted; "%Y" means "4-digit year" and "%m" means "2-digit month", so '%Y-%m' means "a 4-digit year, followed by a hyphen, followed by a 2-digit month." If the dates were formatted like "01/1949", say, we'd use "format='%m/%Y'". For a full list of the % codes, see https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

You can leave out the format argument, and Pandas will try to guess. If the date format is something sensible (like year-month-day), this is fine, but it might get confused with some formats. Best practice is to manually inspect the data to see what format is used, and specify it with the format argument.

Return to Overview

Clean your data

As you would for any dataset, look at the head and tail of your data and use the .info() and .describe() methods to identify anything odd.

In [5]:
airline.head()
Out[5]:
International airline passengers in thousands
Month
1949-01-01 112.0
1949-02-01 118.0
1949-03-01 132.0
1949-04-01 129.0
1949-05-01 121.0
In [6]:
airline.tail()
Out[6]:
International airline passengers in thousands
Month
1960-08-01 606.0
1960-09-01 508.0
1960-10-01 461.0
1960-11-01 390.0
1960-12-01 432.0
In [7]:
airline.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 143 entries, 1949-01-01 to 1960-12-01
Data columns (total 1 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   International airline passengers in thousands  142 non-null    float64
dtypes: float64(1)
memory usage: 2.2 KB
In [8]:
airline.describe()
Out[8]:
International airline passengers in thousands
count 142.000000
mean 281.394366
std 120.192632
min 104.000000
25% 180.250000
50% 265.500000
75% 361.500000
max 622.000000

Note from the .info() method that there are 143 entries, but 142 non-null values - so we have a missing value somewhere in our data.

The min and max are quite far apart, which might be suspicious. Looking at the head and tail, the numbers seem smaller at the start, and bigger at the end, suggesting the data have increased over the time measured. The data do span a 12 year period, so total airline passengers could have increased a lot in that time. So the min and max values don't seem totally unreasonable. You should do this sort of sense-checking of your data to make sure the summary statistics from .describe() make sense.

Return to Overview

Fill in nulls

A common method to fill in nulls in time series data is "padding" - the last non-null value is repeated where the null is. This means that at every timestamp you have the most recent measured value.

In [9]:
airline.fillna(method='pad', inplace=True)

Another option is to fill in the missing values with 0. This can be appropriate if, for instance, the data are sales data and the missing values are on bank holidays - the data are probably missing because no sales took place on those days! You can do this by simply putting "0" in place of "method='pad'" in the above code.

Return to Overview

Resample to fill in missing dates

Filling in nulls only fills in values for dates that appear in the data but don't have values recorded. If a date isn't even included at all, .fillna() won't add it in. Some time series tools expect all times to be at regular intervals - so with monthly data, for instance, there shouldn't be any missing months. So we also need to "resample" the data to make sure that all dates within the range are included. This will create null values, so again we pad them or put in 0.

In [10]:
airline = airline.resample("M").pad()
In [11]:
airline.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 144 entries, 1949-01-31 to 1960-12-31
Freq: M
Data columns (total 1 columns):
 #   Column                                         Non-Null Count  Dtype  
---  ------                                         --------------  -----  
 0   International airline passengers in thousands  144 non-null    float64
dtypes: float64(1)
memory usage: 2.2 KB

Note that there are now 144 entries - before there were 143. So there was a missing date in the original data! Resampling is a good idea as a "just-in-case" step to make sure you don't have any missing dates.

You can also resample to a bigger stretch of time. For example, if you have daily data, but that's too granular and you just want monthly totals, you can resample to monthly with:

In [12]:
# data.resample("M").sum()
# Instead of .sum(), you can use .mean(), etc. Instead of "M", you can use "Y" for yearly resampling, etc.

Exploratory Data Analysis

Plot your time series

In [13]:
airline.plot();

Look out for trends, seasonality, and cyclic behaviour (the difference between seasonality and cyclic behaviour is that seasonality happens at fixed intervals, whereas cycles happen at irregular intervals). Also look out for heteroskedasticity - when the variance of the series changes over time.

Here, we have a seasonal pattern and a steady upwards trend. The seasonal pattern seems to repeat every year (between 1949 and 1959, I count 10 repetitions), which makes sense (things like Christmas and summer holidays will affect flight numbers in regular ways each year). Since the data are monthly, yearly seasonality means a period of 12 (i.e., the seasonal pattern repeats every 12 data points). The data are heteroskedastic, as the size of the variations in the data get bigger as time goes on, instead of staying roughly constant.

For comparison, here are some other time series plots and description of their features.

Top left: upwards trend, levelling off at the end. Some seasonal behaviour. The series is homoskedastic (the variance does not change over time).

Top right: slight upwards trend, an extreme outlier, and a smaller outlier near the end. No seasonality. Heteroskedasticity is hard to assess - there may be a little, but that could be exaggerated by the outliers.

Bottom left: no particular trend. Some cyclical behaviour - occasional high peaks, but because these are at irregular intervals, this is not seasonality. Homoskedastic.

Bottom right: strong upwards trend, then reverses into a downward trend. No seasonality. Homoskedastic.

If necessary, transform your time series

The decomposition and modelling methods we will use later require a homoskedastic time series. If your time series is strongly heteroskedastic, you will need to transform it (you can then reverse the transformation after decomposition and modelling).

The most common cause of heteroskedasticity is when the variation in a time series is proportional instead of absolute. In other words, when fluctuations are as a percentage of the value of the time series, not a fixed number. You can see this in the airline data:

In [14]:
airline.plot();

In the first season, the lowest point is about 100, which is 50% less than the highest point in that season of about 150. In the last season, the lowest point is about 400, which is 50% less than the highest point of about 600. The pattern is the same in all the other seasons. So every season, it seems like the difference between the highest and lowest point is about 50% of whatever the highest value is, rather than a fixed number.

When you have this sort of heteroskedasticity, you can remove it by using the logarithm function available in the NumPy library:

In [15]:
import numpy as np
In [16]:
airline['log'] = np.log(airline['International airline passengers in thousands'])
airline['log'].plot();

We see that the log number of airline passengers is homoskedastic - the fluctuations are all (roughly) the same size. We can then work with this log time series and safely use tools and techniques that require homoskedasticity. We can convert back from the log series to the original series with the exponential function, np.exp(). So, for example, when we forecast, we will forecast the log series, then take np.exp() of the forecast to get the forecast for the original series.

Sometimes neither the original series nor the log series is homoskedastic. In this case, there are other transformations you can use, called Box-Cox transformations, which can deal with this. They are beyond the scope of this guide, but you can read about them and explore their effects with an interactive tool here. For their implementation in Python, see here.

Return to Overview

Plot the autocorrelation function for your time series

In [17]:
from statsmodels.graphics.tsaplots import plot_acf
In [18]:
plot_acf(airline['log'], lags=36);

The ACF (AutoCorrelation Function) has two key uses at this stage: checking for trend and seasonality. It can also help determine what the seasonal period is (the number of observations before the seasonal pattern repeats) as this can sometimes be hard to tell exactly from the time plot. You should already have an idea of the trend and seasonality of your time series from plotting the series itself, so this step should confirm what you know.

Trend shows up in an ACF plot by a slow decrease. Seasonality shows up by peaks at multiples of the seasonal period.

Here, there is a slow overall decrease in the ACF, showing the trend of the data, and the ACF peaks at 12, 24, and 36, showing the seasonality with period 12. This confirms what we saw from the plot of the time series itself.

Below is the ACF plot of a series with no trend or seasonality. In fact, it is the bottom left time series from the Plot your time series section). Note that the rapid drop in the ACF shows the lack of trend, and the lack of regular peaks shows the lack of seasonality.

Decompose your time series

Time series decomposition lets you split your time series into trend, seasonal, and residual components to study each of these separately. Looking at each component in isolation can help you understand that component more clearly. For time series without seasonality, of course there is no seasonal component, so you can decompose to just trend and residuals.

In both cases, the residuals are simply what is left over once the other components are removed. They are the "noise" in the series that cannot be explained as part of seasonal patterns or a long-term trend.

Jump to seasonal decomposition

Jump to non-seasonal decomposition

For a seasonal time series

To take the decomposition of a seasonal time series, you need to know the seasonal period. You should know this from the time plot and ACF plot.

If you have applied a transformation to your time series, you should decompose the transformed series, not the original series.

In [19]:
from statsmodels.tsa.seasonal import seasonal_decompose
In [20]:
# need to include the period. Notice we use the transformed time series (the log series)
decomp = seasonal_decompose(airline['log'], period=12)
decomp.plot();

Once you have your decomposition, you should look at the residuals to check how good the decomposition is:

In [21]:
decomp.resid.plot();

The residuals plot is very important for understanding how good your decomposition is. It should look like random noise. If there is a trend in the residuals, it means the trend component of your decomposition is not accurately capturing all of the trend information and you can't rely on it. If there is seasonality in the residuals, then the seasonal component of your decomposition is not reliable. Here, the residuals have neither trend nor seasonality, suggesting the decomposition is reasonably good.

The residuals should also be homoskedastic. If they aren't, you might need to transform your data. Here, there is some slight heteroskedasticity - the variance is lower in the middle than at the ends. This is not ideal, but not a complete disaster - you never expect perfection in time series analysis. Compare with the residuals plot of the untransformed data:

In [22]:
seasonal_decompose(airline['International airline passengers in thousands'], period=12).resid.plot();

The heteroskedasticity is much worse here, and there even seems to be seasonality left in the residuals. So the decomposition of the log-transformed data is much more reliable than the decomposition of the raw data would be, and we should stick with the transformed version.

Once we have used the residuals to check our decomposition is good, we can access individual components of the decomposition to study them more clearly:

In [23]:
# plot the trend component
decomp.trend.plot();

There is an overall upwards trend, as we could see from the original series, but what is more obvious from this plot is that it levels off slightly at two points. We could then investigate this further by looking into what happened at those times - were there any world events in 1953 and 1957 that would have prevented the number of airline passengers from growing? This is where domain knowledge is crucial.

If using a transformed series, as here, we might want to undo the transformation on the component to plot it. Here, the general shape is similar, but it means the y-axis is in meaningful units.

In [24]:
import matplotlib.pyplot as plt
plt.plot(np.exp(decomp.trend)); # use np.exp() to undo the np.log() transformation

We should also try to understand what drives the overall trend. Why is the number of airline passengers increasing, and why at this rate? One reasonable guess is that population growth might be responsible, so we could try comparing this with data on world population. Generally, once you have decomposed your time series, it can be good to look at relationships with other time series that might be driving the behaviour. Correlation is the main thing to check when comparing two time series to find a relationship, and of course plotting them on the same graph.

In [25]:
# read in and tidy up world population data
world_pop = pd.read_csv('./Data/world-population.csv')
world_pop.Year = pd.to_datetime(world_pop.Year, format='%Y')
world_pop.set_index('Year', inplace=True)
world_pop.sort_values('Year', inplace=True)
world_pop = world_pop.resample('Y').sum()
world_pop = world_pop.astype('float64')
In [26]:
# plot against airline trend data. Note: world_pop is only available from 1951, so the start of the airline data graph is earlier
# also note we use the destransformed trend data
fig, ax1 = plt.subplots()
ax1.plot(np.exp(decomp.trend)) # plot detransformed trend data
ax2 = ax1.twinx() # the two series are on different scales, so to plot them on the same graph we need two y-axes sharing the same x-axis; the twinx() method allows this
ax2.plot(world_pop, 'r');

We see here that the trend in the airline data closely follows the trend in world population, suggesting that population growth is the main driver of increasing airline numbers. For extra confirmation, we calculate the correlation. The world population data is yearly, not monthly, and covers a shorter time period, so we need to resample the airline trend to yearly and slice the relevant rows:

In [27]:
from scipy.stats import pearsonr
resampled_trend = np.exp(decomp.trend.resample('Y').mean())
pearsonr(world_pop.Population, resampled_trend.loc['1951':])
Out[27]:
(0.9971727710677084, 2.785786178487446e-10)

We have a correlation of 0.997 with a p-value of less than 1 in a billion, so we can very confidently conclude that airline passenger numbers tracked world population very closely and it seems likely that world population was the main driver of the trend.

We now turn to the seasonal component.

In [28]:
# plot the seasonal component
decomp.seasonal.plot();

We can now more clearly see the shape of the seasonal pattern. Every time it rises, it does it in 3 stages (up a lot, down a bit, up a lot, down a bit, up a lot), but every time it falls it does it all in one go. We could look deeper into the data to understand why this is. Again, domain knowledge is crucial here.

To get a clearer idea of the seasonal pattern, we can plot a single season:

In [29]:
decomp.seasonal.iloc[:12].plot();

We can now see that the small drops during the rise are in February, and again in April and May. The February drop is probably due to calendar effects - February has 28 days (most years), while January and March have 31 each, so we should expect a lower value in February. The February drop is a common feature of monthly time series.

The April/May drop is less clear. More investigation of possible causes would be needed.

For a non-seasonal time series

In [30]:
# using different example data for this, so need to load it:
euretail = pd.read_csv('./Data/euro_cleaned.csv')
euretail.Date = pd.to_datetime(euretail.Date)
euretail.set_index('Date', inplace=True)
euretail.plot();

For a non-seasonal time series, you can estimate the trend with a centred rolling average, then subtract the trend to get the residuals:

In [31]:
# use an odd number for the window size. Larger numbers will give smoother trends, smaller numbers will give more jagged trends. Try a few different values
euretail['trend'] = euretail.value.rolling(window=7, center=True).mean() # rolling mean for trend
euretail['resid'] = euretail.apply(lambda row: row[1] - row[0], axis=1) # subtract trend for residuals
In [32]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots(nrows=3, ncols=1, figsize=(6,12))
euretail.value.plot(ax=ax[0]);
euretail.trend.plot(ax=ax[1]);
euretail.resid.plot(ax=ax[2]);

The residuals plot is very important for understanding how good your decomposition is. It should look like random noise. If there is a trend in the residuals, it means the trend component of your decomposition is not accurately capturing all of the trend information and you can't rely on it. If there is seasonality in the residuals, then there may have been seasonality in your original time series, and you should recheck this and consider trying a seasonal decomposition. Here, the residuals have neither trend nor seasonality, suggesting the decomposition is reasonably good.

The residuals should also be homoskedastic, as they are here. If they aren't, you might need to transform your data - see here.

Now that we have a decomposition, we would examine the trend component and try to understand the factors driving it, like we did with the seasonal decomposition by comparing the airline trend to world population data.

Model your time series

Modelling your time series allows you to forecast future values. There are many different modelling methods. We will focus here on a very versatile and popular family of methods called SARIMAX: Seasonal AutoRegressive Integrated Moving Averages with eXogenous regressors. Don't worry, it's not as horrible as it sounds!

SARIMAX is actually a combination of several related modelling methods, so the first step is to decide which particular SARIMAX model to use. Sometimes it is unclear which SARIMAX model will be best. In these cases, you can train 2 or 3 different models, and then compare their performance to decide which to use.

Select an appropriate model

For a list of rules for choosing SARIMA models (without the X! we discuss that here) see here. For a more detailed walkthrough with an example, read on.

Differencing and stationarity

A time series is stationary if shows broadly the same behaviour at all times. In particular, a stationary time series has no trend or seasonality and is homoskedastic. Cyclic behaviour can occur in a stationary time series - because it is unpredictable, it doesn't change the expected value (mean) or variance of the series. For some examples of stationary and non-stationary time series, see here.

SARIMAX is built on a simpler class of models called ARMA, which only work for stationary time series. So the first step in SARIMAX is to convert your time series to a stationary one if it isn't already. You do this with differencing: taking the difference from one observation to the next. The opposite of differencing (which has to be done at the end to get back to the original series) is called integrating, and this is what the I in SARIMAX stands for.

If your time series is seasonal, use seasonal differencing first to remove the seasonality. In rare cases, your time series might start with seasonality, but this fades over time and you expect that in the future the seasonality will die out entirely. In this case, do not use seasonal differencing. But if the seasonality remains strong and you expect it to continue into the future, you must use seasonal differencing.

Differencing cannot remove heteroskedasticity. If your time series is heteroskedastic, you will need to transform it before modelling it with SARIMAX.

In our airline example, there is strong, consistent seasonality, so we use seasonal differencing, with period 12.

In [33]:
# the airline time series is seasonal with period 12, so use seasonal differencing:
seasonally_differenced = airline['log'].diff(12) # 12 for seasonal period
seasonally_differenced.plot();

Once you have a non-seasonal time series - either because it was already, or because you seasonally differenced it - check for stationarity. Do this by looking at the time plot (above) and looking for trend or heteroskedasticity. It can also help to look at the ACF plot - for a stationary series, it should have a sharp drop towards 0 after a small number of lags. Here, there are several sections of the series that are much lower than the rest, so this series does not seem to be stationary. We plot the ACF to check.

Note: when plotting the ACF of a differenced series, you will need to drop null values - the first few values of a differenced series will be null, because there were no earlier values to difference them against. If you get a warning about "converting masked element to nan" when plotting ACF, it usually means you need to trim missing values. Use the .dropna() method for this.

In [34]:
plot_acf(seasonally_differenced.dropna());

The ACF decays fairly slowly towards 0, without a sudden drop, suggesting the time series still has some trend and is not stationary. It certainly drops towards 0 much faster than the original series did though.

If your time series is non-seasonal, but also non-stationary, use simple differencing. If the seasonally differenced time series is non-stationary, use simple differencing as well (that is, apply simple differencing to the seasonally differenced series).

Here, we have seasonally differenced, and it is still non-stationary, so we will use simple differencing on the seasonally differenced series:

In [35]:
# simple differencing too:
twice_differenced = seasonally_differenced.diff() # .diff() for simple differencing - or can put .diff(1), it's the same
twice_differenced.plot();
In [36]:
plot_acf(twice_differenced.dropna());

Now there is no real pattern in our twice differenced series, and the ACF drops down to near 0 suddenly. This series is now (close to) stationary. There is still some heteroskedasticity (from about 1955 to 1959, the variance is lower than everywhere else), but you never get a perfectly stationary time series.

If, after simple differencing, your time series is still non-stationary, you can do a second simple differencing. Only do this if you have not seasonally differenced! You should not difference more than twice in total (either twice simply, or once seasonally and once simply). If you can get away with only differencing once (or not at all!) that is usually better.

If you aren't sure how many times you should difference, you can try both - train a model with one level of differencing, and another with two, for instance, and then compare their performance at the end to decide which to use.

You should make a note of how many times you have seasonally differenced (0 or 1) and how many times you have simply differenced (0, 1, or 2). You will need these numbers when setting up the SARIMAX model. Here: simple differencing: 1, seasonal differencing: 1.

Beware of overdifferencing. It's worth repeating: do not difference more than twice in total, and even stick to once if you can. If after differencing once, the time series is not very badly non-stationary, consider trying a model with that as well as the twice-differenced version, then comparing them. We're in that case here: the seasonally differenced time series is non-stationary, but not too badly. So as well as fitting a model with both seasonal and simple differencing, we'll fit a model to just the seasonally-differenced data, and then compare the two models at the end to decide which is best.

Determining the number of AR terms

Once you have a stationary time series - either because it was in the first place, or because you have differenced it enough - the next step in choosing a SARIMAX model is determining the number of "AR" terms. There can be both simple AR terms and seasonal AR terms. To find these, you use the ACF and PACF (Partial AutoCorrelation Function). You will need to plot them for several multiples of the seasonal period, but you should not plot them for more than about a quarter of the length of your time series, as the calculations can become inaccurate.

Here, our time series has 144 points, but we lose 13 through seasonal differencing followed by simple differencing, leaving 131 points. 131/4 = 32.75, and the seasonal period is 12, so plotting the ACF and PACF for 36 lags is a good compromise, giving us 3 full seasons and not being much more than a quarter of the length of the time series.

In [37]:
from statsmodels.graphics.tsaplots import plot_pacf # we've already imported plot_acf, so don't need to import it again, but we also need plot_pacf
plot_acf(twice_differenced.dropna(), lags=36);
In [38]:
plot_pacf(twice_differenced.dropna(), lags=36);

For the simple AR terms, we first look at whether the ACF is positive or negative at lag 1. Here, ACF(1) is negative (about -0.4). This generally suggests you should not use any simple AR terms - if ACF(1) were positive, we would consider using simple AR terms. If you are going to use simple AR terms, you look at the PACF plot and see where it drops suddenly towards 0. Here, PACF(1) is quite far from 0, and PACF(2) is very close to 0, so if we were to use simple AR terms, we would use 1 AR term. But the next value, PACF(3), is away from 0 again, so this doesn't really fit an AR model, especially with the negative ACF(1). So we will not use any simple AR terms here.

Here: simple AR-terms: 0.

The number of simple AR terms to use (if any) is the number of lags before the PACF drops to 0. To illustrate, if PACF(1), PACF(2), and PACF(3) were significantly non-zero (outside the blue band) and PACF(4) were close to 0, we would use 3 AR terms. The number of AR terms used is the number of non-zero PACF spikes (ignoring the one at 0, which is always 1) before it drops to 0.

The process for the seasonal AR terms is similar, but we look at multiples of the seasonal period. Note: if the original time series was not seasonal, do not include any seasonal AR terms!

Here, the seasonal period is 12, so we look at ACF(12) and see if it is positive or negative. Here, ACF(12) is negative (about -0.4), suggesting we should not use any seasonal AR terms. This is usually the case when seasonal differencing has been used.

If we were going to use seasonal AR terms, we would find how many from the PACF plot, by looking at how many multiples of the seasonal period there are before it goes to near 0. Here, PACF(12) is non-zero, PACF(24) is close to zero, so if we used seasonal AR terms, we would use 1.

Here: seasonal AR-terms: 0. Note this down with the differencing: seasonal differencing: 1, simple differencing: 1, seasonal AR: 1, simple AR: 0.

Because the seasonally differenced series was not badly non-stationary, we are also going to try a model with no simple differencing. So we need to find the number of AR terms for that model, from the ACF and PACF plots for the seasonally differenced data:

In [39]:
plot_acf(seasonally_differenced.dropna(), lags=36);
In [40]:
plot_pacf(seasonally_differenced.dropna(), lags=36);

The ACF is positive at lag 1 and the PACF is outside the blue band at lags 1 and 2, then drops to 0. This suggests 2 simple AR terms.

Looking seasonally, the ACF is (slightly) negative at lag 12 and the PACF is not significantly different from 0 at lag 12, suggesting no seasonal AR terms.

So for the seasonally differenced series, we will use simple AR: 2, seasonal AR: 0.

Determining the number of MA terms

Once you have a stationary series and have decided how many AR terms to use, you do a similar process to determine the MA terms. Like with AR terms, there can be both simple and seasonal MA terms. Again, you use the ACF, but you do not need the PACF for MA terms. We could just refer back to the ACF plot above, but for convenience it is plotted again below.

In [41]:
plot_acf(twice_differenced.dropna(), lags=36);

For the simple MA terms, look at ACF(1) (about -0.4 here). If it is negative, you might want to include some simple MA terms; if it is positive, you probably won't. Since it is negative here, we should use simple MA terms. We look at when the ACF drops to 0 to decide how many simple MA terms to use. Here, MA(1) is non-zero, but MA(2) is close to 0 (inside the blue band), so we use 1 simple MA term. The number of simple MA terms to use (if any) is the number of lags before the ACF drops to 0.

For seasonal MA terms, we do the same, but looking at multiples of the seasonal period. ACF(12) is negative, suggesting we should use seasonal MA terms (this is usually the case when seasonal differencing has been used). ACF(24) is very close to 0, suggesting we should use 1 seasonal MA term. The number of seasonal MA terms to use is the number of multiples of the seasonal period before the ACF drops to 0.

Here: simple MA-terms: 1, seasonal MA-terms: 1.

We also need to do this for the series with only seasonal differencing:

In [42]:
plot_acf(seasonally_differenced.dropna(), lags=36);

The ACF is positive at lag 1, suggesting no simple MA terms. It is also not clear how many we would use if we did include any - there's no sudden drop to 0.

Looking seasonally, the ACF is (slightly) negative at lag 12, but still inside the blue band. It then drops even closer to 0 at lag 24. So we could maybe justify trying a single seasonal MA term, but it's not clear we should. Again, we can try fitting two models (bringing us up to 3 in total now!) and see at the end which is best.

It should be clear by now that fitting a SARIMAX model is not an exact science, and a degree of judgement and trial-and-error is needed!

To summarise: for our seasonally differenced series, we will not include simple MA terms, and will include either 0 or 1 seasonal MA terms.

Determing the order of the SARIMAX model

Now we bring together what we've done. We know how many orders of simple and seasonal differencing we've used (here, 1 of each), how many simple and seasonal AR terms to use (here, none of either), and how many simple and seasonal MA terms to use (here, 1 of each). These numbers define the order and seasonal order of the SARIMAX model. The order is (p,d,q) where p is the number of simple AR terms, d is the number of simple differencings, and q is the number of simple MA terms. The seasonal order is (P,D,Q,m), where P is the number of seasonal AR terms, D is the number of seasonal differencings, Q is the number of seasonal MA terms, and m is the seasonal period.

In our example, we have identified 3 possible models to try. For the first (using the twice-differenced data) the order is (0,1,1) and the seasonal order is (0,1,1,12). For the seasonal differencing only, the order is (2,0,0), and the seasonal order should be either (0,1,0) or (0,1,1). We will use these different possible orders and seasonal orders to build our SARIMAX model. We can describe the model as SARIMAX(p,d,q)(P,D,Q,m). For non-seasonal models, the seasonal order will be (0,0,0,0), so we often drop the S and the seasonal order and just write ARIMAX(p,d,q).

So our 3 models to try are SARIMAX(0,1,1)(0,1,1,12), SARIMAX(2,0,0)(0,1,0,12), and SARIMAX(2,0,0)(0,1,1,12).

When comparing models, note that smaller values for the order and seasonal order are usually better. For instance, if you are choosing between an ARIMAX(2,1,0) model and an ARIMAX(1,1,0) model, you would try both, and if the ARIMAX(2,1,0) model performed clearly better, you would use that, but if there was little difference, you would default to the simpler ARIMAX(1,1,0). If in doubt, the simpler model is better!

In particular, you should never use a model with large numbers of both MA and AR terms (even more than 1 of each). These models are likely to overfit the data. If you do try a model with 2 or more AR and MA terms (or even 2 of one and 1 of the other), also try a model with one fewer AR term, and a model with one fewer MA term, and compare them. This applies to both simple AR/MA terms, and seasonal AR/MA terms.

Exogenous regressors (other variables)

The X in SARIMAX is for "eXogenous regressors", which is a fancy way of saying "other variables". If you have another time series, with the same index (so recorded at the same times/dates), which helps explain the behaviour of your model, you can include this.

Note: in order to forecast from a model with exogenous regressors, you will need to forecast the exogenous regressors first. So if the time series you want to forecast is related to another time series you can more easily forecast, you can use that second series as an exogenous regressor to the first. But if forecasting the second time series is just as hard as forecasting the first, it won't help you.

Using exogenous regressors is entirely optional, but if there is a variable whose behaviour you can predict very accurately and which influences your time series, it is often a good idea to include it as an exogenous regresor, as it will tend to improve your forecasts.

Sometimes, you have a time series where you know exactly what it will do in advance. For instance, the dates of public holidays are known well in advance. So a time series which is 0 when it's not a public holiday and 1 when it is can be perfectly forecast, and so makes an excellent exogenous regressor for something like supermarket sales, which will be affected by public holidays.

In our airline example, we have already noticed from studying the seasonal decomposition that the length of each month seems to have an effect, with a dip in February. So we can form a time series of "number of days per month", which can be forecast with no uncertainty at all, and use that as an exogenous regressor. It is probably not a useful thing to do in this case, since the seasonal pattern (including the February dip) will be modelled anyway, but we will do it to show the use of exogenous regressors.

We could probably look up a list of days per month, but it isn't hard to code directly. We will add a column 'days_in_month' to our dataframe recording the values of the exogenous regressor.

In [43]:
def days_in_month(date):
    is_leap_year = (date.year % 4 == 0) # leap years are every year divisible by 4
    long_months = [1, 3, 5, 7, 8, 10, 12] # numbers of months with 31 days
    month = date.month
    if month in long_months:
        return 31
    elif month == 2:
        return 29 if is_leap_year else 28
    else:
        return 30

airline['days_in_month'] = airline.index.map(days_in_month)

Train-test split

Before building your model, you should train-test split your data. Then you will train on the training data, forecast over the period of the test data, and compare the forecast to what actually happened to evaluate your model. This is how you will pick between competing models.

Note: you only use the differenced version of your time series for choosing the MA and AR terms. Once you know the order and seasonal order of your SARIMAX model, go back to using the original series - when you fit the SARIMAX model, it will take account of the differencing order you tell it.

Typically 70-80% of your data is a good size for a training set. You should take the first part of your data set as training, and the last part for testing, not a random split like you would for a regression or classification model.

Here, we have 144 data points, so 70-80% is 100-115. 108 is a multiple of 12, and in the middle of this range, so we'll use that (there's no need to use a multiple of the seasonal period for the train-test split, but I like to).

In [44]:
train_size = 108

train = airline['log'].iloc[:train_size]
test = airline['log'].iloc[train_size:]

# we also need to train-test split any exogenous regressors we are using:
exog_train = airline['days_in_month'].iloc[:train_size]
exog_test = airline['days_in_month'].iloc[train_size:]
In [45]:
# visualise the train-test split to check it seems reasonable
fig, ax = plt.subplots(1, 1)
train.plot(ax=ax)
test.plot(ax=ax)
plt.show();

Warning: You should never do a train-test split near a sudden large change in your data. Even if it means moving away from a 70-80% split, make the split away from any sudden changes in behaviour. If there is a sudden change in behaviour, make sure you understand why it happened before trying to forecast - if your series shows sudden large changes and you don't understand why, you're unlikely to be able to trust your forecasts. After all, if it's changed suddenly and dramatically before, and you don't understand why, how do you know whether or not it'll happen again?

For instance, I would never try to forecast the time series below without understanding what caused the huge spike and the smaller one near the end, and if I did, I would make sure the train-test split was not right next to a spike.

Build your model

Import the SARIMAX model from the statsmodels library and fit it to your training data. Remember to specify the order and seasonal_order of your model, as you found above.

If you have multiple candidates for the order or seasonal order, create and fit multiple models - and make sure you give them clear names! Here, we have 3 models to test: one with 2 orders of differencing, one with seasonal differencing only and a seasonal MA term, and one with seasonal differencing only and no MA terms.

In [46]:
from statsmodels.tsa.statespace.sarimax import SARIMAX
twice_differenced_model = SARIMAX(train, order=(0,1,1), seasonal_order=(0,1,1,12), exog=exog_train).fit()
seasonal_ma_model = SARIMAX(train, order=(2,0,0), seasonal_order=(0,1,1,12), exog=exog_train).fit()
non_ma_model = SARIMAX(train, order=(2,0,0), seasonal_order=(0,1,0,12), exog=exog_train).fit()

Evaluate your model

Use your model to forecast over the period of the test set, and plot that with the actual test data, to see how well your model performs.

You can also choose a confidence interval for your forecasts, to plot that as well. If doing this for multiple models, do it on separate plots to avoid overcrowding the plot. Here, we will choose a 90% confidence interval, which we do by setting alpha=0.1 (to convert an alpha value into a confidence level, multiply alpha by 100 and subtract from 100. To convert a confidence level to an alpha value, subtract the confidence level from 100 and divide by 100).

If using exogenous regressors, as here, the test values of those need to be input into the forecast function.

In [47]:
# find size of test set, to know how many points to forecast
test_size = airline.shape[0] - train_size
# get the forecasts
twice_differenced_forecast = twice_differenced_model.get_forecast(test_size, exog=exog_test).summary_frame(alpha=0.1) # alpha = 0.1 for 90% confidence level
seasonal_ma_forecast = seasonal_ma_model.get_forecast(test_size, exog=exog_test).summary_frame(alpha=0.1)
non_ma_forecast = non_ma_model.get_forecast(test_size, exog=exog_test).summary_frame(alpha=0.1)
In [48]:
# for the twice_differenced model
fig, ax = plt.subplots(figsize=(15, 5))

# plot the original data
ax.plot(airline.index, airline['log'])

# plot the forecasts
ax.plot(airline.index[-test_size:], twice_differenced_forecast['mean'])
# plot the confidence intervals on the forecasts
ax.fill_between(airline.index[-test_size:], twice_differenced_forecast['mean_ci_lower'], twice_differenced_forecast['mean_ci_upper'], color='k', alpha=0.1);

We see that the forecast follows the shape of the test data quite well, though it does consistently overestimate it. The grey confidence intervals include the test values. Note also how the confidence intervals get broader as the forecast moves further away from the end of the trianing set. This is normal - it is generally easier to predict the near future accurately than the distant future!

We should also plot the residuals of the forecast - the differences between the forecast values and the true test-set values. As with the residuals in the decomposition section, residuals should be random noise, with no apparent trend, seasonality, or heteroskedasticity. They should also be centred on 0.

In [49]:
twice_differenced_resid = twice_differenced_forecast['mean'] - test
twice_differenced_resid.plot();

There is no sign of trend or seasonality, and the residuals are homoskedastic. However, they are centred higher than 0 (this corresponds to the consistent overestimation we saw in the previous graph). This is a problem with our forecast.

We also compute the mean squared error for the forecasts. Note: the actual values of the mean squared error are very small here, because we are using log-transformed data, which tends to lead to much smaller values. The actual values of the mean squared error don't matter so much as the comparison between those values for different models.

In [50]:
from sklearn.metrics import mean_squared_error
In [51]:
mean_squared_error(test, twice_differenced_forecast['mean'])
Out[51]:
0.006844109610484803

We now repeat this for our other two models:

In [52]:
# for the seasonal_ma model
fig, ax = plt.subplots(figsize=(15, 5))

# plot the original data
ax.plot(airline.index, airline['log'])

# plot the forecasts
ax.plot(airline.index[-test_size:], seasonal_ma_forecast['mean'])
# plot the confidence intervals on the forecasts
ax.fill_between(airline.index[-test_size:], seasonal_ma_forecast['mean_ci_lower'], seasonal_ma_forecast['mean_ci_upper'], color='k', alpha=0.1);
In [53]:
# plot residuals
seasonal_ma_resid = seasonal_ma_forecast['mean'] - test
seasonal_ma_resid.plot();

The residuals have no seasonality and are homoskedastic. They are also better centred around 0 than the previous model's residuals. They do maybe have a slight downwards trend, suggesting that this model overestimates in the short term, and then underestimates in the longer term.

In [54]:
mean_squared_error(test, seasonal_ma_forecast['mean'])
Out[54]:
0.0023300595946771515
In [55]:
# for the non_ma model
fig, ax = plt.subplots(figsize=(15, 5))

# plot the original data
ax.plot(airline.index, airline['log'])

# plot the forecasts
ax.plot(airline.index[-test_size:], non_ma_forecast['mean'])
# plot the confidence intervals on the forecasts
ax.fill_between(airline.index[-test_size:], non_ma_forecast['mean_ci_lower'], non_ma_forecast['mean_ci_upper'], color='k', alpha=0.1);
In [56]:
# plot residuals
non_ma_resid = non_ma_forecast['mean'] - test
non_ma_resid.plot();

Here the residuals are non-seasonal, homoskedastic, and centred around 0, but there is a clear downwards trend. This model overestimates in the near future, and underestimates further ahead.

In [57]:
mean_squared_error(test, non_ma_forecast['mean'])
Out[57]:
0.004182004451632522

Visually, the forecast seems best for the seasonal_ma model. This also has the lowest mean squared error, at 0.002, compared with 0.004 and 0.007 for the other models. Finally, the residual plot for the seasonal_ma model is closer to random noise around 0 than either of the others (the first model's residuals were all bigger than 0, the last model had clear downwards trend in the residuals).

So all indicators are telling us to select the seasonal_ma model as the best model for our data. This was the SARIMAX(2,0,0)(0,1,1,12) model.

Deploy your model

We determined by testing which model to use. In our case, it was the SARIMAX(2,0,0)(0,1,1,12) model. So far we have fit the model on the training data, but that is not our entire dataset. So we should now refit the model to the entire dataset. Fitting to just the training data is only an intermediate step while comparing different models and assessing the model.

Refit your model and forecast

In [58]:
# refit model to entire dataset
model = SARIMAX(airline['log'], order=(2,0,0), seasonal_order=(0,1,1,12), exog=airline['days_in_month']).fit()
C:\Users\Harry Gulliver\anaconda3\lib\site-packages\statsmodels\base\model.py:566: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
  warnings.warn("Maximum Likelihood optimization failed to "

The above warning message is a problem that can sometimes occur with SARIMAX and other machine learning algorithms. Sometimes the algorithm which fits the model struggles to arive at a final answer. There are alternative methods that can be used when fitting the model, and often trying a different method will produce better results. You should not proceed with a model that has produced a ConvergenceWarning; try different methods until you find one that converges. The available methods are:

  • ‘newton’ for Newton-Raphson
  • ‘nm’ for Nelder-Mead
  • ‘bfgs’ for Broyden-Fletcher-Goldfarb-Shanno (BFGS)
  • ‘lbfgs’ for limited-memory BFGS with optional box constraints
  • ‘powell’ for modified Powell’s method
  • ‘cg’ for conjugate gradient
  • ‘ncg’ for Newton-conjugate gradient
  • ‘basinhopping’ for global basin-hopping solver

You do not need to know what any of these mean. If you get a ConvergenceWarning like we have here, just try different methods in the fit() call until you find one that does not give you a warning.

In [59]:
# refit model to entire dataset - trying a different method
model = SARIMAX(airline['log'], order=(2,0,0), seasonal_order=(0,1,1,12), exog=airline['days_in_month']).fit(method='powell')
Optimization terminated successfully.
         Current function value: -1.691982
         Iterations: 4
         Function evaluations: 256

Note that when you specify a method in fit(), it gives a short report. Pay attention to the Iterations; the default maximum for this is 50; it is when this number grows past 50 that the algorithm stops and gives the ConvergenceWarning we saw earlier. If it is close to 50, you might still want to try a different method. Here, the algorithm fit the model in only 4 iterations, suggesting this method has worked well.

Now that we have successfully fit our model, we can use it to forecast. This is very similar to when evaluating the model, except that now we are forecasting from our whole dataset into the unknown, whereas before we were forecasting from our training set into the time of our test set. If we have applied any transformations to the data, we must remember to reverse those transformations on the forecasts. Be warned that your confidence intervals may be less reliable if you have calculated them on transformed data and then detransformed. In any case, confidence intervals should only ever be used as a guideline, not as definitive.

When using exogenous regressors, as we are, you need to feed the get_forecast method with forecasts of the exogenous regressor. Since our exogenous regressor is just the number of days in each month, it is easy to forecast with certainty.

When using exogenous regressors that are NOT determined with certainty in advance, you need to be careful. You should consider carefully how you will forecast those regressors and how confident you can be in your forecast. Often a useful approach is to forecast a "worst-case scenario", "best-case scenario", and "most likely scenario" for the regressor, and then use each of those three scenarios in turn to produce forecasts of your main series of interest. That way, you have a range of forecasts that you can reasonably expect to include the true future values.

Note that the confidence intervals on the forecast assume the exogenous forecast is perfect. So if there is any uncertainty in your exogenous forecast, your final forecast of the main series will be less certain than it appears from the confidence intervals. This is another reason for considering several different scenarios for the exogenous variable, rather than a single forecast. It is also another reason for only using an exogenous variable you can forecast accurately.

In [60]:
# check the last date in our dataset
airline.index[-1]
Out[60]:
Timestamp('1960-12-31 00:00:00', freq='M')
In [61]:
# set up the forecast period. We will forecast 2 years (24 months)
forecast_period = pd.date_range(start="1961-01-31", periods=24, freq="M") # freq="M" for monthly data

# create our forecast of the exogenous variable. This will be different depending on what your exogenous variable is and how you will forecast it. See above
exog_forecast = forecast_period.map(days_in_month).values
In [62]:
# get the forecast
forecast = model.get_forecast(24, exog=exog_forecast).summary_frame(alpha=0.1) # alpha = 0.1 for 90% confidence level
In [63]:
# plot the forecast
fig, ax = plt.subplots(figsize=(15, 5))

# plot the original data. note: we plot the original time series now, not the transformed version
ax.plot(airline.index, airline['International airline passengers in thousands'])

# plot the forecasts. apply np.exp() to undo the effects of the log transformation we applied
ax.plot(forecast_period, np.exp(forecast['mean'])) # if you didn't have to transform, remove the np.exp
# plot the confidence intervals on the forecasts
ax.fill_between(forecast_period, np.exp(forecast['mean_ci_lower']), np.exp(forecast['mean_ci_upper']), color='k', alpha=0.1); # apply np.exp() to undo the transformation
In [64]:
# we can also plot just the forecast, to get a clearer view of it
fig, ax = plt.subplots(figsize=(15, 5))
ax.plot(forecast_period, np.exp(forecast['mean'])) # apply np.exp() to undo the effects of the log transformation we applied
# plot the confidence intervals on the forecasts
ax.fill_between(forecast_period, np.exp(forecast['mean_ci_lower']), np.exp(forecast['mean_ci_upper']), color='k', alpha=0.1); # apply np.exp() to undo the transformation

We now have a forecast of our original time series, with 90% confidence intervals. We are in a position to predict things like "in 1961 total monthly airline passengers should peak at around 650 000 passengers (the original data is in thousands, which is why the y-axis only shows hundreds)."

We can also print the values for a more precise view than reading off the graph:

In [65]:
np.exp(forecast)
Out[65]:
log mean mean_se mean_ci_lower mean_ci_upper
1961-01-31 447.358304 1.038010 420.733062 475.668470
1961-02-28 412.779281 1.045755 383.493927 444.300999
1961-03-31 470.926957 1.054473 431.581971 513.858812
1961-04-30 486.811911 1.061277 441.444737 536.841459
1961-05-31 502.742526 1.067532 451.505396 559.794079
1961-06-30 574.032791 1.073140 511.106853 644.705982
1961-07-31 659.965413 1.078308 582.994316 747.098786
1961-08-31 655.104409 1.083094 574.500000 747.017905
1961-09-30 546.283348 1.087566 475.832300 627.165278
1961-10-31 486.942994 1.091767 421.463597 562.595395
1961-11-30 419.481061 1.095732 360.914631 487.551198
1961-12-31 464.659820 1.099489 397.541124 543.110474
1962-01-31 480.740456 1.110149 404.822916 570.895018
1962-02-28 443.314506 1.117323 369.372582 532.058307
1962-03-31 505.412261 1.124761 416.542098 613.243066
1962-04-30 522.117930 1.131483 426.113390 639.752563
1962-05-31 538.847956 1.137888 435.703325 666.410154
1962-06-30 614.857889 1.143921 492.857684 767.057581
1962-07-31 706.445730 1.149657 561.633570 888.596402
1962-08-31 700.794254 1.155116 552.816316 888.382945
1962-09-30 584.013529 1.160328 457.295436 745.845628
1962-10-31 520.248220 1.165316 404.501807 669.114960
1962-11-30 447.893675 1.170098 345.906922 579.950071
1962-12-31 495.827167 1.174691 380.466253 646.166586

We can use the confidence intervals to give upper and lower bounds on our predictions. For instance, in July 1961, we expect at least 583 000 passengers and no more than 747 000.

Automating forecasts

Sometimes you might only want a single forecast from one point in time. But often you will want to keep making and updating forecasts as time goes on and new data come in. You don't want to have to redo a lot of work every time a new data point is measured in your time series, so an important part of deploying a forecasting model that will continue to be used is automating the future work.

Every time a new data point becomes available, you will want to retrain your model taking account of the new data, and re-forecast based on that new information. That way you always have access to the forecasts from the most up-to-date data available. You can write functions to do all of this for you. Some examples of functions you might want to write are below.

In [66]:
def import_and_clean():
    '''
    this function imports the data and applies the cleaning processes to it.
    It assumes that every time a new data point becomes available, it is added
    to the file with path ./Data/airline.csv relative to this notebook; you will
    of course want to modify that for your use.
    You will also want to modify the cleaning steps to match what you had to do
    with your data
    Since we used an exogenous regressor for the model, which was calculated from
    the data, we also add that in in this function. In other cases, you might
    need to read in the new value of the exogenous regressor from a separate
    data source and join it on
    '''
    # import and set index
    airline = pd.read_csv('./Data/airline.csv')
    airline.Month = pd.to_datetime(airline.Month, format='%Y-%m')
    airline.set_index('Month', inplace=True)
    
    # pad null values and resample
    airline.fillna(method='pad', inplace=True)
    airline = airline.resample("M").pad()
    
    # apply transform
    airline['log'] = np.log(airline['International airline passengers in thousands'])
    
    # add exogenous regressor column
    airline['days_in_month'] = airline.index.map(days_in_month) # depends on the days_in_month function which we defined earlier,
    # if putting this in a separate script, will need to put the definition of days_in_month in that script too
    
    return airline
In [67]:
def reforecast(data):
    '''
    this function takes the dataframe and refits the model, then forecasts
    it has the particular model we selected hardcoded: SARIMAX(2,0,0)(0,1,1,12),
    with days_in_month as an exogenous regressor.
    '''
    # refit model
    model = SARIMAX(data['log'], order=(2,0,0), seasonal_order=(0,1,1,12), exog=data['days_in_month']).fit(method='powell')
    
    # set up the forecast period. We will forecast 2 years (24 months)
    import datetime
    end_of_data = data.index[-1]
    forecast_period = pd.date_range(start=end_of_data, periods=24, freq="M")[1:] # freq="M" for monthly data
    # filtered out first entry with [1:] so we start the month AFTER the end of the data, not the same month

    # create our forecast of the exogenous variable. This will be different depending on what your exogenous variable is and how you will forecast it
    exog_forecast = forecast_period.map(days_in_month).values
    
    return model.get_forecast(24, exog=exog_forecast).summary_frame(alpha=0.1) # alpha = 0.1 for 90% confidence level
In [68]:
def plot_forecast(data, forecast):
    """
    this function plots the forecast and saves it as a png file
    """
    # plot the forecast
    fig, ax = plt.subplots(figsize=(15, 5))

    # plot the original data. note: we plot the original time series now, not the transformed version
    ax.plot(data.index, data['International airline passengers in thousands'])
    
    forecast_period = forecast.index

    # plot the forecasts. apply np.exp() to undo the effects of the log transformation we applied
    ax.plot(forecast_period, np.exp(forecast['mean'])) # if you didn't have to transform, remove the np.exp
    # plot the confidence intervals on the forecasts
    ax.fill_between(forecast_period, np.exp(forecast['mean_ci_lower']), np.exp(forecast['mean_ci_upper']), color='k', alpha=0.1); # apply np.exp() to undo the transformation
    
    # get today's date as a string to go in the file name - so when we save a new forecast plot, we won't overwrite the previous one
    from datetime import date
    today = date.today()
    today = today.strftime('%Y-%m-%d')
    
    # save the plot as a png file
    plt.savefig(f'./airline_forecast_on_{today}.png')
In [69]:
def update():
    """
    combines the previous functions to update the forecast and produce and save the plot
    """
    updated_data = import_and_clean()
    updated_forecast = reforecast(updated_data)
    plot_forecast(updated_data, updated_forecast)

You can then simply call the update() function every time a new data point is added to produce and save a graph of the updated forecast, taking into account the new data.

You could also write a function to use the previous version of the model (fit to all data points except the newly available one) to produce a one-step-ahead forecast of the new data point, and check the error. It could then notify you if the error is bigger than a tolerance you specify. In this way, you can automatically monitor your model as time goes by to check if it becomes less accurate. After all, the behaviour of the time series could change, and maybe the model you picked is no longer the best, so you might need to repeat the entire modelling process. Having some automatic monitoring to tell you if your predictions start getting worse can alert you to this.

There are many options for automating the deployment of your time series modelling, limited only really by your imagination and use cases! You can also create a Python script (rather than a Jupyter notebook) and schedule it to run at regular intervals - so you don't even need to open a notebook and run the update() function every time a new data point comes in. How to do this is beyond the scope of this guide, but Google will help you!

Some warnings

Do not try to forecast too far into the future. The further into the future you forecast, the less reliable your forecasts become. There are no hard rules for how far is too far, as it depends a lot on how volatile the data are and how confident you are in your understanding of the factors driving the time series. Just bear in mind that your predictions become much less certain as you look further and further ahead.

The confidence intervals in the forecast are guidelines only. They do not guarantee that the time series will remain within those bounds in future. They are estimates, and are also based on certain assumptions, which may not always be valid. In particular, if you forecast from a transformed series, your confidence intervals on the detransformed series might not be accurate.

Unexpected changes can happen. For example, just because your time series has shown a steady upwards trend for all its values so far doesn't mean it will continue to. Maybe something will happen to change that, or maybe there is some natural limit to high how it can grow before its growth must slow. Always remember that your forecasts might be wrong, and maybe even drastically wrong.

Domain knowledge is crucial. Often the best forecasts are made by combining statistical methods like SARIMAX with the judgement of a range of experts. Any time series analysis should be accompanied by careful, critical examination of the situation, the data, and the conclusions.

If using exogenous regressors, the reliability of your forecasts for the main series depend on your forecasts for the regressors. This is not a problem for exactly known regressors, like our days_in_month example, but if there is any uncertainty in the forecast of the regressor, it will make your main forecast more uncertain. In particular, the confidence intervals will be narrower than they should be - do not trust the confidence intervals if using uncertain exogenous regressors.