# jupyter nbconvert --execute --to html '.\Step-by-Step Guide to Time Series Analysis.ipynb'
from IPython.display import HTML

contents_css = """
.contents ol{
    text-align: center;
    list-style-position: inside;
}
.contents a {
    color: var(--ultraviolet) !important;
    font-size: 150% !important;
}
.contents a:hover {
    color: var(--electron) !important;
}
.contents li::marker {
    color: var(--molten);
    font-size:150% !important;
}
.contents li {
    margin-top: 15px;
    margin-bottom: 15px;
    text-align: left;
    margin-left: 30%;
}

.contents li p {display:none}

/*
.contents li:hover p {
    display:inline;
}

.contents > li > ol > li > ol{display:none}
.contents > li > ol > li:hover > ol{
    display:inline;
}*/

.contents li li {
    margin-left: 30px;
}

.contents li li {
    font-size: 90%;
}

.contents li li li {
    font-size: 80%;
}

/* dropdown arrow hack with checkbox */
.arrow {
  border: solid var(--molten);
  border-width: 0 3px 3px 0;
  display: inline-block;
  padding: 3px;
  margin: 0 0 3px 5px;
}

.left {
  transform: rotate(135deg);
  -webkit-transform: rotate(135deg);
}

.down {
  transform: rotate(45deg);
  -webkit-transform: rotate(45deg);
}

input:checked + label.arrow.left {
    transform: rotate(45deg);
  -webkit-transform: rotate(45deg);
}

input:checked + label + br + p {
    display: inline
}

.contents li ol {
    display:none
}

input:checked + label + br + p + ol {
    display:inline
}

input {
    display:inline
}

input[type="checkbox"] {
    display:none
}
"""

htmlstr = """
<html>
<head>
    <link rel="preconnect" href="https://fonts.googleapis.com">
    <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
    <link href="https://fonts.googleapis.com/css2?family=Rubik&display=swap" rel="stylesheet">
    <style>
        :root {
                --ultraviolet: #4d61f4;
                --molten: #ff7c66;
                --supernova: #ffe352;
                --white-dwarf: #f4f4f4;
                --electron: #64abff;
                --deep-space: #242456;
                --aurora: #59d8a1;
            }
        body {
            font-family: Rubik, sans-serif;
            }
        div {font-size: 103%; color: var(--deep-space) !important;}
        div.CodeMirror-code {font-family: monospace; color: var(--deep-space) !important;}
        .rendered_html table, div.output_area pre {color: var(--deep-space) !important;}
        h1 {font-family: "Times New Roman", serif; color:var(--ultraviolet); font-size: 300% !important;}
        h2 {font-family: "Times New Roman", serif; color:var(--ultraviolet); font-size: 200% !important;}
        h3 {font-family: "Times New Roman", serif; color:var(--electron); font-size: 150% !important;}
        h4 {font-family: "Times New Roman", serif; color:var(--molten); font-size: 125% !important;}
        .rendered_html h4 {
            margin-top: 0;
        }
        span#author {color:molten !important;}
        a#return_to_contents {
            font-family: "Times New Roman", serif;
            color:var(--molten);
            font-size: 120% !important;
        }
        /*img.resize {width:80%; height:80%;}
        img.small {width:40%; height:40%;}*/
        """ + contents_css + """
    </style>
</head>
<script>
    $('div#notebook-container > div.code_cell:nth-of-type(1)').hide();
</script>
</html>
"""
HTML(htmlstr)

Time Series Analysis

A Step-by-Step Guide

By H Gulliver for Multiverse

This guide is meant to walk you through the steps you might follow when analysing time series data and forecasting future values. It starts with an overview of the key steps to take, then gives a more detailed run-through, with a running example and Python code. The overview of key steps provides hyperlinks to the relevant sections of the detailed run-through.

Time series analysis is a huge subject. This guide covers only the most common steps you should follow and a few of the most common forecasting models. It is certainly not a complete guide to the subject, but should help you get started with analysing time series data.

The idea is you should follow along with this guide while analysing a time series of your own. The guide will be a reference for what steps to take, and the code to use. No two time series are the same, and the steps needed to analyse a time series vary. Sometimes you might need to skip some of these steps, sometimes you might need extra steps. Again, this guide is a starting point, not comprehensive.

The guide is based around using Python to analyse your time series, and example code is shown, with outputs. The main steps will be the same in any language, but the code will of course be different. An excellent resource for time series concepts in general, and their implementation in R, is this site, which is referenced several times below.

Overview

Click the dropdown arrow by a heading to see a description and subheadings. Click the (sub)section heading to go to the detailed walkthrough of that section.

Import your time series data

Read your time series into Pandas, convert the time column to a datetime, and set it as the index.
Clean your data

Check for odd or missing values. Look at summary statistics and the head and tail and sense-check them.
1. Fill in nulls
  
  Either "pad" missing values (copy the last known value forward), or fill them in with zeroes or another suitable value, depending on context.
2. Resample to fill in missing dates
  
  If there could be entire timestamps missing, insert those and fill in their values as you did for nulls. You can also resample to a different level of granularity - such as looking annually instead of monthly.
Exploratory data analysis

Plot your time series and identify key features. Apply a transformation if needed (such as a logarithm) and plot the autocorrelation function to confirm key features.
1. Plot your time series
  
  Plot the raw time series. Look for outliers, trend, seasonality, cyclic behaviour, and heteroskedasticity. Identify any sudden changes in the series' behaviour over time and try to identify what might have caused them.
2. If necessary, transform your time series
  
  If your series is heteroskedastic (its variance changes over time) you will need to transform it and work with the transformed series. A logarithmic transformation is often effective. Graph the transformed series and look for features, as you did with the raw series.
3. Plot the autocorrelation function for your time series
  
  Look at the correlation of your series with past values of itself (use the transformed series, if you have applied a transformation). Confirm your observations from plotting the series itself by looking for trend (which will show as a slow decrease towards 0 in the ACF) and seasonality (which will show as regular spikes at multiples of the seasonal period).
Decompose your time series

Study the trend and seasonal (if there is one!) components of your time series more closely to better understand the behaviour. You might compare one (or both) of these components to other time series to identify relationships that can explain the behaviour of your data (for example, the trend component of sales data might relate to unemployment rates).
1. For a seasonal time series
  
  Use a seasonal decomposition function to split your (possibly transformed) time series into trend, seasonal, and residual components. Explain the behaviour of the trend and seasonal components to better understand your series - maybe compare them with other time series. Use the residuals to check the quality of the decomposition (the residuals should show no trend, seasonality, or heteroskedasticity).
2. For a non-seasonal time series
  
  Use a rolling mean to extract the trend component from your time series and separate it from the residuals. Use the residuals to check the quality of the decomposition (the residuals should show no trend, seasonality, or heteroskedasticity). Study the trend component and explain its behaviour to better understand your time series - maybe compare it with another time series.
Model your time series

Choose an appropriate model for your time series. We will look at SARIMAX models here, but other options exist. Once you have chosen a model (or several candidates), fit it to a training subset of your data and test it on a testing subset. Once you have settled on a final model, retrain it on your whole dataset and produce forecasts. You might want to build convenience functions to let you automatically update the model and forecasts as new data come in.
1. Select an appropriate model
  
  Choose one or more models to fit to your data. There are many types of model available, but we will focus on choosing SARIMAX models.
  1. Differencing and stationarity
    
    Check if your time series is stationary. If it isn't, use differencing (seasonal differencing at most once to remove seasonality and simple differencing to remove trends) to get a (roughly) stationary time series. Make a note of how many times you differenced, both seasonally and simply, as you will need these numbers to fit your model.
  2. Determining the number of AR terms
    
    Look at the ACF and PACF plots of your differenced time series to identify the number of AR terms. If the PACF drops suddenly to 0 after a small number of lags and/or the ACF is positive at lag 1, then use AR terms - the number of AR terms is the number of lags before the PACF drops to 0. For seasonal AR terms, use the same approach, but looking only at the lags which are multiples of the seasonal period. Make a note of the numbers of simple and seasonal AR terms - you will need these to fit your model.
  3. Determining the number of MA terms
    
    Look at the ACF plot of your differenced time series to identify the number of MA terms. If the ACF drops suddenly to 0 after a small number of lags and/or the ACF is negative at lag 1, then use MA terms - the number of MA terms is the number of lags before the ACF drops to 0. For seasonal MA terms, use the same approach, but looking only at the lags which are multiples of the seasonal period. Make a note of the numbers of simple and seasonal MA terms - you will need these to fit your model.
  4. Determining the order of your SARIMAX model
    
    Bring together the number of seasonal and simple AR and MA terms, the seasonal period, and the number of simple and seasonal differencings to find the orders of the SARIMAX model. This is just collecting the results of the previous steps.
  5. Exogenous regressors (other variables)
    
    If your time series is related to other time series which are easier to forecast accurately (especially ones which are known perfectly in advance, like whether a day will be a public holiday or not), then you can use these as exogenous regressors in your model. Make sure you have these as a column in your dataframe alongside your main (endogenous) time series.
2. Train-test split
  
  Split your time series into training and testing sets. The training set should usually be the first 70-80% of the data, and the rest should be the test set. Do not split your data near a sudden change in the behaviour of the series. Split the actual time series, not the differenced version (but, if you've applied a transformation, split the transformed version, not the raw version).
3. Build your model
  
  Determine the orders of your model. For a SARIMAX model, the order is (p, d, q), where p is the number of simple AR terms, d the number of times you simply differenced, and q the number of simple MA terms. The seasonal order is (P, D, Q, m), where P is the number of seasonal AR terms, D the number of times you seasonally differenced, Q the number of seasonal MA terms, and m the seasonal period. Create a SARIMAX object by specifying these orders, and fit it to your training data and any exogenous variables you are using.
4. Evaluate your model
  
  Use your fitted model to forecast the time series for the test set, and compare with the actual test set values. Look at a graph of the forecasts and confidence intervals vs the true test values, compute an error metric, and look at the residuals to evaluate your model. If you had multiple candidate models, choose the best performing candidate to be your deployed model.
5. Deploy your model
  
  Refit your model to the whole dataset and forecast. Undo any transformations you did to get the actual forecast values.
  1. Refit your model and forecast
    
    Refit your chosen model on the whole time series (both the training and test sets) and forecast from the refit model into the future. If you applied a transformation to your series in the EDA phase, apply the inverse transformation to your forecasts to get the actual forecast values.
  2. Automate the update process for when new data come in
    
    If you will need to keep updating your model and forecasts as new data come in, set up a script to automate this process. Build in an error metric checker to alert you if the fit of your model gets worse in future.
6. Some warnings
  
  Do not try to forecast too far into the future. Confidence intervals are a guideline, not a guarantee. Expect the unexpected (just because your series has always behaved a certain way in the past doesn't mean it will always continue to do the same). Domain knowledge is crucial. Beware of exogenous regressors which aren't known perfectly in advance.

All the imports from libraries appear in the example code when they're first used, but they're also all collected here in one place for your convenience. Depending on your use case, you might not need all of these, and you might want other imports as well.

import pandas as pd
import numpy as np
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error
import datetime

Import your time series data

Load your time series data into pandas and view it:

import pandas as pd

airline = pd.read_csv('./Data/airline.csv')
airline.head()

Convert the time column to datetime and set as the index:

airline.Month = pd.to_datetime(airline.Month, format='%Y-%m')
airline.set_index('Month', inplace=True)
airline.head()

Note: the format keyword argument tells Pandas how the dates are formatted; "%Y" means "4-digit year" and "%m" means "2-digit month", so '%Y-%m' means "a 4-digit year, followed by a hyphen, followed by a 2-digit month." If the dates were formatted like "01/1949", say, we'd use "format='%m/%Y'". For a full list of the % codes, see https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes

You can leave out the format argument, and Pandas will try to guess. If the date format is something sensible (like year-month-day), this is fine, but it might get confused with some formats. Best practice is to manually inspect the data to see what format is used, and specify it with the format argument.

	Month	International airline passengers in thousands
0	1949-01	112.0
1	1949-02	118.0
2	1949-03	132.0
3	1949-04	129.0
4	1949-05	121.0

	International airline passengers in thousands
Month
1949-01-01	112.0
1949-02-01	118.0
1949-03-01	132.0
1949-04-01	129.0
1949-05-01	121.0

	International airline passengers in thousands
Month
1960-08-01	606.0
1960-09-01	508.0
1960-10-01	461.0
1960-11-01	390.0
1960-12-01	432.0

	International airline passengers in thousands
count	142.000000
mean	281.394366
std	120.192632
min	104.000000
25%	180.250000
50%	265.500000
75%	361.500000
max	622.000000

log	mean	mean_se	mean_ci_lower	mean_ci_upper
1961-01-31	447.358304	1.038010	420.733062	475.668470
1961-02-28	412.779281	1.045755	383.493927	444.300999
1961-03-31	470.926957	1.054473	431.581971	513.858812
1961-04-30	486.811911	1.061277	441.444737	536.841459
1961-05-31	502.742526	1.067532	451.505396	559.794079
1961-06-30	574.032791	1.073140	511.106853	644.705982
1961-07-31	659.965413	1.078308	582.994316	747.098786
1961-08-31	655.104409	1.083094	574.500000	747.017905
1961-09-30	546.283348	1.087566	475.832300	627.165278
1961-10-31	486.942994	1.091767	421.463597	562.595395
1961-11-30	419.481061	1.095732	360.914631	487.551198
1961-12-31	464.659820	1.099489	397.541124	543.110474
1962-01-31	480.740456	1.110149	404.822916	570.895018
1962-02-28	443.314506	1.117323	369.372582	532.058307
1962-03-31	505.412261	1.124761	416.542098	613.243066
1962-04-30	522.117930	1.131483	426.113390	639.752563
1962-05-31	538.847956	1.137888	435.703325	666.410154
1962-06-30	614.857889	1.143921	492.857684	767.057581
1962-07-31	706.445730	1.149657	561.633570	888.596402
1962-08-31	700.794254	1.155116	552.816316	888.382945
1962-09-30	584.013529	1.160328	457.295436	745.845628
1962-10-31	520.248220	1.165316	404.501807	669.114960
1962-11-30	447.893675	1.170098	345.906922	579.950071
1962-12-31	495.827167	1.174691	380.466253	646.166586

Time Series Analysis

A Step-by-Step Guide

By H Gulliver for Multiverse

Overview

Import your time series data

Clean your data

Fill in nulls

Resample to fill in missing dates

Exploratory Data Analysis

Plot your time series

If necessary, transform your time series

Plot the autocorrelation function for your time series

Decompose your time series

For a seasonal time series

For a non-seasonal time series

Model your time series

Select an appropriate model

Differencing and stationarity

Determining the number of AR terms

Determining the number of MA terms

Determing the order of the SARIMAX model

Exogenous regressors (other variables)

Train-test split

Build your model

Evaluate your model

Deploy your model

Refit your model and forecast

Automating forecasts

Some warnings