Time Series Forecasting With Autoregression

Forecasting (predicting future values) with time series can be tricky. This is because time series data may exhibit behavior that violates the assumptions of many modeling methods. Because of this, are a few special considerations you need to make when working with time series data. This post will serve as an introduction to and reference for some of the behaviors you should look for when modeling time series data with autoregression.

Why We Need To Be Careful

First, we should note that time series behavior is of particular concern for parametric models (models for which we make an assumption about the functional form of the process that generates the series). For non-parametric models (models where we don't make assumptions about the form of our series' generative process, such as neural networks and tree based methods), we may not need to worry about time's effect on our model parameters.

We'll be talking about considerations you need to make when using parametric model to forecast a response variable $Y$ with time series. This is important, because the behavior of the time series will determine which functional form we choose to model $Y$ with (aka its characteristic equation).

Let's assume we're going to use an autoregressive forecast model, fitted using Ordinary Least Squares (OLS). A key assumption of OLS is that the observations in our dataset are independent. This means that the characteristic equation of $Y$ is based on a random variable that follows a probability distribution whose moments (mean, variance, skew, kurtosis, etc) are constant. In other words: none of parameters of the underlying distribution of $Y$ are a function of time. A time series that meets these criteria is said to be stationary.

Formally, this means that the characteristic equation we've chosen does not have a unit root. If a characteristic equation of $Y$ has a unit root > 0, then at least one of the moments of its underlying probability distribution is a function of time. Thus, our key assumptions of regression are violated and our regression model will not accurately model the data. A time series whose characteristic equation has a unit root > 0 is non-stationary.

Using linear based models with stationary time series, though imperfect, works reasonably well. We can transform a non-stationary time series into a stationary one by identifying its problematic behavior and applying the appropriate transformation(s).

Stationary Time Series

Let's examine the characteristics of a stationary time series. Here's an example below I genereated using random data:


By itself, it doesn't look like much. There are no patterns in the data over time - and that's exactly what we want. Patterns over time generally indicate a dependence on it. Let's add a rolling mean and a rolling standard deviation to see what the absence of time dependent characteristics looks like:


Note how the rolling mean and rolling standard deviation are relatively constant. This indicates that $Y$ does not depend on a function of time. We can show show this more clearly by looking at both the residuals of the data (each value $y_i$ minus the sample mean $\hat{y}$) and a best fit line fit on the residuals (the line that minimizes the RSS):


Our best fit line has a near-zero slope, and the residuals appear to be independently distributed and centered around 0 with constant variance. Note that this step - in which we examine how $Y$ varies around its mean - is also referred to as its first difference.

Dickey-Fuller Test

How can we determine if the characteristic equation of $Y$ has a unit root? With a unit root test. There are a multitude of unit root tests (Dickey Fuller, Phillps-Peron, KPSS, etc), but the specific test you should use depends on your null hypothesis about the underlying process of $Y$. In this post we'll examine the Augmented Dickey-Fuller test, which is appropriate when the characteristic equation we've chosen is an autoregressive process.

Let's take a moment to recall what an autoregressive process is. It's a way to represent sequential data in which each response variable $y_t$ depends on one or more of its previous, or "lagged" values $y{t-1}, y_{t-2},...,y_{t-n}$ and a stochastic (random) term. For example: a simple autoregressive model is represented by:

$$y_t = \rho y_{t-1} + \mu_t$$

Where $\rho$ is a coefficient representing the effect of the lagged value $y_{t-1}$ on $y_t$ and $\mu_t$ is an error term. In this case, a unit root is present if $\rho>=1$: this means that $Y$ may grow or shrink unbounded. If $\rho<1$, then $Y$ will oscillate around the mean of its probability distribution, which satisfies the assumptions we've made about $Y$.

We'll use a slightly more complicated characteristic equation:

$$\Delta y_t = \alpha + \beta t + \gamma y_{t-1} + \delta_1 \Delta y_{t-1} + \cdots + \delta_{p-1} \Delta y_{t-p+1} + \varepsilon_t,$$

In which $\alpha$ is a constant, $beta$ is a trend on $t$, $\gamma$ is analogous to $\rho$, and $p$ is the number of lag terms to include. This form allows us to model problematic behavior such as drift and seasonaility. One thing to note: when performing a unit root test, our null hypthesis is that a root is present, while our alternative can vary. Generally, it's that the series is either stationary or trend-stationary (a unit root is not present but a trend is).


Based on the data above, we should use the Augmented Dickey Fuller test with an alternative hypothesis that the series is stationary with no trend. Using statsmodels.tsa.stattools.adfuller to test this, we obtain the following p-value:

from statsmodels.tsa.stattools import adfuller

results = adfuller(df.data, regression='c', autolag='AIC')
print(f"P-value: {results[1]}")

>>> P-value: 0.007923477202111397

We can safely assume this series to be stationary, and we should feel comfortable using a trendless autoregression model to forecast it.

Non-Stationary Time Series

Let's take a look at a non-stationary time series and examine how its behavior violates the assumptions of linear model. The plot below shows the number of homes sold in Seattle each month from June 2008 through May 2016.


Off the bat, we can tell there are some patterns. Here's the same plot, with a 6 month rolling mean and standard deviation:


There's clearly an upward trend in both mean and standard deviation. Taking a look at the first difference confirms this:


However, there also appears to be a cyclical effect that repeats itself every 12 months. This type of pattern is referred to as seasonality, and it's a common behavior of time series. To confirm its presence, we can take a seasonal difference of the data based on the cycle period length (find each value of $y_i - y_{i-12}$):


Note how the best fit line is now horizontal - this indicates that we've removed the the seasonal trend from the data, and may have succeeded in stationarizing it. Its non-zero location indicates that there's a constant upward trend, as the plot of the first difference did.

We can test for a unit root in the presence of a constant upward trend with a variant of the augmented Dickey Fuller test:

# get seasonal difference
seasonal_diff = df['Homes Sold'] - df['Homes Sold'].shift(12)
seasonal_diff = seasonal_diff[~seasonal_diff.isnull()]

results = adfuller(seasonal_diff, regression='c', autolag='AIC')
print(f"P-value: {round(results[1], 4)}")

>>> P-value: 0.0694

We can't reject our null at a critical value of .05, which indicates that there may be some behavior we still haven't accounted for. However, for the purpose of this blog post we'll simply cheat and use a value of .1 :)


Assuming we've stationarized the data, the plots above tells us that we should use a seasonal AR model with a constant upwards trend, a lag of 12, and a seasonal period of 12 (for an annual cycle). We'll use a basic test-train split to assess our model performance, using the first 6 years to train the forecast model and the last 2 to test our forecast.

Below we setup and train the model, forecast values over the 2 year period from 2014-06-30 to 2016-05-31, and plot the original data along with the model's predictions over its training period and the forecasted values.

import matplotlib.pyplot as plt
from statsmodels.tsa.statespace.sarimax import SARIMAX

# get train, test split
train = df['Homes Sold'].iloc[:72]
test = df['Homes Sold'].iloc[72:]

# fit model, get results
model = SARIMAX(
sarimax_fit = model.fit()

# setup figure, axis
fig = plt.figure(figsize=(12,4))
ax = plt.subplot(111)

# plot lines
data = ax.plot(df['Homes Sold'], alpha=0.67)[0]
y_pred = ax.plot(train.index, model_fit.forecasts[0], alpha=0.67)[0]

# plot predictions
y_pred_forecast = ax.plot(forecast)[0]

# set title, labels
ax.set_title('Seattle Monthly Home Sales SARIMAX Forecast')

ax.legend([data, y_pred, y_pred_forecast], ['Homes Sold', 'In-Sample Prediction', 'Forecast'], loc=0)




Our model appears to follow the data reasonably well, with some caveats. Below is a plot of the residuals of the model as it performs on the test data (referred to as the in-sample residuals - the errors from the predictions made using the training sample) shown in blue, the best fit line of those errors, and the residuals of the forecasted values, shown in green, with a best fit line of those errors.


The best fit line on the training values is near zero, which implies that the model has accounted for the data's non-stationary behavior. However, note that the best fit line of the forecast residuals has a noticeable negative slope: this means that the errors of the model are growing in a negative direction over time, which signals non-stationarity. Should we be concernerned about this?

Yes and no. The reason behind this is that the amplitude (max and min) of the annual cycles in the test data (2015 and 2016) is greater than those in our training data. We can see this change in volatility by calculating the coefficient of variation (standard deviation over mean) for sales in each year:

Year Coefficient of Variation
2009 0.26
2010 0.22
2011 0.17
2012 0.16
2013 0.19
2014 0.18
2015 0.25
2016 0.28

The average from 2009 through 2014 is 0.197, and from 2015 through 2016 it's 0.265. Unfortunately, since there's no increasing trend in volatility in the training data, our model has not captured this behavior.

However, this doesn't mean our forecasts are bad. Here's a plot of the residuals of our SARIMAX forecast compared to those of a simple linear regression model and a standard ARIMA model:


The Root Mean Squared Error (RMSE) of the models:

Model RMSE
SLR 1104.026
ARIMA 1022.765
SARIMAX 867.245

You can see that our SARIMAX model performs significantly better than the others. If you look closely at the graph of the residuals above, you'll see that the residuals from the ARIMA and SARIMAX models start relatively close and begin to diverge. This is because the ARIMA model does not take drift into account - it has no constant time trend. If we were to extend our forecast further out, the difference in RMSE between the ARIMA and SARIMAX models would likely keep increasing.

Closing Remarks

I hope this post was helpful for you. However, if something about my explanation didn't click, here's a great intro post on the subject. And if you have any questions, concerns or corrections for me, please feel free to reach out to me on LinkedIn. Thanks for reading!

Written by