Whether you’re a scientist analyzing earthquake data to predict the next “big one”, or are in healthcare analyzing patient wait times to better staff your ER, understanding time series data is crucial to making better, data informed decisions.
This gentle introduction to time series will help you understand the components that make up a series such as trend, noise, and seasonality. It will also cover how to remove some of these components and give you an understanding on why you would want to. Some common statistical and machine learning models for forecasting and anomaly detection will be explained and we’ll briefly dive into how neural networks can provide better results for some types of analysis.
Time series is a sequence of data points in chronological sequence, most often gathered in regular intervals. Time series analysis can be applied to any variable that changes over time and generally speaking, usually data points that are closer together are more similar than those further apart.
Components of Time Series Data
Most often, the components of time series data will include a trend, seasonality, noise or randomness, a curve, and the level. Before we go on to defining these terms, it’s important to note that not all time series data will include every one of these components. For instance, audio files that are taken in sequence are examples of time series data, however they won’t contain a seasonal component (although note they would have periodic cycles). On the other hand, most business data will likely contain seasonality such as retail sales peaking in the fourth quarter.
Here are the various components that can occur in time series data:
Level: When you read about the “level” or the “level index” of time series data, it’s referring to the mean of the series.
Noise: All time series data will have noise or randomness in the data points that aren’t correlated with any explained trends. Noise is unsystematic and is short term.
Seasonality: If there are regular and predictable fluctuations in the series that are correlated with the calendar – could be quarterly, weekly, or even days of the week, then the series includes a seasonality component. It’s important to note that seasonality is domain specific, for example real estate sales are usually higher in the summer months versus the winter months while regular retail usually peaks during the end of the year. Also, not all time series have a seasonal component, as mentioned for audio or video data.
Trend: When referring to the “trend” in time series data, it means that the data has a long term trajectory which can either be trending in the positive or negative direction. An example of a trend would be a long term increase in a companies sales data or network usage.
Cycle: Repeating periods that are not related to the calendar. This includes business cycles such as economic downturns or expansions or salmon run cycles, or even audio files which have cycles, but aren’t related to the calendar in the weekly, monthly, or yearly sense.
Time series data can be easily graphed, but sometimes it’s difficult to identify the different components that make up a series. Below, we have a figure that displays Zillows jumbo 30 year fixed mortgage rates from data.world with the level displayed in black at 4.20%:
Because it’s hard to see seasonality and other components in the data, we use a technique called decomposition to identify the various parts of our series that we’ll cover in a bit, but first we should understand what type of model could fit our series.
Trend, Seasonality, and Noise Interactivity
When a series contains a trend, seasonality, and noise, then you can define that series by the way those components interact with each other. These interactions can be reduced to what is called either a multiplicative or additive time series.
A multiplicative time series is when the fluctuations in the time series increase over time and is dependent on the level of the series:
Multiplicative Model: Time series = t (trend) * s (seasonality) * n (noise)
Therefore, the seasonality of the model would increase with the level over time. In the graph below, you can see that the seasonality of airplane passengers increases as the level increases:
An additive model is when the fluctuations in the time series stay constant over time:
Additive Model: Time series = t (trend) + s (seasonality) + n (noise)
So an additive model’s seasonality should be constant from year to year and not related to the increase or decrease in the level over time. Below is a graph of births in New York where you can see that although the level increases, the seasonality stays the same:
Knowing whether your series data is multiplicative or additive is important if you want to decompose your data into its various parts such as trend, or seasonality. Sometimes it’s enough to graph your data to discover if it’s an additive or multiplicative series. But, when it’s not, you can decompose your data and compare the ACF values to discover the correlation between data points when testing both an additive model and a multiplicative model.
Decomposition is the deconstruction of the series data into its various components: trend, cycle, noise, and seasonality when those exist. Two different types of classic decomposition include multiplicative and additive decomposition.
The purpose of decomposition is to isolate the various components so you can view them each individually and perform analysis or forecasting without the influence of noise or seasonality. For example, if you wanted to only view the trend of a real estate series, you would need to remove the seasonality found in the data, the noise due to randomness, and any cycles such as economic expansion. Below is a figure showing the various components of the mortgage time series including the original data, the seasonality, trend, and noise or “remainder”:
While it was difficult to initially see the trend in the original observed data, once our series is decomposed into its various parts, it’s much easier to see the seasonality, trend, and noise or randomness in our data. To get seasonally adjusted data, you can subtract the seasonal component from the decomposed data, most statistical packages have functions to do this for you. If you want to remove seasonality in your data via an API, check out the Remove Seasonality algorithm where you can get your trend without seasonality which would look like this with the mortgage rate series:
We already talked about how there is randomness, or noise in our data, and how to separate it with decomposition, but sometimes we simply want to reduce the noise in our data and smooth out the fluctuations from the noise in order to better forecast future data points.
A moving average model leverages the average of the data points that exist in a specific overlapping subsection of the series. An average is taken from the first subset of the data, and then it is moved forward to the next data point while dropping out the initial data point. A moving average can give you information about the current trends, and reduce the amount of noise in your data. Often, it is a preprocessing step for forecasting.
Moving averages are used by investors and traders for analyzing short term trends in stock market data, while SMA’s are being used in healthcare to better understand current trends in surgeries, and even analyze quality control in healthcare providers.
While there are many different moving average models, we’ll cover the simple moving average (SMA), autoregressive integrated moving average (ARIMA), and exponential smoothing.
Simple Moving Average:
The way a SMA is calculated is that it takes the subset of the data mentioned in the moving average model description, adds together the data points, and then takes the average over the subset of data. It can help identify the direction of trends in your data, and identify levels of resistance where in business or trading data, there is a price ceiling that can’t be broken through. For instance if you’re trying to identify the point where you can’t charge past a certain amount for a product, or in investing why a stock can’t move past a certain price point, you can identify that ceiling with a moving average.
Here’s a snippet of our loan rate data (this data is grouped by day):
Date Daily Average Rates
1 2011-06-01 4.896667
2 2011-06-02 4.857500
3 2011-06-03 4.947500
4 2011-06-06 4.908571
5 2011-06-07 4.946250
6 2011-06-08 4.861250
If we wanted the subset of the data to be, say, n = 6, then the first SMA of that subset would be 4.902956 for our mortgage rate for jumbo loans. It’s up to you to choose the period of time (n data points) for the average rates that you want to analyze. Here is the SMA graphed of the mortgage rate data, where you can see how the noise is smoothed out to better detect the trend:
Cumulative Moving Average:
In a cumulative moving average, the mean is calculated from a data stream up to the current data point, which gets updated with each new data point that comes in from the stream. This is often used to smooth data when analyzing stock market data, network data, and even social media data that has some numeric value such as sentiment scores. With smoothing, it becomes easier to detect trends. The key difference between the cumulative moving average and a simple moving average is that because the cumulative moving average is applied to a data stream versus static data, the average is updated as each new data point comes in through the stream.
Our Zillow mortgage rate data isn’t streaming data, but here is an example of a cumulative average plotted:
In the above graph you can see the actual data points, along with the blue line that represents the cumulative moving average and recall that the average gets updated with each new data point. Check out Algorithmia’s Cumulative Moving Average algorithm to get started integrating your own data stream, even one from Spark Streaming.
Exponential Moving Average:
Unlike the simple moving average that moves forward one data point and drops the oldest data point to take the average from, an exponential moving average (EMA) uses all data points before the current data point. Weights are associated with each data point and those further away from the current point are given less weight, in an exponentially decreasing fashion, than the data points closest to the current point. When the most current data has the most influence, you can better determine trends that matter most in real time which is why EMA is used in evaluating stock prices. Exponential moving average is also used in trading to identify support and resistance levels of stock prices as mentioned earlier in our discussion of Moving Averages. But, this model is not only used in trading. In healthcare, the EMA has been used for identifying baselines for influenza outbreaks.
Below is a graph of our mortgage series data that shows the daily rates for mortgage rates between 2011 and 2017 and includes the EMA for 50 and 100 days which can be used to detect long-term trends. If you wanted to detect short-term trends you can use shorter periods.
If you have data that you want to test an EMA on, such as a stock series, fisheries time series, or even sentiment score time series, check out Exponential Moving Average on Algorithmia.
Forecasting is one of the most relevant tasks when working with time series data, but it’s hard to know where to get started. Although you can forecast with SMA or EMA, another moving average model called Autoregressive Integrated Moving Average is popular for fairly accurate and quick forecasting of time series.
Autoregressive Integrated Moving Average:
The Autoregressive Integrated Moving Average, or ARIMA model, is a univariate linear function that is used for predicting future data points based on past data. ARIMA combines the models own past data points to determine future points versus a linear regression model that would rely on an independent variable to predict the dependent variable such as using treasury note rates to predict mortgage rates. Because of ARIMA’s reliance on it’s own past data, a longer series is preferable to get more accurate results.
ARIMA relies on a stationary model, meaning that the mean, variance and other measures of central tendency must remain the same over the course of the series. To achieve a stationary model, you would need to remove trend, seasonality, and business cycle if that exists. The differencing method is used to fit the model by subtracting the current value from the previous data point, including the lag. Note that the lag is simply the difference between values from the same period.
For instance, if you have data that has monthly seasonality, you would take each data point, subtract it from the previous data point in the same month from the previous year, and so on in order to difference your data and create a stationary model.
In an ARIMA model you’ll find that you can adjust various parameters which are:
p (AR) which is the number of autoregressions,
d (I) which is the number of differences, and
q (MA) which is the forecast error coefficient, which together, make up (AR, I, MA).
Here is an example of an ARMA forecasted model using our mortgage data:
Many statistical packages allow you to either set the ARIMA parameters manually or with an automated function that best fits your model. If you choose to select your parameters manually, you can estimate the appropriate ones with an autocorrelation and partial autocorrelation function (ACF, PCF). If you aren’t familiar with these functions check out this online resource PSU that has more information on ACF and PCF.
Luckily, Algorithmia has a forecasting ARIMA model that condenses all the steps into one API call with the AutoArima algorithm. It also provides the AIC, AICc, and BIC which are all indicators of fitness of the model to your data and contains all other necessary metadata in the output of the algorithm so you can create any forecasting operations that you wish. If you’re interested in checking out other forecasting algorithms on Algorithmia, here are a few including a generative forecasting deep learning algorithm:
Anomaly detection is a technique used to find abnormal behavior or data points in a series. The anomalies could range from spikes or dips that largely deviate from normal patterns, or can be larger and longer term abnormal trends.
For example, the presence of data points that largely deviate from the central measures of tendency could represent the occurrence of credit card fraud because the data point or pattern is abnormal for a particular customer (check out the Fast Anomaly Detection algorithm). Other times, patterns are being analyzed for instance in the case of iOT data analyzing blood pressure, or other sensitive data where health anomalies would want to be detected.
Different types of anomaly detection include statistical methods, unsupervised learning, and supervised learning. You can also use a method using the ARIMA model that is described in Chen and Leu’s 1993 paper and is available in most statistical packages.
Some popular unsupervised methods for anomaly detection include clustering algorithms such as K-Means which groups data points together based on each data point’s Euclidean distance to centroids (which is K and chosen by you), and even neural nets such as Generative Adversarial networks can be used to detect anomalies in high dimensional data such as image frames taken from a security video.
One type of supervised machine learning algorithm for anomaly detection is the Support Vector Machine model. Like all supervised algorithms, SVM’s need training data with data points that are labeled as abnormal to classify unseen data, although note that SVM’s can also be used for unsupervised clustering when you don’t have labeled data. SVM’s are a non-probabilistic supervised method that works well with high dimensional data that rely on a hyperplane to separate the binary classes.
For instance, with the credit card fraud example, a labeled data set that indicates what data points were cases of fraud, and which were not could be used in a support vector machine model to classify new unseen data. A vector space of the features of the data would be used to indicate if they were associated with credit card fraud or not. The feature space relies on a hyperplane that divides the data into fraud/not fraud and the SVM model focuses on the data points that are closest to the hyperplane to optimize on rather than all data points.
We’ll pretend that one of the features in this case is the amount the user spends (it could also be what state they usually spend in). In our labeled data, data points with the feature of a high dollar amount that was known to be fraud, is labeled as such. But, let’s say that we have a feature with an amount that comes close to a hyperplane in our vector space. It’s not quite high enough to be labeled cleanly as fraud so it’s an example of what type of feature that the SVM would focus on optimizing the hyperplane on.
Below is an example of multiple hyperplanes and the data points they are trying to optimize on:
Neural Nets in Time Series
Lastly, neural nets, are used for anomaly detection and forecasting in time series and are particularly useful when there are non-linear relationships to be discovered or when data has missing values or when lags aren’t regular in duration or length between events such as outliers.
If you haven’t been introduced to neural network concepts, check out our Intro to Deep Learning post and for a more detailed explanation of neural nets and machine learning check out Intro to Machine and Deep Learning.
While more traditional methods such as ARIMA and SVM’s are helpful when trying to forecast future time series, find anomalies, and classify data, they tend to rely on more recent data, or focus on only data points that aren’t outliers (as in SVM’s). More recently, the deep learning model called the Long Short-Term Memory model or LSTM, has been used in order to take advantage of its ability to recall older information in the data to create a more accurate forecast or recognize anomalies that might have gone unnoticed by more traditional methods.
LSTM’s are a variation of a Recurrent Neural Network or RNN which is great deal more complex than one of the more simple deep learning models, feedforward networks. Unlike a feedforward neural net which does classification or prediction based on a single forward pass of the input through the nodes in the network, an RNN relies on two inputs. One input is the new data the network hasn’t seen before, and the other input is the information retained from previous data that has already passed through the network.
In other words, RNN’s consume their own output as input and use this maintained awareness to inform their perception of new data. This is great for time series data since many times patterns are formed from previous lags that the neural net can learn from, due to the persistence of recently learned information. However, when it comes to older information learned, say 5 lags ago, then RNN’s aren’t as powerful since they don’t retain any long term dependencies. This is crucial for important tasks such as detecting network attacks or fraudulent activities for banks.
Luckily, LSTM’s recall older information from previous passes in the network.
One way to use LSTM’s in anomaly detection is to create a classifier model and train data on known anomalies like we talked about for SVM’s, but another way is to create a prediction or forecasting model based on known normal data. For instance, if you wanted to detect any fraudulent activity whether it’s bank data or network data, you could train your LSTM on data containing normal activity and then use it to predict the next n data points. If real data is compared to the predicted data and is significantly different from what is forecasted, than an anomaly is detected. In either case, whether you’re building an anomaly detection system based off of a classification model or a prediction model, LSTM’s perform very well due to the persistence on older dependencies, but note that due to neural nets unbound output space issue, that data points too far into the future aren’t as accurate, so it’s best to be more conservative when forecasting your model.
If you want to train your own neural net time series forecast model, check out Generative Forecast that is a multivariate incremental forecaster, but don’t worry you won’t have to wait days for your model to be trained on your data, because it updates the model without doing backpropogation.
No matter what domain you’re in, whether you are applying these models to stock data for trading, patient data for healthcare optimization, or even working with audio data files, you can easily create insightful analysis in a few lines of code with Algorithmia. Not only do you get the insights provided by these time series models, you get them in a productionized environment that scales on demand through our REST API’s. If you want to host your own time series model, check out our documentation on hosting your algorithms and models.
Here is a sample of time series algorithms available on Algorithmia:
- Time Series Summary Statistics
- Fast Anomaly Detection
- Outlier Detection
- Auto Correlate
- Tensorflow LSTM predictor
- Remove Seasonality