Tuesday, September 4, 2012

Time Series Analysis and Forecasting. Programming Approach - thoughts

"Certain things are impossible... 
Until an ignoramus appears, who is not aware of that".



Time Series - a sequence of data points, measured typically at successive time instants spaced at uniform time intervals. 

There are quite a lot of things that may fit this definition. For example, air temperature changes throughout the day (let's say, hourly measured), distance from the Earth to the Moon (which changes slightly throughout the lunar month). Even which political party holds the presidential chair after the elections (which depends of the "history" of the previous president, etc.) We can go on with the list of examples until the server's storage is full. As you may see, the examples above have cyclic nature, but so is everything (or at least everything) related to time series (of course, within certain deviations).

It is a nature of the mankind to want to know the future (although, sometimes it better not to know). Attempts are being made to predict, or let's use a more politically correct term - forecast, where certain series would go in the future. The best example may be shamans predicting rain or drought. These days there are complex (and not so complex) algorithms to forecast time series (e.g. noise reduction in digital signal processing). But the most scandalous and loud argument is going on about the stock market analysis and forecasting. Many of you may have heard about William Gann - some say genius, some say charlatan. I personally tend to take the first side, although, there may be facts that I am not aware of.

Mr. Gann died almost 60 years ago. Quite a long period of time. Imagine how many time series forecasting (read stock market forecasting) techniques have been born and how many have vanished. Since the chaos theory, more and more people tend to say that "stock market forecasting is impossible due to its fractal nature". Which makes sense if you look at the problem from the chaos theory's perspective. However, do not forget that chaos theory is accepted as the one that fits the situation the best, not as the one that fully explains it. In my perception, this tiny difference leaves a tiny space for hope ;-)

Well, we've had enough of science this far. Let us get to practice. Let me try to simplify things as much as possible, to demonstrate a simpler, yet effective approach from a developer's point of view.


Software

From software perspective, there's not too much needed for successful forecasts - an expert system. Smart people use different software packages and programming languages targeted at expert systems development, but being an ignoramus (as I decided to be for this article), I decided to use what I have and what I know - C language, GCC and Geany text editor as an IDE.


Data

There are several (graphical) ways to represent stock/forex market data. The most known one is candlesticks. A sequence of simple graphic figures, of which each one represents the variation of the price for a certain period of time (open, high, low and close values). We, however, are not going to consider any of them. Simply because we do not need that. Instead, we are going to concentrate on the raw row of numbers for a given period (let's say one year) measured hourly, which gives us a sequence of more then 8000 items (we are only paying attention to one value - either open, high, low or close).

If you try to plot this sequence (e.g. in Excel|) you will get a curvy line. Take another look at it and you will notice that there are similar segments (within certain deviations, of course). Just as a set of similar images, which would bring up one of the best approaches for image recognition - Artificial Neural Networks (especially perceptrons). Although, there is nothing new in using ANN for stock/forex market analysis. There are tones of commercial software products that provide the end user with different indicators telling him/her whether to buy, sell of hold the current position, I personally have not seen a lot of attempts to actually make long term (e.g. 24 hours for an hourly measured sequence) forecasts. There is also a lot of uncertainty as to what data should be used as ANN's input and how much data should be fed in each time. Unfortunately, no one has the exact answer for this question. It is just your trial and error. The same applies to the amount of hidden neurons in the ANN.

Another big question is how should the data be preprocessed - prepared for the ANN. Some use complex algorithms (Fourier transform, for example), other tend to use a more simplistic ones. The idea is that data should be in the range of 0.0- 1.0 and it should be as varied as possible. But remember - if you feed ANN with garbage - you get garbage in response. Meaning that you have to carefully select your algorithm for data preprocessing (normalization). I tend to use a custom normalization algorithm, which is quite simple. Sorry to disappoint you, but I am not going to give it here for now as it is still not completely defined (although, it already produces good results).

The bottom line for this paragraph - data preprocessing is not very important, it is the MOST important.


Instruments

My programming solutions for this problem is quite simple - a console program that reads the input (the whole sequence of price values for the specified period), trains an artificial neural network (in my case the topology was 8x24x1 - 8 inputs, 24 hidden neurons and one output neuron), and then produces a long term forecast (at least 7 entries into the future) while each step of the forecast is done using the previously generated values.

The ANN is a simple multilayer perceptron with 8 inputs, 24 hidden neurons and 1 output neuron. Basically saying - we do not perform much calculations ourselves, if at all. ANN is a perfect implementation of a learning paradigm, able to find hidden dependencies and rules. Therefore, if you ask me - there is no better solution then utilizing ANNs for time series forecasting.


Test

So, I implemented an ANN (in C this time, not in Assembly) and got the dataset (EUR/USD price values for every hour of the past year). The next move was to give it a try and test in run time. I decided to do that during the weekend as I was not sure about how much time would be required to train the network. Surprisingly, I got a good error after only about 30,000 epochs (several minutes). The following picture shows what I got:

EUR/USD forecast

Test set - data not included in the ANN training process. Used as a pattern for error calculation.
Test forecast - forecast on data from the past, which was not included in the training set.
Real forecast - forecast of the future values. This was done on Saturday at least 24 hours before the opening of the next trading session.
Real data - real values obtained Monday early morning after the new trading session began.

As you can see, such simple system was even able to forecast the gap between the two sessions.


P.S. Although, this article contains no source code, no description of any interesting programming technique or whatsoever, it comes to show, that each problem has a (not necessarily complicated) solution. Most of the time, the most important thing is to take a look at a problem from another angle.