Dec 10
Smoothing data prior to training a neural net is important because it provides a way of filtering unwanted noise that may be present in the inputs. In addition to removing noise smoothing inevitably introduces time lag which results in degradation of any correlation the input may have with the target variable. If we over smooth then there is a possibility that the correlation deminishes so much that the neural net is unable to learn important relationships between the input and target.
Let’s consider the following example:
We want to predict the next value of a noisy sine wave. For this we shall use a 2:3:1 neural net

Our noisy input data consists of 400 samples with a further 100 samples used for testing the neural net’s prediction.

Experiment 1:
We first investigate the predictive quality with both input1 and input2 set to 15-bar moving averages of the training data.


MSE = 0.2342
Experiment 2:
We now investigate with input1 set to 5-bar and input2 set to 15-bar moving averages of the training data.


MSE = 0.1335
Key Points:
- Excessive filtering of inputs in experiment 1 resulted in noise removal as well as degradation of correlation which shows up in the quality of forecasted output.
- In experiment 2 we changed one of the inputs to a 5 bar MA and left the other at 15 bar MA. Because the 5-bar MA input suffered less correlation degradation the forecast quality was much better than that of experiment 1.
- Essentially what the neural net does during training is to identify correlations between the inputs and the target data. If for instance we have reason to believe that the target is influenced by an N-bar moving average then supplying an N-bar smoothed input will do no harm. But if our purpose is solely to filter noise in the input then we have to be aware about change in correlation and it’s consequence.
Bookmark & Share
Oct 21
Part 8: Trading strategy results and Wrap-up
Our trading strategy takes the following form:
if(predicted5DayChange > 1.5)
trade = long(5);
elseif(predicted5DayChange < -1.5)
trade = short(5);
end
% n is the number of days after which the trade is automatically closed
The performance of this strategy is compared against that of a buy-and-hold strategy. Transaction costs, slippage and other caveats are ignored. The plot below shows the cumulative profit of the trading strategy over different test sets.

A comparative summary is given in the table below:

We have shown that a simple MLP nerual network can generate profits in the absence of trading costs. In summary:
- Predictive quality and hence return on investment degrades as the time-series moves further away from the training set.
- Creating a committee of networks improves predictive quality.
- In the absence of all other costs, the model generates greater profits than a simple buy-and-hold strategy.
I would say these results should be considered a lower bound on the profiteering abilities of a neural network trading system as the model can be improved in a number of ways:
- Evolving weights via Particle Swarm algorithm or some other Evolutionary algorithm
- Modify network to predict market turning points rather than absolute values
- Modify network to detect divergence of other linked markets
- Allow the number of days to stay in a trade to vary.
I am currently in the process of improving this system so that I can deploy it at Collective2. Nonetheless this experiment ends here and we shall next look at something to do with Genetic Algorithms.
Bookmark & Share
Oct 17
Part 7: Application to trading
Lets recap what we have been done so far by looking at the plot below:

We used 889 trading days of the FTSE 100 index for training our stack of twenty-five 9:6:1 MLP neural networks. We then chose 3 sets of testing data consisting of 125 trading days each. Our results showed that the committee model performed best over test set 1. Hence the quality of predicted 5-day forecasts degraded as data points moved further away from the training set. The plot below tries to reflect this:

A trading strategy can be something like this.
if(predicted5DayChange > 1.5)
trade = long(n);
elseif(predicted5DayChange < -1.5)
trade = short(n);
end
% n is an integer number of days which needs to be found so as to maximise profit
Our aim now is to find the best value of n.
Bookmark & Share
Oct 14
Part 6: Network performance measures
Tests are performed with three out-of sample data sets consisting of 125 trading days each.
- Test set 1: 01/07/95 to 31/12/95
- Test set 2: 01/01/96 to 28/06/96
- Test set 3: 01/07/96 to 31/12/96
The figures below shows the mean square error over the entire test set range, with a smaller MSE value indicating better performance.

There are two visible trends here: The committee model outperforms the average of individual models across all test sets, although not as much in the third set. Also the predictive quality of the models degrades over time since the error in both the committee and individual models increase. This is further investigated in terms of correlation coefficient, which provides an indication of how similar the predictions are with the actual time-series.

Predictions made as a result of a committee stack show higher correlation (ie similarity) to the actual five-day change than do the average correlations of the individual models. As expected the correlation coefficients degrades over time, thus confirming that forecasts made on out-of-sample data further away from the training set are prone to being less accurate.
Bookmark & Share
Oct 13
One of the problems associated with trained networks is that they can sometimes suffer over-fitting, which can be briefly described as a network that is too complex and as a result learns non-linear relationships present in underlying noise in the training set. This can be dangerous because predictions can be beyond the range of the training data which could result in wild forecasts. A common way to avoid overfitting is to use lots of training data which can cause problems in terms of computational time. Other methods involve:
Smoothing provides a way of improving the signal-to-noise ratio of your inputs by filtering out the noise. This comes at the expense of lag, which can be minimised by using custom designed zero-lag filters. Smoothed input data would enable the network to learn mostly general trends. Trends that are too general can lead to network underfitting and so deciding how much filtering to apply to the inputs needs to come from a pre-defined signal-to-noise ratio.
Regularisation involves modifying the error performance function of the network. In most cases the error is taken to be the sum of the squares of the network error over the training set. The performance function can be for instance modified to include the mean of the overall network error, so as to introduce a measure of generality. A common approach is Bayesian Regularisation which works very well with the Levenberg-Marquardt learning algorithm.
This approach involves comparing the error measure from a validation set with the error measure from the test set. During training the errors from the validation and training sets decrease in a similar fashion. However when the network starts to overfit the data, the validation set will typically begin to rise. When this happens, the training can be stopped so that the weights at the minimum of the validation error are set to be the best-case values.
This has not been thoroughly investigated, but some research has shown using uncorrelated inputs improves overfitting. Finding such inputs can be a problem because inputs generally tend to be correlated to a certain extent.
Bookmark & Share
Oct 11
Part 5: Creating a committee of networks.
In Part 4 I talked about using a single network to obtain forecasts. We can improve upon that by using a committee of networks so as to improve overall accuracy of forecasts.
Previous research [Zhang et al] shows that using a group of between 20 and 30 networks are sufficient to improve results obtained from a single network. The reason for this can be understood as follows:
Consider a neural network with only 2 taps (or weights). If we change the value of these weights with respect to each other and measure the error produced at the output of the network and plot it, we get a surface with multiple minima similar to that shown below:

During training, the learning algorithm (Levenberg-Marquardt in this case) attempts to find the minimum points on this surface because it aime to minimise the error at the output of the network. Notice that there are four minima on the surface, two of which are global (dark blue ones) and the other two are local (see this). Local search algorithms (like Levenberg-Marquardt) behave such that they follow the path of steepest slope on the cost surface. But they need to be told where to start their search (aka the initialisation point). If we use only one initialisation point then there is a possibility of the network getting stuck at a local minimum. Creating multiple instantiations of initialisation points improves the possibility of reaching a global minimum. This is why we create a committee of networks, as each one will be initialised at a different point on the cost surface. We can therefore expect certain members of the committee to perform better than others, provided the cost surface has local as well as global minima. Our committee structure looks something like:

A stack of 25 networks will be used to provide a forecast which is expected to be more accurate than that of a single network.
In summary:
- We use a committee because 25 brains are better than one.. duh!
- Put more formally, certain networks in the committee will perform better than others because their weights will correspond to those at a global minima on the cost surface.
- We need to avoid at all times networks that have weights corresponding to local minima as this can mean the difference between a good forecast and a bad forecast.
- We could have used tournament selection, but that is more applicable to genetic algorithms, something we shall discuss in due course.
More on optimising cost surfaces later on.
Bookmark & Share
Oct 09
Part 4: Network training
The network structure we will use is a 9:6:1 multi-layer perceptron. That is, there are 9 input nodes, 6 hidden nodes and 1 output node. The is no particular reason why we use 6 nodes in the hidden layer. We chould have chosen a different number, but trial and error shows that 6 nodes learns the time-series well without loss of generality. The network structure is shown below:

Each of the nodes is activated by a tan-sigmoid transfer function. The training algorithm employed is the Levenberg-Marquardt algorithm, which is a very powerfull gradient descent algorithm. It is important to point out that there are a number of neural network training algorithms which we could have used. They have their own advantages and disadvantages provides a framework to decide which one to use for a particular problem:
-
Resilient backpropagation
-
Random order incremental update
-
Polak-Ribiere conjugate gradient descent
-
Powell-Beale conjugate gradient descent
-
Bayesian regularization
These algorithms very rarely get stuck at saddle points because they have a random disturbance which alleviates problems associated with attraction of saddles. Once initialised, they descend until they reach a true minimum point, which might not necessarily be global. Evolutionary algorithms have a special feature of ensuring that a global minimum reached (subject to certain constraints) during training:
Evolutionary algoritms truely enable one to have a predictive edge, as we shall see later on. Next we shall look at the forcasting ability of the trained network.
Related:
Bookmark & Share
Oct 07
Part 1: Setting project goals
Part 2: Analysing the target variable
Part 3: Input data selection and preprocessing
We shall use a variety of indicators to drive out neural network for forecast the % 5-day change of the FTSE100 index. Although there may be thousands of such indicators that we can chose from, our aim should be to pick the ones that have a significant bearing on the target variable being forecasted. In this example we use a range of technical, fundamental and intermarket indicators namely:
-
5-day lagged % change of the FTSE100
-
20-day lagged % change of the FTSE100
-
** 10-day 5-day convergence divergence of the FTSE100
-
** 20-day 10-day convergence divergence of the FTSE100
-
GBP/USD exchange rate
-
S&P 500 composite index
-
Brent crude (US$ per barrel)
-
LIBOR 1-month deposit rate
-
LIBOR 12-month deposit rate
Inputs 1 and 2 provide the neural network model with a measure of momentum in the market, giving the added ability to discern whether a short run 5-day trend agrees with the longer 20-day trend. Input 3 was calculated as the ratio of the 10-day smoothed vs the 5-day smoothed time-series of the FTSE100 index. The smoothing was accomplished using the ** Zero-lag filter that was designed earlier. The GBP/USD exchange rate is included as changes of the GBP against a major currency pair can be expected to impact the domestic and overseas earnings of companies in the FTSE100 index. For similar reasons the price of Brent Crude is also included. The S&P 500 index is included because it is often said that “when the US sneezes the rest of the world catches a cold”. We can safely assume there is correlation between the two indices (infact I know there is!). Two measures of interest rates are provided. namely the LIBOR 1-month and 12-month. Interest rates affect share prices by altering the rate of return that can be earned on competing instruments such as bonds, bank deposits etc since they impact the borrowing cost of firms in the FTSE100 index. The plot below shows the input data drawn from the range 01-Jan-1992 to 15-June-1995, a total of 889 days.

This range for training was deliberately chosen due to the profound effects of black wednesday that is visibly present in all input data as well as the target variable. A deviation of this kind would enable our network to learn black swans in addition to the normal behiavour of the index. To further illustrate the importance of having outliers in your training set, we have the plot below.

Assuming there is correlation between the S&P500 and the FTSE100 (which of course there is!), features present in the training set is representative of pretty much the whole data set. This is good because our network will be in a position to predict a fall in the FTSE100 should a should an event similar to black wednesday occur. A training set which is not representative of all possible events, including black swans is one of the reasons why neural networks fail in their forecasting ability.
Key points:
-
We have selected a range of fundamental, technical and intermarket indicators to use as inputs.
-
We have selected time-series range which includes a black swan event.
-
We avoid the use of a moving average for smoothing because of its terrible lag characteristics. Instead we used the custom designed zero-lag filter.
-
We will normalise all inputs and target to the range (-1, 1) inorder to satisfy the training constraints of the MLP neural network.
Bookmark & Share
Oct 03
Part 1: Setting project goals
Part 2: Analysing the target variable
Our case study is to examine the ability of our MLP Neural Networks to predict the five-day percentage change of the UK’s FTSE100 index during the period July 1995 to December 1996. We shall draw our training data from the range January 1992 to June 1995. Lets have a look at perspectives of the 5-day % change of the FTSE100 index.

The returns series shows that while most changes are in the range +/-2%, larger changes are not uncommon. Infact there a few positive black swans, which are not characteristics of a normal distribution. What we see is a leptokurtic distribution, which is a general characteristic of price changes of equities. Leptokurtic or fat-tailed distributions exhibit more frequent large positive or negative price changes than would be expected if price changes followed a normal distribution. Distributions of financial data also exhibit higher peaks than would be expected if they followed a normal distribution. Hence price changes are not normally distributed. Autocorrelation is a measure of how similar a time-series is to itself when time shifted. The plot shows the returns series is most similar to itself when it is not time shifted at all. At other lags the series exhibits slight similarity, but is negligible. If we were to see peaks at other lags then the returns series would be classified as cyclic, which is not uncommon in financial data.
In Part 1 I mentioned briefly the performance measures we would need to use. Elaborating further, [7, 8, 9] make mention of “thick modelling” which I believe implies that a different statistical approach to lets say sharpe ratio is required to measure performance correctly. Infact we have the Diebold-Mariano error test, the Cross-validation and the Harvey-Leybourne correction tests, which are all specific to neural network models. Perhaps these would be usefull for comparison within a set of different NN models. So we have a rough idea about testing.
Bookmark & Share
Oct 01
Part 1: Setting project goals
It may seem trivial to any market player that one of the objectives is to predict the future value of an asset. This doesn’t necessarily have to be the case because predicting the actual price is extremely difficult, particularly for volatile asset classes. Lets say the FTSE100 index is currently at 6500 and that is changes by an average of 50 points per day. If your objective is to forecast the future value of the index to within 5 points, your model will have to present predictions with an accuracy of 0.8% (50 out of 6500) which is tough, but not impossible. Now lets say the the goal of the model is to predict the one day change in the index rather than the absolute value, the required accuracy drops to 10% (5 out of 50). Changing the predictive target can substantially impact on the easy of the predictive task. Choice is not limited to one kind of predictive target and many others can be used. The key is to make it as simple as possible for your model to make forecasts, without compromising on the quality of predictions.
I would like to experiment with the FTSE 100 index data to build a MLP neural network model that helps predict weekly percentage change. I shall use the model to to backtest over different out of sample periods inorder to identify profiteering capabilities. Its a little tricky to decide what my performance measures should be given that there are so many to chose from. But that is to be discussed later. Let us focus on building the model first.
Bookmark & Share