Quantcast aiQUANT » 2007 » November

Archive for November, 2007

Financial applications of Artificial Immune Systems

Artificial Immune Systems 9 Comments »

Isystems.pngAlgorithms inspired by biological immune systems also find their application in Finance. Although not as popular as the Neural Net or GA and their derivatives, recent research has shown that these algorithms can provide optimisation and classification capabilities comparable to conventional methods [Gonzalez, Dasgupta]. The aim in designing artificial immune systems (AIS) is not to produce exact models of the natural immune system, but rather extract ideas and metaphors that can help solve real-world problems. There are two main categories of the Artificial Immune system namely Negative Selection and Clonal Selection which lend their inspiration to Classification and Optimisation respectively.

immuApplic.png

Background

The natural immune system is comprised of a network of specialised cells and chemical molecules which work together to defend the host organism from foreign objects (Pathogens) such as bacteria, fungi, chemical compound etc. The three main capabilities of the Natural Immune system are:

  1. Recognise a foreign body (pathogens) such as bacteria, fungi, chemical compound etc.
  2. Destroy the pathogen.
  3. Remember an almost unlimited number of pathogens.

ImmuneImage.jpg

There are two types of cells comprising the immune system: phagocytes and lymphocytes. The first group belongs to the innate immune system while the latter group mediates adaptive immunity. A major component of the population of lymphocytes is made up of B and T cells, whereas that of phagocytes comprises of B cells only. These cells are capable of recognising and responding to certain antigen (foreign body) patterns present on the surface of pathogens. The picture above tries to illustrate this.

Although the real workings of the immune system is complicated, the immunity process can be summarised as follows:

  1. Antigen-secreting pathogen enters the body
  2. B cells are activated by the foreign antigen
  3. With the help of T cells, B cells undergo cloning and mutation
  4. Plasma B cells secrete immunoglobulins which attach to the antigen
  5. Marked antigens are attacked by the immune system
  6. Memory of the antigen is maintained by B memory cells

The key point is that immune systems demonstrate self-organisation and adaptive behavior. From a modelling perspective, it can be considered as a sophisticated information processing system which possesses powerful pattern recognition and classification abilities. It also has the capability to adapt to new circumstances (problems), and can remember solutions to problems it has previously encountered. In designing artificial immune algorithms we wish to draw metamorphical inspiration from the workings of the natural immune system to design algorithms to solve computational problems. The negative selection algorithm and clonal expansion and selection algorithm are examples of applications of such metaphors.

Negative Selection Algorithm

The basis of the negative selection algorithm is the ability of the immune system to distinguish between two system states i.e. normal and abnormal. [Forrest] developed the negative selection algorithm and has been modified since by a number of researchers. The negative selection algorithm can be summarised as:

  1. Detector set D is empty
  2. Repeat
  3. Create a random vector x drawn from the valid range
  4. For every self element s in the self vector
  5. Calculate the euclidean distance between s and x
  6. If distance < threshold go to step 2
  7. Else add x (a valid non-self detector) to set D.
  8. Until set D contains the required number of detectors (assume N)

The flowchart below shows the above steps

detCreaTrain.PNG

To understand this let us assume we would like to classify corporate bonds so as to differentiate investment grade from junk grade. Assume that the diagnosis will be made on the basis of ratio information drawn from the financial statements of the companies and that we also have a data-set containing example ratios for financially healthy companies and also for companies which later defaulted on their bonds.

Self is defined as financially healthy companies. Next a set of detectors (of size D) is randomly created. Each of these detectors consists of a vector of real numbers, corresponding to financial data obtained from the financial statements (such as accounting ratios). The negative selection process is then applied, whereby a series of vectors of ratios corresponding to healthy companies is presented to the detectors. Any detector which is identical or similar (the degree of similarity could be measured using the euclidean distance) to a data vector corresponding to a healthy company is discarded. As detectors are discarded, new detectors are created to to build up the size of the population D, which are subject to the same negative selection process. If a detector is created which fails to match any vector in the training set of healthy companies, it is a potentially useful detector of an unhealthy company. This process is repeated until all the detectors in the out-of-sample set is exhausted.

NIBRcomp2.png

The main challenges in applying this algorithm are:

  • It ignores the fact that examples of past failing corporate bonds exist
  • The task of generating a population of valid detectors will grow rapidly as the size of self increases.

These issues depend on on whether a good seed is used in generating the random number which depend on historical values of the accounting ratios. A good seed inevitably leads to early convergence which improves classification at steady state.

Clonal Expansion and Selection Algorithm

Unlike the negative selection algorithm, the clonal selection algorithm aims to mimic the creation of a large number of antibodies which will bind strongly to a specific antigen. This metaphor is adopted to design an optimisation algorithm in which the antibody can be considered as a potential solution, where the degree of the binding or fit between the antibody and the antigen represents the quality of the solution. The objective, therefore, is to start from an initial population of solutions, rest them against the antigens, using the algorithm iteratively. The steps in the clonal expansion and selection algorithm can be summarised as:

  1. Create an initial random population P of solution vectors (antibodies)
  2. Select a subset F of the solutions from P, biasing the selection process towards better solutions
  3. For each member of F (parents), create a set of clones, with better members of F producing more clones
  4. Mutate each of these clones, in inverse proportion to the parent’s fitness. Better solutions are mutated less (hypermutation)
  5. Select a subset of newly generated solutions S
  6. Create a number of newly created random solutions R
  7. Replace poorer members of P with better solutions from S and R
  8. Repeat steps 2 - 7 until termination condition is reached

The key difference between the clonal selection algorithm and other evolutionary algorithms is that this method in which variety of candidate solutions is generated when seeking to iteratively improve solutions.

Analogy.png

As an example I apply the clonal selection algorithm to finding the maximum value of a two dimensional multi-modal function. The multi-modal function could for instance describe the fitness of an asset allocation problem. I set my parameters as follows:

  • Number of iterations: 25
  • Number of antibodies: 100
  • Number of clones per antibody: 10
  • Clone Mutation rate: 0.010

Our initial population of 2-dimensional solution antibodies occupies different locations of the function:

initiPopuCS.png

After 25 iterations we see the clonal algorithm has managed to converge to the maximum points on the function

finalPopuCS.png

We have shown that the clonal expansion and selection can be used to optimise a 2-dimensional multi-modal function. This example can be extended to problems of n-dimension such as optimising trading rules or tactically allocating a portfolio of stocks to meet certain risk /return requirements. We will hopefully discuss such problems in due course. I hope this post gives the reader a rough idea of how Artificial Immune Systems can be applied to finance. Off-course complete understanding of these algorithms comes from the workings of the actual biological process from which the metaphors are drawn, which the reader may be keen to explore further before attempting to apply these algorithms.

    Creating “Complex” price from “Real” price

    Hilbert Transform, Technical Indicators No Comments »

    There are many benefits of converting price data in complex number form. The presence of an Imaginary component gives one the freedom to create other kinds of indicators which can provide further insight into the statistical properties of the asset price. For example, indicators that fall into the categories below can only be derived from complex price data:

    • Signal to Noise ratio (SNR)
    • Cycle Period
    • Phasor diagrams
    • Predictive Indicators (i.e. filters with Negative Lag)
    • Power Spectral Density of the price

    A complex number takes the form

    a + ib where i=sqrt{-1}

    Real price data only have an a component, aka the InPhase component. The aim is to derive the corresponding Quadrature component using the InPhase data of the price at a given bar. The Hilbert Transform provides a way of doing this, and an excellent description can be found at wikipedia.

    To complement the notes we can think of the Hilbert Transform as a control system, the simplest of which looks like:

    HilTransf.png

    Note that z^{-n} represents the z-domain delay operator. It remembers its input value for n samples into the future.

    We shall discuss Indicators based on Complex Price Data in due course.

    Forecasting via Genetic Algorithm: Part 6

    Genetic Algorithm No Comments »

    Part 6: Simulation Results and wrap-up

    A fitness landscape is a 3-D (or sometimes 4-D) plot which shows how the fitness value varies with respect to other variables that it depends on. To obtain the fitness landscape for our problem I altered the population size from 2 to 20 for mutation probabilities within the range 0 to 1. I then ran the GA for 25 generations and recorded the average fitness of the entire population so as to ascertain optimum mutation rate and population size to use. Alternatively we could have recorded the fitness of the fittest organism within the population, but this would not have given us an indication of the overall quality of solutions evolved by the algorithm. Hence the reason for taking the overall population fitness into account. The plot below shows how the average fitness changes for different population size and mutation rate.

    fLandSurf.png

    We can also view the 3-D plot in the form of a contour plot:

    fLandContour.png

    We are interested in the dark regions because the value of the fitness function in these locations is closest to zero. It is important to point out that ideally the fittest organism in our problem should have a fitness of 0 as this represents that there in no difference between the predicted value and the actual value. We see that a mutation rate of alteast 0.5 and population size greater than 9 will yield fittest solutions within the first 25 generations.

    mRate2.pngUsing the minimum mutation rate and population size (0.5 and 10 respectively) we run the GA for mating rates varying from 0.1 to 1 for 25 generations. This time we are interested in seeing how the fitness value depends on the mating rate The graph shows that for mating rates upto about 0.7, the algorithm produces fit organisms. For rates above 0.8 we see poor solutions as the overall fitness rapidly increases. The explanation for this is that organisms that high mating rates almost always results in child genes being dominated by that of the parents which gives little room for population diversity. Hence the population remains generally unfit for many generations into the future.

    Hence for our experiment we use the following parameters:-

    Population size: 10
    Mutation rate: 0.5
    Mating rate: 0.5

    We let the GA run for 500 generations and get the fittest genome which we use in our prediction function. Recall that our prediction function is:

    eq1.PNG

    The paremeters of the fittest genome evolved by the GA after 500 generations are:

    \alpha=0.1,\hspace{1}\beta=-0.2,\hspace{1}\gamma=0.3,\hspace{1}\delta=-0.2,\hspace{1}\theta=0.1,\hspace{1}\tau=-0.1,\hspace{1}\mu=6.6

    a=25,\hspace{1}b=22,\hspace{1}c=7,\hspace{1}d=11,\hspace{1}e=55,\hspace{1}f=47,\hspace{1}

    We can now apply our evolved prediction function to the 3 data sets of the FTSE 100 time series. However, to give more meaning to our results we can compare with the Random Walk model of an asset price, which takes the form:

    P_{t+1}=P_{t}+\mu+Z_{t}

    where \small Z_{t} is a normally distributed random variable and \small\mu is drift.

    Comparative results are summarised below:

    results.png

    Data set 2 is the true out-of sample range with we intended to forecast via our prediction function. Clearly it under-performs relative to the random walk model suggesting that our prediction function provides a poor approximation of the FTSE 100 index. Note that it is not the GA which is at fault here, but the prediction function.

    Key Points:

    • We have shown that the GA can evolve optimum parameters for a pre-defined prediction function
    • The performance of the GA is totally independent to the forecasting ability of the prediction function. The GA will almost always find the best solutions for the prediction function, which might be a poor approximator of the asset price as in our case.
    • GAs are very powerful tools for global search and optimisation. However, there is a trade-off between search coverage and simulation time. Analysing fitness landscapes based on a small number of evolved generations can help estimate suitable population size as well as mutation rate and mating rate hence reducing simulation time for larger generations.

    Possible improvements:

    • Let the algorithm search for its own prediction function. This leads to Genetic Programming which helps generate very sophisticated prediction functions.
    • Use a range of other fundamental and technical indicators as inputs to the prediction function.
    • Apply filtering so as to reduce noise over-fitting.

    Towards a C++ Genetic Algorithm library

    GAlib 3 Comments »

    The current GA model we are experimenting with in Matlab is proving to be a pain in terms of simulation time. I find myself having to wait tens of minutes for a few hundred generations to evolve. Clearly this will not help when we move on to more sophisticated problems such as Evolving Trading Rules as we will be dealing with complicated objective functions. I feel it’s about time we moved on to a faster C++ model which would be beneficial in terms of execution time and portability across systems. I recently came across an open source Genetic Algorithm library written by a PhD student at MIT.

    GAlib: http://lancet.mit.edu/ga/

    From the documentation it appears to be pretty sophisticated with some great functionality. Most of the leg-work is already done which would reduce a lot of programming for problems we would like to solve.

    I shall start by compiling example code supplied with it, but there are 2 options I have:

    1. Compile it under MS Visual C++ 2008 Beta which would inevitably lead to some thorough compile-time debug
    2. Compile it under UNIX using the g++ compiler which is a smarter compiler than MS Visual Studio

    At present I’m undecided between the two but in any case will start to post my endeavours with the library over the next few days. In parallel we shall continue with the Forecasting via GA experiment and hopefully by the time we finish it we would have gained sufficient experience with GAlib in-order to do some serious work.

    Writing a Matlab MEX function

    Matlab No Comments »

    The good thing with Matlab is that it provides excellent library of functions and great felxibility to implement something in very less time. One of the other great things is that you can write your code in another language and run it in the Matlab environment. Matlab’s MEX (Matlab-EXecutable) feature allows C code to compile into a Matlab compatible executable so that it runs much faster than its M-file equivalent. Under Windows the MEX function is compiled into a .dll or win32 binary file. If you are running Matlab in other environments then the function would need to be compiled into an appropriate format:

    mexExts.png

    Usually the Matlab compiler automatically identifies the appropriate format of the target file.

    In this example I show a MEX function for calculating the mean of an array. We will then use this function in our Genetic Algorithm to show that it improves the speed of our code. I keep the explanation brief as there are numerous resources elsewhere which explain how to write MEX functions.

    The standard interface for a MEX function written C takes the form:

    mexInt.jpg

    • nlhs (Type = int): This paramter represents the number of “left hand side” arguments.
    • plhs (Type = array of pointers to mxArrays): This parameter is the actual output arguments. As we will see in our example code that, an mxArray is MATLAB’s structure for holding data and each element in plhs holds an mxArray of data.
    • nrhs (Type = int): Similar to nlhs, this paramter holds the number of “right hand side” arguments.
    • prhs (Type = const array of pointers to mxArrays): This array hold all of the pointers to the mxArrays of input data.

    In a similar fashion our MEX function code for calculating the mean of an array is:

    meanMEX.png

    compProf.png

    The resulting .dll file runs 40% faster than its .m file equivalent which slightly improves the speed of our Genetic Algorithm.

    The bigger window shows profiling with the MEXed code.

    • .m file = approx 1 second per generation
    • .dll MEX = approx 0.6 seconds per generation

    We will move to a complete C++ model in a couple of weeks when we look at a more challenging problem using a self-evolving Prediction Function.

    Resources:

    How to Optimally Reward a Trader

    Trading No Comments »

    I haven’t yet started writing the Matlab MEX function but I hope to start today and post results by Saturday latest. In the mean time have a look at this paper which gives a fairly mathematical treatment about Rewarding a Trader.  Note that the figures and diagrams are shown at the end of the paper.

    Abstract:

    Traders are compensated by bonuses, in addition to their basic salary. However, little s known about how to optimally reward a trader. In this article we build a framework for the study of this problem and explore a variety of possible compensation structures.

    You can freely download it from here.

    Matlab memory optimisation issues

    Matlab 1 Comment »

    As a Matlab user I’ve found writing code that make use of very large data sets can sometimes run very slow.  Memory performance is not increased at the same rate as CPU performance if the structure of the code is such that it is ”memory-bound”.  But with a little knowledge about how Matlab stores and accesses data, we can avoid inefficient memory usage and hence improve execution time.

    As an example let us consider two similar snippets of code below:

    A1A2.PNG

    Code A1 time = 56.56 seconds.  Code A2 time = 0.032 seconds

    We see a very big difference in execution times for both codes which do the same thing!  So why is code A2 faster?  Matlab does not require the user to declare the types and sizes of variables before they are used.  The drawback of this is that each time you use an array variable for instance, Matlab must allocate memory for the a new larger array and copy existing data into it.  This can be very memory intensive and hence slow down your code as in A1.  Code A2 has a section of memory already allocated before it enters the for loop and as a result runs 1750% faster.  Try it!

    Another example is to do with the fact that Matlab favours Columns over Rows:

    B1B2.PNG

     Code B1 time = 1.34 seconds.  Code B2 time = 1.01 seconds

    Code B2 is faster because it traverses the elements of the 2-D array by going down the columns in the inner loop, which is similar to the ”fast cache” mechanism CPUs use to reduce the time taken to access buffers in main memory.  Similar results are noted for arrays of N-dimension.

    Main points:

    • Preallocate large vectors and matrices before populating them with data they need to hold.
    • When processing 2-D or N-D arrays, access your data in columns rather than rows.
    • Avoid creating unnecessary variables unless they are essential to your algorithm.
    • Try to do most of your processing within variables that already exist.

    Forecasting via Genetic Algorithm: Part 5

    Evolutionary Computation, Genetic Algorithm, Matlab No Comments »

    Part 5: Programming Issues

    I wrote code for the Genetic Algorithm in MATLAB and it is terribly slow.  A run for 50 Generations averages to about 1 second per generation.  Here are some performance measures obtained using Matlab Profiler:

    prof1.png

    The FitnessFunction function is the biggest culprit and takes almost 100% of the total simulation time.  Evaluation of the fitness function is the most time consuming part in the Genetic Algorithm.  This is generally the case even with simpler fitness functions because on a relative scale, operators such as crossover and mutation require very little algebraic manipulation.  Looking at the function itself we get more detail:

    prof2.png

    We find that evaluations for alpha and beta take the most time.  I shall attempt to speed up the algorithm by writing a MEX function.  In doing so we expect the simulation time to drop by many orders of magnitude as compiled MEX code would not need to run under Matlab’s interpreted mode.

    Forecasting via Genetic Algorithm: Part 4

    Forecasting, Genetic Algorithm, Model Development No Comments »

    Part 4: Roulette Selection

    So we have initialised our population of chromosomes each of which is made up of 13 genes.  We can represent our population in the from of a matrix where individual chromosomes are represented vertically and it’s genes horizontally.  We have something like:

    matrixFitn.png

    Each member of the population is assigned a fitness value based on the fitness function, which in our case is defined as

    eq2.PNG

     

    Why Roulette Selection?

    Roulette wheel sampling was first suggested by Goldberg and provides a method of global selection when applied to members of a population.  In order for genes of fit chromosomes to be passed on to future generations, there must be some form of selection method to decide which members of the population will be selected to breed and which ones will be replaced.  It seems sensible to breed from genes belonging to the fittest chromosomes and replace the least fit ones, but this has a draw back: the GA can converge to local minimum as the search space is restricted to the genes covered by the fittest chromosomes only.  One of the mechanisms that enable genes of weaker organism to be passed on while at the same time favouring fitter chromosomes is Roulette Selection.  To understand this consider the following example:

    Assume we have a population of 4 chromosomes each of which has their fitness value evaluated. To construct a Roulette wheel for this population we calculate the Roulette probability of each chromosome as a ratio of its fitness to the fitness of the entire population. That is
    RouletteProbability=\frac{chromosomeFitness}{populationFitness}
    The illustration below shows the Roulette wheel for our hypothetical 4-chromosome population:

    rouWeeEg.png

    Extending the method to the problem at hand, we have:

    rouWee.png

     Note that our Roulette wheel represents a population size of 10.  Chromosome 2 for instance has a higher probability of being picked compared to to Chromosome 6.  This off-course doesn’t mean that Chromosome 2 will always be picked.  It all depends on where the Roulette wheel “stops”.  Also note that different population size would have a different representation to its Roulette wheel because every chromosome is allocated a section corresponding to its fitness value.

    Main Points:

    • Roulette Selection is an efficient way of allowing excellent genes embedded in a poor performing chromosome to be passed on to future generations.
    • Roulette Selection enables convergence to global minimum.
    • Think of this method as chromosomes represented on a Roulette wheel which is “spun” whenever a parent needs to be selected for crossover/breeding.
    Close
    E-mail It