Intro to Genetic Algorithms in Scala

This is intended to be quite a simple high level approach to using some of these ideas and techniques for searching solution spaces. As part of my dissertation back when I was in university, I used this approach to find optimal configurations for Neural Networks. More recently, I have applied this approach to finding good configurations for training an OpenNLP model, with pretty good results - which is really what made me think to write a little about the topic.

So what is it?

Genetic algorithms are a subset of evolutionary computing that borrow techniques from classic evolution theories to help find solutions to problems within a large search space.

So, what does that mean? It sounds a little like a bunch of academic rhetoric that make nothing any clearer, right? But actually, the concept is quite simple, and works like the evolutionary theory of "Survival of the fittest" - that is, you generate a large set of random possible solutions to the problem and see how they all perform at solving the problem, and from there, you "evolve" the solutions allowing the best solutions to progress and evolve, cutting out the worst performing solutions.

How it works

In practical terms, at a high level, it involves the following steps:
  1. Create an encoded representation for the problem solution - This might sound complicated, but this just means capture all the possible inputs or configuration for a solution to the problem.

  2. Randomly generate a set of possible solutions - that is, a set of configurations/inputs (this is called a "population")

  3. Asses the performance of these solutions - to do this, you need a concept of "fitness" or how well a solution performs - and rank the population

  4. Evolve the population of solutions - this can involve a range of techniques of varying complexity, often keeping the top n candidates, dropping the bottom m candidates and then evolving/mutating the rest of the new population

  5. Repeat the rank & evolve cycle (depending on how it performs)

To understand it lets first consider a simple, hypothetical numeric problem, let's say you have some function f(x, y, z) that returns an integer, and we want to find a combination of x, y and z that gives the a value closest to 10,000 (e.g. you want a cuboid with volume of 10,000 and you want to find dimensions for the width, depth and height that achieve this - like I said, this is just hypothetical). The first step would be to encode the solution, which is pretty simple - we have three inputs, so our candidate looks like this:

case class Candidate(x: Int, y: Int, z: Int)

We will also need a fitness function, that will asses how a candidate performs (really, this will just be evaluation of the inputs against our function f)

def fitness(candidate: Candidate): Int = Maths.abs(10000 - (x * y * z))

Now we have a representation of the solution, in generic terms, and we have a function that can measure how well they perform, we can randomly generate a population - the size of the population can be chosen as suits, the bigger the population the longer an evolution cycle will take, but of course gives you a wider gene pool for potential solutions which improves your chances of a better solution.

Once we have our initial population, we can evaluate them all against our fitness functions and rank them - from there, we can evolve!

Survival of the fit, only the strong survive

So how do we evolve our candidates? There are a few techniques, we will look at the simplest here:

  • Survival of the fittest - the simple and sensible approach that the top performing candidate in your population always survives as is (in the scenario where by we have stumbled upon the optimal solution in the first random generation, we want to make sure this is preserved)

  • Mutation - much like you might imagine genes get mutated in normal evolution, we can take a number of candidates and slightly adjust the values in our candidate - for numeric "genes" its easy - we can just adjust the number, a simple popular technique for doing this is Gaussian mutation, which ensures that in most cases the the value will be close to the original value, but there are chances that it could be a larger deviation (it's a bell curve distribution, whereby the peak of the bell curve is the original value).

  • Cross breeding - again, like normal reproduction, randomly select two parent candidates (probably from the top x% of the candidate pool) and take some attributes from each parent to make up a child

Between mutation and cross breeding you can create a new pool of candidates and continue the search for a solution.

Some code

Partly because I wanted to learn a bit more about the Shapeless library (more of that in another post) and partly for fun, I created a simple GitHub project with some code that has very simple evolutionary search machinery that allows you to try and use these techniques.

The shapeless stuff was to support generic encoding of our solution - so the library can easily be used to define a solution representation and a fitness function.

Following on from the example above to find three dimensions that have a volume of 10,000 I have the following configuration:
In the above, we first define the case class ThreeNumbers - this can be any case class you like, but must have Gene[A] as its attributes - these are custom types that I created that have defined generation and mutation methods, and have additional configuration (such as max and min values for numbers etc).  We then pass all the configuration in, along with a fitness function of the form (a: B => Double) and run it.

It varies on each run, as the initial pool of candidates is generated at random, but in most cases this example solves the problem (finds an error rate of zero) within 10 generations:

As mentioned, I have used this approach to train OpenNLP NER models - the different training data generation parameters and the training configuration was all encoded as a candidate class and then used this method to evolve to find the best training configuration for named entity recognition, but it can be applied to all sorts of problems!

Feel free to try it out and let me know if it works well for any problem domains - you can also checkout the code and make it more sophisticated in terms of evolution/mutation techniques and introducing a learning rate.