Generic programming in Scala with Shapeless

In my last post about evolutionary computing, I mentioned I started building the project partly just so I could have a play with Shapeless. Shapeless is a generic programming library for Scala, which is growing in popularity - but can be fairly complicated to get started with.

I had used Shapeless from a distance up until this point - using libraries like Circe or Spray-Json-Shapeless that use Shapeless under the hood to do stuff like JSON de/serialisation without boilerplate overhead of having to define all the different case classes/sealed traits that we want to serialise.

You can read the previous post for more details (and the code) if you want to understand exactly what the goal was, but to simplify, the first thing I needed to achieve was for a given case class, generate a brand new instance.

Now I wanted the client of the library to be able to pass in any case class they so wished, so I needed to be able to generically inspect all attributes of any given case class* and then decide how to generate a new instance for it.

Thinking about this problem in traditional JVM terms, you may be thinking you could use reflection - which is certainly one option, although some people still have a strong dislike for reflection in the JVM, and also has the downside that its not type safe - that is, if someone passes in some attribute that you don't want to support (a DateTime, UUID, etc) then it would have to be handled at runtime (which also means more exception handling code, as well as loosing the compile time safety). Likewise, you could just ask clients of the code to pass in a Map of the attributes, which dodges the need for reflection but still has no type safety either.



Enter Shapeless

Shapeless allows us to represent case classes and sealed traits in a generic structure, that can be easily inspected, transformed and put into new classes.

A case class, product and HList walk into a bar

A case class can be thought of as a product (in the functional programming, algebraic data type sense), that is, case class Person(name: String, age: Int, living: Boolean) is the product of name AND age AND living - that is, every time you have an instance of that case class you will always have an instance of name AND age AND living. Great. So, as well as being a product, the case class in this example also carries semantic meaning in the type itself - that is, for this instance as well as having these three attributes we also know an additional piece of information that this is a Person. This is of course super helpful, and kinda central to the idea of a type system! But maybe sometimes, we want to be able to just generically (but still type safe!) say I have some code that doesn't care about the semantics of what something is, but if it has three attributes of String, Int, Boolean, then I can handle it - and that's what Shapeless provides - It provides a structure called a HList (a Heterogeneous List) that allows you to define a type as a list of different types, but in a compile time checked way.

Rather that a standard List in Scala (or Java) where by we define the type that the list will contain, with a HList, we can define the list as specific types, for example:
The above example allows us to define out HList with specific types, which provides our compile time safety - the :: notation is some syntactical sugar provided by Shapeless to mirror normal Scala list behaviour, and HNil is the type for an empty HList.

Case class to HList and back again

Ok, cool - we have a way to explicitly define very specific heterogeneous lists of types - how can this help us? We might not want to be explicitly defining these lists everywhere. Once again Shapeless steps in and provides a type class called Generic.

The Generic type class allows conversion of a case class to a HList, and back again, for example:
Through compile time macros, Shapeless provides us these Generic type classes for any case class. We can then use these Generic classes to convert to and from case classes and HLists:



Back to the problem at hand

So, we now have the means to convert any given case class to a HList (that can be inspected, traversed etc) - which is great, as this means we can allow our client to pass in any case class and we just need to be able to handle the generation of/from a HList.

To start with, we will define our type class for generation:

Now, in the companion object we will define a helper method, that can be called simply without needing anything passed into scope, and the necessary Generator instance will be provided implicitly (and if there aren't any appropriate implicits, then we get our compile time error, which is our desired behaviour)

As you can see, we can just call the helper method with an appropriate type argument and as long as there is an implicit Generator in scope for that type, then it will invoke that generate method and *hopefully* generate a new instance of that type.

Next, we need to define some specific generators, for now, I will just add the implicits to support some basic types:
All simple so far - if we call Generator.generate[Int] then it gets the implicit Generator[Int] and generates a random int (not that useful, perhaps, but it works!)

Now comes the part where our generate method can handle a case class being passed in, and subsequently the Shapeless HList representations of a case class (we will recursively traverse a HList structure using appropriate implict generators). First up, lets look at how we can implicitly handle a case class being passed in - at first this may start to look daunting, but its really not that bad.. honestly:
So, what is going on there? The implicit is bound by two types: [T, L <: HList] - which is then used in the implicit argument passed into the method implicit generic: Generic.Aux[T, L] and argument lGen: Generator[L] - this matches for all types T that there exists in scope a Shapeless Generic instance that can convert it into some HList L, and for which L an implicit Generator exists.  Now we know from our earlier look at Shapeless that it will provide a Generic instance for any case class (and case classes will be converted into a HList), so this implicit will be used in the scenario where by we pass in a case class instance, as long as we have an implicit Generator in scope that can handle a HList: So lets add that next!

Let's start with the base-case, as we saw earlier, HNil is the type for an empty HList so lets start with that one:
Remember, the goal is from a given case class (and therefore a given HList) is to generate a new one, so for an empty HList, we just want to return the same.

Next we need to handle a generator for a non-empty HList:
So this case is also relatively simple, because our HList is essentially like a linked list, we will handle it recursively - and the type parameters [H, T <: HList] represent the type of the head of the HList (probably a simple type -  in which case it will try to resolve the implicit using our original Int/Boolean/Double generators) and the tail of the HList which will be another HList (possibly a HNil if we are at the end of the list, otherwise this will resolve recursively to this generator again). We grab the implicit Generator for the head and tail then simply call generate on both and that's really all there is to it! We have defined the implicit for the top level case class (which converts it to the generic HList and then calls generate on that), we have the implicit to recursively traverse the HList structure and we have finally the implicits to handle the simple types that might be in our case class.

Now we can simply define a case class, pass it to our helper method and it will generate a brand new instance for us:

I was using this approach to generate candidates for my evolutionary algorithm, however the exact same approach could be used easily to generate test data (or much more). I used the same approach documented here to later transform instances of case classes as well.


Here is the completed code put together:


Footnote

I have mentioned so far that Shapeless provides the means to transform case classes and sealed traits to generic structures, but have only really talked about case classes. The sealed trait is a slightly different shape to a case class (it is analogous to the algebraic data type co-product) and accordingly, Shapeless has a different structure to handle it, rather than a HList it has a type called Coproduct. The above code can be extended to also handle sealed traits, to do so you just need to add the implicit to handle the empty coproduct (CNil) and recursively handle the Coproduct.

0 comments:

Intro to Genetic Algorithms in Scala

This is intended to be quite a simple high level approach to using some of these ideas and techniques for searching solution spaces. As part of my dissertation back when I was in university, I used this approach to find optimal configurations for Neural Networks. More recently, I have applied this approach to finding good configurations for training an OpenNLP model, with pretty good results - which is really what made me think to write a little about the topic.

So what is it?

Genetic algorithms are a subset of evolutionary computing that borrow techniques from classic evolution theories to help find solutions to problems within a large search space.

So, what does that mean? It sounds a little like a bunch of academic rhetoric that make nothing any clearer, right? But actually, the concept is quite simple, and works like the evolutionary theory of "Survival of the fittest" - that is, you generate a large set of random possible solutions to the problem and see how they all perform at solving the problem, and from there, you "evolve" the solutions allowing the best solutions to progress and evolve, cutting out the worst performing solutions.

How it works

In practical terms, at a high level, it involves the following steps:
  1. Create an encoded representation for the problem solution - This might sound complicated, but this just means capture all the possible inputs or configuration for a solution to the problem.

  2. Randomly generate a set of possible solutions - that is, a set of configurations/inputs (this is called a "population")

  3. Asses the performance of these solutions - to do this, you need a concept of "fitness" or how well a solution performs - and rank the population

  4. Evolve the population of solutions - this can involve a range of techniques of varying complexity, often keeping the top n candidates, dropping the bottom m candidates and then evolving/mutating the rest of the new population

  5. Repeat the rank & evolve cycle (depending on how it performs)


To understand it lets first consider a simple, hypothetical numeric problem, let's say you have some function f(x, y, z) that returns an integer, and we want to find a combination of x, y and z that gives the a value closest to 10,000 (e.g. you want a cuboid with volume of 10,000 and you want to find dimensions for the width, depth and height that achieve this - like I said, this is just hypothetical). The first step would be to encode the solution, which is pretty simple - we have three inputs, so our candidate looks like this:

case class Candidate(x: Int, y: Int, z: Int)

We will also need a fitness function, that will asses how a candidate performs (really, this will just be evaluation of the inputs against our function f)

def fitness(candidate: Candidate): Int = Maths.abs(10000 - (x * y * z))


Now we have a representation of the solution, in generic terms, and we have a function that can measure how well they perform, we can randomly generate a population - the size of the population can be chosen as suits, the bigger the population the longer an evolution cycle will take, but of course gives you a wider gene pool for potential solutions which improves your chances of a better solution.

Once we have our initial population, we can evaluate them all against our fitness functions and rank them - from there, we can evolve!

Survival of the fit, only the strong survive

So how do we evolve our candidates? There are a few techniques, we will look at the simplest here:

  • Survival of the fittest - the simple and sensible approach that the top performing candidate in your population always survives as is (in the scenario where by we have stumbled upon the optimal solution in the first random generation, we want to make sure this is preserved)

  • Mutation - much like you might imagine genes get mutated in normal evolution, we can take a number of candidates and slightly adjust the values in our candidate - for numeric "genes" its easy - we can just adjust the number, a simple popular technique for doing this is Gaussian mutation, which ensures that in most cases the the value will be close to the original value, but there are chances that it could be a larger deviation (it's a bell curve distribution, whereby the peak of the bell curve is the original value).

  • Cross breeding - again, like normal reproduction, randomly select two parent candidates (probably from the top x% of the candidate pool) and take some attributes from each parent to make up a child

Between mutation and cross breeding you can create a new pool of candidates and continue the search for a solution.

Some code

Partly because I wanted to learn a bit more about the Shapeless library (more of that in another post) and partly for fun, I created a simple GitHub project with some code that has very simple evolutionary search machinery that allows you to try and use these techniques.

The shapeless stuff was to support generic encoding of our solution - so the library can easily be used to define a solution representation and a fitness function.

Following on from the example above to find three dimensions that have a volume of 10,000 I have the following configuration:
In the above, we first define the case class ThreeNumbers - this can be any case class you like, but must have Gene[A] as its attributes - these are custom types that I created that have defined generation and mutation methods, and have additional configuration (such as max and min values for numbers etc).  We then pass all the configuration in, along with a fitness function of the form (a: B => Double) and run it.

It varies on each run, as the initial pool of candidates is generated at random, but in most cases this example solves the problem (finds an error rate of zero) within 10 generations:



As mentioned, I have used this approach to train OpenNLP NER models - the different training data generation parameters and the training configuration was all encoded as a candidate class and then used this method to evolve to find the best training configuration for named entity recognition, but it can be applied to all sorts of problems!

Feel free to try it out and let me know if it works well for any problem domains - you can also checkout the code and make it more sophisticated in terms of evolution/mutation techniques and introducing a learning rate.

0 comments:

Agile Machine Learning: From Theory to Production

Later this year, Sumanas and I will be co-presenting a talk about researching Machine Learning in an agile development environment at the JAXLondon conference. This is a high level overview of some of the topics we will be presenting (we will also try to get some cool ML demos in there too, just to make things a bit more interesting).

So what’s the problem?

Artificial Intelligence(AI) and Machine Learning(ML) are all the rage right now - at Google’s recent I/O 2017 event they really put ML front and center with plans to bake it into all their products, and lots of other large companies are aligning themselves as Machine Learning companies as a natural progression from Big Data.

According to a recent Narrative Science survey, 38% of enterprises surveyed were already using AI, with 62% expecting to be using it by 2018. So it would be understandable if many companies might be feeling the pressure to invest in an AI strategy, before fully understanding what they are aiming to achieve, let alone how it might fit into a traditional engineering delivery team.

We have been working the last 12 months on taking a new product to market and trying to go from a simple idea to a production ML system. Along the way we have had to integrate open ended academic research tasks with our existing agile development process and project planning, as well as working out how to deliver the ML system to a production setting in a repeatable, robust way, with all the considerations expected from a normal software project.

Here are a few things you might consider if you are planning your ML roadmap (and topics we will cover in more detail in the JAXLondon session in October)

Machine Learning != your product

Machine Learning is a powerful tool to enhance a product - whether it be by reducing costs of human curation, or more powerful understanding for your voice/natural language interface - however, Machine Learning shouldn’t be considered the selling point of the product - think of the end result Product First, that is, is there a market for the product regardless of whether it is powered by ML or human oversight. Consider if it makes sense to build a fully non-ML version of the product to start proving the market fit and delivering value to customers.

Start small and build

The Lean Startup principles of MVP and fast iterations still apply here: following on from the point above, starting with a non-ML product, as you start to apply ML techniques, if you are able to start leveraging some of these techniques to get even a small increase in performance (better recommendations, reduced human effort/cost, improved user experience - replacing human process with ML for just 5% of cases can start to realise cost benefits) then that can start adding value straight away. From a small start you are able to start proving the value that can be added whilst also getting the ML infrastructure tested and proven.

Tie into development sprint cycles

You may be hiring a new R&D team, or you may be using members from your existing engineering team, either way, it is helpful to have them working in similar development sprint cycles (if you work in sprints). It will allow both sides of the teams to understand what is happening and how work is progressing - product and engineering changes and issues might be useful in informing the direction of R&D and likewise, there may be data features or feedback from the R&D team that could be easily engineered and would make things simpler for research. Whilst research is ongoing, and often a time consuming task, having fortnightly (or whatever the sprint length is) checkpoints where ideas can be discussed and demoed can be good for the whole team’s understanding as well as being a positive motivator.

Don’t forget Clean Code!

Whilst experimenting and researching different ideas it can be pretty easy to fall into hacking mode, just rattling out rough scripts to prove an initial concept or idea - and there is definitely a place for this, but as your team progresses it will be more beneficial to actually invest in good coding principles. Whilst one-off scripts make sense to be hacked out, as the team works across several ideas, having code that is re-usable and organised sensibly with proper separation of concerns can make the research in the future easier, as well as reducing the cost when it comes to the production-isation. Investing in some machinery to make experiments easily testable (and benchmarking different solutions) will be very beneficial to invest in from the start.

Recommended Reading

While the interwebs are awash with Machine Learning articles, tutorials and click-baitey links guaranteed to reduce the error of your model in 3 quick steps, the following is a small list of resources that we think are worth browsing.



For the academically inclined, the following is a list of papers, both recent and not so recent:




During the session in JAXLondon later this year, we will go into more detail with these ideas, as well as others including technical and architectural considerations for building and deploying an ML stack.

0 comments: