Generic programming in Scala with Shapeless

In my last post about evolutionary computing, I mentioned I started building the project partly just so I could have a play with Shapeless. Shapeless is a generic programming library for Scala, which is growing in popularity - but can be fairly complicated to get started with.

I had used Shapeless from a distance up until this point - using libraries like Circe or Spray-Json-Shapeless that use Shapeless under the hood to do stuff like JSON de/serialisation without boilerplate overhead of having to define all the different case classes/sealed traits that we want to serialise.

You can read the previous post for more details (and the code) if you want to understand exactly what the goal was, but to simplify, the first thing I needed to achieve was for a given case class, generate a brand new instance.

Now I wanted the client of the library to be able to pass in any case class they so wished, so I needed to be able to generically inspect all attributes of any given case class* and then decide how to generate a new instance for it.

Thinking about this problem in traditional JVM terms, you may be thinking you could use reflection - which is certainly one option, although some people still have a strong dislike for reflection in the JVM, and also has the downside that its not type safe - that is, if someone passes in some attribute that you don't want to support (a DateTime, UUID, etc) then it would have to be handled at runtime (which also means more exception handling code, as well as loosing the compile time safety). Likewise, you could just ask clients of the code to pass in a Map of the attributes, which dodges the need for reflection but still has no type safety either.

Enter Shapeless

Shapeless allows us to represent case classes and sealed traits in a generic structure, that can be easily inspected, transformed and put into new classes.

A case class, product and HList walk into a bar

A case class can be thought of as a product (in the functional programming, algebraic data type sense), that is, case class Person(name: String, age: Int, living: Boolean) is the product of name AND age AND living - that is, every time you have an instance of that case class you will always have an instance of name AND age AND living. Great. So, as well as being a product, the case class in this example also carries semantic meaning in the type itself - that is, for this instance as well as having these three attributes we also know an additional piece of information that this is a Person. This is of course super helpful, and kinda central to the idea of a type system! But maybe sometimes, we want to be able to just generically (but still type safe!) say I have some code that doesn't care about the semantics of what something is, but if it has three attributes of String, Int, Boolean, then I can handle it - and that's what Shapeless provides - It provides a structure called a HList (a Heterogeneous List) that allows you to define a type as a list of different types, but in a compile time checked way.

Rather that a standard List in Scala (or Java) where by we define the type that the list will contain, with a HList, we can define the list as specific types, for example:
The above example allows us to define out HList with specific types, which provides our compile time safety - the :: notation is some syntactical sugar provided by Shapeless to mirror normal Scala list behaviour, and HNil is the type for an empty HList.

Case class to HList and back again

Ok, cool - we have a way to explicitly define very specific heterogeneous lists of types - how can this help us? We might not want to be explicitly defining these lists everywhere. Once again Shapeless steps in and provides a type class called Generic.

The Generic type class allows conversion of a case class to a HList, and back again, for example:
Through compile time macros, Shapeless provides us these Generic type classes for any case class. We can then use these Generic classes to convert to and from case classes and HLists:

Back to the problem at hand

So, we now have the means to convert any given case class to a HList (that can be inspected, traversed etc) - which is great, as this means we can allow our client to pass in any case class and we just need to be able to handle the generation of/from a HList.

To start with, we will define our type class for generation:

Now, in the companion object we will define a helper method, that can be called simply without needing anything passed into scope, and the necessary Generator instance will be provided implicitly (and if there aren't any appropriate implicits, then we get our compile time error, which is our desired behaviour)

As you can see, we can just call the helper method with an appropriate type argument and as long as there is an implicit Generator in scope for that type, then it will invoke that generate method and *hopefully* generate a new instance of that type.

Next, we need to define some specific generators, for now, I will just add the implicits to support some basic types:
All simple so far - if we call Generator.generate[Int] then it gets the implicit Generator[Int] and generates a random int (not that useful, perhaps, but it works!)

Now comes the part where our generate method can handle a case class being passed in, and subsequently the Shapeless HList representations of a case class (we will recursively traverse a HList structure using appropriate implict generators). First up, lets look at how we can implicitly handle a case class being passed in - at first this may start to look daunting, but its really not that bad.. honestly:
So, what is going on there? The implicit is bound by two types: [T, L <: HList] - which is then used in the implicit argument passed into the method implicit generic: Generic.Aux[T, L] and argument lGen: Generator[L] - this matches for all types T that there exists in scope a Shapeless Generic instance that can convert it into some HList L, and for which L an implicit Generator exists.  Now we know from our earlier look at Shapeless that it will provide a Generic instance for any case class (and case classes will be converted into a HList), so this implicit will be used in the scenario where by we pass in a case class instance, as long as we have an implicit Generator in scope that can handle a HList: So lets add that next!

Let's start with the base-case, as we saw earlier, HNil is the type for an empty HList so lets start with that one:
Remember, the goal is from a given case class (and therefore a given HList) is to generate a new one, so for an empty HList, we just want to return the same.

Next we need to handle a generator for a non-empty HList:
So this case is also relatively simple, because our HList is essentially like a linked list, we will handle it recursively - and the type parameters [H, T <: HList] represent the type of the head of the HList (probably a simple type -  in which case it will try to resolve the implicit using our original Int/Boolean/Double generators) and the tail of the HList which will be another HList (possibly a HNil if we are at the end of the list, otherwise this will resolve recursively to this generator again). We grab the implicit Generator for the head and tail then simply call generate on both and that's really all there is to it! We have defined the implicit for the top level case class (which converts it to the generic HList and then calls generate on that), we have the implicit to recursively traverse the HList structure and we have finally the implicits to handle the simple types that might be in our case class.

Now we can simply define a case class, pass it to our helper method and it will generate a brand new instance for us:

I was using this approach to generate candidates for my evolutionary algorithm, however the exact same approach could be used easily to generate test data (or much more). I used the same approach documented here to later transform instances of case classes as well.

Here is the completed code put together:


I have mentioned so far that Shapeless provides the means to transform case classes and sealed traits to generic structures, but have only really talked about case classes. The sealed trait is a slightly different shape to a case class (it is analogous to the algebraic data type co-product) and accordingly, Shapeless has a different structure to handle it, rather than a HList it has a type called Coproduct. The above code can be extended to also handle sealed traits, to do so you just need to add the implicit to handle the empty coproduct (CNil) and recursively handle the Coproduct.


Intro to Genetic Algorithms in Scala

This is intended to be quite a simple high level approach to using some of these ideas and techniques for searching solution spaces. As part of my dissertation back when I was in university, I used this approach to find optimal configurations for Neural Networks. More recently, I have applied this approach to finding good configurations for training an OpenNLP model, with pretty good results - which is really what made me think to write a little about the topic.

So what is it?

Genetic algorithms are a subset of evolutionary computing that borrow techniques from classic evolution theories to help find solutions to problems within a large search space.

So, what does that mean? It sounds a little like a bunch of academic rhetoric that make nothing any clearer, right? But actually, the concept is quite simple, and works like the evolutionary theory of "Survival of the fittest" - that is, you generate a large set of random possible solutions to the problem and see how they all perform at solving the problem, and from there, you "evolve" the solutions allowing the best solutions to progress and evolve, cutting out the worst performing solutions.

How it works

In practical terms, at a high level, it involves the following steps:
  1. Create an encoded representation for the problem solution - This might sound complicated, but this just means capture all the possible inputs or configuration for a solution to the problem.

  2. Randomly generate a set of possible solutions - that is, a set of configurations/inputs (this is called a "population")

  3. Asses the performance of these solutions - to do this, you need a concept of "fitness" or how well a solution performs - and rank the population

  4. Evolve the population of solutions - this can involve a range of techniques of varying complexity, often keeping the top n candidates, dropping the bottom m candidates and then evolving/mutating the rest of the new population

  5. Repeat the rank & evolve cycle (depending on how it performs)

To understand it lets first consider a simple, hypothetical numeric problem, let's say you have some function f(x, y, z) that returns an integer, and we want to find a combination of x, y and z that gives the a value closest to 10,000 (e.g. you want a cuboid with volume of 10,000 and you want to find dimensions for the width, depth and height that achieve this - like I said, this is just hypothetical). The first step would be to encode the solution, which is pretty simple - we have three inputs, so our candidate looks like this:

case class Candidate(x: Int, y: Int, z: Int)

We will also need a fitness function, that will asses how a candidate performs (really, this will just be evaluation of the inputs against our function f)

def fitness(candidate: Candidate): Int = Maths.abs(10000 - (x * y * z))

Now we have a representation of the solution, in generic terms, and we have a function that can measure how well they perform, we can randomly generate a population - the size of the population can be chosen as suits, the bigger the population the longer an evolution cycle will take, but of course gives you a wider gene pool for potential solutions which improves your chances of a better solution.

Once we have our initial population, we can evaluate them all against our fitness functions and rank them - from there, we can evolve!

Survival of the fit, only the strong survive

So how do we evolve our candidates? There are a few techniques, we will look at the simplest here:

  • Survival of the fittest - the simple and sensible approach that the top performing candidate in your population always survives as is (in the scenario where by we have stumbled upon the optimal solution in the first random generation, we want to make sure this is preserved)

  • Mutation - much like you might imagine genes get mutated in normal evolution, we can take a number of candidates and slightly adjust the values in our candidate - for numeric "genes" its easy - we can just adjust the number, a simple popular technique for doing this is Gaussian mutation, which ensures that in most cases the the value will be close to the original value, but there are chances that it could be a larger deviation (it's a bell curve distribution, whereby the peak of the bell curve is the original value).

  • Cross breeding - again, like normal reproduction, randomly select two parent candidates (probably from the top x% of the candidate pool) and take some attributes from each parent to make up a child

Between mutation and cross breeding you can create a new pool of candidates and continue the search for a solution.

Some code

Partly because I wanted to learn a bit more about the Shapeless library (more of that in another post) and partly for fun, I created a simple GitHub project with some code that has very simple evolutionary search machinery that allows you to try and use these techniques.

The shapeless stuff was to support generic encoding of our solution - so the library can easily be used to define a solution representation and a fitness function.

Following on from the example above to find three dimensions that have a volume of 10,000 I have the following configuration:
In the above, we first define the case class ThreeNumbers - this can be any case class you like, but must have Gene[A] as its attributes - these are custom types that I created that have defined generation and mutation methods, and have additional configuration (such as max and min values for numbers etc).  We then pass all the configuration in, along with a fitness function of the form (a: B => Double) and run it.

It varies on each run, as the initial pool of candidates is generated at random, but in most cases this example solves the problem (finds an error rate of zero) within 10 generations:

As mentioned, I have used this approach to train OpenNLP NER models - the different training data generation parameters and the training configuration was all encoded as a candidate class and then used this method to evolve to find the best training configuration for named entity recognition, but it can be applied to all sorts of problems!

Feel free to try it out and let me know if it works well for any problem domains - you can also checkout the code and make it more sophisticated in terms of evolution/mutation techniques and introducing a learning rate.


Agile Machine Learning: From Theory to Production

Later this year, Sumanas and I will be co-presenting a talk about researching Machine Learning in an agile development environment at the JAXLondon conference. This is a high level overview of some of the topics we will be presenting (we will also try to get some cool ML demos in there too, just to make things a bit more interesting).

So what’s the problem?

Artificial Intelligence(AI) and Machine Learning(ML) are all the rage right now - at Google’s recent I/O 2017 event they really put ML front and center with plans to bake it into all their products, and lots of other large companies are aligning themselves as Machine Learning companies as a natural progression from Big Data.

According to a recent Narrative Science survey, 38% of enterprises surveyed were already using AI, with 62% expecting to be using it by 2018. So it would be understandable if many companies might be feeling the pressure to invest in an AI strategy, before fully understanding what they are aiming to achieve, let alone how it might fit into a traditional engineering delivery team.

We have been working the last 12 months on taking a new product to market and trying to go from a simple idea to a production ML system. Along the way we have had to integrate open ended academic research tasks with our existing agile development process and project planning, as well as working out how to deliver the ML system to a production setting in a repeatable, robust way, with all the considerations expected from a normal software project.

Here are a few things you might consider if you are planning your ML roadmap (and topics we will cover in more detail in the JAXLondon session in October)

Machine Learning != your product

Machine Learning is a powerful tool to enhance a product - whether it be by reducing costs of human curation, or more powerful understanding for your voice/natural language interface - however, Machine Learning shouldn’t be considered the selling point of the product - think of the end result Product First, that is, is there a market for the product regardless of whether it is powered by ML or human oversight. Consider if it makes sense to build a fully non-ML version of the product to start proving the market fit and delivering value to customers.

Start small and build

The Lean Startup principles of MVP and fast iterations still apply here: following on from the point above, starting with a non-ML product, as you start to apply ML techniques, if you are able to start leveraging some of these techniques to get even a small increase in performance (better recommendations, reduced human effort/cost, improved user experience - replacing human process with ML for just 5% of cases can start to realise cost benefits) then that can start adding value straight away. From a small start you are able to start proving the value that can be added whilst also getting the ML infrastructure tested and proven.

Tie into development sprint cycles

You may be hiring a new R&D team, or you may be using members from your existing engineering team, either way, it is helpful to have them working in similar development sprint cycles (if you work in sprints). It will allow both sides of the teams to understand what is happening and how work is progressing - product and engineering changes and issues might be useful in informing the direction of R&D and likewise, there may be data features or feedback from the R&D team that could be easily engineered and would make things simpler for research. Whilst research is ongoing, and often a time consuming task, having fortnightly (or whatever the sprint length is) checkpoints where ideas can be discussed and demoed can be good for the whole team’s understanding as well as being a positive motivator.

Don’t forget Clean Code!

Whilst experimenting and researching different ideas it can be pretty easy to fall into hacking mode, just rattling out rough scripts to prove an initial concept or idea - and there is definitely a place for this, but as your team progresses it will be more beneficial to actually invest in good coding principles. Whilst one-off scripts make sense to be hacked out, as the team works across several ideas, having code that is re-usable and organised sensibly with proper separation of concerns can make the research in the future easier, as well as reducing the cost when it comes to the production-isation. Investing in some machinery to make experiments easily testable (and benchmarking different solutions) will be very beneficial to invest in from the start.

Recommended Reading

While the interwebs are awash with Machine Learning articles, tutorials and click-baitey links guaranteed to reduce the error of your model in 3 quick steps, the following is a small list of resources that we think are worth browsing.

For the academically inclined, the following is a list of papers, both recent and not so recent:

During the session in JAXLondon later this year, we will go into more detail with these ideas, as well as others including technical and architectural considerations for building and deploying an ML stack.


Building a RESTful API like a Product Manager

Something I have been thinking about for a while is the idea of APIs as your Product - the idea that you should design, build and test your API like it's a public facing product (as it might actually be if its a public API), and what lessons we should be learning from Product Managers about how we achieve this.

In the latest instalment of their "Technology Radar", ThoughtWorks recommended the concept "API as a Product" as ready to trial (although I think it's far beyond trial stage - I can see no real risk of adopting the lessons from Product Management for your API).

I have generally been interested in Product Management for several years - I think it's hard not to get interested in the science of Product Management when you are building side projects, as you naturally have to think about the product you are building - sure, with side projects they can easily turn into vanity projects, where you just build what you think is cool, but I think it's still fun to think about real users and good Product Management techniques that can be applied even to side-projects.

Having tried to gain some understanding, I have turned to some classic material on the subject of Product Management: The Lean StartupDon't Make Me ThinkHow to Build a Billion Dollar App. I have also been lucky enough to attend Mind The Product Conference twice (though volunteering - but managed to watch lots of the talks), and to have worked with several great Product Managers.

So what lessons can we start applying? I think, if one statement sums it up, its an idea from Kathy Sierra:

Does your product make your users awesome?
This is a great way to think about APIs (I'm mostly talking about REST/GraphQL type APIs here - but this thinking is a great way of approaching design for normal class/interface/library APIs too).

As engineers, I'm sure we have all used APIs that don't achieve this - APIs that make us have to think, that make us feel stupid, that seem overly complicated (for example, the old Java Date library - the number of times I had to google just to find out how to simply create a specific date for something simple like a unit test is ridiculous). And hopefully, we have all used APIs that make us feel awesome - APIs that make us productive, that let us do what we expect to be able to do, and can (with the help of the IDE's auto-complete) let us guess what the method calls are.

Another idea Kathy Sierra mentions in that talk, and is also made famous from the title of the Product Management classic, is:

Don't make me think

I have written before about the StackOverflow and GitHub APIs, but they deserve a mention again, if you have ever written an integration with them, you will know.  The APIs are designed in such a way that sensibly models the domain model.

For example, take StackOverflow, if we were going to describe their domain model, what first class objects would we have? I guess Questions, Answers, Comments etc - and sure enough, their API models these:

/questions - will get you all the questions on the site

/answers - will get you all the answers,

Now, how do you think you might get all answers for a given question? Well, a list of answers should probably be a child of a question, so something that represents that, something like:

/questions/{Question ID}/answers

Sure enough, thats the URL - and the behaviour is consistent through the design, I'm sure you can guess how you might retrieve all comments for a given answer.

I'll be honest, developing against an API where I can guess the endpoints makes me feel pretty awesome.

MVP and the Lean Product

The idea of a Minimum Viable Product is a core concept in Lean Startup - the idea that you build the minimum version of the product that can get you feedback as to whether you are heading in the right direction before investing more effort in building out a fuller fledged product.

Whilst I think an appreciation of an MVP is a good awareness for any engineer - I have generally found it is useful in facilitating conversation when building a new product - if as an engineer you recognise some new feature is going to be considerable effort, its helpful to think about what hypothesis the feature is trying to test, and is there a simpler engineering solution to test the hypothesis.

However, the principal of not over engineering the API is also a good one - once you have a well designed domain model, its quite tempting to get carried away with REST design best practices and start building endpoints for all different variations of the model - however, unless you need them, it's better to start with less (if you are building a public REST API from the start then it might be harder to know what isn't needed, but you can still certainly build an MVP version of the API to test the hypothesis - e.g. whether people actually want to use your API)

User Testing

Another important part of building a product is collecting user feedback - Whilst a REST API may not be well suited to A/B testing (your consumers won't be very happy if switch the API design on them after they have implemented against a specific variation!), listening to user feedback is still important - because the consumers will likely be engineers (writing the client code), then hopefully they will be expecting REST standards and good API design principles, but another way to think about this is the importance of not making breaking changes to your API.

A technical way of thinking about this is Consumer Driven Contracts Testing - As RESTful APIs are de-coupled from the client code, the API doesn't have any idea what clients are using or doing, so this approach allows consumers to define what they expect from the API, and then the API can always test against these contracts, so be sure that any change to the API is still in keeping with existing consumers, or users, of the API.


I think, in general, all engineers can benefit from thinking about how good products are built, what concepts are important, and be able to think about the products or services they are building in this way - both in terms of thinking about the actual implementation as a product that someone else will be the user of (either externally if a RESTful API, but also internally if its cross-team or even just someone coming to take over your code once you have moved on).


First of the year

9:30 PM , , 0 Comments

Ignoring global events, 2016 was an interesting year for me. Personally, I manged to achieve, and more importantly learn, quite a bit: I switched jobs, started learning Scala (full time), started learning and working with machine learning, entered my first BBQ competition and my first Chilli cook-off competition. I also submitted my first proposal for a conference talk - it was not selected in the end, but I was pleased to have at least submitted it, and one of my goals for 2017 is start public speaking (conferences/meetups) so will hopefully continue to submit talk proposals in the hope that one gets selected by the end of the year.

There are probably lots of things that I had in mind to do at the start of the year that I didn't manage - such as finishing my education side project, but never mind.

The new tech stuff I have been able to learn has been a product of moving jobs, so that's cool, and I'm very happy with what we are doing there at the moment, but with the two cooking competitions I signed and paid up to enter both of them early in the year on a whim, and that provided the motivation and deadline to force me to practice, prepare and actually complete them.


So here are a few goals I am aiming for in this coming year:

Machine learning - let's start with a realistic one. I want to become a lot more competent and experienced in the ML field. It's an exciting area right now, and something that is only going to get more prevalent, so I am really happy to be somewhere I can be involved in it from the start of a project and get to know it better.

Writing - A little different this time. As a goal to aim for, and also just to see the wider process I want to write a book. About meat. I have lots of ideas already, it will be relatively short but at first at least it should be an e-book whistlestop tour of hows and whys of cooking meat.

Kickstarter - I am currently working on an android app for my boy (5 years old), and it made me think about how I could make it more accessible for other kids to make their own apps. So I might kickstart it once I have made some progress. Again, mostly for the experience of running a kickstart project rather than actually expecting to get the money to do it.

Public speaking - As mentioned, I would like to get into this. So if I manage to get a proposal accepted before the end of the year, that would be cool.

On to the next one!


Managing your personal finance (a Java webapp)

"Cash Rules Everything Around Me" - Method Man

A few years ago I was fed up of how useless all the banks were at helping me understand my transactions/spending patterns with their online banking. Basically, they all just seemed to let you log on, and then see a list of transactions (often just referencing codes) of payments in and out by date. Nothing else. It seemed like there were lots of simple things missing that could make it a lot more useful, to name a few:

  • Smarter tagging/categorisation - where some direct debit was just listed by the code of the recipient rather than meaningful info, or being able to group all the spending in coffee shops
  • Alerting to changes/unexpected behaviour - for example when a fixed term-price contract comes to an end and the price changes - detecting a change in a regular fixed aount/recipient should be easy

So this is what I started building - a lot of the functionality has since been built into banks such as Monzo, the aim of this webapp was simply to allow you to import transaction histories (most online banks I use provide the ability to export transactions in csv) so you could view/filter/tag any existing bank data. It only supports Santander at the moment, and the alerting stuff never actually got built in, but I thought I would chuck it on Github rather than have it just sit around gathering dust in a private repo.

I have written about the project a while ago, when it was first underway, but as it has been not worked on for a while, I thought I would move it to GitHub - hence the post here today. The app and all the code can be downloaded here:

Building & Running

Its a simple java, maven app (it was several years ago, so not migrated to Spring Boot, Gradle, Groovy etc) - and it builds a WAR file. It also uses LESS for style stuff, which is also built by maven (although watch out for silent failures if you make changes!). If you are using an eclipse based IDE you can follow the guide here: to get incremental build support for LESS building (e.g. just change the LESS source and go refresh in the browser and it will reload).

You can run the WAR file in any tomcat/IDE server, and the backend is currently expecting a MySql DB (but can easily be switched for a differed DB driver if required)


  • Migrate to Spring Boot, Groovy, Gradle
  • Add further bank transaction importers
  • Add alerting framework


(I loaded in a generated transaction history - which is why some of the charts look a little weird)