Using OpenNLP for Named-Entity-Recognition in Scala

A common challenge in Natural Language Processing (NLP) is Named Entity Recognition (NER) - this is the process of extracting specific pieces of data from a body of text, commonly people, places and organisations (for example trying to extract the name of all people mentioned in a wikipedia article). NER is a problem that has been tackled many times over the evolution of NLP, from dictionary based, to rule based, to statistical models and more recently using Neural Nets to solve the problem.

Whilst there have been recent attempts to crack the problem without it, the crux of the issue is really that for approach to learn it needs a large corpus of marked up training data (there are some marked up corpora available, but the problem is still quite domain specific, so training on the WSJ data might not perform particularly well against your domain specific data) and finding a set of 100,000 marked up sentences is no easy feat.  There are some approaches that can be used to tackle this by generating training data - but it can be hard to generate truly representative data and so this approach always risks over-fitting to the generated data.

Having previously looked at Stanford's NLP library for some sentiment analysis, this time I am looking at using the OpenNLP library. Stanford's library is often referred to as the benchmark for several NLP problems, however, these benchmarks are always against the data it is trained for - so out of the box, we likely won't get amazing results against a custom dataset. Further to this, the Stanford library is licensed under GPL which makes it harder to use in any kind of commercial/startup setting. The OpenNLP library has been around for several years, but one of its strengths is its API - its pretty well documented to get up and running, and is all very extendable.


Training a custom NER

Once again, for this exercise we are going back to the BBC recipe archive for the source data - we are going to try and train an OpenNLP model that can identify ingredients.

To train the model we need some example sentences - they recommend at least 15,000 marked up sentences to train a model - so for this I annotated a bunch of the recipe steps and ended up with somewhere in the region of about 45,000 sentences.

As you can see in the above example, the marked up sentences are quite straight forward. We just wrap the ingredient in the tags as above (although note that if the word itself isn't padded by a space either side inside the tags, it will fail!).

Once we have our training data, we can just easily setup some code to feed it in and train our model:

This is a very simple example of how you can do it, and not always paying attention to engineering best practices, but you get the idea for whats going on. We are getting an input stream of our training data set, then we instantiate the Maximum Entropy name finder class and ask it to train a model, which we can then write to disk for future use.

When we want to use the model, we can simply load it back into the OpenNLP Name Finder class and use that to parse the input text we want to check:

So, once I had created some training data in the required format, and trained a model I wanted to see how well it had actually worked - obviously, I don't want to run it against one of the original recipes as they were used to train the model, so I selected this recipe for rosemary-caramel millionaire shortbread, to see how it performed, here are the ingredients it found:

  • butter
  • sugar
  • rosemary
  • caramel
  • shortbread


All in all, pretty good - it missed some ingredients, but given the training data was created in about 20 minutes just manipulating the original recipe set with some groovy, that's to be expected really, but it did well in not returning false positives.


In conclusion, if you have a decent training set, or have the means to generate some data with a decent range, you can get some pretty good results using the library. As usual, the code for the project is on GitHub (although it is little more than the code shown in this post).

0 comments:

Turn your GitHub Page into a Personalised Resume

8:14 PM , , , 0 Comments

A little while ago, I decided I wanted to update my CV, and figured given I was in tech it made sense for my CV to be online.  I was aware of GitHub Pages - which give you a nice looking URL which seemed like a perfect location for my tech CV.


http://robhinds.github.io/


Once I had it looking pretty decent, and updated to modern Bootstrap styling so it was fully responsive, I thought I would stick it on GitHub, as a GitHub page.  GitHub provides support for everyone to have a free hosted page with normal HTML/JS resources etc (which is pretty nice of them!) and gives you a nice, share-able URL like http://{username}.github.io.

Whilst I was reading about GitHub pages, I noticed that they have native support for Jekyll - which is a static HTML generator tool for building websites - which is when I had my second realisation of the day - I could make my CV open-source-able by making it a configurable Jekyll projects that lets users just update some config in their GitHub account and hey-presto, have a nicely styled, personalised tech CV!

So I started porting it over to Jekyll: which just involved moving the configurable, user specific items into a config file (_config.yml) and then breaking the HTML sections into fragments to make it more manageable to understand what is going on.  The idea of Jekyll is pretty straight forward - its just a simple tokenised/template approach to putting together static HTML output, but it does work well and I really didn't find myself wanting for anything in the process.  The GitHub native support was also really nice, all I needed to do was just upload the source of the project to my GitHub account and GitHub handled the build and serving of the site out of the box!

And that's all the configuration it takes! The YAML format is pretty readable - it largely just works with indenting, and hopefully taking a look over the nested sections of data, its fairly easy to understand how you can modify parts to make it customise-able.


You can see my GitHub CV page here robhinds.github.io - Out of the box you can configure lots of aspects: custom text blocks, key skills, blogs, apps, github projects, stackoverflow, etc.

How can you have a custom GitHub CV?

It really is super simple to get your own GitHub CV:
  1. Create a Github account (if you don't already have one)

  2. Go to the project repository and fork the repository

  3. Change the name of the repository (in the settings menu) to {{yourusername}}.github.io

  4. Edit the /_config.yml file in your repository - it should be pretty straight forward as to what the links/details are that you need to add.

  5. Visit your new profile page: {{yourusername}}.github.io and start sharing it!


0 comments:

Unsupervised Learning in Scala using Word2Vec

A pretty cool thing that has come out of recent Machine Learning advancements is the idea of "Word Embedding", specifically the advancements in the field made by Tomas Mikolov and his team at Google with the Word2Vec approach. Word Embedding is a language modelling approach that involves mapping words to vectors of numbers - If you imagine we are modelling every word in a given body of text to an N-dimension vector (it might be easier to visualise this as 2-dimensions - so each word is a pair of co-ordinates that can be plot on a graph), then that could be useful in plotting words and starting to understand relationships between words given their proximity. What's more, if we could map words to sets of numbers, then we could start thinking about interesting arithmetic that we could perform on the words.

Sounds cool, right? Now of course, the tricky bit is how can you convert a word to a vector of numbers in such a way that it encapsulates the details behind this relationship? And how can we do it without painstaking manual work and trying to somehow indicate semantic relationships and meaning in the words?


Unsupervised Learning

Word2Vec relies on neural networks and trains on a large, un-labelled piece of text in a technique known as "unsupervised" learning.

Contrary to the last neural network I discussed which was a "supervised" exercise (e.g. for every input record we had the expected output/answer), Word2Vec uses a completely "unsupervised" approach - in other words, the neural network simply takes a massive block of text with no markup or labels (broken into sentences or lines usually) and then uses that to train itself.

This kind of unsupervised learning can seem a little unbelievable at first, getting your head around the idea that a network could train itself without even knowing the "answers" seemed a little strange to me first time I heard the concept, especially as a fundamental requirement for a NN to converge on optimum solution requires a "cost-function" (e.g. some thing we can use after each feed-forward step to tell us how right we are, and if our NN is heading in the right direction).

But really, if we think back to the literal biological comparison with the brain, as people we learn through this unsupervised approach all the time - its basically trial-and-error.


It's child's play

Imagine a toddler attempting to learn to use a smart phone or tablet: they likely don't get shown explicitly to press an icon, or to swipe to unlock, but they might try combinations of power buttons, volume controls and swiping and seeing what happens (and if it does what they are ultimately trying to do), and they get feedback from the device - not direct feedback about what the correct gesture is, or how wrong they were, just the feedback that it doesn't do what they want - and if you have ever lived with a toddler who has got to grips with touchscreens, you may have noticed that when they then experience a TV or laptop, they instinctively attempt to touch or swipe the things on the screen that they want (in NN terms this would be known as "over fitting" - they have trained on too specific a set of data, so are poor at generalising - luckily, the introduction of a non-touch screen such as a TV expands their training set and they continue to improve their NN, getting better at generalising!)

So, this is basically how Word2Vec works. Which is pretty amazing if you think about it (well, I think its neat).


Word2Vec approaches

So how does this apply to Word2Vec? Well just like a smartphone gives implicit, in-direct feedback to a toddler, so the input data can provide feedback to itself. There are broadly two techniques when training the network:

Continuous Bag of Words (CBOW)

So, our NN has a large body of text broken up into sentences/lines - and just like in our last NN example, we take the first row from the training set, but we don't just take the whole sentence to push into the NN (after all, the sentence will be variable length, which would confuse our input neurons), instead we take a set number of words - referred to as the "window size", let's say 5, and feed those into the network. In this approach, the goal is for the NN to try and correctly guess the middle word in that window - that is, given a phrase of 5 words, the NN attempts to guess the word at position 3.

[It was ___ of those] days, not much to do

So its unsupervised learning, as we haven't had to go through any data and label things, or do any additional pre-processing - we can simply feed in any large body of text and it can just try to guess the words given their context.

Skip-gram

The Skip-gram approach is similar, but the inverse - that is, given the word at position n, it attempts to guess the words at position n-2, n-1, n+1, n+2.

[__ ___ one __ _____] days, not much to do

The network is trying to work out which word(s) are missing, and just looks to the data itself to see if it can guess it correctly.


Word2Vec with DeepLearning4J

So one popular deep-learning & word2vec implementation on the JVM is DeepLearning4J. It is pretty simple to use to get used to what is going on, and is pretty well documented (along with some good high-level overviews of some core topics). You can get up and running playing with the library and some example datasets pretty quickly following their guide. Their NN setup is also equally simple and worth playing with, their MNIST hello-world tutorial lets you get up and running with that dataset pretty quickly.

Food2Vec

A little while ago, I wrote a web crawler for the BBC food recipe archive, so I happened to have several thousand recipes sitting around and thought it might be fun to feed those recipes into Word2Vec to see if it could give any interesting results or if it was any good at recommending food pairings based on the semantic features the Word2Vec NN extracts from the data.

The first thing I tried was just using the ingredient list as a sentence - hoping that it would be better for extracting the relationship between ingredients, with each complete list of ingredients being input as a sentence.  My hope was that if I queried the trained model for X is to Beef, as Rosemary is to Lamb, I would start to get some interesting results - or at least be able to enter an ingredient and get similar ingredients to help identify possible substitutions.

As you can see, it has managed to extract some meaning from the data - for both pork and lamb, the nearest words do seem to be related to the target word, but not so much that could really be useful. Although this in itself is pretty exciting - it has taken an un-labelled body of text and has been able to learn some pretty accurate relationships between words.

Actually, on reflection, a list of ingredients isn't actually that great an input, as it isn't a natural structure and there is no natural ordering of the words - a lot of meaning is captured in the phrases rather than just lists of words.

So next up, I used the instructions for the recipes - each step in the recipe became a sentence for input, and minimal cleanup was needed, however, with some basic tweaking (it's fairly possible that if I played more with the Word2Vec configuration I could have got some improved results) the results weren't really that much better, and for the same lamb & pork search this was the output:

Again, its still impressive to see that some meaning has been found from these words, is it better than raw ingredient list? I think not - the pork one seems wrong, as it seems to have very much aligned pork as a poultry (although maybe that is some meaningful insight that conventional wisdom just hasn't taught us yet!?)

Arithmetic

Whilst this is pretty cool, there is further fun that can be had - in the form of simple arithmetic. A simple, often quoted example, is the case of countries and their capital cities - well trained Word2Vec models have countries and their capital cities equal distances apart:

(graph taken from DeepLearning4J Word2Vec intro)

So could we extract similar relationships between food stuffs?  The short answer, with the models trained so far, was kind of..

Word2Vec supports the idea of positive and negative matches when looking for nearest words - that allows you to find these kind of relationships. So what we are looking for is something like "X is to Lamb, as thigh is to chicken" (e.g. hopefully this should find a part of the lamb), and hopefully use this to extract further information about ingredient relationships that could be useful in thinking about food.

So, I ran that arithmetic against my two models.
The instructions based model returned the following output:

Which is a pretty good effort - I think if I had to name a lamb equivalent of chicken thigh, a lamb shank is probably what I would have gone for (top of the leg, both pieces of slow twitch muscle and both the more game-y, flavourful pieces of the animal - I will stop as we are getting into food-nerd territory).

I also ran the same query on the ingredients based set (which remember, ran better on the basic nearest words test):

Which interestingly, doesn't seem as good. It has the shin, which isn't too bad in as far as its the leg of the animals, but not quite as good a match as the previous.


Let us play

Once you have the input data, Word2Vec is super easy to get up and running. As always, the code is on GitHub if you want to see the build stuff (I did have to fudge some dependencies and exclude some stuff to get it running on Ubuntu - you may get errors relating to javacpp or jnind4j not available - but the build file has the required work arounds in to get that running), but the interesting bit is as follows:
If we run through what we are setting up here:

  1. Stop words - these are words we know we want to ignore -  I originally ruled these out as I didn't want measurements of ingredients to take too much meaning. 
  2. Line iterator and tokenizer - these are just core DL4J classes that will take care of processing the text line by line, word by word. This makes things much easier for us, so we don't have to worry about that stuff
  3. Min word frequency - this is the threshold for words to be interesting to us - if a word appears less than this number of times in the text then we don't include the mapping (as we aren't confident we have a strong enough signal for it)
  4. Iterations - how many training cycles are we going to loop for
  5. Layer size - this is the size of the vector that we will produce for each word - in this case we are saying we want to map each word to a 300 dimension vector, you can consider each vector a "feature" of the word that is being learnt, this is a part of the network that will really need to be tuned to each specific problem
  6. Seed - this is just used to "seed" the random numbers used in the network setup, setting this helps us get more repeatable results
  7. Window size - this is the number of words to use as input to our NN each time - relates to the CBOW/Skip-gram approaches described above.

And that's all you need to really get your first Word2Vec model up and running! So find some interesting data, load it in and start seeing what interesting stuff you can find.

So go have fun - try and find some interesting data sets of text stuff you can feed in and what you can work out about the relationships - and feel free to comment here with anything interesting you find.

1 comments:

Machine (re)learning: Neural Networks from scratch

Having recently changed roles, I am now in the enviable position of starting to do some work with machine learning (ML) tools. My undergraduate degree was actually in Artificial Intelligence, but that was over a decade ago, which is a long time in computer science in general, let alone the field of Machine Learning and AI which has progressed massively in the last few years.

So needless to say, coming back to it there has been a lot to learn. In the future I will write a bit more about some of the available tools and libraries that exist these days (both expanding on the traditional AI Stanford libraries I have mentioned previously with my tweet sentiment analysis, plus newer frameworks that cover "deep learning").

Anyway, inspired by this post, I thought it would be a fun Sunday night refresher to write my own neural network. The first, and last, time that I wrote a neural network was for my final year dissertation (and that code is long gone), so was writing from first principals. The rest of this post will be a very straight forward introduction to the ideas and the code for a basic single layer neural network with a simple sigmoid activation function. I am training and testing on a very simple labeled data set and with the current configuration it scores 90% on the un-seen classification tests.  I am by no means any kind of expert in this field, and there are plenty of resources and papers written by people who are inventing this stuff as we speak, but this will be more the musings of someone re-learning this stuff.

As always, all the code is on GitHub (and, as per my change in roles, this time it is all written in Scala).

A brief overview

The basic idea is that a Neural Network(NN) attempts to mimic the parallel architecture of the human brain. The human brain is a massive network of billions of simple neural cells that are interconnected, and given a stimulus they each either "fire" or don't, and its these firing neural cells and synapses that enable us to learn (excuse my crude explanation, I'm clearly not a biologist..).

So that is what we are trying to build: A connected network of neurons that given some stimuli either "fire" or don't.

A Neuron

Ok, so this seems like a sensible place to start, right? We all know what a network is, so what are these nodes in our network that we are connecting? These are the decision points, and in themselves are incredibly simple. They are basically a function that takes n input values  and multiplies them by a pre-defined weight (per input), adds a bias and then runs it through an activation function (think of this as our fire/don't-fire function):



In terms of our code, this is pretty simple (don't worry about the sigmoid derivative values we are setting, we are just doing this to save time later):

As you can see, the neuron holds the state about the weights of the different inputs (a NN is normally fixed in terms of number of neurons, so once initialised at the start we know that we will get the same number of inputs).

You may have also noticed the first line of the class, where we require an ActivationFunction to be applied - as I mentioned, this is our final processing of the output. In this example I am just using the Sigmoid function:

As you can see, it's a pretty simple function. Much like the brain, the neurons are very simple processors, and its only the combination of them that makes the brain so powerful (or deep learning, for that matter)

The network

Ok, so a sinple "shallow" NN has an input layer, a single hidden layer and an output layer (deep NNs will have multiple hidden layers).  The network is fully connected between layers - that is to say, each node is connected to every node in the next layer up, and thats it - no neurons are connected to other layers and no connections are missed.



The exact setup of the network is largely problem dependent, but there are some observations:

  1. - The input layer has to correlate to your inputs: If you have a data set that has two input values, let's say you have a dataset that contains house price data, and you have the input values number of rooms and square foot then you would have to have two input neurons.

  2. - Similarly, if you were using the NN for a classification problem and you know you have a fixed number of classifications, the output neurons would correlate to that. For example, consider you were using the NN to recognise hand written digits (0-9) then you would likely have 10 output neurons to group those possible outputs.

 The number of hidden neurons, or the number of layers is a lot more problem dependent and is something that needs to be tuned per problem.

If you have ever worked with graph type structures in code before, then setting up a simple network of these neurons is also relatively straight forward given the uniform structure of the network.

The weights

Oh right, yes. So all these neurons are connected, but its a weighted graph, in CS terms. That is, the connection between each node is assigned a weight - as a simple example, if we take the house price dataset as an example, after looking at the data we might determine that the number of rooms is a more significant factor in the end result, so that connection should have a greater weighting than the other input.  Now that is not exactly what happens, but in simple terms it's a good way of thinking about it.

Given that the end goal is for the NN to learn from the data, it really doesn't matter too much what we initialise the weights for all the connections to, so at start-up we can just assign random values. Just think of our NN as a newborn baby - it has no idea how important different inputs are, or how important the different neural cells that fire in response to the stimuli are - everything is just firing off all over the place as they slowly start to learn!

So what next?

Ok, so we have our super-simple neurons, that just mimic a single brain cell and we have them connected in a structured but randomly weighted fashion, so how does the network learn? Well, this is the interesting bit..

When it comes to training the network we need a pretty large dataset to allow it to be able to learn enough to start generalising - but at the same time, we don't want to train it on all the data, as we want to hold some back to test it at the end, just to see just how smart it really is.

In AI terms, we are going to be performing "supervised" learning - this just means that we will train it with a dataset where we know the correct answer, so we can make adjustments based on how well (or badly) the network is doing - this is different to "unsupervised" learning, where we have lots of data, but we don't have the right answer for each data point.

Training: Feed forward


The first step is the "feed-forward" step - this is where we grab the first record from our training data set and feed the inputs into our randomly initialised NN, that input is fed through the all of the neurons in the NN until we get to the output layer and we have the networks attempt at the answer. As the weighting is all random, this is going to be way off (imagine a toddlers guess the first time they ever see a smart phone).

As you can see, the code is really simple, just iterate through the layers calculating the output for each neuron. Now, as this is supervised learning we also have the expected output for the dataset, and this means we can work out how far off the network is and attempt to adjust the weights in the network so that next time it performs a bit better.

Training: Back propogation

This is where the magic, and the maths, comes in.  The TL;DR overview for this is basically, we know how wrong the network was in its guess, so we attempt to update each of the connection weights based on that error, and to do so we use differentiation to work out the direction to go to minimise the error.

This took a while of me staring at equations and racking my brain trying to remember my A-level maths stuff to get this, and even so, I wouldn't like to have to go back and attempt to explain this to my old maths teacher, but I'm happy I have enough of a grasp of what is going on, having now coded it up.

In a very broad, rough overview, here is what we are going to do now:

  1. - Cacluate the squared error of the outputs. That is a fairly simple equation of
    1/2 * (target - output)^2

  2. - From there, we will work out the derivative of this function. A lot of the errors we will start to work with now uses differentiation and the derivatives of the result - and this takes a little bit of calculus, but the basic reason is relatively straight forward if you think about what differentiation is for.

     
  3. - We work back through the network, calculating the errors for every neuron combining the error, derivatives and the weighting (to determine how much a particular connection played in an error, it also needs to be considered when correcting the weighting.

Once we have adjusted the weight for each connection based on the weight, we start again and run the next training data record.  Given the variation in the dataset, the next training record might be quite different to the previous record trained on, so the adjustment might be quite different, so after lots (maybe millions) of iterations over the data, incrementally tweaking and adjusting the weights for the different cases, hopefully it starts to converge on something like a reasonable performance.

Maths fun: Why the derivative of the errors?

So, why is calculating the derivative relevant in adjusting the errors?

If you imagine the network as a function of all the different weights, its pretty complicated, but if we were to reduce this, for the sake of easier visualisation, to just be a 3-d space of possible points (e.g. we have just 3 weights to adjust, and those weights are plotted on a 3-d graph) - now imagine our function plots a graph something like this:


(taken from the wikipedia page on partial derivatives)

The derivative of the function allows us to work out the direction of the slope in the graph from a given point, so with our three current weights (co-ordinates) we can use the derivative to work out the direction in which we need to adjust these weights.

Conclusion

The whole NN in scala is currently on Github, and there isn't much more code than I have included here.  I have found it pretty helpful to code it up from scratch and actually have to think about it - and has been fun coding it up and seeing it train up to getting 90% accuracy on the unseen data set I was using (just a simple two-in two-out dataset, so not that impressive).

Next up I will see how the network performs against the MNIST dataset (like the Hello-World benchmark for machine learning, and is the classification of handwritten digits).

Oh, and if you are interested as to why the images look like a dodgy photocopy, they are photos of original diagrams that I included in my final year dissertation in university, just for old time sake!

1 comments:

RESTful API Design: An opinionated guide

This is very much an opinionated rant about APIs, so it's fine if you have a different opinion. These are just my opinions. Most of the examples I talk through are from the Stack Exchange or GitHub API - this is mostly just because I consider them to be well designed APIs that are well documented, have non-authenticated public endpoints and should be familiar domains to a lot of developers.

URL formats

Resources

Ok, lets get straight to one of the key aspects. Your API is a collection of URLs that represent resources in your system that you want to expose. The API should expose these as simply as possible - to the point that if someone was just reading the top level URLs they would get a good idea of the primary resources that exist in your data model (e.g. any object that you consider a first-class entity in itself). The Stack Exchange API is a great example of this. If you read through the top level URLs exposed you will probably find they match the kind of domain model you would have guessed:

/users
/questions
/answers
/tags
/comments
etc

And whilst there is no expectation that there will be anyone attempting to guess your URLs, I would say these are pretty obvious. What’s more, if I was a client using the API I could probably have a fair shot and understanding these URLs without any further documentation of any kind.

Identifying resources

To select a specific resource based on a unique identifier (an ID, a username etc) then the identifier should be part of the URL. Here we are not attempting to search or query for something, rather we are attempting to access a specific resource that we believe should exist. For example, if I were to attempt to access the GitHub API for my username: https://api.github.com/users/robhinds I am expecting the concrete resource to exist.

The pattern is as follows (elements in square braces are optional):
/RESOURCE/[RESOURCE IDENTIFIER]

Where including an identifier will return just the identified resource, assuming one exists, else returning a 404 Not Found (so this differs from filtering or searching where we might return a 200 OK and an empty list) - although this can be flexible, if you prefer to return an empty list also for identified resources that don’t exist, this is also a reasonable approach, once again, as long as it is consistent across the API (the reason I go for a 404 if the ID is not found is that normally, if our system is making a request with an ID, it believes that the ID is valid and if it isn't then its an unexpected exception, compared to if our system was querying filtering user by sign-up dates then its perfectly reasonable to expect the scenario where no user is found).

Subresources

A lot of the time our data model will have natural hierarchies - for example StackOverflow Questions might have several child Answers etc. These nested hierarchies should be reflected in the URL hierarchy, for example, if we look at the Stack Exchange API for the previous example:
/questions/{ids}/answers

Again, the URL is (hopefully) clear without further documentation what the resource is: it is all answers that belong to the identified questions.

This approach naturally allows many levels of nesting as necessary using the same approach, but as many resources are top level entities as well, then this prevents you needing to go much further than the second level. To illustrate, let’s consider we wanted to extend the query for all answers to a given question, to instead query all comments for an identified answer - we could naturally extend the previous URL pattern as follows
/questions/{ids}/answers/{ids}/comments

But as you have probably recognised, we have /answers as a top level URL, so the additional prefixing of /questions/{ids} is surplus to our identification of the resource (and actually, supporting the unnecessary nesting would also mean additional code and validation to ensure that the identified answers are actually children of the identified questions)

There is one scenario where you may need this additional nesting, and that is when a child resource’s identifier is only unique in the context of its parent. A good example of this is Github’s user & repository pairing. My Github username is a global, unique identifier, but the name of my repositories are only unique to me (someone else could have a repository the same name as one of mine - as is frequently the case when a repository is forked by someone). There are two good options for representing these resources:

  1. The nested approach described above, so for the Github example the URL would look like:
    /users/{username}/repos/{reponame}

    I like this as it consistent with the recursive pattern defined previously and it is clear what each of the variable identifiers is relating to.

  2. Another viable option, the approach that Github actually uses is as follows:
    /repos/{username}/{reponame}

    This changes the repeating pattern of {RESOURCE}/{IDENTIFIER} (unless you just consider the two URL sections as the combined identifier), however the advantage is that the top level entity is what you are actually fetching - in other words, the URL is serving a repository, so that is the top level entity.

Both are reasonable options and really come down to preference, as long as it's consistent across your API then either is ok.

Filtering & additional parameters

Hopefully the above is fairly clear and provides a high level pattern for defining resource URLs. Sometimes we want to go beyond this and filter our resources - for example we might want to filter StackOverflow questions by a given tag. As hinted at earlier, we are not sure of any resources existence here, we are simply filtering - so unlike with an incorrect identifier we don’t want to 404 Not Found the response, rather return an empty list.
Filtering controls should be entered as part of the URL query parameters (e.g. after the first ? in the URL). Parameter names should be specific and understandable and lower case. For example:
/questions?tagged=java&site=stackoverflow

All the parameters are clear and make it easy for the client to understand what is going on (also worth noting that https://api.stackexchange.com/2.2/questions?tagged=awesomeness&site=stackoverflow for example returns an empty list, not a 404 Not Found). You should also keep your parameter names consistent across the API - for example if you support common functions such as sorting or paging on multiple endpoints, make sure the parameter names are the same.

Verbs

As should be obvious in the previous sections, we don’t want verbs in our URLs, so you shouldn’t have URLs like /getUsers or /users/list etc. The reason for this is the URL defines a resource not an action. Instead, we use the HTTP methods to describe the action: GET, POST, PUT, HEAD, DELETE etc.

Versioning

Like many of the RESTful topics, this is hotly debated and pretty divisive. Very broadly speaking, two approaches to define API versioning is:
  • Part of the URL
  • Not part of the URL
Including the version in the URL will largely make it easier for developers to map their endpoints to versions etc, but for clients consuming the API it can make it harder (often they will have to go and find-and-replace API URLs to upgrade to a new version). It can also make HTTP caching harder - if a client POSTs to /v2/users then the underlying data will change, so the cache for GET-ting users from /v2/users is now invalid, however, the API versioning doesn’t affect the underlying data so that same POST has also invalidated the cache for /v1/users etc. The Stack Exchange API uses this approach (as of writing their API us based at https://api.stackexchange.com/2.2/)

If you choose to not include the version in your API then two possible approaches are HTTP request headers or using content-negotiation. This can be trickier for the API developers (depending on framework support etc), and can also have the side affect of clients being upgraded without knowing it (e.g. if they don’t realise they can specify the version in the header, they will default to the latest).  The GitHub API uses this approach https://developer.github.com/v3/media/#request-specific-version

I think this sums it up quite nicely:


Response format

JSON is the RESTful standard response format. If required you can also provide other formats (XML/YAML etc), which would normally be managed using content negotiation.

I always aim to return a consistent response message structure across an API. This is for ease of consumption and understanding across calling clients.

Normally when I build an API, my standard response structure looks something like this::

[ code: "200", response: [ /** some response data **/ ] ]

This does mean that any client always needs to navigate down one layer to access the payload, but I prefer the consistency this provides, and also leaves room for other metadata to be provided at the top level (for example, if you have rate limiting and want to provide information regarding remaining requests etc, this is not part of the payload but can consistently sit at the top level without polluting the resource data).

This consistent approach also applies to error messages - the code (mapping to HTTP Status codes) reflects the error, and the response in this case is the error message returned.

Error handling

Make use of the HTTP status codes appropriately for errors. 2XX status codes for successful requests, 3XX status codes for redirecting, 4xx codes for client errors and 5xx codes for server errors (you should avoid ever intentionally returning a 500 error code - these should be used for when unexpected things go wrong within your application).

I combine the status code with the consistent JSON format described above.

0 comments:

Groovy Retrospective: An Addendum - Memory usage & PermGen

I can't really have a Groovy retrospective without mentioning memory.

Over the last four years I have spent more time than any sane person should have to investigating memory leaks in production Groovy code. The dynamic nature of Groovy, and it's dynamic meta-programming presents different considerations for memory management compared to Java, simply because perm gen is no longer a fixed size. Java has a fixed number of classes that would normally be loaded in to memory (hot-reloading in long living containers aside), where as Groovy can easily change or create new classes on the fly. As a result, permgen GC is not as sophisticated (pre moving permgen to the normal heap anyway) and largely in Java if you experienced an Out-Of-Memory permgen exception then you would just increase your permgen size to support the required number of classes being loaded by the application.

To be fair, the majority of the problems encountered were due to trying to hot-reload our code coupled the setup of the application container (in this case tomcat) and having Groovy on the server classpath rather than bundled with the application (much like you wouldn't bundle Java itself with an application, however, bundling Groovy with your application is recommended).

A couple of points of interest, if you are considering Groovy (especially relevant in a long running process where you attempt to reload your code):


The MetaClassRegistry

When you meta-program a Groovy class, e.g. dynamically add a method to a class,  its MetaClass is added to the MetaClassRegistry, which is a static object on the GroovySystem class. This means that any dynamically programmed class creates a tie back to the core Groovy classes.

The main consideration to keep in mind when meta programming in a Groovy environment is that if you want to reload your classes you now have a link between your custom code and the core Groovy code so you must either 1) explicitly clear out the MetaClassRegistry; 2) reload the core Groovy classes as well (throw everything out on reload)

I think coming from a Java environment, where you would likely use the JAVA_HOME on the server for long running applications, it can often seem logical to have a similar server classpath entry for Groovy also - but actually, the easiest approach is to bundle the groovy classes with your application so is a normal candidate for reloading.

If you decide not to reload Groovy, you can add explicit code to clear out the registry - this is pretty simple code, but a note of warning, without just throwing everything away there are still plenty of risks that you can leave links that stop your classes being collected (which was the case for me for a long time until just making all the third party classes (groovy included) candidates to be thrown away and reloaded.  Even with throwing away all third party libraries you can still be caught out if you use shutdown hooks (e.g. jvm shutdown hook to clean up connections will tie your classes back to your underlying JRE, meaning that no classes can be collected until you restart your application!)

Anyway, above is code to clear the registry, it assumes you have access to the GroovyClassLoader, but you can also follow the same approach by just grabbing the MetaClassRegistry from any arbitrary Groovy class and iterate through that. Give it a try, if you play around a little you will probably find its quite easy to create a leak if you want to!


Anonymous Classes

Another thing to keep in mind in Groovy applications is the generation of classes by your application. As you would expect, as you load in Groovy classes to your application they will be compiled to class files (or if you are pre-compiling into a JAR or something) which will be added to PermGen. However, in addition to this, your code being executed may also result in additional (possibly anonymous) classes also being created and added to PermGen - so without care, these can start to fill up that space and cause OOM exceptions (although generated classes will often be very little, so might take a while before it actually errors).

An example of what might do this is loading Groovy config files - if you are loading sensibly and just doing it once then it won't be an issue, but if you find yourself re-loading the config every request/execution then it can keep adding those to PermGen.  Another example of where this happens (surprisingly) is if you are using Groovy templating. Consider the following code:

(Taken from the SimpleTemplateEngine JavaDocs)

The example is a simple example of binding a Groovy template with a map of values - maybe something that you would do to send an email or create a customized document for someone - but behind the scenes Groovy will create a class that is added to permgen for each execution. Now this isn't a  lot, however, if you are dealing with high throughput it can certainly add up pretty quickly.


Lazy Garbage Collection

Another interesting behaviour that I observed over the last few years is that, in the JVM implementation I was using, the PermGen garbage collector collected lazily.  As I mentioned, because nothing interesting traditionally happened in the permanent generation in the JVM, the garbage collectors didn't do anything interesting. Further more, because it was always assumed that the contents of perm gen were fairly static (as the name suggests) the collection happens fairly in-frequently, and often only kicks in for a full collection (which is more costly).  What this means is that even if everytime you reload you free up lots of classes for collection (say, your entire application), it might not GC permgen for several reloads as a full collection isn't required, and the JVM will just lazily perform the collection when permgen is almost full.



If you look at the diagram above, it displays a common pattern I observed - each little step in the up swing of the chart represents a full application reload, but you will see that there are multiple reloads before the permgen usage approaches the limit, and it is only when the usage is close to the limit that it actually performs the collection.

The challenges this can present is that if you push the permgen close to the limit but still not triggering a full collection, then the following reload it can once again spike the permgen (because the entire application and its associated third party code is being reloaded into memory), this can push it over the limit and cause OOM exceptions.  This was not something I ever saw on production environments, but was fairly common in desktop/development environments where less resources were available.

(another interesting observation in this particular pattern is that the GC seems to be intermittently collecting fewer classes. I never got to the bottom of that question mark: there was no difference in activity between application reload each time, so there is no change in application behaviour to trigger a leak and the middle period of reloading maintains a constant level of memory usage patterns - which also shows no leak behaviour)



Hopefully someone smarter than me about all this stuff is reading this and can shed some better insights into it, otherwise, hopefully its helpful if you are about to start doing crazy things with perm gen..

0 comments:

Groovy: A Retrospective

I have been using Groovy, in ernest, for just over 4 years now, and as I am possibly moving to using something else (Scala) I wanted to write up some notes about it all (not getting into memory leaks, as that's a whole other story!).

Over the years I have read people saying that it is not production ready, or is only really a language designed for scripting and testing etc but I have been using it in a production setting for the last four years and generally find that it is now at a stable enough level for production use (yes, it is still slower than Java, even with @CompileStatic, so if you know from the start that you are building some low-latency stuff, then maybe go with something else).

Java to Groovy

I started using Groovy 4 years ago when I joined a team that was already using it as the primary development language and came straight from an enterprise Java background.  The nice thing about Groovy (certainly versus something like Scala) is that it is, for the most part, very similar to Java syntax, and if you can write Java then you can also write Groovy.  Which is great if you either have an existing Java team or a large Java hiring pool and you want to start using Groovy. The downside is that it takes considerable discipline and code review process to ensure a consistently styled code base, as from my experience, whilst the Java-to-Groovy switch can start in an instance, the journey is a gradual one.

From my experience, both of making the transition myself and from hiring other Java developers into the team, there are three main stages of the journey:

  1. I like the return keyword! - If making an immediate switch to a Groovy environment, most former Java devs (myself included) likely end up just writing java in *.groovy files - most people find dropping the semicolon the easiest first step, but lots of Java developers wont fully embrace idiomatic Groovy, a good example of this being use of the return keyword.

    I remember early code review conversations being advised that I didn't need the return, to which I insisted that I liked it and found it clearer to read - A conversation I have since been on the opposite end of with several newer colleagues since then.  The effect of this is fairly minimal - it means as you bring more Java developers on board, you will find patches of your codebase which are distinctly Java in style (return keyword, occasional semicolons, getters & setters, for-each loops etc) - still readable and functional groovy, just not idiomatic.

  2. I can type less!? - The second phase that appears to be common is the opposite end of the spectrum - with the discovery of dynamic typing and the ability to define everything as def (or not include the type at all in the case of the method arguments).

    I'm not sure if this is a result of laziness (fewer character to type, and less thinking required as to specific types) - which I don't really buy - or whether it is just perceived as the idiomatic way.  Another common trait here is to over use scripts (they have a use, but are harder to unit-test and in my opinion, don't offer much over just using a groovy class with a simple main method which just handles the CliBuilder and instantiates the class to run).

    Again, in my opinion, typing is better - I still don't buy that typing def rather than the actual type is actually that much of a saving, and often ends up being a code-smell (even if it didn't start out as one).  The official Groovy recommendation is to type things, with the exception of local method scoped variables - but even in that case I cannot see the benefit: there is no real gain to the developer in typing those precious fewer characters and the future readability suffers as other developers have to track through the code more to be sure as to what a variable is doing.

  3. Idiomatic, typed Groovy - In a circle-of-life analogy, the resulting code is actually quite close to Java (well, it's statically typed and I don't focus on brevity of code at the cost of readability), but makes full use of the best groovy features - automatic getters and setters with dot notation for access (where brevity is good!), all the collection methods (each, findAll, collect, inject), explicit Map and List declaration, optional (but typed!) method arguments, truthy-ness, etc.

    I suspect the evolution to this stage, in part, comes from working on a sufficiently large code base, for long enough to have to start dealing with the fall out of nothing being accurately typed - for me this was having to port some groovy code to Java and found several methods with several un-typed (sometimes defaulted) arguments, and the method body code just a large if block checking the types of the passed data and handling differently.

    A week or so ago I put together a small web-scraping project in Groovy - its just a few classes, but gives a good picture of how I code Groovy now after 4 years.

The good bits

I think it was probably a combination of the above journey and Groovy's maturing (I started using it at version 1.7), but it took probably 2 1/2 years before I wanted to start using Groovy at home and for my own projects, continuing to use Java in those first few years.  But now, I am happy to recommend Groovy (performance concerns noted), its a nice syntactical sugar that can really compliment the way you would normally program in Java and there are some things that I really miss when going back to Java.

  • The explicit maps and list definition is awesome - This is a good example where brevity and readability come together, and is kind of puzzling that this hasn't been natively supported in Java anyway.  The closest I can think that Java has is its Arrays.asList(..) method, which is a nice helper, but still more clunky than it needs to be.  When I go back to Java I definitely miss the ability to just natively define nested Map/List structures
  • The extended collection methods with closure support, for iterating, collecting, filtering etc are also great, and probably the biggest part of Groovy that I miss going back to Java (I know that Java streams have started to address this area, and my limited experience of these have been pretty nice, but not quite as good).  This isn't just a nice syntax for iterating or filtering (although, again, how has there not been a better native support for find/findAll in Java until now?), but also the support for more Functional Programming techniques such as Map-Reduce.
  • Another simple nice feature that adds readability and conciseness - being able to simply assert if ( list )  or if ( map ) to determine if a collection has any values is great.
  • Automatic getters and setters - this is a nice one in theory. I agree that it makes no sense to always have to create getters and setters for everything (even if it is normally the IDE generating them), and I like the dot notation for accessing them.  At first glance it feels like it is breaking the encapsulation of the object (and really maybe they should have implemented this feature as an annotation rather than messing with the standard Java access modifier behaviour) - but really its just doing what you would do by hand - if you leave your variables as default modifier (and don't add your own getters and setters) then Groovy will add the, at compile time, and you are free to use the getter and setter methods throughout the code. The idea that you then use private modifier for variables that you want for just internal use and not exposed. However, there is a long standing bug that means this doesn't work, and you actually get getters and setters for everything! Still, a nice idea.


It's not quite a farewell, as I will continue to use Groovy (partly because I still enjoy using Spring framework so much) for the time being, and only time will tell whether Scala and some other frameworks take its place as my go to hobbyist platform of choice!

1 comments:

Spring-Security: Different AuthenticationEntryPoint for API vs webpage

This is just a real quick post, on a little bit of Spring that I came across today. It's a very simple thing, but, in my opinion, beautiful in it's simplicity.

I found myself working on some Spring-Security stuff, and an app where I needed to define my AuthenticationEntryPoint (I am in the process of adding the security stuff, so this is not done yet).  Simple enough - normally in config you can just add it to the exception handling setup. However, this time I wanted to define two different entry points: one for when a user attempts to access an API (JSON) and one for normal site pages.

It's not unusual to have an API baked into an app (maybe under /api/** etc), and the ideal behaviour would be to return an appropriate HTTP Status code for the API (401) plus a JSON payload, and for a logged in web page the user would be bounced to the login page before continuing.


Having dealt with this split for error handling, controller routing and security elsewhere, I assumed I would have to implement a custom AuthenticationEntryPoint, chuck in a fer IF statements checking the logged in user and requested URL and either redirect or respond with the status appropriately. However, Spring has us covered with its DelegatingAuthenticationEntryPoint - which is what it sounds like, and super simple to use.  Probably best demonstrated with the code (because it's just that simple!)

In our normal configure method we just set the entrypoint as usual. But in the DelegatingAuthenticationEntryPoint we simply initialise it with a map of RequestMatcher: AuthenticationEntryPoint (defined in Groovy above, so we have nice Map definition etc - would be slightly more verbose in Java)  - The RequestMatcher can be any implementation you like, but of course simple path matchers will probably work fine; For the AuthenticationEntryPoint there are also lots of really nice Spring implementations - including the two used in the example above, which perfectly provide what I need.


This genuinely elicited an "awww yeah" from me.

0 comments:

Sentiment analysis of stock tweets

Having previously wired up a simple spring app with Twitter to consume their tweet stream relating to last year's Rugby World Cup - mostly just to experiment with the event-driven programming model in Spring and Reactor - I thought on a whim, why not see if I can find some nice sentiment analysis tools to analyse the tweets, so rather than just consuming the number of tweets about a given topic, I could also analyse if they were positive or not.


Now, that probably sounded like a fairly glib comment. And to be honest, it was: sentiment analysis is very hard, and the last time I looked most efforts were not up to much. Added to that, to make it actually effective, you need some pretty specific training data - for example, if you had a model trained using this blog and then tried to apply that to another sort of text - say tweets - then it's most likely not going to perform well.  Tweets are particularly different as people use different language, grammar and colloquialisms on twitter (in part due to the 140 chars limit) compared to normal writing.

But still, I had my laptop on my commute home on the train, so I figured why not see if there are any simple sentiment analysis libraries that I could just drop in and run the tweets through.  Sure the resulting scores would likely be way off, but it would be an interesting experiment to see how easy it was (and if done, could we then find a decent training set to re-train our model so it was more accurate at analysing tweets).


A quick google later and I came across Stanford's Core NLP (Natural Language Processing) library, via the snappily titled "Twitter Sentiment Analysis in less than 100 lines of code!" (which seemed just as flippant as my original suggestion, so seemed like a good fit!).  Surprisingly, it was actually just as easy as I had hoped that it might of been! The libraries are nicely available in the maven repo, coming with a pre-trained model (albeit trained on film reviews) and are written in Java.  A lot of the code is taken from the approach outline in the above article and the Stanford Core NLP sample class, but its pretty simple and I managed to process a few thousand tweets last night having set it all up on my commute and analyse the sentiment (producing wildly in-accurate sentiment scores - but who's to know, right?!)

(I switched to streaming stock related tweets - mostly just so I could include references to Eddie Murphy in Trading Places)

Updating our dependencies

I will skip the normal app setup and Twitter connection stuff, as I was just building this on top of the app I had previously done for the RWC (which already connected to the Twitter streaming API and persisted info to Redis.

All we need to do here is add the two Stanford dependencies - you can see I also added a dependency for Twitter's open-source library - this provides tweet cleanup/processing stuff, and really just used to extract "cashtags" (like a hashtag, but starting with a $ used on Twitter to indicate stock symbols, e.g. $GOOGL etc).


Spring configuration

Next up, as we are using Spring its super easy to just add the configuration so we can let Spring manage our Stanford NLP objects and inject them into our service class that will have the code to analyse the sentiment

Now we have told Spring to manage the main Stanford class we need and the simple Twitter Extractor class. For the StanfordCoreNLP class we are passing in some properties for what text analysis we want to use (this can usually be done with a properties file, but I was feeling lazy so did it programatically - you can see details of which Annotators are available here: http://stanfordnlp.github.io/CoreNLP/annotators.html )


Next up, based on the code examples we have seen, we need a little bit of code to analyse a piece of text and return a score - So I created a simple Spring service called SentimentService that I later wire into my event listener.

That's mostly it really, in my event listener instead of just persisting the tweet along with its labels I also run the analysis and also save the score.


 (Analysis of a couple thousand tweets - an average score plus number of tweets for each symbol)

As always, all the code is available on GitHub, so feel free to fork it and play yourself (and if you manage to find a training set to accurately analyse tweets then let me know!)

0 comments:

Mobile: Web vs Native (again)

It seems like by now, pretty much everyone has weighed in on the native vs web conversation for mobile development, and being as I haven't yet, I thought why not? The age-old question being, should you make a native mobile app or just go with a standard responsive website? (or just use some common technology to wrap up your mobile website in an app).

The reason that I started thinking about this again was because an article as circulated at work, which seemed to be saying that you should do both native and web, as they have different strategic values - which, whilst I largely disagree with that, I think the underlying point being made was that there are different reasons for going down either path.

A quick caveat:

I will get this out the way up front - I'm not talking about any scenarios where you are in a very mobile-centric business - so if you want to make use of phone hardware like camera etc, or you are specifically a mobile-first type app, or one that is intrinsically linked with the mobile as an identity, then yes. Native is the only option.  This discussion is more for normal existing businesses that might (or not) already have a website and is at the point of inflection of whether to  build a native app or not11.

--- 

So, native or responsive web?

My general rule of thumb is as follows: Don't go native.

You would be right in thinking that's a fairly sweeping rule. But I think probably fair - and whys that? I think there are two primary concerns that make the cost and effort with building a native app not worthwhile:

1. Discoverability

The web is a great place for discovering things - just the word "web" portrays it quite nicely, from any given point there are undoubtedly loads of little threads(I'm talking links) that could be followed at no real cost - assuming it's not a dodgy looking link, the barrier to prevent someone following a link is practically non-existent - So you read a nice featured article about a new company/product on a blog, at the end of the article they link to their site, you click it. I mean, why not right? if it ends up looking crappy the back button saves you and what have you lost? a few seconds? Easy.

This is something that is clearly missing in the mobile app eco-system - assuming you aren't some behemoth of a company that millions of people want (or need) to interact with, a mobile app isn't going to help increase your customer base. Sure it might give you a richer experience for the handful (I'm talking in web scale here) of customers - but its not going to help grow customers, it will cost you time and money to produce/maintain and that's before you start having to work on directing existing customers to your mobile app:

So again, if you are a bank, who has millions of customers who have specific, regular needs to transact with you, or if you are a hot new social-local-sharing company then sure, go for that enhanced UX.  But an app is only going to be any good if the users already have the intent to transact with you. If a user is browsing the web on a mobile, your conversion rate is going to be a lot higher with a link through to a responsive website than to an app download.


2. Scalability

For me, this is the deal breaker. I'm not talking technical scalability - will your servers be able to withstand the un-doubted, soon to be approaching, mass of people who will rush out and download your app as soon as its released (see my previous point), I'm talking customer-to-app scalability.

Imagine your a regular business - you have thousands of happy customers, maybe your website even gets tens or hundreds of thousands of uniques a day - so lets build an app right?

The problem is, there is a limit to how many apps that a user will have on their phone - on a standard android phone you will likely have two screens worth of app icons when you un-box it, so there is a limit to how many more apps they will install - Again, this is how it is not like the web: visiting a website is free, but there is a much greater barrier to installing and keeping an app (let alone using it) - and when the phone is getting full, or sluggish, or the user wants a bit more space to download a show from BBC iPlayer, then the apps that aren't regularly used are going to get the chop.

When you are competing with limited resources and the big players - Facebook, messaging, banks, geo-based stuff (maps, uber etc) - then it's hard to make a compelling case for the phone user to keep or even install the app.  This demand makes it even harder to convert your loyal web customers to mobile - and weighed up against the fact that you could make an awesome(and consistent) responsive web experience, the choice looks more clear for me.


Benedict Evans says you should build an app if people are going to put your app on their homescreen, which seems like pretty sound advice - and given the size of a a phone homescreen, this makes for fairly few companies.


Final caveat:

If you are building the app because it came out of a side-project organically, or a hackathon or something similar, then by all means - there's lots of fun to be had and lessons to be learnt in building, testing and launching a mobile app - so if you don't mind the potential cost then definitely go for it!



0 comments: