Using OpenNLP for Named-Entity-Recognition in Scala
A common challenge in Natural Language Processing (NLP) is Named Entity Recognition (NER) - this is the process of extracting specific pieces of data from a body of text, commonly people, places and organisations (for example trying to extract the name of all people mentioned in a wikipedia article). NER is a problem that has been tackled many times over the evolution of NLP, from dictionary based, to rule based, to statistical models and more recently using Neural Nets to solve the problem.Whilst there have been recent attempts to crack the problem without it, the crux of the issue is really that for approach to learn it needs a large corpus of marked up training data (there are some marked up corpora available, but the problem is still quite domain specific, so training on the WSJ data might not perform particularly well against your domain specific data) and finding a set of 100,000 marked up sentences is no easy feat. There are some approaches that can be used to tackle this by generating training data - but it can be hard to generate truly representative data and so this approach always risks over-fitting to the generated data.
Having previously looked at Stanford's NLP library for some sentiment analysis, this time I am looking at using the OpenNLP library. Stanford's library is often referred to as the benchmark for several NLP problems, however, these benchmarks are always against the data it is trained for - so out of the box, we likely won't get amazing results against a custom dataset. Further to this, the Stanford library is licensed under GPL which makes it harder to use in any kind of commercial/startup setting. The OpenNLP library has been around for several years, but one of its strengths is its API - its pretty well documented to get up and running, and is all very extendable.
Training a custom NER
Once again, for this exercise we are going back to the BBC recipe archive for the source data - we are going to try and train an OpenNLP model that can identify ingredients.To train the model we need some example sentences - they recommend at least 15,000 marked up sentences to train a model - so for this I annotated a bunch of the recipe steps and ended up with somewhere in the region of about 45,000 sentences.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Bring a large pan of salted water to the boil, then add the <START:ingredient> cauliflower <END> and cook for two minutes. |
Once we have our training data, we can just easily setup some code to feed it in and train our model:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def trainModel() = { | |
val charset = Charset.forName("UTF-8") | |
val lineStream: ObjectStream[String] = new PlainTextByLineStream(new FileInputStream(s"src/main/resources/trainingdata.txt"), charset) | |
val sampleStream = new NameSampleDataStream(lineStream) | |
try { | |
val params = TrainingParameters.defaultParams() | |
params.put(TrainingParameters.ALGORITHM_PARAM, QNTrainer.MAXENT_QN_VALUE) | |
model = NameFinderME.train("en", "food", sampleStream, params, new TokenNameFinderFactory()) | |
} | |
finally { | |
sampleStream.close() | |
} | |
try { | |
modelOut = new BufferedOutputStream(new FileOutputStream(s"src/main/resources/en-ingredients-finder.bin")) | |
model.serialize(modelOut) | |
} finally { | |
if (modelOut != null) | |
modelOut.close() | |
} | |
} |
When we want to use the model, we can simply load it back into the OpenNLP Name Finder class and use that to parse the input text we want to check:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
val model = new TokenNameFinderModel(modelIn) | |
val nameFinder = new NameFinderME(model) | |
val matches = nameFinder.find(sampleRecipe) | |
matches.foreach { m => | |
sampleRecipe.slice(m.getStart, m.getEnd).foreach(println(_)) | |
} |
- butter
- sugar
- rosemary
- caramel
- shortbread
All in all, pretty good - it missed some ingredients, but given the training data was created in about 20 minutes just manipulating the original recipe set with some groovy, that's to be expected really, but it did well in not returning false positives.
In conclusion, if you have a decent training set, or have the means to generate some data with a decent range, you can get some pretty good results using the library. As usual, the code for the project is on GitHub (although it is little more than the code shown in this post).
0 comments: