Machine Learning with AWS & Scala

Recently, in an attempt to starting learning React, I started building an akka-http backend API as a starting point. I quickly got distracted building the backend and ended up integrating with both the Twitter streaming API and AWS' Comprehend sentiment analysis API - which is what this post will be about.

Similar to an old idea, where I built an app consuming tweets about the 2015 Rugby world cup, this time my app was consuming tweets about the FIFA world cup in Russia - splitting tweets by country and recording sentiment for each one (and so a rolling average sentiment for each team).


Overview

The premise was simple:

  1. Connect to the Twitter streaming API (aka the firehose) filtering on world cup related key words
  2. Pass the body of the tweet to AWS Comprehend to get the sentiment score
  3. Update the in memory store of stats (count and average sentiment) for each country

In terms of technology used:
  1. Scala & Akka-Http
  2. Twitter4s Scala client
  3. AWS Java SDK

As always, all the code is on Github - to run it locally, you will need a Twitter dev API key (add an application.conf as per the readme on the Twitter4s github) and you will also need an AWS key/secret - the code will look for credentials stored locally but you can also just set them in environment variables before starting. The free tier supports up to 50,000 Comprehend API requests in the first 12 months - and as you can imagine, plugging this directly into twitter can result in lots of calls, so make sure you restrict it (or at least monitor it) before you leave it running!


Consuming Tweets

Consuming tweets is really simple with the Twitter4s client - we just define a partial function that will handle the incoming tweet. 

The other functions about parsing countries/teams are excluded for brevity - and you can see its quite simple - each inbound tweet we make a call to the Sentiment Service (we will look at that later) then pass it with the additional data to our update service that will then store it in memory. You will also see it is ridiculously easy to start the Twitter streaming client filtering by key words.


Detecting Sentiment

Because I wanted to be able to stub out the sentiment analysis without being tied to AWS, you will notice I am using the self-type annotation on my twitter class above, which requires a SentimentModule to be passed in at construction - I am using a simple cake pattern to manage all my dependencies here. In the Github repo, there is also a Dummy implementation, that will just pick a random number for the score, so you can still see the rest of the API working - but the interesting part is the AWS integration:
Once again, the SDK makes the integration really painless - in my code I am simplifying the actual results a lot to a much cruder Positive/Neutral/Negative rating (plus a numeric score -100..100).

The AWSCredentials class is the bit that is going to look in the normal places for an AWS key.


Storing and updating our stats

So now we have our inbound tweets and a way to asses their sentiment score - I then setup a very simple akka actor to manage the state and just stored the API data in memory (if you restart the app, the store gets reset and the API stops serving data).

Again, very simple out of the box stuff for akka, but it allows easy and thread safe management of the in-memory data store. I also track a rolling list of the last twenty tweets processed, which is managed by a second, almost identical, actor.


The results

I ran the app during several games, below are some sample outputs from the API. The response from the stats API is fairly boring reading (just numbers) but the example tweets show two examples of a positive and neutral tweet correctly identified (apologies for the expletives in the tweet about Poland - I guess that fan wasn't too happy about being beaten by the Senegalese!) - you will also notice, the app captures the countries being mentioned, which exposes one flaw of the design: in the negative tweet from the Polish fan loosing two goals to Senegal, it correctly identifies the sentiment as negative, but we have no way to determine the subject - as both teams are mentioned, the app naively assigns it as a negative tweet to both of the teams, where as on reading, it is clearly negative with regards to Poland (I wasn't too concerned for my experiment, of course, just an observation worth noting).

Sample tweet from the latest API:

Sample response from the stats API:

When I finally did get around to starting to learn React, I just plugged in the APIs and paid no attention to styling, which is a round about way of apologising for the horrible appearence of the screenshot below (I'm really sorry about the css gradient)!





0 comments: