Creating a Java URL shortener

A long time ago I posted a brief article about using Google's URL shortening API to easily create shortened URLs for your application (e.g. if you wanted to post stuff server side to twitter and cared about character usage, you could fire your URL over to Google's API and just use the result).

However, recently I have had to think about how to shorten URLs myself, and really, it's pretty easy to implement yourself.


Creating a unique, repeatable identifier for a URL
I think a lot of people's first instinct might be to go for hashing the URL string - this isn't a good idea for a few reasons though:
  • Length - most normal hashing algorithms (md5/sha-*) produce long strings, which kind of goes against the point of a url shortener
  • Unique-ness - Obviously, if this is going to be a URL identifier then it needs to be unique, and hashes by their very nature are not unique - which means you would need to handle the scenario where a URL creates an already used hash and has an alternative
  • Look-up - as hashes are not (easily) reversible, you would need to look up the URL using the hash as the db key - which may not be ideal given a very large set of URLs (imagine number of URLs bitly.com has)


Thankfully there is a viable, easy solution available.


Lets first think about our database structure for persisting our URLs - In the simplest case we could probably get by with two columns:
  • id (DB generated sequence ID)
  • url - text field to capture the URL value

Generating the identifier from a DB
  1. Now, if you provide a String URL value, your code just needs to insert it into the table, this will create the row and the unique ID.
  2. Next, fetch that unique numeric ID, and convert it to base-62 (this will convert the numeric value into the base-62 representation (rather than normal base10, it will allow 0-9, a-z, A-Z as characters.  This gives you both an identifier in the form of "1jLPSIv" but also provides a massive ID space that can fit into relatively few characters (that will be part of the shortened URL) - 6 characters of base 62 provides a possible 6^62 different unique combinations (1.7594524073e+48 in total)

You may not want to directly leak the DB ids to the URL, in which case you can easily salt the IDs as you choose appropriate.



Bit.ly example

Looking at bit.ly shortening URLs it would appear that they follow the same pattern. I shortened two of my previous blog post URLs one after another, and the URLs generated as follows:

bit.ly/1oYgrsG

bit.ly/1tR63Zb

If we look at those two url identifiers, they look a lot like base-6, and if we convert the identifiers to base 62(and let's look at base-64 as well, just for funsies they may be using that)

URL - Base64>Base10 - Base62>Base10
1oYgrsG - 3685493160710 - 103119489480
1tR63Zb - 3690751293019 - 107587946123

As you can see, they look reasonable - there's a chance that there were a few URLs created between my two URLs, but I suspect there is a reasonable amount of salting going on so it is just a one-by-one increment. (although interestingly, if you get a bit.ly url and then just increase the last char by one - assuming base62 - then it will likely provide a new url - so you can have your own game of manual-webpage-roulette!)

Algorithm analysis & complexity: MergeSort - part 1

I have just signed up for the Coursera Algorithm Design & Analysis course from Stanford. I have been meaning to take it up for a while but kept missing when it was running. As it happens, this time I have all but missed it, with the final assessments due already, but I have signed up and am going through the material and lectures on my own.

I am only on week one so far, but it is all good - the lecturer seems good and articulates the content well (so far the presentation and sophistication of the course is better than the Scala & FP course and the Logic courses I have previously done on Coursera).

After I completed the Functional Programming & Scala course, I wrote up some notes on some of the techniques and problems in Groovy and thought it might be fun to write up some groovy notes from the algos course.

Week one is covers an introduction to algorithm analysis, divide and conquer (with merge sort) and some other stuff.


Mergesort - Analysis

Mergesort has three key steps
  1. Split the list in to two halves
  2. Recursively sort the first half
  3. Recursively sort the second hald
  4. Merge the two sorted halves
This is a well known example of the divide&conquer paradigm, the recursive sorting continues to recurse and divide the list into two halves until sorting is trivial (e.g. base case of one item in each list).


Let's look at the merge step:

(pseudocode taken from the course)
result = output [length = n]
leftList = 1st  sorted array [n/2]
rightList = 2nd  sorted array [n/2]
i = 1

j = 1
for k = 1 to n
  if leftList(i) < rightList(j) 

    result(k) = leftList(i)
    i++ 

  else [rightList(j) < leftList(i)]
    result(k) = rightList(j)
    j++
end



The above should be fairly straight forward to understand:
  • result stores our final merged list, and will be length n (number of elements in starting array)
  • leftList/rightList - these are the two inputs to the merge, and are simple the two halves of the list that need to be merged. This assumes that both lists are already sorted prior to merging
  • i/j - These are pointers to what we consider the "head" of each list. Rather than removing items from lists we will simply keep track of a pointer for each list to track how many items of each list have been merged
  • We then iterate through all "n" items, comparing the "head" of our two lists, the smaller value getting added to our result list


Complexity analysis of merge step:

In the course, the lecturer states that the worst case running time = 4n + 2 which he then further simplifies to = 6n  (this simplification is simply because n must always be at least 1, so 4n + 2 will always be no worse than 6n.

Let's look at how we get to that figure (we will then later look at what the asymptotic run time is). This figure is simply the number of operations that need to be executed when this algorithm is run so let's count those:

result = output [length = n]
leftList = 1st  sorted array [n/2]
rightList = 2nd  sorted array [n/2]
i = 1

j = 1
In the above, the only operations that are actually set are i & j (the others are just describing the function input/output). So that costs us 2 (constant regardless of how big "n" is)

for k = 1 to n
  if leftList(i) < rightList(j) 

    result(k) = leftList(i)
    i++ 

  else [rightList(j) < leftList(i)]
    result(k) = rightList(j)
    j++
end



In the above, we have one condition checking/incrementing "k" in the for statement (executed every iteration, so this will be done "n" times); we then have an IF conditional check, this will also be done in every iteration (so this is another "n" operations); then depending on which branch of that condition is executed, there are always a further two operations executed (the assignment of the head to result and the increment of the head pointer) - another two operations executed "n" times.

So the above block executes 4 operations for each of the "n" iterations. The first block executes 2 operations =

4n + 2

Seems straight forward enough.  Really though, this is just a fairly simplified piece of pseudo code, and there is a bunch of additional code that is needed to handle other stuff, edge cases etc (e.g. what happens if you have added all of the left list to the result and only have the right list left? obviously that can just be appended to the results, but the above code doesn't handle this). Then there are further questions about different language implementations that might have more operations to do this (not to mention difference when it gets compiled down to machine code) - we will get more into this later when we talk about asymptotic performance. For the time being we will just say the run time (worst case) for merging two lists is 4n + 2.


Now, let's look at recursively sorting..

This part of the algorithm is also relatively simple, we just keep recursing through the list, splitting the list in to two halves until each half of the list is just one element (one element is already sorted!), then we can merge those lists easily.

So, if we are recursing through and splitting the list into two halves, how many times will we have to perform the merge?

Let's step through this, with an example input list of 8 elements (n=8)
  1. One list of 8 elements
  2. Two lists of 4 elements
  3. Four lists of 2 elements
  4. Eight lists of 1 elements
this is just a basic binary tree - see the illustration below taken from wikipedia:


As you can see in the above tree, it splits down into individual elements, then it re-combines them using the merge technique. To work out the complexity of the algorithm we need to know two things:
  1. How deep is our tree going to be?
  2. What is the runtime of a given depth in the tree (e.g. for a given depth "j" in the tree, what is the cost of re-merging all the lists at that level

The first question is relatively easy, as it is a binary tree we know the depth is simply going to be log2(n)+1 - How do we know that? the log of a  number is simply how many times it needs to be divided by 2 (assuming 2 is the logarithm base) before the number is <= 1 - which is exactly what we are doing when we are recursively dividing our lists in half until we reach the single element lists.

E.g. the log2(8) = 3 (8/2=4; 4/2=2; 2/2=1)



So now lets calculate the running time for any given level of the tree.  Lets consider a list of length "n" and we want to know a few things:

How many merge operations do we need to perform for depth "j" of a list "n"?
In a similar way to how we calculated the depth, we can work out the number of merges (or number of leaves at the level of the tree).  If you have halved the lists each step 0..j then the number of leaves will be 2 to the power j.

E.g. The first level down, after just halving the original list, we have j=1, so expect 2 power 1 = 2 leaves.
Second level down, we again halve each of those two lists, we have j=2, so expect 2 power 2 = 4 leaves

Note that j cannot exceed the depth of the tree, which we already know.


What is the complexity of each of these merge operations?
We know the complexity of a merge from our earlier analysis of the merge step as 6m (where m is the size of the halved lists - not the original value "n" of the starting list size), and we know we have 2 to the power j of these merges, which means we know at any level of the tree, the run time performance is:

(2 power j) * (6m)

Now the value of "m" will also be dependent on the value of "j" - because as we get further down the tree, we have a greater number of merges to perform, but the size of the lists are also getting smaller (so the actual value of "m" is decreasing), so what is the value of 6m when we are at level "j" of the tree?  We can work that out in a similar fashion - we know the original "n" (size of starting list) has been halved "j" times  - so it is simply

(n)/(2 power j)

So now we can combine those and resolve the algebra.. we can see that the 2 power j cancels each other out (intuitively really, as we are increasing the number of merges, but at the same time and at the same rate reducing the size of the lists to be merged:

(2 power j) * 6 * (n/(2 power j)) = 6n




What is the overall complexity of the tree?
So we now already know the complexity of any given level of the tree (independent of the depth) and we know the depth of the tree (number of levels):

6n * (log2(n)+1)  =  6nlog2(n) + 6n

So there we have it - we have calculated the worst case running time for the merge sort.


There are a few further things that we will need to consider in later posts: In one post I will have a quick look at asymptotic analysis (e.g. "Big-Oh" notation) and what this means; in another post I will look at a Groovy implementation of it and the analysis of that.








Groovy bug - Stackoverflow calling super methods

I recently stumbled upon a bug in the version of Groovy that I was working with (Groovy 2.1.5), it's a strange one, that seems to have been solved in Groovy 2.2 but I thought I would post it here in case it was useful to anyone else.

The problem is demonstrated in the code below, but the exact use case seems to be having an inheritance structure > 2 classes deep, whereby the method return type changes.



The result in running the above code in Groovy 2.1.5 is a StackOverflow - it seems as though Groovy just recursively calls the doStuff() method on Child class until it errors rather than calling the method on the parent class.  Running on Groovy 2.2.0+ appears to run correctly.

I posted the question on stackoverflow.com, and it was suggested that it was maybe related to this bug, which has a very re-assuring resolution comment:

looks like the issue was fixed somewhen between 2.2.2 and 2.3.0

Yo: One reason it doesn't suck, completely

There has been a lot of talk about Yo recently. Initially it seems like the talk was all driven by the fact that this seemingly pointless app had raised $1million in angel funding (a good way to generate publicity and hype I guess if you happen to have a milli lying around). A lot of people speculated this as further proof of the tech bubble they had been long predicted. Others just talked about how crappy it was (albeit indirectly)

At a glance, neither the app nor the funding round seem particularly interesting - The app sounds distinctly like a novelty app that has only generated interest (and therefore users) on account of the funding, and seems destined to be a blip in the tech-history books much like Chat Roulette etc (although I can't really see it hitting he heights of Chat Roulette - at least people still make an occasional joke about Chat Roulette - I think give it three months and no one will be talking about Yo). The funding is less interesting when you hear the full story: according to Forbes, the app was created by a chap called Or Abel, after his former boss, Moshe Hogeg apparently asked him to make the app so he could buzz his PA without having to call, when Or Abel then switched up the idea a little bit Moshe lead the funding round (so a guy invested in what was basically his own idea, probably wasn't the world's hardest pitch).

Context based messaging

One reason cited for the app not being that lame, is apparently that it is actually trying to fill in a gap - providing context-based-messaging e.g. when you ping someone a "yo" they know the context of the message so you don't need to know anything else.

I don't agree. At least with the examples cited so far, it still seems pretty lame. The main example that has been used is the World Cup - you could subscribe to WORLDCUP by yo-ing them and then you will get a "yo" back everytime a goal is scored. Which honestly doesn't sound that great. I have been following the WC pretty avidly, but getting intermittent messages saying someone scored doesn't sound useful, and the main problem is that I would then have to launch another app to actually get the context (e.g. which team scored).  There are lots of other solutions to this that provide simple notifications including the context.


Some other examples:

Wanna say "good morning"? just Yo. 
Wanna say "Baby I'm thinking about you"? -- Yo. 
"I've finished my meeting, come by my office" -- Yo. 
"Are you up?" -- Yo.

These all make sense, but what next? How do you continue the conversation? how do you confirm/agree that all important context of the message? All these simple "yo"s all drive users to other apps - which means more effort/taps so why bother starting the conversation in Yo at all? why not just use WhatsApp/SMS/Facebook/GChat/etc from the start?


Ok, so clearly Context-Free messaging is a bad idea..

However, in my opinion, there is a glimmer of hope for the app, but the question for me is really just does it have enough runway to execute it before it just becomes another novelty app in the history books.

A few years back, just before Twitter went all dick-ish with their API and started locking out the entire developer community and third party eco-system there was a few good articles discussing an alternative business model for Twitter, and rather than becoming a media company (that it has become, rich content including pics/videos & trying to drive all eyeballs onto twitter.com or official apps - not via third party clients) it should become a global messaging system - it had the infrastructure made to be a massive pub-sub/notification system (I think at the time there was a better article that I read, but can't find the link now, so that one will have to do - if anyone thinks they know the one I mean then please add to the comments!).


This is a space that I think Yo could step up and fill. And its possible that they are thinking the same - having already announced their API, they already suggest some simple notification systems that could utilise it - In which case, it could become a really interesting platform.


So who knows, there is potential for it to become something pretty neat. Odds are on it will just end up a passing gimmick though.




The Idiot Box: Disrupting TV

There has been a lot written in recent months about the changing role of content on the web and devices, particularly recently with Apple's acquisition of Beats (things just aint the same for gangsters) - and Benedict Evans recently wrote questioning whether content is actually still king. And I think we can agree he makes a good point, when it comes to music, it is no longer a USP, its just expected. Between YouTube, SoundCloud, Spotify, Google's Play services etc there is no reason people can't just stream any music they like, fairly seamlessly, and switch between providers/apps just as easily.  Apple had invested massively in iTunes, but the iTunes buy-download model isn't what people want any more (hence buying Beats, actually for their streaming service maybe).


A more interesting area is Television - content is still a key factor, NetFlix, Amazon Instant(formerly LoveFilm) etc are largely compared entirely based on their content - as a service there is little between them, other than content.  So really it's no suprise to see all the normal big players getting involved in the market Google (ChromeCast etc), Amazon Instant, Apple TV.

Within 5years I think we will see a massive shift in viewing patterns, with pretty much all new TVs sold today being web-enabled I think its an inevitability that people will move to all on-demand services rather than being dependent on scheduled programming. Within a further 5years I wouldn't be surprised if scheduled programming was all but dead and gone.

We recently bought a NowTV box - at just £10 its a pretty low barrier to web-enabling your TV - and we have pretty much switched over to entirely on-demand service - despite having a PVR we still go for the on demand options.


I think there will be some interesting things that come from this:

Platform Fragmentation

I think this is the biggest problem facing the market at the moment, and to me looks like a massive opportunity. At the moment there is so much fragmentation across the platform: Playstations, Xbox, Android, iOS, TVs (Sony, Samsung, etc) - all have their own platforms, and any on-demand service that wants to offer an app on every platform needs work from either the platform owners or the service providers - if its the platform owner then the provider looses control of UI/UX app features and consistency across platforms, and if its the service provider then they have a lot of work to do to support the different platforms, and they will inevitably have to make decisions whether or not to bother with each platform.

At the moment, in the UK, if you buy a web-enabled device then you can't guarantee that it will have the basic UK free-to-air OD services (BBC, ITV, Channel4, Channel5) - When I bought the NowTV it didn't have Channel4 apps and still doesn't have ITV (and that's a platform backed by BSkyB, which is a fairly large organisation and platform). There are already some Sony TVs that are no longer supported and provider apps are no longer being developed/maintained/supported. 

Android and iOS aside, all platforms will suffer this problem - that is until someone comes with a platform standard/OS that is open and can be re-used across devices. And given Android's prevalence, and Google's investment in TV it would seem like the best placed candidate to tackle that, but let's see!


What is driving creation production?

With scheduled TV there are quiet times, early hours of the morning, working week daytime - and content is created for that specifically. Providers have to each fill their schedules for these hours, so these are commissioned/bought/run.  However, if scheduling was to end, and it was all on-demand then would people still make this content? I'm sure there will still be a demand for some of this content, but people who watch it just because its on will obviously diminish. Students in the UK have had a tradition of watching daytime TV, whether it be Countdown, Diagnosis Murder or Quincy - but in the era of on-demand why would they search out this content? 


Ideas

So this is just a quick note on some things that I am personally finding really interesting right now

Education 

as mentioned, I think this is going to be cracked soon, and maybe Khan academy will do it. Either way, I think whoever does it would need to be a "full stack" startup. I have thought about trying some things - like a GitHub type system for open-sourcing education materials and resources, creating open-standards for curriculum and educational texts etc - but they have always just been tools or things around the periphery. I don't think I have the resources right now to be thinking full stack!


Android first

Android is the most pervasive mobile OS (yes, there are some caveats about the stats, such as numbers coming from China, and the value of the customer compared to other iOS), and Google continue to widen their reach (recently purchasing Nest etc) I think we are going to seeing our first truly Android first apps. If Instagram was built today, would it be iOS first? Probably, but I think that landscape is changing. I think coupled with the following areas this could be a big one.  If I'm building a mobile app it's going to be Android first. Silicon valley is under-invested in Android - There is also growing appetite for it:



Smartphone as your social graph

I have mentioned this one a few times here. Smartphone usage is continuing to increase (even whilst tablet sales falter) and more and more existing mobile users across the world upgrade. Further to this, your smartphone is really where you social graph is. Your address book has your contacts in it, your phone number gives you a unique identifier, and as WhatsApp proved, it gives great power to mobile first startups to disrupt the social network incumbents. It could be argued that Google+ was never going to succeed to usurp Facebook as king of the social networks because people are lazy, and essentially creatures of habbit - if all your friends are on Facebook, why try to convince your entire network to switch to g+? But the smartphone takes away that power, as your network is on your device, not a particular platform.  Couple that with the fact that so many use their phones to double up as cameras it now means most of our photos/videos are also on the device.

Television

I was recently talking with some colleagues about the future of TV and I speculated that in 5-10 years we may see the end of scheduled television, and everything will be only on demand. This would leave some interesting questions/problems:

  • One big problem as I see it is the fragmentation of device software - if you are creating an on demand app, you need to think about Android, iOS, browsers, Xbox, PS, not to mention all the TV manufacturers that have their own software running on their web enabled TV. This means at the moment, if you buy a new web enabled TV or set top box, the on demand apps may not be available, and may not be consistent. I think this could be a good market for Android and wouldn't be surprised if they do make some bold moves in this area (yes, bolder than Chrome Cast) - I would think it would make sense for them to be linking up with TV manufacturers to as the de-facto TV OS (would benefit from android app eco system - even if it would mean a LOT more work for app developers to fix up for another range of screen sizes
  • Another interesting implication of such a switch would be would we see a decline in produced content. At the moment there is a lot of content created specifically for quieter times of the viewing schedule (mon-fri afternoons, early morning, etc) - what might be considered as "filler", but we might see a decline in this as consumers will have complete control of what they want to watch, and there won't be any watch it because its on mentality.
I think its an interesting areas where battles are really going strong, with big players in the content production/distribution space (Netflix/Amazon/YouTube) as well as Google/Apple etc taking on the incumbents in hardware.  I just today saw a link for a site called Glass that is dedicated to ongoing conversation about this topic.