However, recently I have had to think about how to shorten URLs myself, and really, it's pretty easy to implement yourself.
Creating a unique, repeatable identifier for a URL
I think a lot of people's first instinct might be to go for hashing the URL string - this isn't a good idea for a few reasons though:
- Length - most normal hashing algorithms (md5/sha-*) produce long strings, which kind of goes against the point of a url shortener
- Unique-ness - Obviously, if this is going to be a URL identifier then it needs to be unique, and hashes by their very nature are not unique - which means you would need to handle the scenario where a URL creates an already used hash and has an alternative
- Look-up - as hashes are not (easily) reversible, you would need to look up the URL using the hash as the db key - which may not be ideal given a very large set of URLs (imagine number of URLs bitly.com has)
Thankfully there is a viable, easy solution available.
Lets first think about our database structure for persisting our URLs - In the simplest case we could probably get by with two columns:
- id (DB generated sequence ID)
- url - text field to capture the URL value
Generating the identifier from a DB
- Now, if you provide a String URL value, your code just needs to insert it into the table, this will create the row and the unique ID.
- Next, fetch that unique numeric ID, and convert it to base-62 (this will convert the numeric value into the base-62 representation (rather than normal base10, it will allow 0-9, a-z, A-Z as characters. This gives you both an identifier in the form of "1jLPSIv" but also provides a massive ID space that can fit into relatively few characters (that will be part of the shortened URL) - 6 characters of base 62 provides a possible 6^62 different unique combinations (1.7594524073e+48 in total)
You may not want to directly leak the DB ids to the URL, in which case you can easily salt the IDs as you choose appropriate.
Looking at bit.ly shortening URLs it would appear that they follow the same pattern. I shortened two of my previous blog post URLs one after another, and the URLs generated as follows:
If we look at those two url identifiers, they look a lot like base-6, and if we convert the identifiers to base 62(and let's look at base-64 as well, just for funsies they may be using that)
URL - Base64>Base10 - Base62>Base10
1oYgrsG - 3685493160710 - 103119489480
1tR63Zb - 3690751293019 - 107587946123
As you can see, they look reasonable - there's a chance that there were a few URLs created between my two URLs, but I suspect there is a reasonable amount of salting going on so it is just a one-by-one increment. (although interestingly, if you get a bit.ly url and then just increase the last char by one - assuming base62 - then it will likely provide a new url - so you can have your own game of manual-webpage-roulette!)