An overview: How to implement Google's Structured Data Schemas

As a result of working in web products (both personal and professional) I have become quite au fait with SEO from a technical point of view, and purely technical things you can do to put you in good standing - all of which really, are just web best practices to be a good web citizen - things that we should be doing to make the web better (proper use of HTTP headers, mobile friendly designs, good performance for people on slower connections etc). 

One thing that is a bit above and beyond that is the use of Google's Structured Data. I will talk about what it is and what it does below, but if you are dynamically loading webpages (e.g. your website isn't just HTML on a web server, but either served from a server-side application, or is an API driven JS application), then you are most likely well placed to easily and immediately start implementing it.

1. What is Structured Data?

Google has defined a set of schemas regarding Structured Data on websites. This is a schema that allows better definition of key data points from a given website. It's a sensible move by Google and is a natural progression for their search engine.

Think about it, there are millions of websites out there made up with sprawling HTML code and content - whilst HTML is a standard (more or less!) across the web, there are probably millions of different ways people use it. It would be nice if everyone used the H1 etc heading tags consistently, or if everyone used the <em> tag the same way (emphasis vs italics), but the reality is they don't - some sites will be using HTML as intended but many many more undoubtedly just rely on <span> or <div> tags combined with CSS classes to re-define every element they might need. 

This is all fine, Google is smart enough to pull out the content for indexing - yes, if you use span elements with custom styling for headings on your website rather than the H1+ tags then Google will penalize you, but it won't stop Google reading and indexing the site. Whats more, its getting smarter all the time -  I'd probably back Google to be able to pull out relevant clips or question/answers directly in a fairly reliable way. However, they are Google and much like the Microsoft/IE of the 90s, they have the dominant market share so they can define their own standards for the web if they want to. That's exactly what Structured Data is. 

It's Google saying: 

Hey, if you provide some data in your code that looks like this then we will read that, so we don't have to guess or work out stuff. Or we can just keep trying to work it out from your html content.. it's your call

As mentioned earlier, if you have control over the source code and the data on your website, then this is really powerful. You can define specific (keyword heavy) snippets of content and explicitly tell Google about them - what's more the Schema for Structured Data lets you define content such as FAQ or how-to directions - so you can go beyond keyword heavy snippets and actually create FAQ for questions that you have identified from Google search trends (or whatever SEO tools you might use).  

Hopefully you get the picture by now - this is a pretty powerful tool.

2. Schema tags: FAQ and HowTo

Two specific parts of the Structured Data Schema stood out to me as quite universally useful for websites, and also produce decent tangible results: FAQ schema and HowTo schema.

  • FAQ schema allows high level Q&A points to be provided in the metadata for Google - generally useful as most sites will have some element of their pages that could be presented as FAQ

  • HowTo schema allows step-by-step how to guides - less widely applicable, but if you have any website that provides how-to guides or anything with instructions this is applicable.

What exactly to these tags do and why do we care? Well, as well as trying to win favour with the all-seeing google search bot, if it gets picked up it also means we get more search real estate and increased accessibility to our content which should also increase the chance of click through conversion.

If you have ever seen search results like this:

These points are being scraped from the site by Google from their schema tags - if you can automate the creation and inclusion of these for your site then you can be in a pretty good position to improve your SEO (still relatively low number of sites implementing these schemas).

3. Automating your schema tags

As I have mentioned a couple of times, and as you have hopefully realised - if you have a dynamic website, you are likely already taking structured data (from your database for example, so reliably structured!) and building HTML pages (either server-side, or data as JSON to a javascript app for page creation client side) - but either way, we are starting off with structured data, and the Google Structured Data wants.. you got it, Structured Data! So if you have the content, it really is a generic, simple transform between structured data formats.

Below is some example code - It is based on jekyll, as thats what my most recent personal project has been, but it's also pretty close to pseudocode, so hopefully you can easily translate it to whatever tech you use:

As you can see, its a simple JSON based data structure and you just fill in the relevant parts of the object with your data. 

You can see the full code in my Jekyll food website over on github - likewise, you can see the end result in action on the website (hosted by github pages, of course) too - the project was a food-science site - covering science and recipes, so a perfect match for FAQ (science pages) and HowTo (recipe pages) - for example if you go to a chilli recipe page which naturally has structured data for step-by-step instructions, and view the page source, you will see the JSON schema at the top of the page, using the HowTo schema elements laying out resources required and then the steps required. Likewise on the Science of humidity in cooking in the page source you will see the JSON schema with FAQ:

4. Conclusion

If you have control of the data being loaded on to the page (a custom site - something like WordPress or another off the shelf platform might make this harder, but at the same time there are undoubtedly lots of plugins that already handle this for you on your given platform which make it even easier!), and are even vaguely interested in increasing organic traffic and search ranking, then I'd recommend implementing this. It's very easy to add (it's also primarily behind the scenes tech - it doesn't impact the visuals or design of the pages its added to) and can only be beneficial.

As with anything trying to improve SEO, its always hard to measure and get concrete details on, but its relatively low cost to add, minimal ongoing maintenance so I think its worth a shot.

If you have had experiences with Google's Structured data, good or bad, then I'd love to hear about it!

How to make your Jekyll website blazing fast!

A few years back I used Jekyll and Github pages to create a one page CV/resume template site, it turned out to be one my most forked Github project and the blog post I wrote up about it one of the more popular articles here.

So recently, when I was looking to create a site about food science, I thought I would have another look at Jekyll and Github hosting. One of the things I knew I wanted was a nice dynamic site, but one that was really quick and that I could really optimise for SEO, and I was confident that Jekyll would deliver on that front (I mean, its a static site generator, so it builds full HTML to be served in one, so no overly heavy or slow client side stuff needed).

I quickly forked a Jekyll project, added a couple of old posts to test the content and gave it a whirl (serving it directly from the Github pages URL) - out the box I had a nice clean site, already mobile responsive, so time to see how it performed!

A quick test on the Google Page Insights (formerly Lighthouse), and straight out the gates it was scoring pretty well - but there were some sensible recommendations within the results, that made sense to implement.

Serve images in next-gen formats

I was using reasonably sized jpg images for the most part, but there are further optimisations available in the form of using webp formats. Quickly grabbing the cwebp CLI tool made it super quick to convert all my images over to webp. I had to update the image HTML a little to support webp whilst falling back to jpg where browser support dictated as so:

The above basically just sets the source of the image as the webp variation of the file and falls back to the jpg if webp is not supported.

Google font optimisation

My basic jekyll template I was using relied on Google webfonts, which presented two different performance issues:

  1. Flash of invisible text (FOIT) - A while ago, Google introduced a change to the CSS loading snippet for their google fonts, with the addition of the query param “&display=swap” on the end of the CSS URL.
    This small change, broadly supported by modern browsers, basically fixes the FOIT and instead falls back to a flash of unstyled text (FOUT) - so on slower connections whilst the font is loading, there is at least content displayed, albeit not as pretty as you’d like (there is a cool tool here to compare fonts, which you can play with to try and find a native font that isn’t too much of a leap from your google webfont, so the experience isn’t too jarring).
    This improved my Page Insights time to First Contentful Paint as stuff on the page.

  2. CSS loading synchronously so still blocking page load time - thankfully people have solved this through a variety of tricks and techniques - moving the CSS load to async and also utilising preconnect and preload to warm up the end points (a very detailed write up can be seen here!)

Configuring Gradle publishing with AWS S3

Using S3 as a remote maven repository can provide a cost effective alternative to some of the hosted solutions available - admittedly you won’t get tooling like Artifactory sitting on top of the repository, but if you are used to using your own privately hosted remote maven repositories, then the S3 solution is a good option if you want to move it off premises.

Thankfully, the more modern versions of Gradle come with good support for AWS built in - if you are using Gradle pre-version 3 or so, then some of these may not be available, but given we are approaching Gradle 7, then it’s probably a good time to upgrade your tooling.

Out of the box, publishing to an S3 bucket is incredibly easy, but of course, you will want to secure your bucket so its not open for the world to push their artifacts to - and Gradle combined with the AWS SDK make this pretty simple.

If you want Gradle to push to S3 using the normal profile credential chain (checking env vars before using default credentials etc) then its really easy:
The above approach, running gradle publish will attempt to publish the artifact to a maven repo in the named S3 bucket, and will look locally on the host machine for AWS credentials - this will require that you have default credentials set.

Now this might not always be ideal, if for example you are running on a machine whereby you have default credentials set to some other account (not the dev tooling account where you are hosting the maven repo) and you don’t want to have to change the default credentials just to push stuff to maven. As you will have seen in the AWS docs around the profile credential chain, you can override the default profile using an environment variable AWS_PROFILE - which is a possible solution but not ideal for a couple reasons:

  1. Users have to remember to set it if they want to avoid using default credentials

  2. It isn’t very granular - setting the env var at gradle runtime (by exporting/set the variable on the command line before running the gradle command) sets the variable for the entire gradle task(s) being run - there maybe situations where you want finer control and need to use different credentials for AWS for different jobs (I have seen this requirement in real world code)

  3. Dealing with environment variables in Gradle doesn't have great support - some tasks can set them easily, others not so much (publish, for example, doesn’t support it)

Thankfully, there is still a trivial mechanism to override this - although it took me a while to stumble upon this solution - most stackoverflow questions and gradle issue discussions mostly have examples using the environment variables.

The following will use the named credentials “buildtool” on the host machine to publish. Of course, this is a simple string hardcoded, but this could be passed in using other custom arguments, sidestepping the need for env vars if you wanted to override it (create a custom -D JVM argument, for example, to override - which is much more granular and specific to your usage and much better support on Gradle).

Note you also need to import the AWS SDK to use this one:

Re-thinking the Visitor pattern with Scala, Shapeless & polymorphic functions

In this article I will look at a relatively boilerplate free way to traverse tree structures in Scala, using polymorphic functions along with Shapeless' everything function.

Over the course of my career, a problem that I have had to face fairly repeatedly is dealing with nested tree like structures with arbitrary depth. From XML to directory structures to building data models, nested trees or documents are a common and pretty useful way to model data.

Early in my career (classic Java/J2EE/Spring days) I tackled them using the classic Visitor pattern from the Gang of Four and have probably had more than my fair share of implementing that pattern, then whilst working in Groovy I re-imagined the pattern a little to make it a little more idiomatic (dealing with mostly Maps and Lists) and now I am working in Scala, and once again the problem has arisen.

There are lots of things that Scala handles well - I do generally like its type system, and everyone always raves about the pattern matching (which is undeniably useful), but it has always irked me a bit when dealing with child classes that I have to match on every implementation to do something - I always feel like its something I should be able to do with type classes, and inevitably end up a little sad every time I remember I can't. Let me explain with a quick example, lets imagine we are modeling a structure like XML (I will assume we all know XML, but the format essentially allows you to define nested tree structures of elements - an element can be a complex type e.g. like a directory/object, that holds further children elements, or a simple type e.g. a string element that holds a string).

Above is a basic setup to model a tree structure - we have our sealed trait for the generic element, and we then have a class for the complex element (that is an element that can have further list of element children) and then a couple basic classes for the simple elements (String/Boolean/Double).

Now, when we have a ComplexElement and we want to process its children, a List[Element], ideally type classes would come to our rescue, like this:

Above we have a simple ValidatorTypeClass for which we define our implementations for all the different types we care about, and from there, it looks relatively simple to traverse a nested structure - the type class for the ComplexElement simply iterates through children and recursively passes to the child element type class to handle the logic (note: I will use validation as an example throughout this article, but that is just for the sake of a simple illustration - there are many better ways to perform simple attribute validation in Scala - but helps provide an example context for the problem.)

However, if you run the above code, you will get an error like this:

The reason is, it's looking for an implicit type class to handle the parent type Element (ComplexElement value attribute is defined as List[Element]), which we haven't defined. Sure, we could define that type class ValidatorTypeClass[Element], and simply pattern match the input across all the implemented types, but at that point there's no point having type classes, and you just end up with a big old pattern matching block - which is fine, but it feels kind of verbose, especially when you have to have the blocks repeated throughout the code as you inevitably have to handle the tree structure in several different places/ways.

So I wanted to find a better way, and having written about Shapeless a couple times before, once again..

Enter Shapeless

The good news is, Shapeless has some tools that can help improve this - the bad news is, there isn't really any documentation on some of the features (beyond reading the source code and unit tests) and some of it just doesn't seem to be mentioned anywhere at all! I had previously used a function that Shapeless provides called everywhere - even this function isn't really explicitly called out in the docs, but I stumbled upon it in an article about what was new in Shapeless 2.0 where it was used in an example piece of code without any mention or explanation - everywhere allows in place editing of tree like structures (or any structures really) and was based on the ideas laid out in the Scrap Your Boilerplate (SYB) paper that large parts of the Shapeless library was based on.

As well as everywhere Shapeless also provides a function called everything which is also from the SYB paper, and instead of editing, it lets you simply traverse, or visit generic data structures. It's pretty simple, conceptually, but finding any mention of it in docs or footnotes was hard (I found it reading the source code), so lets go through it.

everything takes three arguments:

The first one is a polymorphic function that we want to process every step of the data structure, combine is a polymorphic function to combine the results, and complex (the third argument above) is our input - in this case the root of our nested data model.

So lets start with our polymorphic function for validating every step (this will be every attribute on each class, including lists, maps and other classes that will then get traversed as well (you can find out more about polymorphic functions and how they are implemented with Shapeless here):

So what is happening here? And why do we have two polymorphic functions? Well lets start with our second one, validates, that is going to be handling the validation. Remember our lovely and simple type class we defined earlier? we are going to use it here, in this polymorphic function we simply define this implicit function that will match on any attribute it finds what has an implicit ValidatorTypeClass in scope, and run the validation (in our simple example, returning a boolean result for whether it passes or fails).

Now, there are also going to be other types in our structure that we essentially want to ignore - they might be simple attributes (Strings, etc) or they might Lists, that we want to continue to traverse, but as a type in itself, we can just pass over. For this we need a polymorphic function which is essentially a No-Op and returns true. As the cases in the polymorphic function are implicits, we need to have the default case in the parent class so it is resolved as a lower priority than our validating implicit.

So, everywhere is going to handle the generic traversal of our data structure, what ever that might look like, and this polymorphic function is going to return a boolean to indicate whether every element in the structure is ok - now as mentioned, we need to combine all these results from our structure.

To do that, we just define another polymorphic function with arity 2 to define how we handle - which in the case of booleans is really very simple:

This combinator will simply combine the booleans, and as soon as one element fails, the overall answer will be false.

And thats it! Shapeless' everywhere handles the boilerplate, and with the addition of those minimal polymorphic functions we don'd need to worry about traversing anything or pattern matching on parent types - so it ends up really quite nice. Nine extra lines of code, and our type class approach works after all!

Footnote 1: Further removing boilerplate

If you found yourself writing code like this a lot, you could further simplify it, by changing our implicit ValidatorTypeClass to a more broad VisitorTypeClass and provide a common set of combinators for the combine polymorphic function, and then all you would need to do each time is provide the specific type class implementation of VisitorTypeClass and it would just work as if by magic.

Footnote 2: A better validation

As mentioned, the validation example was purely illustrative, as its a simple domain to understand, and there are other better ways to perform simple validation (at time of construction, other libraries etc), but if we were to have this perform validation, rather than return booleans, we could look to use something like Validated from Cats - this would allow us to accumulate meaningful failures throughout the traversal. This is really simple to drop in, and all we would need to do is implement the combine polymorphic function for ValidatedNel class:

Thankfully, Cats ValidatedNel is a Semigroup implementation, so it already provides the combine method itself, so all we need to do is call that! (Note: you will need to provide a Semigroup implementation for whatever right hand side you choose to use for Validated, but thats trivial for most types)

Your API as a Product: Thinking like a Product Manager [VIDEO]

The video for the talk I gave at the 2018 API Conference is now available online.

I have talked about a bit before, as well as sharing the slides, but one of my main take-aways is that we are all (mostly) in the business of building products - on a daily basis, whether we are coding, write docs, tests, change requests, specifications, designs - there is almost always an end product of our work, and the product decisions we make whilst building it has a direct impact on the end-users (people will have to read/amend your code, read your specifications, translate your designs, consume your APIs etc). With that in mind, it seems sensible that we look at what lessons we can take from the discipline of Product Management to help us make smart decisions in our day-to-day.

Kubernetes & Prometheus: Getting started

I have recently started working on a migration process to move our company deployments over to Kubernetes (from Fleet, if you were interested, which was at the time of dpeloyment a pretty cutting edge technology but it is pretty low level, and have to provide stuff like load balancing and DNS yourself).

A colleague of mine had already done the hard work in actually spinning up a Kubernetes cluster on AWS (using EKS) and generally most of the boiler plate around service deployment, so having had a general intro and deploying my first service (single microservice deployed as a Kubernetes “service” running inside a “pod”), which mostly just involved copy and pasting from my colleagues examples, my next goal was to deploy our monitoring setup. We currently use Prometheus & Graphana, and those still seem to be best in class monitoring systems, especially with Kubernetes.

The setup process is actually pretty simple to get up and running (at least if you are using Helm) but it did catch me out a couple times, so here are some notes.

  1. A cluster running Kubernetes (as mentioned, we are using an AWS cluster on EKS)
  2. Kubectl & Helm running locally and connecting correctly to your kubernetes cluster (kubectl --version should display the client and server version details ok)

Let’s get started by getting our cluster ready to use Helm (Helm is a Kubernetes package manager that can be used to install pre-packaged "charts"), to do this we need to install the server side element of Helm, called Tiller, onto our Kubernetes cluster:

$ kubectl apply -f  tiller-role-binding.yml
helm init --service-account tiller

The above does three things:
  1. It creates a Service Account in your cluster called “tiller” in the standard kubernetes namespace “kube-system” - this will be used as the account for all the tiller operations
  2. Next we apply the role binding for the cluster - here we define a new ClisterRole for the new Service Account
  3. Finally we initialise Helm/Tiller, referencing our new Service Account. This step effectively installs Tiller on our kubernetes cluster

Straight forward enough. For Prometheus we will be using a Helm packaged Prometheus Operator. A Kubernetes Operator is an approach that allows packaging of an application that can be deployed on Kubernetes and can also be managed by the Kubernetes API - you can read more about operators here, and there are lots of operators already created for a range of applications.

As I found myself repeatedly updating the config for the install, I preferred to use the Helm “upgrade” method rather than install (upgrade works even if it has never been installed):

helm upgrade -f prometheus-config.yml \
      prometheus-operator stable/prometheus-operator \
      --namespace monitoring --install

The above command upgrades/installs the stable/prometheus-operator package (provided by CoreOS) into the “monitoring” namespace and names the install release as “prometheus-operator”.

At the start, the config was simply:

This config (which could have been passed as CLI arguments using “--set”, but was moved to a dedicated file simply because later on we will add a bunch more config) addresses two challenges we faced:

Adding Namespaces:

We use a dedicated namespace for our actual application (in the case above “dev-apps”), and additionally we wanted to monitor our applications themselves as well as the core kubernetes health, so we had to add that namespace there so Prometheus could monitor that as well.

Monitoring the right port:

The next one was more of a head-scratcher, and used up a lot more time figuring it out. With the stable/prometheus-operator Helm install, we noticed that on the targets page of Prometheus that the monitoring/prometheus-operator-kubelet/0 and monitoring/prometheus-operator-kubelet/1 were both showing as 0/3 up.

Our targets were showing 
monitoring/prometheus-operator-kubelet/0 (0/3 up)
monitoring/prometheus-operator-kubelet/1 (0/3 up)

They all looked correct, much like the other targets that were reported as being up, and were hitting the endpoints and /metrics/cadvisor but were all showing errors registered as "Connect: connection refused"

Initial googling revealed it was a fairly common symptom, however, rather misleadingly, all the issues documented were around particular flags that needed to be set and issues with auth (the errors listed were 401/403 rather than our error “connect: connection refused”) - and is also covered in the Prometheus-Operator troubleshooting section.

After much digging, what actually seemed to have caught us out was some conflicting default behaviour.

The Promehteus Operator defines three different ports to monitor on:

And of the three ports defined by the source Prometheus Operator code, the Helm chart is currently set to default to “http-metrics”, e.g. to use port 10255

However, more recently that read-only port, 10255, has been disabled and is no longer open to monitor against. This meant we had conflicting default behaviour across the software - so we had to explicitly override the default behaviour on prometheus operator by setting kubelet.servicemonitor.https flag to true. 

As you can see in the above defaulting, it switches between http-metrics port and https-metrics port based on the servicemonitor.https flag. Explicitly including that in our config overrode the default value and it switched to monitor on 10250 and all was fine then.

I expect the default behaviour will be changed soon, so this gotcha will hopefully be short lived, but in case it helps anyone else I will leave it here.

Next up I will attempt to explain some of the magic behind the Prometheus configuration and how it can be setup to easily monitor all (or any) of your Kubernetes services.