Machine (re)learning: Neural Networks from scratch
Having recently changed roles, I am now in the enviable position of starting to do some work with machine learning (ML) tools. My undergraduate degree was actually in Artificial Intelligence, but that was over a decade ago, which is a long time in computer science in general, let alone the field of Machine Learning and AI which has progressed massively in the last few years.So needless to say, coming back to it there has been a lot to learn. In the future I will write a bit more about some of the available tools and libraries that exist these days (both expanding on the traditional AI Stanford libraries I have mentioned previously with my tweet sentiment analysis, plus newer frameworks that cover "deep learning").
Anyway, inspired by this post, I thought it would be a fun Sunday night refresher to write my own neural network. The first, and last, time that I wrote a neural network was for my final year dissertation (and that code is long gone), so was writing from first principals. The rest of this post will be a very straight forward introduction to the ideas and the code for a basic single layer neural network with a simple sigmoid activation function. I am training and testing on a very simple labeled data set and with the current configuration it scores 90% on the un-seen classification tests. I am by no means any kind of expert in this field, and there are plenty of resources and papers written by people who are inventing this stuff as we speak, but this will be more the musings of someone re-learning this stuff.
As always, all the code is on GitHub (and, as per my change in roles, this time it is all written in Scala).
A brief overview
The basic idea is that a Neural Network(NN) attempts to mimic the parallel architecture of the human brain. The human brain is a massive network of billions of simple neural cells that are interconnected, and given a stimulus they each either "fire" or don't, and its these firing neural cells and synapses that enable us to learn (excuse my crude explanation, I'm clearly not a biologist..).So that is what we are trying to build: A connected network of neurons that given some stimuli either "fire" or don't.
A Neuron
Ok, so this seems like a sensible place to start, right? We all know what a network is, so what are these nodes in our network that we are connecting? These are the decision points, and in themselves are incredibly simple. They are basically a function that takes n input values and multiplies them by a pre-defined weight (per input), adds a bias and then runs it through an activation function (think of this as our fire/don't-fire function):In terms of our code, this is pretty simple (don't worry about the sigmoid derivative values we are setting, we are just doing this to save time later):
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
class Neuron(var weights: List[Double], bias: Double) { | |
self: ActivationFunction => | |
var biasedOutput = 0.0 | |
var activationOutput = 0.0 | |
var sigmoidDerivative = 0.0 | |
var sigmoidDerivativeByWeight: List[Double] = Nil | |
var inputs: List[Double] = Nil | |
def calculateOutput(in: List[Double]): Double = { | |
inputs = in | |
val weightedInputs = (weights zip inputs) map ( a => a._1 * a._2 ) | |
val summedWeightedInput = weightedInputs.foldLeft(0.0)(_ + _) | |
biasedOutput = summedWeightedInput + bias | |
activationOutput = activation(biasedOutput) | |
sigmoidDerivative = activationOutput * (1.0 - activationOutput) | |
sigmoidDerivativeByWeight = weights map ( _ * sigmoidDerivative ) | |
activationOutput | |
} | |
} |
You may have also noticed the first line of the class, where we require an ActivationFunction to be applied - as I mentioned, this is our final processing of the output. In this example I am just using the Sigmoid function:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
trait ActivationFunction { | |
def activation(x: Double): Double | |
} | |
trait SigmoidActivation extends ActivationFunction { | |
override def activation(x: Double) = 1 / (1 + Math.exp(-x)) | |
} | |
trait NoOpActivation extends ActivationFunction { | |
override def activation(x: Double) = x | |
} |
The network
Ok, so a sinple "shallow" NN has an input layer, a single hidden layer and an output layer (deep NNs will have multiple hidden layers). The network is fully connected between layers - that is to say, each node is connected to every node in the next layer up, and thats it - no neurons are connected to other layers and no connections are missed.The exact setup of the network is largely problem dependent, but there are some observations:
- - The input layer has to correlate to your inputs: If you have a data set that has two input values, let's say you have a dataset that contains house price data, and you have the input values number of rooms and square foot then you would have to have two input neurons.
- - Similarly, if you were using the NN for a classification problem and you know you have a fixed number of classifications, the output neurons would correlate to that. For example, consider you were using the NN to recognise hand written digits (0-9) then you would likely have 10 output neurons to group those possible outputs.
The number of hidden neurons, or the number of layers is a lot more problem dependent and is something that needs to be tuned per problem.
If you have ever worked with graph type structures in code before, then setting up a simple network of these neurons is also relatively straight forward given the uniform structure of the network.
The weights
Oh right, yes. So all these neurons are connected, but its a weighted graph, in CS terms. That is, the connection between each node is assigned a weight - as a simple example, if we take the house price dataset as an example, after looking at the data we might determine that the number of rooms is a more significant factor in the end result, so that connection should have a greater weighting than the other input. Now that is not exactly what happens, but in simple terms it's a good way of thinking about it.Given that the end goal is for the NN to learn from the data, it really doesn't matter too much what we initialise the weights for all the connections to, so at start-up we can just assign random values. Just think of our NN as a newborn baby - it has no idea how important different inputs are, or how important the different neural cells that fire in response to the stimuli are - everything is just firing off all over the place as they slowly start to learn!
So what next?
Ok, so we have our super-simple neurons, that just mimic a single brain cell and we have them connected in a structured but randomly weighted fashion, so how does the network learn? Well, this is the interesting bit..When it comes to training the network we need a pretty large dataset to allow it to be able to learn enough to start generalising - but at the same time, we don't want to train it on all the data, as we want to hold some back to test it at the end, just to see just how smart it really is.
In AI terms, we are going to be performing "supervised" learning - this just means that we will train it with a dataset where we know the correct answer, so we can make adjustments based on how well (or badly) the network is doing - this is different to "unsupervised" learning, where we have lots of data, but we don't have the right answer for each data point.
Training: Feed forward
The first step is the "feed-forward" step - this is where we grab the first record from our training data set and feed the inputs into our randomly initialised NN, that input is fed through the all of the neurons in the NN until we get to the output layer and we have the networks attempt at the answer. As the weighting is all random, this is going to be way off (imagine a toddlers guess the first time they ever see a smart phone).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1 to iterations foreach { i => | |
//randomly select an input/output row to train on | |
val trainingIndex = new Random().nextInt(trainingInputs.size) | |
val trainingInput = inputs(trainingIndex) | |
val trainingOutput = outputs(trainingIndex) | |
//run the feed forward pass | |
feedForward(trainingInput) | |
// run the back prop calcs | |
backPropogation(trainingOutput) | |
} | |
def feedForward(inputs: List[Double]): List[Double] ={ | |
assert(inputs.size == inputNeurons.size) | |
val hiddenLayerOutputs = hiddenNeurons map ( _.calculateOutput(inputs) ) | |
outputNeurons map ( _.calculateOutput(hiddenLayerOutputs) ) | |
} |
Training: Back propogation
This is where the magic, and the maths, comes in. The TL;DR overview for this is basically, we know how wrong the network was in its guess, so we attempt to update each of the connection weights based on that error, and to do so we use differentiation to work out the direction to go to minimise the error.This took a while of me staring at equations and racking my brain trying to remember my A-level maths stuff to get this, and even so, I wouldn't like to have to go back and attempt to explain this to my old maths teacher, but I'm happy I have enough of a grasp of what is going on, having now coded it up.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def backPropogation(targetOutputs: List[Double]): Unit = { | |
val outputNeuronsWithTargets = outputNeurons zip targetOutputs | |
val squaredErrorsDerivativePerOutput = outputNeuronsWithTargets map { n => | |
//Output of each neuron - the target output | |
n._1.activationOutput - n._2 | |
} | |
//Derivative errors for hidden layer | |
val sqErrorDerivativeWithOutputNeurons = outputNeurons zip squaredErrorsDerivativePerOutput | |
val derivativeErrorsPerHidden = 0 to hiddenNeurons.size-1 map { hiddenNeuronIndex => | |
val derivativeErrs = sqErrorDerivativeWithOutputNeurons map { | |
case (neuron, sqErrDerivative) => | |
neuron.sigmoidDerivativeByWeight(hiddenNeuronIndex) * sqErrDerivative | |
} | |
derivativeErrs.foldLeft(0.0)(_ + _) | |
} | |
//iterate over all hidden layer neurons and calculate the weight error, | |
//then we multiply by learning rate and apply new weight | |
val derivativeErrorsWithHidden = derivativeErrorsPerHidden zip hiddenNeurons | |
derivativeErrorsWithHidden foreach { case (derivativeError, neuron) => | |
//iterate over all weights pointing to this hidden neuron to calc error | |
neuron.weights = neuron.inputs.zipWithIndex.map{ case (input, index) => | |
val derivedWeightError = derivativeError * neuron.sigmoidDerivative * input | |
neuron.weights(index) - (derivedWeightError * learningRate) | |
} | |
} | |
//now do similar for the output layer, but used squared errors | |
val squaredErrorsWithOutputs = squaredErrorsDerivativePerOutput zip outputNeurons | |
squaredErrorsWithOutputs foreach { case (squaredError, neuron) => | |
//iterate over all weights pointing to this output neuron to calc error | |
neuron.weights = neuron.inputs.zipWithIndex.map{ case (input, index) => | |
val derivedWeightError = squaredError * neuron.sigmoidDerivative * input | |
neuron.weights(index) - (derivedWeightError * learningRate) | |
} | |
} | |
} |
- - Cacluate the squared error of the outputs. That is a fairly simple equation of 1/2 * (target - output)^2
- - From there, we will work out the derivative of this function. A lot of the errors we will start to work with now uses differentiation and the derivatives of the result - and this takes a little bit of calculus, but the basic reason is relatively straight forward if you think about what differentiation is for.
- - We work back through the network, calculating the errors for every neuron combining the error, derivatives and the weighting (to determine how much a particular connection played in an error, it also needs to be considered when correcting the weighting.
Once we have adjusted the weight for each connection based on the weight, we start again and run the next training data record. Given the variation in the dataset, the next training record might be quite different to the previous record trained on, so the adjustment might be quite different, so after lots (maybe millions) of iterations over the data, incrementally tweaking and adjusting the weights for the different cases, hopefully it starts to converge on something like a reasonable performance.
Maths fun: Why the derivative of the errors?
So, why is calculating the derivative relevant in adjusting the errors?If you imagine the network as a function of all the different weights, its pretty complicated, but if we were to reduce this, for the sake of easier visualisation, to just be a 3-d space of possible points (e.g. we have just 3 weights to adjust, and those weights are plotted on a 3-d graph) - now imagine our function plots a graph something like this:
(taken from the wikipedia page on partial derivatives)
The derivative of the function allows us to work out the direction of the slope in the graph from a given point, so with our three current weights (co-ordinates) we can use the derivative to work out the direction in which we need to adjust these weights.
Conclusion
The whole NN in scala is currently on Github, and there isn't much more code than I have included here. I have found it pretty helpful to code it up from scratch and actually have to think about it - and has been fun coding it up and seeing it train up to getting 90% accuracy on the unseen data set I was using (just a simple two-in two-out dataset, so not that impressive).Next up I will see how the network performs against the MNIST dataset (like the Hello-World benchmark for machine learning, and is the classification of handwritten digits).
Oh, and if you are interested as to why the images look like a dodgy photocopy, they are photos of original diagrams that I included in my final year dissertation in university, just for old time sake!
1 comments: