Direct Approach Neural Networks
Recently I want to learn about artificial neural networks (ANN).
Learning is about connecting trees. Each of us have our own unique background and experiences, that forms a personal knowledge tree. A piece of new knowledge or concept is like a new node (leaf) that has not been attached to your knowledge tree (yet). Leaning is to reach out and trying to connect this new node onto your existing tree.
Learning, or the conjugate side, teaching, is relatively simple when we are dealing with kids, whose individual knowledge trees are short and simple, and it is often straight forward to figure out the best way to connect the knowledge to their trees. For example, addition, we check whether they know counting, and then connect that knowledge to their counting branch. Calculus? Oh, you are not ready, but let's try construct some simple algebra branches.
As our trees grow, learning or teaching become complicated. It is difficult to get a grasp of a grown tree -- not just from teacher's point of view, it is even true for one to grasp his own tree. If one cannot grasp the existing tree, then how to best connect the new node to the tree can become obscured.
This is what I found out about neural networks. Most teaching material will spend large amount of effort trying to establish necessary branches that may connect to this knowledge. However, at least in my case, this knowledge is actually right beside my existing tree, and I can connect to it directly.
Unfortunately, I didn't know that. For so many years, artificial neural networks remain magic to me, which poses such a barrier to start learning or even to see where to start. When I finally decided to learn about it recently, it forces me to spend weeks of time to plow through many unfamiliar branches until I suddenly realize that all those branches are just there to confuse me (from point of my existing tree, do not generalize) and all I needed is to understand where this new knowledge node is relative to my own tree and I can trivially connec to it. I drew the following picture to illustrate my problem:
Typical material constructs the blue branches (either an abstract math branch or programming code branch, or the biology branch as the name "neural network" literally means but turns out we know very little about how our brain works, so that connection goes nowhere), while all I needed is that green branch. I didn't know that, and I guess no one would know that. So I think it is worthwhile to write down some notes on neural networks -- from the perspective of my own knowledge tree. In case some readers share similar background as me, they may find it useful, or even may give me some pointers on how to improve my own tree.
[ The tree diagram illustrates another problem: my knowledge tree needs pruning and some re-connections, and some parts are ready to be turned into net. The pruning and reconnecting is to grow wisdom, a much more needed task for grown trees (than acquiring new nodes).]
Define the problem
Defining the problem is of paramount importance. With good definitions, one can start to look at the problem (new knowledge) from the perspective of their own tree, and he can start to make all kinds of judgement on how to approach it. It is similar to seeing the mountain helps one go toward it even though the paths are still not clear. Most material I read start the introduction of neural network as to develop artificial intelligence such as teaching computer to play Atari. I bet with this type of description, you would have no idea how to connect to your existing tree, unless you are an artificial intelligence major.
In reality, artificial intelligence are simply an envelope fancy name for fields that are likely adjacent to our individual knowledge tree. Let's ignore the words, and define a simpler version of the problem:
Problem: Find a function with input
with output
that best fits a set of available samples (a training set).
What I described is a general data fitting problem, and it readily connects to my personal knowledge tree: I have a basic college math background and I am experienced in linear regression; and this is all I needed to connect to this new knowledge node.
Chances are low that anyone reads this will have the exact same background as me. For convenience, I will illustrate some background around this definition.
The XOR Net
We all know what XOR is:
In terms of the problem, we have two inputs, one output and a training set of 4 samples.
The reason we love XOR net, in addition to its simplicity, is because it represents a class of problems that have two inputs and one output and we can visualize it, maybe as following:
, where the yellow dots are our training samples.
In general, any functions that map from 2 inputs into one output can be plotted as a 2D image with x and y coordinates as two inputs and color as the output.
Our XOR samples only specifies discrete functions with only four points marked as the four yellow dots. In fact, all finite training set only describes discrete points. The idea of neural networks is to build a continuous function that can map points beside the sample points into some sensible results. In the case of XOR net, ideally we want our function to give us sensible answers even outside the given training set, e.g. is it on or off at (0.2, 0.4)?
With our basic college math background, we'll do some abstraction -- generalization so we can apply our understanding to a much broader set of problems. If there are 3 inputs and one outputs, we should imagine a mapping of points in 3 dimensional space. (We have to imagine because now there is no easy way to visualize it now.) If we have 3 inputs and 2 outputs, we should imagine of a mapping of simultaneous two points in 3 dimension.
We can now try to extrapolate into problems that are of arbitrary number of inputs and arbitrary numbers of outputs, at which point, we might simply give up imagination altogether and go back to the mathematical description:
To find approximate function
that fits a set of samples:
, where the vector notation
is equivalent to writing
.
Method of Fitting: Linear Regression
Now we are in familiar territory, or at least for me. To make my notes comprehensible to wider audience, I'll describe some of my understanding on this topic.
An arbitrary function is inapproachable. So we focus on a subset of the problem, where the function is described by some parameters:
To find the best parameters that fits the function to sample data, we seek to minimize the difference between function output and sample output, typically using a cost function such as difference squared:
If we plot the cost function
then an obvious thing to do is to adjust the parameters toward the direction of down-hill slope, and gradually approach the minimum point where the function approximates the sample points with least difference. Given a model function, we can calculate the direction of slope, and mathematically follow that slope, which is what known as gradient descent.
One should understand that I only drew a simplistic cost function curve. When we are modeling a nonlinear function, the actual cost function curve (in 1-dimension, or surface in 2-dimension, or super surface in n-dimension) can be anything. It might even resemble a landscape at a mountain range where we can see the local slope of gradient descent but that slope may bear little direction to our desired global minimum. When problem is finite, there is a classic method of Levenberg-Marquardt. But the problem we resort to artificial neural networks are so general (or ambiguous) that gradient descent turns out to be more practical -- just go down to where we can see then figure again at there. What if we are trapped in a local minima? We can build up some momentum when we descend and hopefully we will go over the local minima with that momentum -- all common sense talking :)
The Neural Network Model
Ideally, we would like some model based on physics (then the result can mean something), but the method of neural network is the method we seek when we are lost. Like a baby just born, a simple general model function assumes no specific knowledge. Naturally, we start with a simplistic form we can think of:
tanh function looks like:
which has the nice property of turning most values into a yes/no answer. In neural networks jargon, we are describing a single layer neural network:
However, above function is a linear function. A linear function in one dimension is a point that separates an axis; in two dimension is a line that separates the surface; in 3 dimension is a plane and a hyper-plane in higher dimensions. Linear functions will never approximate our XOR samples well (try divide the XOR map with one line). So next, we produce a more complicated version:
To relate the graph with the equation, x_j are the input, w_j are the weighting on the first layer, theta_j is the weighting on bias, and alpha_i are the weighting on the second layer. The equation only describes single output; just add another subscript k to turn the equation into multiple output.
Learn Directly With Code
At this point, did you feel illuminated or confused? I myself find it complicated. I have been committing the same error as all other teachers -- trying to draw branches without understanding my readers' existing knowledge tree. Let me restart, that is, if you know code, let me try describe neural networks in code directly.
[ For my own convenience, I'll show code in Perl, and I'll use a meta layer called MyDef. If you are not familiar with either Perl or MyDef, try read the code as pseudo-code, and pay more attention to the comments. ]
First, the problem
page: test
#-- 2 input, 3 hidden nodes, 1 output
$call init_ann, 2, 3, 1
#-- XOR training set, mapped to [-1, 1]
my @training_set=(
-1, -1, -1,
-1, 1, 1,
1, -1, 1,
1, 1, -1,
);
#-- learning rate for the output layer and hidden layer
$global $rho_o=0.1, $rho_h=0.05
#-- momentum
$global $alpha=0.9
#---- for each sample from the training set
$for $i=0:@training_set:3
#-- forward calculates the node values
ann_forward($training_set[$i], $training_set[$i+1])
#-- backward calculate errors and adjust weights
ann_backward($training_set[$i+2])
The learning rates and momentum term are what makes the neural network mysterious; we'll see how they are being applied shortly.
Data structure: init_ann
#-- note: this is meta-layer macro, code will be placed in the main code directly
subcode: init_ann(@plist)
$global @layer_counts
$global @layer_nodes
$global @layer_weights
$global @layer_weight_deltas
my $i=0
$(for:p in $(plist))
$layer_nodes[$i]=[]
$(if:i=0)
#-- input ----
$layer_counts[$i]=$(p)+1
#-- bias 1.0 ----
$layer_nodes[$i]->[$(p)]=1.0
$(else)
#-- hidden and output layers ----
$layer_counts[$i]=$(p)
#-- initialize weights ----
$layer_weights[$i-1]=[]
$layer_weight_deltas[$i-1]=[]
$call set_layer
$for $i=0:$n_input*$n_output
$weight->[$i]=rand(1)-0.5
$delta->[$i]=0
$i++
$global $n_layers
$n_layers=$i
set_layer is for setting up the context one layer at a time (so additional hidden layers come at no extra cost in code):
subcode: set_layer
my $n_input=$layer_counts[$i-1]
my $n_output=$layer_counts[$i]
my $input =$layer_nodes[$i-1]
my $output=$layer_nodes[$i]
my $weight=$layer_weights[$i-1]
my $delta=$layer_weight_deltas[$i-1]
ann_forward: calculate network output
fncode: ann_forward
#-- copy input vector to first layer nodes ----
$loop($layer_counts[0]-1) $layer_nodes[0]->[i]=$_[i]
$for $i=1:$n_layers
$call set_layer
my @out
#-- out = sum ( w[i,j]*input[i] )
$sum($n_input, $n_output) $out[j] = $weight->[i,j] * $input->[i]
#-- sigmoid, so we get yes/no output
$loop($n_output) $output->[i] = sigmoid($out[i])
$sum and $loop are meta-layer notation for loops. You know what they mean, right? Or you may read this (hopefully, later).
ann_backward: training the network
One of the problem with new knowledge is they use words such as "neural" and "training" so they are difficult to relate to one's existing knowledge base. Here "training" simply means to adjust parameters to minimize fitting errors.
fncode: ann_backward
#-- start at the last layer, output layer
my $i=$n_layers-1
$call set_layer
my @error
#-- calculate the error
$loop($n_output) $error[i] = $_[i] - $output->[i]
#-- adjust the weighting layer by layer
my $error=\@error
my $rho=$rho_o
#-- meta-layer notation
$while $i>0; $i--
my @error_prop
$call set_layer
#-- back propagation ----
$if $i>1
$sum($n_input, $n_output) $error_prop[i] = $error->[j] * sigmoid_prime($output->[j]) * $weight->[i,j]
#-- calculate adjustment ----
$loop($n_input, $n_output) $delta->[i,j] += $rho * $error->[j] * sigmoid_prime($output->[j]) * $input->[i]
$for $j=0:$n_input*$n_output
#-- commit adjustment
$weight->[$j] += $delta->[$j]
#-- apply momentum
$delta->[$j] *= $alpha
#-- prepare for the next layer
$error=\@error_prop
$rho=$rho_h
Some math equations are in order to clarify the code. Each layer of the neural network are linear, so the adjustments are:
The adjustment multiplies \rho, which sets the step size of adjustment, which effectively become the descent rate..
Each round of parameter adjustment is accumulated with last round adjustment (multiplied with \alpha), which provides a sense of momentum. The momentum term accelerates the descent when the direction is consistent, and stabilizes the direction when the adjustments gets into oscillation.
Oh, before I forget, here is the sigmoid function:
fncode: sigmoid($x)
#-- tanh (x) ----
my $e_x=exp($x)
my $e_nx=1.0/$e_x
return ($e_x-$e_nx)/($e_x+$e_nx)
fncode: sigmoid_prime($s)
#-- figured out with pen and paper
return (1-$s**2)
And that is it, the entire artificial neural network implementation!
See Some Results
Now we have the code, we can have some hands on experience with the network. To do that, we need some kind of visual results. We'll add a plot function to plot a 2D map of our network function.
subcode: _autoload
#-- my old utility function
use savebmp
fncode: plot($savename)
my @data
#-- let's draw a 100 x 100 map
my $n=100
$for $i=0:$n
$for $j=0:$n
#-- normalize to [-2,2] x [-2,2], so we conver the sample points
my $x=$i/$n*4-2
my $y=$j/$n*4-2
#-- calculate with the network
ann_forward($x, $y)
#-- get the output
push @data, $layer_nodes[-1]->[0]
#-- Plot it!
savebmp::savebmp(\@data, $n, $n, -1, 1, "$savename.bmp")
We need adjust the main function to call the plot.
$for $j=0:50
$print Iteration $j
$if $j % 20 == 0
plot($j)
$for $i=0:@training_set:3
ann_forward($training_set[$i], $training_set[$i+1])
ann_backward($training_set[$i+2])
We added some iterations (you don't expect to train a network with a dozen parameters with only 4 samples, do you?), and plotting the function map every 20 iterations to check its progress.
Following are the results (I repeated 4 times):
Bear in mind that the XOR training points are at the center of each quadrant. In all four times, my neural network was successfully trained to recognize the XOR samples. It is interesting to note that not only the convergence speeds were all different each time depending on the random initial weightings, the resulting network's results on inputs other than the training set are also different. So if we feed the network input of (0, 1), a point well outside of our training set, the output will be highly unpredictable, each time certain but unpredictable, just like our brain.
We can greatly expand the number of hidden nodes, and add more hidden layers (is that what deep mind means?), and achieve magic like results. However, it is worthwhile to revisit the simple XOR net and reflect the meaning -- if there is any. Just like our brain when we see a black face in a bad neighborhood, we make some seemingly certain judgements, but does that judgements mean anything other than who we are?
