There is a rich green, vigorous ginseng bonsai tree in my office. Every time I look at it, my brain fires with other pictures, words and situations related to it. If you and I would meet, we could have a decent conversation about the shape of bonsais. You could say that I can handle bonsai situations. The perception coming from my retina is linked with the visual cortex, and with other areas in my brain, for instance where language is processed. At the end and fully wired, I am nothing but a composite neural net. Now, would I recognize all kinds of bonsai trees through generalization?
What if I had never been to this room before and you send me the picture above. Would the leafs guide me into the right direction, that this is the crown of a ginseng bonsai tree? I would guess some kind of indoor plant in the first place. But a bonsai? Mh, not sure, maybe. The task of classification can be quite tricky, because for natural results, we need a generalizing intelligence. To me, learning new things is something I just do. Even when I think about things, I just think about things. And when I learned something, I gracefully apply it, learn new things, and so on. It's all somehow implicit to me, and this is the great miracle of nature. But, miracles are hard to formalize and code. What can we do? Generalization means connecting dots between similar objects, and one option to achieve this is to let data flow through neural structures, which enforce compression and decompression.
Their neural networks also were the first artificial pattern recognizers to achieve human-competitive or even superhuman performance on important benchmarks such as traffic sign recognition [...], or the MNIST handwritten digits problem of Yann LeCun at NYU. Source: https://en.wikipedia.org/wiki/Artificial_neural_networkIf you group small neural networks to learn simple representations of the input, a technique known as Convolution, you have a good tool for visual and auditory recognition in high resolutions, given our current hardware limitations. Further, independent neural nets can be chained together, to form composite organisms, which can learn tasks on their own, e. g. using reinforcement learning algorithms on top. Before we can advance to such architectures, we need to enhale the mathematical structure of a feed-forward net, because it is the basis for modern architectures.
The net above has three layers. On the left, there is the input layer with two neurons. It is simply the raw, untouched input. In the middle we have the Dense layer with three neurons. Here is the area where dots are joined, to learn. The neurons carry Activators and , plain functions, which can be considered as the cells ability to fire. A net can stack many layers, it can be deep. On the right we have the output layer, with one neuron. A net can have multiple input and output neurons, depending on the dimension of our data. The layers are densely connected through synapses, called weights, since the thickness of the synapse determines the amplification between neurons. How to compute the net? Studying biological neural networks, e. g. nets found in dead animals, it turns out that the connectivity patterns can be modelled through ordinary matrix multiplication. To compute a layer, we multiply inputs by weights and apply the cell's activator to fire. This result is the input of the next layer. If we do this recursively until we reach the output layer, our result is a number, usually between 0 and 1, depending on the activator's domain. This number can be seen as the answer of the net with respect to the given input. For the sake of brevity, we express this forward pass in matrix notation. The matrices contain the left and right weights coming in and out of the dense layer. Isn't it somewhat calming to know that behind the complex sounding term Artificial Neural Network there is just a couple of matrices and nested function calls?
In theory, a full batch is more precise, yet more prone to adjusting dynamically during training, to soften or amplify gradients varying in intensity over time. When training data is grouped into batches, each batch contributes to the minimum of per iteration. Since Big Data implies a certain degree of redundancy, in practice, using mini-batches often gives fast convergence and a more stable . Further, memory and performance considerations arise regarding a good batch size. One important factor is to combine both training data and weights as much as possible using large batches, leading to densely packed matrices. This way the inherent parallelism of a feed forward net can be fully harnessed by multicore processors, on both CPU and GPU. However, since this is a rather informal blog post, I don't want to go deeper into the mechanics here. If you are curious how the analytical gradients can be constructed algorithmically for batched gradient descent, have a look at NeuroFlow for Scala. Personally, I find functional code easier to understand things than with curly LaTeX math equations. We use this implementation later.
The bold points and stand for a bonsai. I picked it on purpose, because you can't separate it linearly. No matter how hard you try, you will never be able to separate and from and using only one line. It is not possible, thus the classification of a bonsai can't be solved through a linear function. Interestingly, our 'bonsai function' is nothing else but the logical XOR function, which formulates binary addition, so we can make our net learn to add modulo 2 as a side effect. To be an universal approximator, our net must be able to learn such non-linearities, because the patterns we humans produce are most likely non-linear. Looking at our net equation, we immediately conclude that if we insert the linear activator functions and into , we simply get another linear function, since the matrix multiplications won't bring any non-linearity either. Using linear activators, our complex net is not able to recognize bonsais. We can't change the way matrix multiplication is defined, because it mimics the neural connectivity patterns found in nature. Consequently, in order for any neural net to recognize non-linear patterns, we need to find more suitable activators for the cells. There are a few such functions, like Tanh, Sigmoid, or the Rectified Linear Unit. We use the Sigmoid , which is a classic and fits our numeric range . Because the function depends on the negative natural exponential function as denominator, it is non-linear, and we find a smooth step characteristic, softly firing a neuron, or not.
Now, if we use for and , the net takes the functional form of a nested sigmoid. The complete loss function therefore is: Theoretically, using this non-linear functional form and gradient descent training, we should be able to separate our non-linear bonsai space.
import neuroflow.application.plugin.Notation._
import neuroflow.core.Activator._
import neuroflow.core._
import neuroflow.dsl._
import neuroflow.nets.cpu.DenseNetwork._
implicit val weights = WeightBreeder[Double].random(-1, 1)
val (g, h) = (Sigmoid, Sigmoid)
val net = Network(
layout = Vector(2) :: Dense(3, g) :: Dense(1, h) :: SquaredError(),
settings = Settings(
learningRate = { case (_, _) => 1.0 },
iterations = 2000
)
)
Now we have a net, initialized with random weights between -1 and 1, a learning rate and maximum iterations for gradient descent. When no batch size is defined, the net assumes a full one. Then, we define the training data using inline vector notation and start training:
val xs = Seq(->(0.0, 0.0), ->(0.0, 1.0), ->(1.0, 0.0), ->(1.0, 1.0))
val ys = Seq(->(0.0), ->(1.0), ->(1.0), ->(0.0))
net.train(xs, ys)
... ah, time for a fresh cup of sencha ...
_ __ ________
/ | / /__ __ ___________ / ____/ /___ _ __
/ |/ / _ \/ / / / ___/ __ \/ /_ / / __ \ | /| / /
/ /| / __/ /_/ / / / /_/ / __/ / / /_/ / |/ |/ /
/_/ |_/\___/\__,_/_/ \____/_/ /_/\____/|__/|__/
1.5.6
Network : neuroflow.nets.cpu.DenseNetwork
Weights : 9 (≈ 6,86646e-05 MB)
Precision : Double
Loss : neuroflow.core.SquaredError
Update : neuroflow.core.Vanilla
Layout : 2 Vector
3 Dense (σ)
1 Dense (σ)
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:216] Training with 4 samples, batch size = 4, batches = 1.
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:263] Breeding batches ...
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:803] Iteration 1.1, Avg. Loss = 0,503172, Vector: 0.5031724735606108
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:823] Iteration 2.1, Avg. Loss = 0,502110, Vector: 0.5021102644775862
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:824] Iteration 3.1, Avg. Loss = 0,501510, Vector: 0.5015098477591278
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:825] Iteration 4.1, Avg. Loss = 0,501152, Vector: 0.5011517002396553
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:57:49:826] Iteration 5.1, Avg. Loss = 0,500920, Vector: 0.5009203492807744
...
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:58:00:200] Iteration 99999.1, Avg. Loss = 8,66563e-05, Vector: 8.66563188141294E-5
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:58:00:200] Iteration 100000.1, Avg. Loss = 8,66554e-05, Vector: 8.6655441891917E-5
INFO neuroflow.nets.cpu.DenseNetworkDouble - [08.02.2018 13:58:00:200] Took 100000 of 100000 iterations.
Network was:
---
5.972841198278272 7.031856941751971 -4.693156686110289
5.958082325044351 -4.709170682392037 6.984931986901868
---
19.752926278797165
-14.472887673690817
-14.474769706032983
Clearly our weights have changed and loss is small, looks like everything worked out. To check if our net can recognize bonsais, we feed it with all inputs:
Input: DenseVector(0.0, 0.0) Output: DenseVector(0.009977792013595178)
Input: DenseVector(0.0, 1.0) Output: DenseVector(0.9940081702899719)
Input: DenseVector(1.0, 0.0) Output: DenseVector(0.9940077326864107)
Input: DenseVector(1.0, 1.0) Output: DenseVector(0.0013940967243219436)
As we can see, feeding our net with vectors and leads to a number close to 1, whereas feeding it with vectors and leads to a number close to 0. We finally did it! Our net can recognize bonsais, more, it can add binary numbers modulo two. Another interesting property of our trained net is its plot.
To separate the space, the net draws a mountain landscape, and if we look at the colored surface, we see that only a bonsai gets the peak. Thank you for reading. :-)