I’m trying to implement a neural network architecture in Haskell, and use it on MNIST.
I’m using the hmatrix
package for linear algebra.
My training framework is built using the pipes
package.
My code compiles and doesn’t crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN
values in the computations. After some inspection, I see that extremely small values (order of 1e-100
) eventually appear in the activations. But, even when that doesn’t happen, the training still doesn’t work. There’s no improvement over its loss or accuracy.
I checked and rechecked my code, and I’m at a loss as to what the root of the problem could be.
Here’s the backpropagation training, which computes the deltas for each layer:
backward lf n (out,tar) das = do
let δout = tr (derivate lf (tar, out)) -- dE/dy
deltas = scanr (\(l, a') δ ->
let w = weights l
in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
return (deltas)
lf
is the loss function, n
is the network (weight
matrix and bias
vector for each layer), out
and tar
are the actual output of the network and the target
(desired) output, and das
are the activation derivatives of each layer.
In batch mode, out
, tar
are matrices (rows are output vectors), and das
is a list of the matrices.
Here’s the actual gradient computation:
grad lf (n, (i,t)) = do
-- Forward propagation: compute layers outputs and activation derivatives
let (as, as') = unzip $ runLayers n i
(out) = last as
(ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
let r = fromIntegral $ rows i -- Size of minibatch
let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)
Here, lf
and n
are the same as above, i
is the input, and t
is the target output (both in batch form, as matrices).
squeeze
transforms a matrix into a vector by summing over each row. That is, ds
is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs
, which corresponds to the gradients for the weights.
Here’s the actual update code:
move lr (n, (i,t)) (GradBatch (gs, ds)) = do
-- Update function
let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
return (n', (i,t))
lr
is the learning rate. FC
is the layer constructor, and af
is the activation function for that layer.
The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad
and move
, with a parameterized stop condition.
Finally, here’s the code for a mean square error loss function:
mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
f' (y,y') = (y'-y)
in Evaluator f f'
Evaluator
just bundles a loss function and its derivative (for calculating the delta of the output layer).
The rest of the code is up on GitHub: NeuralNetwork.
So, if anyone has an insight into the problem, or even just a sanity check that I’m correctly implementing the algorithm, I’d be grateful.