Extremely small or NaN values appear in training neural network

I’m trying to implement a neural network architecture in Haskell, and use it on MNIST.

I’m using the hmatrix package for linear algebra.
My training framework is built using the pipes package.

My code compiles and doesn’t crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn’t happen, the training still doesn’t work. There’s no improvement over its loss or accuracy.

I checked and rechecked my code, and I’m at a loss as to what the root of the problem could be.

Here’s the backpropagation training, which computes the deltas for each layer:

backward lf n (out,tar) das = do
    let δout = tr (derivate lf (tar, out)) -- dE/dy
        deltas = scanr (\(l, a') δ ->
                         let w = weights l
                         in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
    return (deltas)

lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.

In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.

Here’s the actual gradient computation:

  grad lf (n, (i,t)) = do
    -- Forward propagation: compute layers outputs and activation derivatives
    let (as, as') = unzip $ runLayers n i
        (out) = last as
    (ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
    let r  = fromIntegral $ rows i -- Size of minibatch
    let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
    return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)

Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).

squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.

Here’s the actual update code:

move lr (n, (i,t)) (GradBatch (gs, ds)) = do
    -- Update function
    let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
        n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
    return (n', (i,t))

lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.

The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.

Finally, here’s the code for a mean square error loss function:

mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
          f' (y,y') = (y'-y)
      in  Evaluator f f'

Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).

The rest of the code is up on GitHub: NeuralNetwork.

So, if anyone has an insight into the problem, or even just a sanity check that I’m correctly implementing the algorithm, I’d be grateful.

1 Answer
1

Leave a Comment