I have found that rectified linear units along with SGD perform very well when supplemented with proper regularization such as dropout and weight decay. I was trying to see if using second order optimization techniques improves the training on rectifier networks. However I found that the GGN matrix values (Gv products) are too small and evaluate to NaNs, thus stopping training. I tried to use very large values of the damping coefficient to see if it helps alleviate the problem, but I've had no luck so far.

I was wondering if anyone might have some insight into why HF is behaving this way for rectifier networks and what can be done to avoid this. The optimization works fine for sigmoid nets.

asked Sep 23 '13 at 11:44

Siddharth%20Sigtia's gravatar image

Siddharth Sigtia
101226


2 Answers:

I haven't tried this myself, but here's an idea:

HF optimization relies the Hessian being smooth, but with rectified linear units, even the first derivative of the error function is discontinuous.

You can approximate the rectified activation function using a smooth function such as

epsilon * log(1 + exp(x / epsilon))

The smaller the epsilon, the closer it is to the rectified activation function.

You could try to choose a reasonable epsilon, and/or lower it as the training progresses.

answered Oct 12 '13 at 17:24

Max's gravatar image

Max
476162729

edited Oct 12 '13 at 21:22

I think Max is on the right track. In Schraudolph's paper (referenced in Martens 2010), the goodness of the direction suggested by the Gauss-Newton approximation G to the Hessian is dependent on H_{L circ M} (see section 3)

As long as G is positive semi-definite ... the extended Gauss-Newton algorithm will not take steps in uphill directions

I haven't worked out the form of H_{L circ M} for ReLU units and sum of squared-error loss function, but it might be that this matrix isn't PSD.

You might be able to use the Fisher Information matrix, which relies solely on gradient information, to approximate the Hessian. But this would likely defeat the point of using a second order method: you're ignoring any of the curvature information provided by the second derivative terms (see section 3 of Schraudolph again).

If you're married to ReLU activation functions, try a first order method like NAG while paying careful attention to initialization, momentum and learning rate annealing. This paper is a good start.

answered Jun 10 '14 at 18:09

LeeZamparo's gravatar image

LeeZamparo
56247

edited Jun 10 '14 at 19:43

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.