|
I have found that rectified linear units along with SGD perform very well when supplemented with proper regularization such as dropout and weight decay. I was trying to see if using second order optimization techniques improves the training on rectifier networks. However I found that the GGN matrix values (Gv products) are too small and evaluate to NaNs, thus stopping training. I tried to use very large values of the damping coefficient to see if it helps alleviate the problem, but I've had no luck so far. I was wondering if anyone might have some insight into why HF is behaving this way for rectifier networks and what can be done to avoid this. The optimization works fine for sigmoid nets. |
|
I haven't tried this myself, but here's an idea: HF optimization relies the Hessian being smooth, but with rectified linear units, even the first derivative of the error function is discontinuous. You can approximate the rectified activation function using a smooth function such as epsilon * log(1 + exp(x / epsilon)) The smaller the epsilon, the closer it is to the rectified activation function. You could try to choose a reasonable epsilon, and/or lower it as the training progresses. |
|
I think Max is on the right track. In Schraudolph's paper (referenced in Martens 2010), the goodness of the direction suggested by the Gauss-Newton approximation G to the Hessian is dependent on H_{L circ M} (see section 3)
I haven't worked out the form of H_{L circ M} for ReLU units and sum of squared-error loss function, but it might be that this matrix isn't PSD. You might be able to use the Fisher Information matrix, which relies solely on gradient information, to approximate the Hessian. But this would likely defeat the point of using a second order method: you're ignoring any of the curvature information provided by the second derivative terms (see section 3 of Schraudolph again). If you're married to ReLU activation functions, try a first order method like NAG while paying careful attention to initialization, momentum and learning rate annealing. This paper is a good start. |