|
I am using AdaGrad and AdaDelta in my current project for training deep neural network. AdaGrad is very sensitive to its initial learning rate. If initial learning rate is inappropriate, AdaGrad has very poor performance. Moreover, the learning rate of AdaGrad always decay over time, which is not good for non-stationary task. In particular, I think different layers of a deep neural network should have different learning rate schedules, and the learning rate of one parameter should be adaptive w.r.t. the current state of the neural network. The performance of AdaDelta is often comparable or even better than the best turned AdaGrad, and it's performance is insensitive to hyperparameters. I don't need to tune it's hyperparameters for different learning tasks and different datasets. However, the learning rate of AdaDelta never decay over time, so it never converges to any local optimum. In my project (a classification learning task), an important prediction quality measure is the difference between the empirical positive rate and predicted positive rate (averaged over the entire training data or test data). If the model converges, the two rates should be the same (at least on the training data). But I found that for AdaDelta trained models the two rates differ a lot, which is problematic. AdaDelta has a window size hyperparameter, and it computes the approximate expected squared gradient and expected squared parameter update (Delta x) over the data window. I don't know whether increasing the window size helps converging. I have also considered the algorithms presented in the paper "No More Pesky Learning Rates" and "Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients". However, these algorithms require either computing the diagonal hessian or doing two steps of feed-forwarding/back-propagation for each mini-batch, so they don't fit into my current learning framework. Moreover, since I'm using relu activation function, the loss function is non-smooth, so the algorithm presented in "No More Pesky Learning Rates", which requires diagonal hessian, does not work. On the other hand, the algorithm presented in "Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients" has been experimented only on extremely simple one-dimensional learning tasks, and is not compared with other adaptive learning rate algorithm. Anyone recommend a better adaptive learning rate algorithm? Some properties are desired:
Any ideas? Thanks. |
|
It's a good question (+1) without, to the best of my knowledge, very good answers in ML. Justin mentioned RPROP. Unfortunately, it's meant to work only with full batches (very big mini-batches are essentially indistinguishable from full batches, so they would work too, but that defeats the purpose of mini-batches). However, there is a new, relatively unknown, and currently unpublished (AFAIK) extension to RPROP for mini-batch learning called RMSPROP that Geoff Hinton says is currently his favorite optimization method. The only public implementation that I know is in https://github.com/BRML/climin (A project started by Justin, incidentally, so I'm surprised he didn't know about RMSPROP). A few things to note: The documentation is wrong, as of this writing: no RPROP and RMSPROP are "adaptive" methods, unlike the standard SGD, but they are not free of hyperparameters. In fact, RMSPROP has 5 of them! The idea however is that the algorithm should be either relatively insensitive to them, or the same choice of hyperparameters should work almost everywhere. rmsprop looks like a combination of several techniques. I'll try it. Thanks!
(Dec 03 '13 at 06:52)
x10001year
I did not mention rmsprop because it is sensitive to hyper parameters. :) (1) The learning rate is almost as difficult to get right as with SGD. Decay can also change a lot. (2) I have used this rmsprop implementation successfully for deep nets, and it's the way I optimize them. The tests do not include big models because they need to run fast and are only sanity checks. (3) Thanks for the pointer wrt the documentation. (4) I want to contradict you on the mini batch thing, rprop is fine for mini batch sizes you typically use with deep non-conv nets, such as ~200.
(Dec 06 '13 at 15:21)
Justin Bayer
|
|
You might want to try resilient propagation, although it does not strictly fit your requirements.
|