|
Webpage http://deeplearning.net/tutorial/mlp.html#mlp shows training multi-layer perceptron using constant learning rate. Is it possible to implement some adaptive rule such as backtracking in such scheme (and in Theano of course). If so how? |
|
I've heard of researchers trying backtracking schemes, but no one has argued for using it. So how do you pick a learning rate? Basically brute force with heuristics to keep computations under control. Leon Bottou has often suggested to pick a small number of data points (e.g. 1000) and do that many stochastic updates to evaluate a learning rate. Do a line search using your favorite method on the learning rate with this evaluation measure, to find the most effective learning rate on this small sample. Then make the learning rate a little smaller (because you're going to use it on all your data) by dividing it by say, 2 or 3, and then do SGD on the whole dataset. You might combine this algorithm with an annealing schedule as well. Check out Leon Bottou's Svm Asgd implementation. |
|
In the MLP code from the tutorial, learning_rate is a constant, but it doesn't have to be. If you want to vary it during learning, you can replace it by a Theano shared variable, and then you can update the value of this variable during training according to whichever scheme you prefer. Not sure if this is the kind of answer you were looking for... I am not very familiar with adaptive learning rate schemes myself, but this is how I would implement it. |
|
What do you mean by backtracking in this context? A rule that is often used is exponential decay. Common forms are a/(b + c * t), where a, b and c are some constants and t is the update step.
This answer is marked "community wiki".
1
For instance Armijo-rule (see http://en.wikipedia.org/wiki/Wolfe_conditions, or http://www.math.umbc.edu/~aa5/articles/fall2006.pdf).
(Jan 18 at 04:07)
Mateusz
1
It is possible to do line search, but AFAIK it is not very common. There was a paper recently using LBFGS for neural networks. I think in non-convex optimization, using line search is not very popular since the cost is very high compared to the benefit. The line search itself is a non-convex optimization and even if you do it, I think you can not guarantee that you never need to search in this direction again.
(Jan 18 at 05:20)
Andreas Mueller
Just on a side note: LBFGS can give you 0 training error on MNIST with an MLP. I know that's not the holy grail (since it's MNIST and you never want 0 training error), but it shows how powerful that optimizer is.
(Mar 11 at 13:57)
Justin Bayer
|
|
After I have figured out how the deep learning algorithms in Theano work, I mean to try out an RProp implementation. RProp is a surprisingly effective simple adaptive learning-rate GD method which relies only on observing the change of sign in successive partial derivatives. It's robust and computationally cheap. The fundamental idea is that if the sign of the partial derivative changes after a batch presentation, the learning rate is too high; if it doesn't change, the learning rate is too low. During my PhD research I found that, on a number of problems and network architectures, and using default settings, RProp significantly outperformed fixed-rate GD, GD with momentum and even Levenberg-Marquandt for the same number of epochs (while being cheaper to compute than LM). It does require batch training to work, presumably because the second-order information it uses is buried by noise during online training. My experience with RProp is that it has been worse than minibatched stochastic gradient descent on everything I tried, mainly because it operates on larger batches usually.
(Mar 09 at 15:54)
gdahl ♦
RPROP is great for some problems, especially if you can afford a rather big batch size (e.g. 500 or 1000). My theory is that it works fine if you have discontinues error landscapes (e.g. because of a max in your model/loss).
(Mar 11 at 03:35)
Justin Bayer
|