|
Hi, I'm referring to the code here: http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html It's from Geoff Hinton paper on DBN on MNIST. They are using a softmax classifier to classify the digits.
In the method Is anyone here familiar with this code and can elaborate on it? Thanks. |
|
The matlab code is not using stochastic gradient descent for the supervised phase, it is use a nonlinear conjugate gradient algorithm. The non-averaged gradient is correct and for nonlinear CG it isn't necessary to divide by the minibatch size since the algorithm does line searches. For minibatched SGD dividing by the minibatch size is recommended since it makes it easier to set the learning rate to a single value and change the minibatch size. I understand, thanks. Btw, is this a good optimizer for DBNs? Simple SGD seems too slow in my experiments.
(Jan 11 '13 at 08:50)
rm9
Well-tuned SGD usually works pretty well and is what I use.
(Jan 11 '13 at 15:27)
gdahl ♦
What do you mean by "well-tuned" ? Testing many learning rates?
(Jan 12 '13 at 05:57)
rm9
learning rates, minibatch sizes, momentum, ... decaying learning rate schedules are also often used. Usually it's a good idea to optimise these 'learning' parameters together with the hyperparameters of the model.
(Jan 12 '13 at 08:39)
Sander Dieleman
Is there any "smart" of way of doing this tuning? If I use SGD with the above sample code (instead of CG), I get bad performance. I tried around several learning rates, batch sizes and momentums but nothing seem to work. The example is on MNIST dataset so it should work somehow..
(Jan 13 '13 at 03:01)
rm9
Grid search is a popular approach to optimise hyperparameters, but anything 'deep' usually has rather a lot of hyperparameters so the grid becomes very big very quickly. Interestingly you can mitigate this problem by sampling the parameters randomly: http://jmlr.csail.mit.edu/papers/v13/bergstra12a.html There has also been some interesting research lately on using Bayesian techniques for hyperparameter optimisation: http://nips.cc/Conferences/2011/Program/event.php?ID=2579 James Bergstra has an unreleased software package for this called 'hyperopt'. Lately he seems to be working on it again and I think that he mentioned at NIPS that he is planning to release it soon. You can find it here: https://github.com/jaberg/hyperopt I usually use random search, because it's trivial to implement and works remarkably well.
(Jan 13 '13 at 14:50)
Sander Dieleman
You must be doing SGD wrong then. One "smart" way of tuning it is to use Bayesian optimization, but if you can't get things to at least almost work by hand something else might be wrong.
(Jan 24 '13 at 18:33)
gdahl ♦
showing 5 of 7
show all
|