10
6

Hi machine learners,

I am fine-tuning a deep neural network. I would like to limit the impact of my choice of learning rate for the supervised training, because I already have so many hyper-parameters. I am hoping that having a decreasing learning rate would help me do that. However, I am not sure what is the best function to use. For now, I tried decresing with a linear function, but it doesn't seem to do much difference, since the learning rate always stay in the same order. I will try an exponential decrease, but then I am worried that I will loose in classificaation performance and/or augment the training time. Maybe I should try something in between the linear and exponential ?

What do you people think ? Do you usually keep the learning rate constant for supervised learning, or you have a favorite decreasing function ?

Thanks,

Philippe Hamel

[email protected]

asked Jul 21 '10 at 09:21

Philippe%20Hamel's gravatar image

Philippe Hamel
151235

edited Jul 21 '10 at 13:16

ogrisel's gravatar image

ogrisel
498995591


One Answer:
11

You should absolutely read Efficient Backprop, Neural Networks: Tricks of the trade by Y. LeCun, L. Bottou, G. Orr and K. Muller.

To summarize the relevant part, you should use a schedule such as "learning_rate(t) = lambda / (t0 + t)" and you select lambda and t0 empirically using a grid search on an exponential scale on the top 1000 first samples for instance and select the values that decrease the objective function the fastest to train your model on the full dataset.

answered Jul 21 '10 at 13:01

ogrisel's gravatar image

ogrisel
498995591

Ok, thanks, that is exactly what I needed. I will give it a try. However, now I have two hyper-parameters instead of one :P. Oh well...

Thanks a lot for the reference! I will give it a good read. At first glance, it seemed to contain a lot of neat tricks for gradient descent.

(Jul 21 '10 at 14:40) Philippe Hamel
1

You only have to tune t0. According to Bottou, lambda should just be set to the current regularization constant (see http://leon.bottou.org/projects/sgd).

(Jul 22 '10 at 11:36) Frank
1

@Frank: that depends on the objective function you want to optimize: you can have no regularizer (updates are just based on the gradient of the loss function) or squared L2, or L1 plus squared L2 (elastic net: there are 2 hyperparameters, one for each regularizer component) and so on. So if you are doing linear SVMs, lambda is the usual regularizer hyperparameter, but this might not hold in the general case.

(Jul 22 '10 at 11:51) ogrisel

Efficient Backprop is a good paper, but a lot of what it suggests is complete overkill. Take a look at:

http://www.computer.org/portal/web/csdl/doi/10.1109/ICDAR.2003.1227801

This paper talks about how the authors achieved excellent results with a convolutional neural net using almost none of LeCun's tricks. As far as learning rate is concerned, pick a low one and multiply by some factor < 1 every however-many-epochs seems to work decently.

Depending on your architecture, you can do quite a bit without any time-consuming-to-implement, complicated-to-keep-track-of tricks. Of course, if you need that little extra, they'll often come in handy. Just don't use them before you need to.

(Jul 22 '10 at 20:57) Jacob Jensen
3

I don't think this is overkill at all. For instance I used SGD to train stacked denoising autoencoders and I started with a fixed learning rate set by hand. Depending on the size (number of nodes per layers and number of layers) and the amount of input corruption, the optimal learning was often hard to find by hand. At some point I thought I had convergence bug in my code. Once I had implemented the scheduling tricks from Lecun and Bottou, the quality of the convergence of my architectures were much more predictable, faster and better than with fixed learning rate manual tuning.

(Jul 24 '10 at 07:59) ogrisel
1

It is the case that this might introduce new hyperparameters. There have been some other questions on this site where different hyperparam optimization techniques are discussed (like random sampling). But if you simply want to reduce the number of hyperparams, use a constant learning rate and don't decay it. ogrisel says he encountered difficulties with this technique.

(Jul 24 '10 at 10:17) Joseph Turian ♦♦
showing 5 of 6 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.