This applies to any kind of deep network model where you're using layer by layer pretraining for some kind of MLP. Each layer has multiple hyperparameters, but measures like validation set classification performance are only available if you've constructed the whole model. If you want to try k different hyperparameter settings per layer and the network has depth D, then you would end up having to train k^D networks, which is far too expensive. I can think of a few ways around this, but I'm curious which other people are using in practice:

  • Randomly sample k sets of hyperparameters for the whole network, and train those k networks.
  • Randomly sample k sets of hyperparameters for layer N, train layer N, and before building layer N+1, pick the best 1 set of hyperparameters for layer N. Criteria for choosing the best could be validation set error using an MLP trained off of only the layers used so far, or computing some measure of the invariance properties of the features learned by the layer.

asked Jun 30 '10 at 18:07

Ian%20Goodfellow's gravatar image

Ian Goodfellow

edited Jul 05 '10 at 16:20

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

2 Answers:

Can't you get a proxy/lower bound for the validation set performance on the partial model, by "adding" a label layer on top of it and training for a while just this layer with gradient descent (or train an SVM using the features from your current topmost layer)? Then you could use this lower bound performance for model selection.

Or, can't you use a "gibbs sampling"-like strategy experimenting with adding or removing nodes after your whole network is set up? (this might really degrade your performance, though)

answered Jul 05 '10 at 16:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

I'm a bit late to the discussion, but this page bubbles up quickly in my searches (I was searching for methods to reduce the hyperparameter search space for deep nets). "Greedy layer-wise" choices seem enticing to me too, but I don't have much personal experience to share. On the other hand, I stumbled on this paper:

"Unsupervised Layer-Wise Model Selection in Deep Neural Networks" (2010)

They experiment with choosing the number of hidden units in a stack of RBMs layer by layer, by using reconstruction error as a guide (instead of energy/probability). They say that reconstruction error tends to hit a plateau (see graphs) at about the same number of units (after training for 2 epochs on MNIST), independent of some parameter (batch size). Using that "minimum for plateau" they get (rather) similar classification performance in the end, compared to using more units (up to 1000, 1000, for 2 layers).

This is encouraging, even though maybe a bit limited in scope (what about choosing learning rate, or using some sort of early stopping to know when to stop training each layer etc.).

answered Nov 01 '10 at 11:11

Francois%20Savard's gravatar image

Francois Savard

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.