|
This applies to any kind of deep network model where you're using layer by layer pretraining for some kind of MLP. Each layer has multiple hyperparameters, but measures like validation set classification performance are only available if you've constructed the whole model. If you want to try k different hyperparameter settings per layer and the network has depth D, then you would end up having to train k^D networks, which is far too expensive. I can think of a few ways around this, but I'm curious which other people are using in practice:
|
|
I'm a bit late to the discussion, but this page bubbles up quickly in my searches (I was searching for methods to reduce the hyperparameter search space for deep nets). "Greedy layer-wise" choices seem enticing to me too, but I don't have much personal experience to share. On the other hand, I stumbled on this paper: http://hal.archives-ouvertes.fr/docs/00/48/83/38/PDF/ECAI-632.pdf "Unsupervised Layer-Wise Model Selection in Deep Neural Networks" (2010) They experiment with choosing the number of hidden units in a stack of RBMs layer by layer, by using reconstruction error as a guide (instead of energy/probability). They say that reconstruction error tends to hit a plateau (see graphs) at about the same number of units (after training for 2 epochs on MNIST), independent of some parameter (batch size). Using that "minimum for plateau" they get (rather) similar classification performance in the end, compared to using more units (up to 1000, 1000, for 2 layers). This is encouraging, even though maybe a bit limited in scope (what about choosing learning rate, or using some sort of early stopping to know when to stop training each layer etc.). |
|
Can't you get a proxy/lower bound for the validation set performance on the partial model, by "adding" a label layer on top of it and training for a while just this layer with gradient descent (or train an SVM using the features from your current topmost layer)? Then you could use this lower bound performance for model selection. Or, can't you use a "gibbs sampling"-like strategy experimenting with adding or removing nodes after your whole network is set up? (this might really degrade your performance, though) |