|
Can we find (theoretically) the optimal number of hidden units for each hidden layer of a neural network ? (trade-off between a small number of parameters for a good training - few units - and a good code/representation in the hidden layer - many units). In practice, any regularization on the weights exists to get this optimal number ? I mean starting with a lot of units, and after training, we get weights equal to zero - units that are too many - and weights of the 'optimal' units. Any idea, reference ? |
|
Well, optimal subset selection, even in the linear case, is NP-complete, but there are some heuristics. L1 regularization is quite popular for enforcing sparsity, and such models can be learned online using a few tricks such as John Langford's truncated gradient method (I don't know if any theoretical guarantees can be made yet about applying this to neural network learning, but he provides proofs involving the squared and logistic loss in the "zero layer" case). There's also the elastic net penalty, which combines L1 and L2 penalty. This has the effect of shrinking some coefficients (in a linear or generalized linear model) to 0, but avoids making arbitrary decisions between highly correlated predictors. I don't think it's been explored for neural networks. 1
Indeed the L1 regularization craze of the last couple of years was mostly applied to online linear models and quadratic reconstruction with a sparsity constraint (dictionary learning); I haven't seen any paper studying it's application to neural nets with one or two hidden layers with non linear activation function. It's worth a try.
(Jun 23 '10 at 16:51)
ogrisel
3
Indeed an L0 criterion (minimizing the number of non-zeroed units) is combinatorially hard to optimize. An L1 criterion on the output weights of an MLP (with a potentially infinite number of hidden units) can be defined in a way that is similar to boosting, the solution of which is an MLP with a finite number of hidden units. Nicolas Le Roux and Yoshua Bengio showed that this yields in theory to a convex optimization problem, albeit with an infinite number of degrees of freedom (just like in boosting). By adding one hidden neuron at a time, the problem can in principle be solved exactly. The only practical problem is that in order to check that one is done (can't add anymore hidden units without making the total training criterion go up because of the L1 penalty) requires finding an optimal linear classifier (the one associated with a new hidden unit). Although we are used to thinking that linear classification is easy, finding the linear classifier with the smallest number of (weighted) classification error is also NP-hard. Approximations may work, though (and we lose the guarantee of finding the optimal solution to the overall optimization problem). See 'Convex Neural Networks' in NIPS'2005 (http://www.iro.umontreal.ca/~lisa/publications2/index.php/publications/show/19).
(Jul 05 '10 at 02:24)
Yoshua Bengio
I wonder if it's possible to formulate the "adding a neuron to reduce error" problem as a submodular optimization problem. Then selecting subsets could be done quickly and with high accuracy.
(Jul 05 '10 at 08:15)
Alexandre Passos ♦
@Alexandre Passos: There is a technique proposed in Perkins et al (jmlr.org/papers/volume3/perkins03a/perkins03a.pdf) that they call Grafting. They add units based upon which has the steepest gradient of the training objective.
(Jul 05 '10 at 18:42)
Joseph Turian ♦♦
|
|
This is tangentially related, but I think people seeing this question may find it useful. If you're dealing with a reinforcement learning task rather than a supervised learning task, you don't have to worry about these problems. The NEAT algorithm (and its variants like HyperNEAT) actually incorporate topology growth into the evolution, so you just setup your inputs and outputs, and NEAT evolves the hidden nodes. |
|
I assume you mean MLPs, not NNs in general. If you are content with a practical method, you can try a constructive learning algorithm, e.g.: Create a small net, and train it. Then add a new node and re-train (initializing the weights from the previous training). Keep doing this until test set performance starts to drop. (Sorry I forget how this concrete method is called.) I believe you are talking about "early stopping", which is typically measured on the validation set, not the test set. There is a technique proposed in Perkins et al (jmlr.org/papers/volume3/perkins03a/perkins03a.pdf) that they call Grafting. They add units based upon which has the steepest gradient of the training objective.
(Jun 30 '10 at 12:47)
Joseph Turian ♦♦
There is also a technique called Cascade Correlation where you start with a small network already trained, freeze its weights, and incremently add a new node. About the computation of the weights of this new unit: "For each new hidden unit, we attempt to maximize the magnitude of the correlation between the new unit’s output and the residual error signal we are trying to eliminate." http://www.cs.iastate.edu/~honavar/fahlman.pdf About computing any classification performance, I would like to stay in the unsupervised setting.
(Jun 30 '10 at 16:20)
Gregoire
For completeness, here are the the comp.ai.neural-nets FAQ on the number of hidden units (The corresponding answer for hidden layers seems more complete.)
(Jul 05 '10 at 10:59)
Srihari
|
|
There's also a technique called "Optimal Brain Damage", and a second-order extentsion "Optimal Brain Surgeon" that take a trained net and cull nodes that have small weights or are decorrelated with error in some way. This is early 90's stuff though, so I'm not sure about breadth of applicability. |
|
This depends on the hardware implementation of the neural network. I expect the optimal number for Von Neumann computers to be somewhat less than the optimal number for brains, for example. I would do some profiling, write a cost function and do a genetic algorithm rather than trying to find an analytic "optimal" which may be completely wrong a few years from now. |