4
3

I was wondering if there is any literature about the effect of different parameter initialisation strategies on RBM training. This seems to be an aspect of the training procedure that can considerably affect results, at least from what I've seen, and yet it typically seems to be neglected (RBM-related papers rarely mention what method was used for initialisation).

Initialising them randomly by drawing samples from a Gaussian seems to be a common strategy (and the most intuitive, in my opinion), but interestingly the deeplearning.net tutorial about RBMs suggests a uniform initialisation from a particular range depending on the number of visibles and hiddens. This initialisation strategy is taken from Understanding the difficulty of training deep feedforward neural networks by Glorot and Bengio. However, this paper seems to discuss supervised training of deep neural networks (~5 layers) with backpropagation. If I understand it correctly, the initialisation strategy is based on a theoretical derivation that doesn't apply for unsupervised training with contrastive divergence (or indeed any greedy layer-wise learning method).

Since I use Theano often I've based a lot of my own code on that of the tutorials, so I've also been using this initialisation, without really understanding the reasoning behind it. Lately I've found that this 'normalised initialisation', as they call it, can lead to a considerably different learning trajectory and a different end results than the Gaussian initialisation strategy. Additionally, for the latter, choosing the right variance is crucial as well (otherwise there's a risk of saturating the binary hidden units due to the sigmoid function, which slows down training).

So to reiterate, I'm wondering if there is any literature on this subject or, seeing as quite a few people here have experience with RBM training, maybe they could share their insights.

asked Apr 13 '12 at 09:53

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734


One Answer:

Yoshua Bengio was kind enough to answer this question for me. My assumption that the derivation doesn't apply for greedy unsupervised training is wrong. His explanation was a bit more lucid than mine, but this is the gist of it.

The idea of the 'normalized initialization' described in the paper is to ensure that the variance of the features in each successive layer is roughly constant. This is important for supervised training of deep networks, to prevent the gradient from vanishing. But it is also beneficial when doing layer-by-layer training. An autoencoder for example can be seen as a 2-layer neural network, and keeping the variance in check across layers will also benefit forward and backward propagation there. So the initialisation does make sense for greedy layer-by-layer training as well.

answered Jun 27 '12 at 08:22

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.