|
Something's bothered me for quite some time. I expected it would eventually dawn on me, but so far the answer eludes me. Take a look at this code. It's Salakhutdinov & Hinton's example Matlab code for an RBM. In particular, lines 68 and 71. At each 'up' pass during the gibbs sampling step, the hidden probabilities are calculated then thresholded using uniform random #s. These thresholded binary values are then used as hidden states on line 71 when generating visibles. Similarly, if you look at the binary-to-linear code, the hiddens are linear (Gaussian) and are sampled by adding a small amount of Gaussian noise. You may notice that the visible units (in this case, it's a binary-to-binary RBM) don't receive similar treatment. Why? I recently implemented an RBM with rectified linear units on the visibles, and binary hiddens. I found something quite puzzling. In a talk Hinton gave, he recommended sampling with an expression reminiscent of: max(0,x + N(0,sigmoid(x)) = max(0,x + sqrt(sigmoid(x))*randn(size(x))) ... which I took to mean when the RLU is latent; otherwise you'd just use max(0,x). However, in practice I found that the RBM didn't work at all unless I introduced that bit of noise on the visibles, as well as sampling (as noted earlier) the latent variables. So, here's my two-part question:
|
|
I don't know if you noticed, but in the code you link to it isn't just that the visible unit reconstruction isn't sampled, but ALSO the negative phase hidden states aren't sampled either! So the asymmetry is really between the positive phase and the negative phase, not the visible and the hidden since the only visible reconstruction created in CD1 is the negative data. Consider an RBM with Bernoulli visible units and Bernoulli hidden units. First I will present an argument why you MUST sample the positive phase hidden activities before conditioning on them to create a reconstruction of the visible units. Then I will present an argument why NOT sampling the negative phase hidden activities is useful. If the positive phase hidden activation probabilities are used directly when reconstructing then the reconstructed visible units can cheat and gain more than one bit about the data per hidden unit they condition on by having the weights conspire to use the actual real values on [0,1] of the activation probabilities. So now the latent variables in the model aren't binary at all, they are real valued and the model can very easily learn trivial and useless weights if it has enough hidden units. Actually sampling the values prevents this cheating and adds noise that prevents pathological overfitting. If you imagine the CD procedure unrolled, you can see that once you sample the first step of the block Gibbs sampling chain the later steps will be unable to get any more information about the current data case than is contained in the sampled hidden activities. A similar argument works for Gaussian hidden units since without adding noise, the reconstruction process can make use of inappropriately precise values. Since the final negative phase hidden units in the code you linked to aren't used for anything except computing the statistics for the weight update, sampling them just adds noise to the weight update and slows down learning. So if you are with me so far, you see why sampling the positive phase hiddens before conditioning on them is important and why it might be better to not bother sampling the negative phase hiddens. For RBMs, the canonical statement of the CD algorithm samples everywhere, even though researchers from Geoff Hinton's group often don't do this for the reasons I mentioned above. I believe some researchers outside of the Toronto group do actually sample everywhere (it can work either way). So what about the negative phase visible reconstruction? Should that be sampled? Sampling it is certainly "correct" so when in doubt, sample it. The reconstruction is used to infer the negative phase hidden units, so the argument for not sampling them doesn't directly apply. For binary units, however, I think sampling the visible reconstructions will just add noise to the weight updates despite their use in computing the negative phase hidden activation probabilities, but for other unit it types it might be more important. I generally don't sample the visible reconstructions when using binary RBMs, but it might even work better sampling them. Thank you George, that makes a lot of sense.
(Feb 22 '12 at 22:08)
Brian Vandenberg
|