|
I'm a little confused on how to learn edge weights in a Boltzmann machine -- is the following correct? I have a set of visible units (corresponding, say, to pixels in an image) and a set of hidden units. After randomly initializing the weights, I alternate between the following two phases until some stopping rule is reached:
My two main points of confusion are:
|
|
Hi. So first to the algorithm: I think the "right" thing to do is only do stochastic updates of the units "until convergence" (I don't know how you want to judge that. Or judge whether you sample from equilibrium. I would just do a fixed number of steps). Then you stop updating and just save w_d =x_i * x_j. Then you let the system run freely for some number of iterations and save w_m = x_i*x_j and do an update W = nu w_d - w_m. I think you should not change the weights using the positive=clamped phase before sampling in the negative=free phase (which is how I read your algorithm). To the questions: 1) Yes. Always reset the visible units to the data you want to learn. To the second part of the question: I am not quite sure what you mean by "done iterating with the first pattern". I would definitely not try to learn one pattern first and then another. 2) Definitely not update the weights after each equilibrium step but update using both phases together. Also I would do batch-updates over the whole training set or at least mini batch updates if your training set is large. Cheers, Andy With regard to Andy's answer for #1, I'll add a caveat. In the basic algorithm, you'd reset the visible units to the original data. However, Tijmen Tieleman published a paper about a modification to the algorithm ("Using Fast Weights To Improve Persistent Contrastive Divergence") in which you wouldn't necessarily reset it.
(Feb 22 '11 at 17:05)
Brian Vandenberg
Just to confirm: when you say "update using both phases together", do you mean I don't update once after the clamped phase and then once again after the free phase, but only once after both phases have finished?
(Feb 23 '11 at 01:50)
grautur
|
|
@Grautur: yes, thats what I mean. @Brian: I usually use PCD without fast weights since it is very easy to implement. I think the paper is the one before the one you cite, called "Training Restricted Boltzmann Machines using approximations to the likelihood gradient" or something similar. I did some work on these methods and they are kind of unstable. They all diverge sooner or later. And if you pick a small learning rate, they reach bad optima. If you pick a large learning rate, they diverge really quickly and are really unstable. Then it's not even possible to evaluate the model with AIS any more.... I haven't used PCD a lot, but from conversations I've had it sounds like many of Hinton's students use it (Tijmen Tieleman, Alexander Krizhevsky, and I think James Martens said something about using it as well). I'm a little surprised they'd stick with it if it isn't reliable.
(Feb 24 '11 at 10:55)
Brian Vandenberg
Well, I use it, too. It all depends on how you define reliable. You can find some settings that work but there is not really a good way to do cross validation or something like that. I had a paper on that at the deep learning workshop last nips and talked to some of Hinton's people about it. There are more and more learning methods coming out and more and more evaluation methods, too. But from my point of view, there is nothing that you can "just use" (yet).
(Feb 24 '11 at 11:33)
Andreas Mueller
Cool, I'll have to read your paper on it. I've been mulling over this idea in my head of analyzing these models from the perspective of dynamical systems (bifurcations & all that fun stuff). In a restricted boltzmann machine, I wouldn't expect there to be any odd behavior like you'd see in systems with feedback (RNNs, for example; because of the feedback, it's possible for the same set of parameters to exhibit different stability behavior depending on the history of the system), but a Boltzmann machine -- at least, as I understand them -- is like an RNN in that the connections can induce feedback in the system.
(Feb 24 '11 at 13:46)
Brian Vandenberg
I don't know very much about nonlinear systems so I can not really comment on RNNs. I just want to make it clear that there are different issues here: Inference and learning. Of course they are tightly coupled but I think they should be analyzed separately. You were referring to something similar to inference, i.e. finding points of attraction in an RNN - which I guess is similar to finding the stationary distribution of a Markov chain. The instability I was referring to is during learning, i.e. wild jumps in performance when adjusting the parameters. In a paper that is just undergoing review, we argue that these instabilities in learning are caused by the inability to find the stationary distribution, so these issues have something to do with each other. Well, I don't really know how to sum up this comment but I hope you can sort of get my point ;)
(Feb 25 '11 at 06:56)
Andreas Mueller
|
Can you be more specific on one point: is this for a Boltzmann machine, or a restricted Boltzmann machine?
This is for a general Boltzmann machine. (Are restricted Boltzmann machines trained differently? I thought the only difference was in the structure of restricted Boltzmann machines, and that they were trained the same as general Boltzmann machines. As in, even though all the visible/all the hidden units can be updated in parallel, for a RBM, the update rule is still the same.])
Well, there are many learning rules but they are basically the same. As you said, one can do block Gibbs sampling instead of sampling each node separately - which makes it a lot more efficient. Often Contrastive Divergence is used, which means just doing one step of Gibbs sampling between updates. But that is not such a good idea...
The learning rule has the same appearance, I was asking because I haven't studied general Boltzmann machines in any real depth [yet], and I didn't want to start speaking about a topic I'm not familiar with.