I'm a little confused on how to learn edge weights in a Boltzmann machine -- is the following correct?

I have a set of visible units (corresponding, say, to pixels in an image) and a set of hidden units. After randomly initializing the weights, I alternate between the following two phases until some stopping rule is reached:

  1. Clamped phase - the states of the visible units are fixed, so only the states of the hidden units are updated (according to the Boltzmann stochastic activation rule). We update until the network reaches equilibrium. Once we reach equilibrium, we continue updating N more times (for some predefined N), keeping track of the average of $x_i x_j$ (where $x_i, x_j$ are the states of nodes $i$ and $j$). After those N equilibrium updates are finished, we update $w_{ij} = w_{ij} + frac{1}{C}Average(x_i x_j)$.

  2. Free phase - the states of all the units are updated. Once we reach equilibrium, we similarly continue updating N' more times, but instead of adding correlations at the end, we subtract: $w_{ij} = w_{ij} - frac{1}{C}Average(x_i x_j)$.

My two main points of confusion are:

  1. When we're in the clamped phase, do we always reset the visible units to one of the patterns we want to learn? (Or do we somehow leave the visible units in the state they were in at the end of the free phase?) Does it matter whether we cycle through all the patterns before hitting the first pattern again, or do we only switch to the second pattern after we're completely done iterating with the first pattern?
  2. Do we do a batch update of the weights at the end of each phase, or do we update the weights after each equilibrium step? Or are both fine?

asked Feb 20 '11 at 02:14

grautur's gravatar image

grautur
58682026

edited Feb 20 '11 at 02:15

Can you be more specific on one point: is this for a Boltzmann machine, or a restricted Boltzmann machine?

(Feb 22 '11 at 16:57) Brian Vandenberg

This is for a general Boltzmann machine. (Are restricted Boltzmann machines trained differently? I thought the only difference was in the structure of restricted Boltzmann machines, and that they were trained the same as general Boltzmann machines. As in, even though all the visible/all the hidden units can be updated in parallel, for a RBM, the update rule is still the same.])

(Feb 23 '11 at 01:53) grautur
1

Well, there are many learning rules but they are basically the same. As you said, one can do block Gibbs sampling instead of sampling each node separately - which makes it a lot more efficient. Often Contrastive Divergence is used, which means just doing one step of Gibbs sampling between updates. But that is not such a good idea...

(Feb 23 '11 at 11:00) Andreas Mueller

The learning rule has the same appearance, I was asking because I haven't studied general Boltzmann machines in any real depth [yet], and I didn't want to start speaking about a topic I'm not familiar with.

(Feb 24 '11 at 10:52) Brian Vandenberg

2 Answers:

Hi. So first to the algorithm: I think the "right" thing to do is only do stochastic updates of the units "until convergence" (I don't know how you want to judge that. Or judge whether you sample from equilibrium. I would just do a fixed number of steps). Then you stop updating and just save w_d =x_i * x_j. Then you let the system run freely for some number of iterations and save w_m = x_i*x_j and do an update W = nu w_d - w_m. I think you should not change the weights using the positive=clamped phase before sampling in the negative=free phase (which is how I read your algorithm).

To the questions: 1) Yes. Always reset the visible units to the data you want to learn. To the second part of the question: I am not quite sure what you mean by "done iterating with the first pattern". I would definitely not try to learn one pattern first and then another. 2) Definitely not update the weights after each equilibrium step but update using both phases together. Also I would do batch-updates over the whole training set or at least mini batch updates if your training set is large.

Cheers, Andy

answered Feb 20 '11 at 07:52

Andreas%20Mueller's gravatar image

Andreas Mueller
1817133671

With regard to Andy's answer for #1, I'll add a caveat. In the basic algorithm, you'd reset the visible units to the original data.

However, Tijmen Tieleman published a paper about a modification to the algorithm ("Using Fast Weights To Improve Persistent Contrastive Divergence") in which you wouldn't necessarily reset it.

(Feb 22 '11 at 17:05) Brian Vandenberg

Just to confirm: when you say "update using both phases together", do you mean I don't update once after the clamped phase and then once again after the free phase, but only once after both phases have finished?

(Feb 23 '11 at 01:50) grautur

@Grautur: yes, thats what I mean. @Brian: I usually use PCD without fast weights since it is very easy to implement. I think the paper is the one before the one you cite, called "Training Restricted Boltzmann Machines using approximations to the likelihood gradient" or something similar. I did some work on these methods and they are kind of unstable. They all diverge sooner or later. And if you pick a small learning rate, they reach bad optima. If you pick a large learning rate, they diverge really quickly and are really unstable. Then it's not even possible to evaluate the model with AIS any more....

answered Feb 23 '11 at 11:03

Andreas%20Mueller's gravatar image

Andreas Mueller
1817133671

I haven't used PCD a lot, but from conversations I've had it sounds like many of Hinton's students use it (Tijmen Tieleman, Alexander Krizhevsky, and I think James Martens said something about using it as well). I'm a little surprised they'd stick with it if it isn't reliable.

(Feb 24 '11 at 10:55) Brian Vandenberg

Well, I use it, too. It all depends on how you define reliable. You can find some settings that work but there is not really a good way to do cross validation or something like that. I had a paper on that at the deep learning workshop last nips and talked to some of Hinton's people about it. There are more and more learning methods coming out and more and more evaluation methods, too. But from my point of view, there is nothing that you can "just use" (yet).

(Feb 24 '11 at 11:33) Andreas Mueller

Cool, I'll have to read your paper on it.

I've been mulling over this idea in my head of analyzing these models from the perspective of dynamical systems (bifurcations & all that fun stuff).

In a restricted boltzmann machine, I wouldn't expect there to be any odd behavior like you'd see in systems with feedback (RNNs, for example; because of the feedback, it's possible for the same set of parameters to exhibit different stability behavior depending on the history of the system), but a Boltzmann machine -- at least, as I understand them -- is like an RNN in that the connections can induce feedback in the system.

(Feb 24 '11 at 13:46) Brian Vandenberg

I don't know very much about nonlinear systems so I can not really comment on RNNs. I just want to make it clear that there are different issues here: Inference and learning. Of course they are tightly coupled but I think they should be analyzed separately. You were referring to something similar to inference, i.e. finding points of attraction in an RNN - which I guess is similar to finding the stationary distribution of a Markov chain. The instability I was referring to is during learning, i.e. wild jumps in performance when adjusting the parameters. In a paper that is just undergoing review, we argue that these instabilities in learning are caused by the inability to find the stationary distribution, so these issues have something to do with each other.

Well, I don't really know how to sum up this comment but I hope you can sort of get my point ;)

(Feb 25 '11 at 06:56) Andreas Mueller
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.