In the parameter update for CD it appears the that terms in the log-likelihood that are constant in x (the random variable of interest) would cancel (see Notes on CD). Is this true? And if it is true, is it troubling?

asked Dec 13 '10 at 15:26

Eno%20Onmai's gravatar image

Eno Onmai
1112

edited Dec 13 '10 at 23:02

1

I think you need to be a little more specific.

The parameter update takes the form:

dW_(i,j) = <v_i * h_j>_0 - <v_i * h_j)_n
or shorthand:
Dw = <x * y>_0 - <x * y>_n

... where <xy> is the expected value of the product of a visible & hidden unit. Cov(x,y) = E[xy] - E[x]E[y], so this expectation is related to the covariance of the hidden and visible units.

<>_0, given known weights, is the expected value of that product according to your sample space.

<>_n, given known weights and a sufficiently large n, is the expected value of that product in the sample space the model has learned to understand.

The difference between the two is a Kullback-Leibler divergence, and it should maximize P(v_i) over your sample space.

-Brian

(Dec 13 '10 at 16:05) Brian Vandenberg

2 Answers:

The paper A Fast Learning Algorithm For Deep Belief Nets (2006) gives a pretty good outline of how contrastive divergence learning works, and is the paper that helped me to finally grasp the concept. The guts of the theory are in the appendices of the paper.

I was hoping you'd provide something a little more concrete when you edited your question -- eg, reference a specific equation (eg, equation 16) and ask about it, but I'll do what I can to answer your question based on what you've given.

I'm going to use some shorthand to make my exposition more terse. <>_0, <>_1, ..., <>_n should be fairly obvious, feel free to ask for a clarification if not.

The weight update step for CD1, CD2, CD3, ..., CDn looks like this:

W_(t+1) = W_(t) + eta * ([<>_0 - <>_1])
W_(t+1) = W_(t) + eta * ([<>_0 - <>_1] + [<>_1 - <>_2] )
W_(t+1) = W_(t) + eta * ([<>_0 - <>_1] + [<>_1 - <>_2] + [<>_2 - <>_3] )
W_(t+1) = W_(t) + eta * ([<>_0 - <>_1] + [<>_1 - <>_2] + [<>_2 - <>_3] + ... + [<>_(n-1) - <>_n] )

Notice all the terms that cancel. The explanation behind why all of this occurs is in the paper I linked. The result for CDn is:

W_(t+1) = W_(t) + eta * (<>_0 - <>_n)

-Brian

This answer is marked "community wiki".

answered Dec 13 '10 at 23:55

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

There is a practical guide to training RBMs by Hinton. You should maybe have a look at that. I am not quite sure what your question is but Brian is right, it is quite nicely explained in the original paper. To summarize CD in two sentences: The gradient to the log likelihood of the training data can be expressed as a difference of an expected value with respect to the data and an expected value with respect to the model distribution. The second one can be calculated with an MCMC scheme, which Hinton proposes to approximate with just one step. Whether this is troubling or not, every one should judge for themselves ;)

answered Dec 14 '10 at 10:16

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Actually, it is indeed troubling, the plain CD version should only be used if you know exactly what you're doing:

http://www.neuroinformatik.ruhr-uni-bochum.de/ini/PEOPLE/igel/EAotDoGSBLAfRBM.pdf

http:// http://www.ais.uni-bonn.de/papers/nips10ws_schulz_mueller_behnke.pdf

(Dec 15 '10 at 14:27) Hannes S

He's not referring to the fact that CD is only an approximation to the true gradient, though. Those papers are certainly informative, but as far as I can tell he seems to be talking about the math being wrong in deriving the weight update.

But he hasn't returned to add anything to the discussion, so that's a wild guess on my part.

-Brian

(Dec 15 '10 at 14:40) Brian Vandenberg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.