|
In the parameter update for CD it appears the that terms in the log-likelihood that are constant in x (the random variable of interest) would cancel (see Notes on CD). Is this true? And if it is true, is it troubling? |
|
The paper A Fast Learning Algorithm For Deep Belief Nets (2006) gives a pretty good outline of how contrastive divergence learning works, and is the paper that helped me to finally grasp the concept. The guts of the theory are in the appendices of the paper. I was hoping you'd provide something a little more concrete when you edited your question -- eg, reference a specific equation (eg, equation 16) and ask about it, but I'll do what I can to answer your question based on what you've given. I'm going to use some shorthand to make my exposition more terse. <>_0, <>_1, ..., <>_n should be fairly obvious, feel free to ask for a clarification if not. The weight update step for CD1, CD2, CD3, ..., CDn looks like this:
Notice all the terms that cancel. The explanation behind why all of this occurs is in the paper I linked. The result for CDn is: W_(t+1) = W_(t) + eta * (<>_0 - <>_n) -Brian
This answer is marked "community wiki".
|
|
There is a practical guide to training RBMs by Hinton. You should maybe have a look at that. I am not quite sure what your question is but Brian is right, it is quite nicely explained in the original paper. To summarize CD in two sentences: The gradient to the log likelihood of the training data can be expressed as a difference of an expected value with respect to the data and an expected value with respect to the model distribution. The second one can be calculated with an MCMC scheme, which Hinton proposes to approximate with just one step. Whether this is troubling or not, every one should judge for themselves ;) Actually, it is indeed troubling, the plain CD version should only be used if you know exactly what you're doing: http://www.neuroinformatik.ruhr-uni-bochum.de/ini/PEOPLE/igel/EAotDoGSBLAfRBM.pdf http:// http://www.ais.uni-bonn.de/papers/nips10ws_schulz_mueller_behnke.pdf
(Dec 15 '10 at 14:27)
Hannes S
He's not referring to the fact that CD is only an approximation to the true gradient, though. Those papers are certainly informative, but as far as I can tell he seems to be talking about the math being wrong in deriving the weight update. But he hasn't returned to add anything to the discussion, so that's a wild guess on my part. -Brian
(Dec 15 '10 at 14:40)
Brian Vandenberg
|
I think you need to be a little more specific.
The parameter update takes the form:
... where <xy> is the expected value of the product of a visible & hidden unit. Cov(x,y) = E[xy] - E[x]E[y], so this expectation is related to the covariance of the hidden and visible units.
<>_0, given known weights, is the expected value of that product according to your sample space.
<>_n, given known weights and a sufficiently large n, is the expected value of that product in the sample space the model has learned to understand.
The difference between the two is a Kullback-Leibler divergence, and it should maximize P(v_i) over your sample space.
-Brian