|
I started working through the problem of how to apply hessian-free learning to RBMs, and ran into a hurdle I don't quite understand. To start with, I started with a simple binary-to-binary RBM. If I measure success in terms of the cross-entropy function on visible units, I can apply the R operator to that, and I end up with an expression of the form:
... where v1 represents the data distribution, vN is the visibles after N steps of Gibbs sampling. This is where I run into a conundrum. vN comes from sampling a distribution. With binary units, it's usually sampled using a process of the form:
For something normally distributed, it'd look more like:
... or some other similar sampling procedure. In every case, however, vN passes through some function that has in my mind defied description in a form that would make it straightforward for me to say what R{vN} is. In the Gaussian case above:
... whereas I'm not sure about the binary-binary case. The left side of the above is pretty straightforward -- it's just a linear combination. But the right side is something else entirely. It depends on the quantity being optimized (weights/biases), so it can't be discounted, but it's not just a probability distribution either; it's some function for sampling from that distribution. I'll readily admit I'm not an expert on probability theory, so I'm somewhat out of my depth here. Anyone care to weigh in on this? |
|
Pearlmutter's Fast exact multiplication by the hessian contains a section (4.3) on Stochastic Boltzmann machines. After obtaining Hv = (R{pij+} - R{pij-})/T, he uses the relationship between the state and energy to get an expression for R{pij} Quoting the end of the section:
Thank you for pointing that out, I forgot his paper had that section on boltzmann machines.
(Mar 09 '12 at 11:19)
Brian Vandenberg
|
Sorry for the silly request for clarification, but what is R{} supposed to be in general? Can you point to a paper with these equations?
@Alexandre: The R{} operator is defined as R{f(w)} = (d / dr) * f(w + r * v)|r=0, where w and v are vectors and d is the partial symbol. It's used to compute the product of the Hessian with a vector for neural networks using a backprop like algorithm.
@Brian: I don't have time to work out the actual math completely, but I would actually start with applying the operator to the likelihood gradient itself. The first part should just be R{d-F/dW} - R{dZ/dW}, where F is the free energy and Z the partition function. You will have to approximate the term that depends on Z. I foresee some numerical troubles when it comes to actually using these very noisy estimates in the remainder of the algorithm though and the gradient itself will be based on sampling as well...
@Philemon: I think you're on the right track there. Earlier this morning it occurred to me that in the RBM setting, the model derivation didn't depend on Gibbs sampling; that was just a tool to reach an approximation of the model's distribution. The results thus obtained are then used to calculate an expectation. I should be applying R{} to the expectation to get the R_2 pass (if I'm doing the Hessian instead of the Gauss-Newton approximation).
Now the only thing I'm unsure of is what the RBM corrolary is for the F_0 pass.