In Le Roux et al.'s Learning a generative model of images by factoring appearance and shape, they build a Beta RBM using the formula from Welling et al.'s exponential family harmonium paper. They conclude that the resulting RBM is very difficult to train, so they modify the energy function by making it symmetric in the hidden units.

From the paper (page 8): Unfortunately, training such an RBM proved very difficult, as turning a hidden unit on could only increase the precision of the conditional distribution. Furthermore, there is no easy way of enforcing the positivity constraint on the parameters of the beta distributions (enforcing all the elements of a, b, W, and U to be positive resulted in too hard a constraint).

Emphasis is mine. The way I see it, they're saying there are two problems with this model: the fact that turning on a hidden unit can only increase precision (= decrease variance), and the fact that constraining all parameters to be positive is too restrictive.

I understand the second problem, but I don't understand why the first problem is actually a problem; why does the fact that turning on a hidden unit can only increase precision, not decrease it, make training the model harder?

I understand that making the energy function symmetric solves both of these problems (it allows for weaker parameter constraints, and makes it possible for a hidden unit to decrease the precision when turned on).

(By the way, I'm pretty sure they intend the word 'precision' to mean 'the inverse of the variance' in this context - but I might be wrong).

asked Mar 16 '12 at 20:56

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734


One Answer:

Suppose a, b, W, and U in equation 2.8 of the paper you mention are positive. Consider the terms in E(v,h) involving h. They are:

  • -log(v^T) W h
  • -log( e-v )^T U h
  • -c^T h

Any factor of -log(v^T) or -log( e-v )^T will always be positive because the entries of v are on (0,1) and e is the vector of all 1s. Since W and U are positive and the hidden units are binary, if we neglect the bias term -c^T h, turning on a hidden unit will always increase the energy. This will cause the model to typically learn to turn off all hidden units which is very undesirable since if they are always off, there is no point in having them. Since the hidden biases are also learned, the model can easily conspire to turn all the hidden units off during learning.

answered Mar 17 '12 at 14:31

gdahl's gravatar image

gdahl ♦
341453559

So by "turning on a hidden unit can only increase precision" they meant "turning on a hidden unit can only increase energy". Makes me wonder why they used that particular formulation... anyway, that's what I needed to know. Thanks :)

(Mar 17 '12 at 17:12) Sander Dieleman

I tested and confirmed your hypothesis for a Gamma-Bernoulli RBM. The model indeed learns to turn off all the hiddens. Although it is technically possible for log(v) to be positive in this case, in practice this only occurs very rarely, so the same reasoning applies.

Interestingly, symmetrising the energy function does not solve the problem. Instead, roughly half of the units are now always on, and the other half of the units are always off. Adding a 'sparsity' penalty with a target of 0.5 doesn't help. Has anybody encountered alternative approaches to tackle this issue? I guess it can come up whenever there's an energy function with constrained parameters somewhere.

(Mar 19 '12 at 10:37) Sander Dieleman
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.