|
I'm currently training a Markov Random Field (or actually a Boltzmann Machine) with many hidden units and a relatively small set of visible units. The large number of latent variables is necessary to capture complicated dependencies but seems to make learning very difficult. The problem is that the model has a tendency to get stuck in very bad local minima with highly correlated hidden units and totally meaningless visible units. This actually makes sense to me since the energy of the system is defined as a weighted sum of potential functions of which there are many more defined as a function of the hidden units than there are that are defined as a function of the visible units. So he model seems to favour configurations in which as many of the hidden units match, while caring less about the values of the visible units. I tried using different learning rates for the weights of the different types of potential functions and also initializing the weights of the visible units to very high values but this seems to just delay the problem to a later moment during training. Are there more principled approaches to combat this problem? Like perhaps altering the potential functions (which are and-gates in a Boltzmann Machine) or how they are included in the energy function? It seems to me that this should be an issue in any undirected graphical model with latent variables that are dependent on each other. For training I tried various versions of contrastive divergence and mean-field or combinations of the two with mixed results. After training for long enough I still end up getting stuck in configurations where the hidden variables are highly correlated. |
I think restricted boltzmann machines were initially used because assuming conditionally independent latent variables gets around all these problems during training, and you can stack them (therefore breaking the conditional independence assumption) to improve modeling power.
Any special reason you're trying to train a boltzmann machine and not stacked RBMs or something similar?
The model I'm trying to train is actually a variant of the deep boltzmann machine originally proposed by Salakhutdinov. The reason I'm using dependent hidden variables is that I'm trying to do structured prediction. I start to think the pre-training Salakhutdinov did might indeed sort of solve this problem but I was wondering if similar issues have been tackled in the MRF/CRF literature.
Could it be that the model with highest likelihood has highly correlated random variables?
This seems to be the case indeed. I guess I will just have to artificially scale the importance of these correlations down by multiplying the terms in the energy functions with scaling factors...