|
Hi, Apologies in advance if this question is naive, I just want to ask around before spending weeks coding an experiment: Most of the papers I've read on "deep" architectures (which I take to mean a graphical model with several layers of latent variables) make use of the maximum likelihood / truncated-MCMC combination (is this really true?). How strong is the evidence that this is really the best method? Has there been any experimentation with belief propagation or message passing algorithms on deep architectures? I would not be surprised if the simple loopy-belief propagation would fail to converge on a dense network with short interlocking loops, but has anyone tried any of the more sophisticated region graph approximations, described in the YFW paper? Also, I'm curious if there has been any work on deep architectures using loss functions other than the likelihood - such as the L1 loss or any of the variations on the pseudo-likelihood. It is not obvious to me how one would practically evaluate the gradient of the L1 distance, but I would be very interested in any relevant resources. Also I should emphasize that I am interested in experimental results unless there is a really strong theoretical reason why these methods would fail.. |
|
I encourage you to explore the paper below. It explores the various principles which can be used to train models such as the Restricted Boltzmann Machine. They compare the following methods: approximate maximum likelihood methods (PCD-style methods), Contrastive Divergence, Pseudo-Likelihood, Ratio Matching and Generalized Score Matching. They do a very good job at studying each method and analyzing how they differ from one another. Thanks for this link. This was an interesting analysis - It seems that for pseudolikelihood you pay a big performance penalty (10-15x over contrastive divergence or MCMC according to the authors) for performance that is sometimes worse! (at least for MNIST, 20news and CalTech) I'm not sure I fully understand their evaluation methods however.
(Jul 07 '10 at 22:52)
jbowlan
|
|
There is no need to use MCMC-style learning algorithms if there are no graphical models involved (e.g. using autoassociators as the building block instead of RBMs), although graphical models are effective in the sense that they tend to self-regularise through their stochastic nature. It is also fairly tricky to make CD-based algorithms work (e.g. see a recent implementation tutorial by Hinton). I guess there has been some work exploring the sparsity in either the parameters or in the posterior of the hidden units, and those should use some variants of L1 regularisation. |