13
12

The performance of neural networks, especially deep neural networks, can be greatly improved with a good initialization of the weights prior to back-propagation. One way to generate these good initial weights is to use stacked autoencoders, a type of gradient-trained neural network which learns representations of its input in an unsupervised manner. Another way is to use restricted boltzmann machines, which are a type of network that learns to represent the statistical distribution of its inputs, and is trained with a non-backpropagating algorithm.

What are the differences between these two methods in practice? When would you want to use autoencoders and where would you want to use RBMs? If you decided that you needed to pretrain a network for good performance, how would you choose which of these methods to use? Is there a significant difference in training speed or expressive power?

Thanks!

asked Dec 24 '12 at 23:14

Andrew%20Gibiansky's gravatar image

Andrew Gibiansky
201235


2 Answers:
17

In the context of unsupervised pretraining, I'd say it's largely a matter of preference. Autoencoders are conceptually a bit simpler, but RBMs have more bells and whistles, you could say. There's a plethora of RBM models with different parameterisations, different types of visible and hidden units, ... whereas research on autoencoders mostly seems to have been focused on regularisation (sparse / denoising / contractive autoencoders etc.). At least that's the impression I got.

A nice thing about RBMs, and by extension deep belief networks, is that they are generative models, so you can sample from them. But if you're only interested in pretraining a neural network that doesn't really matter. And recently a nice method has been proposed to sample from contractive autoencoders as well.

I mostly use RBMs, because that's what I know best, but if you're new to both I guess autoencoders might be a bit easier to get started with. RBMs are probabilistic models and require approximations because the gradient of the log-likelihood is intractable, whereas autoencoders are deterministic and can be trained simply with gradient descent.

To clarify, both autoencoders and RBMs are typically trained with some form of stochastic (minibatch) gradient descent, so I'm not sure if the distinction you make in your question actually makes sense. The main difference is that for RBMs, the gradient is intractable, so it is approximated (that's essentially what contrastive divergence is).

More recently it seems that a lot of the deep learning folks now believe that the whole unsupervised pretraining business isn't as important as it was originally made out to be. Some impressive results with 'deep' models have recently been attained without pretraining. Networks with rectified linear units (basically the activation function is max(0, x) instead of tanh(x) or sigmoid(x)) seem to suffer less from the problems like vanishing gradients, and new regularisation methods like 'dropout' probably help too.

To summarise, for unsupervised pretraining you can use either method, I don't think one has a definitive performance advantage over the other, it depends on the task. If there is any literature showing the contrary I would be very interested at any rate :)

answered Dec 25 '12 at 17:35

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734

edited Dec 26 '12 at 11:34

Can you elaborate more on the rectifier non linearity or point to a reference? This is an interesting development.

(Dec 26 '12 at 01:49) cdrn
3

The terminology in my answer was incorrect, they are called 'rectified linear units' or sometimes 'noisy rectified linear units' (NReLUs) in the context of RBMs. Sorry about that.

Deep Sparse Rectifier Neural Networks by Glorot, Bordes and Bengio is probably the best paper to check out: http://eprints.pascal-network.org/archive/00008596/ To my surprise I've never read it myself, so I'm going to do that now :)

(Dec 26 '12 at 11:50) Sander Dieleman
1

Also http://www.cs.toronto.edu/~hinton/absps/reluICML.pdf

(Dec 26 '12 at 23:02) gdahl ♦
1

Here's a reference for 'dropout' regularization: http://arxiv.org/abs/1207.0580

(Apr 03 '13 at 19:55) LeeZamparo

The debate about actual usefulness of pre-training seems quite interesting. Actually, I am the deep-learning beginner who just follow generalized pre-training and fine-tuning step. Can you introduce me any paper or related material who argue that pre-training is not really useful?

(Sep 13 '13 at 02:52) Ken Kim

There is no extensive literature on this topic yet to my knowledge, but it is briefly discussed in these papers: http://eprints.pascal-network.org/archive/00008596/01/glorot11a.pdf http://www.cs.toronto.edu/~hinton/absps/googlerectified.pdf

(Sep 17 '13 at 08:30) Sander Dieleman

I don't have a solid reference for you, but I've read about this before in papers about Hessian-Free optimization training deep networks as good as the pre-trained versions

(Oct 23 '14 at 17:18) Chet Corcos

Also, I've read that good initialization matters a lot. If your initial weights are too large, the "neurons" will be stuck in on-off positions.

(Oct 23 '14 at 17:19) Chet Corcos
showing 5 of 8 show all

Both styles of pre-training work well in practice, as Sander says.

Autoencoders are actually optimizing an objective function you can compute during training which can have a lot of advantages. Perhaps the best part of actually optimizing something is that you can look at it during training and have it mean something. They are also easier to use funky activation functions in that don't make good probabilistic models. Any time you want to use a bizarre hidden unit nonlinearity, it will probably be easier with an autoencoder.

RBMs are generative models and if you want to end up with a Boltzmann machine or a proper DBN at the end of training, using them makes more sense. Or if you want to use general training algorithms meant for proper probabilistic models, such as pseudolikelihood or perturb and MAP or something other than contrastive divergence, the RBM probabilistic model gives you something that can fit in the framework assumed by those algorithms.

Autoencoders can break symmetry in the encoder and the decoder which can sometimes be really really useful for computational reasons, but isn't especially important for basic deep neural net pre-training.

answered Dec 25 '12 at 23:54

gdahl's gravatar image

gdahl ♦
341453559

edited Jan 16 '13 at 13:33

Is it possible to use the weights learned from pretraining an autoencoder to initialize an RBM?

(Jan 17 '13 at 14:44) cdrn

Sure, why not. Although I don't really see the point, since an RBM has only two layers by definition, so the issues that pretraining is usually supposed to solve don't occur. Maybe some more exotic flavours of RBMs could benefit from it :)

The opposite (initialising the weights of a deep autoencoder with those learnt in an RBM) has been done by Salakhutdinov and Hinton in their 'Semantic Hashing' paper, if I'm not mistaken.

Sparse RBMs have also been initialised with GMMs (which in turn were initialised with K-means) by Sohn et al.: http://web.eecs.umich.edu/~honglak/iccv2011-sparseConvLearning.pdf

(Jan 17 '13 at 17:37) Sander Dieleman

Can you interpret a bit more about how decoder can be so useful?

(Mar 22 '14 at 22:47) Jianbo Ye
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.