|
Is a generative model required for deep learning? Since the goal is classification, why do we use unsupervised learning first? |
|
The following explanation is an oversimplification to give you intuition into what is going on: The difficulty in training a deep architecture is that, if you use standard backprop, the gradient signal doesn't flow back to the lowest layers. Instead, the few supervised output units pass back a small gradient signal, and at each layer the gradient signal has increasing noise as it gets passed backwards. So the top layers overfit, and the lower layers don't get tuned effectively. The lower layers essentially fire random noise, and the top layers overfit. For this reason, before 2006, no one knew how to train a deep architecture (besides Yann LeCun, with his convolutional architectures, but those were not as general purpose). The breakthrough in 2006 came when Hinton came up with the original DBN algorithm. Bengio et al followed up by teasing apart the important steps in training a deep architecture. They were:
When you train a single layer using an unsupervised criterion, the gradient signal is passed backwards through only one hidden layer (the layer you are constructing). And the output layer for an unsupervised criterion has as many units as the input. So the output layer passes back a strong gradient signal, and it doesn't have to travel very far. By doing this unsupervised pretraining in a layer-by-layer manner, the deep network receives a good initialization of its parameters. Then, when finetuning against the supervised criterion using backprop, you can find a better local minimum. The work of Erhan et al show that the effect of unsupervised pretraining is not only regularization, but also improved optimization. Take a look at there work for a great empirical study with large scale experiments and pretty graphs. Another motivation for unsupervised pretraining is that the representations we learn under the generative model can be useful. If we conduct an expensive pretraining phase to learn a task-inspecific representation, the benefit is that this representation can be adapted to quickly train up a task-specific classifier. Maybe you can add the newest paper by Martens (that you mentioned elsewhere)?
(Jun 30 '10 at 17:49)
osdf
|
|
Another way to think about it (that I learned from Geoff Hinton), in information theoretic terms (and being necessarily a little bit handwavy) is that if we think of each presentation of a training case in a supervised learning setup as sending a "message" to the learner, there is at most log C bits per training case, often much less in the case of redundancy between cases. This compounds the problem mentioned in Joseph's answer of the gradient signal being attenuated by the time it backpropagates to lower layers: we wish to learn a large number of parameters using a noisy supervision signal, and that supervision signal didn't contain much information in the first place! The unsupervised objective of modelling the input distribution solves this problem in two ways: not only is the learning now local, but there is a lot more information in the learning signal, depending on the entropy of the joint distribution of input features. Geoff Hinton explains this intuition really well around the minute 8 in the following google tech talk: http://www.youtube.com/watch?v=VdIURAu1-aU (the rest of the lecture is worth listening too).
(Jun 23 '10 at 16:38)
ogrisel
|