|
I read a little bit into deep learning, but I find it hard to understand what the main differences of the various approaches are. I read that convolutional neural networks are good for 2D images, because they encode some of the spatial relationships, but are sensitive to clutter. And I read the auto encoders are useful for 'big data', because they can be trained with unlabeled data. Are these 'impressions' correct? What parts did I miss? And the most important question: What literature do you recommend? |
|
I would recommend reading some of Yoshua Bengio's review papers as a good overview of the field: http://arxiv.org/abs/1206.5538 There are a few different ways of categorizing deep learning algorithms. One way of categorizing them is whether they are purely supervised or if they are semi-supervised (able to use unlabeled data). Nearly all deep learning algorithms are semi-supervised. In fact, the only deep learning method that works well without unlabeled data is the convolutional net. It is also possible to make semi-supervised versions of the convolutional net though. Another way of categorizing deep learning algorithms is whether their training procedure is probabilistic or deterministic. Things like RBMs and DBMs generally involve Markov Chain Monte Carlo approximations in their training algorithm. Things like autoencoders and predictive sparse decomposition can usually be trained just by taking the gradient of a deterministic function. So autoencoders are usually a lot easier to implement, and make a good starting point for a beginner. Finally, one other way of categorizing deep learning algorithms is whether they incorporate prior knowledge about the data or not. Convolutional networks incorporate prior knowledge that the data has spatial structure. So they work well when the data has spatial structure, but they are not applicable if the data isn't structured like that. For most learning algorithms it is possible to make both a convolutional and non-convolutional version. "In fact, the only deep learning method that works well without unlabeled data is the convolutional net." Depending on what you mean by "works well", the Hessian free fraction might disagree. Also, the "lots of labeled data" fraction might disagree. But I guess you mean that using unlabeled data always improves things.
(Nov 16 '12 at 16:03)
Justin Bayer
I guess by "work well" I meant get good classification results. As far as I know, Hessian Free doesn't really help with that. The Hessian Free results are about reducing reconstruction error. As far as I know, pretraining is still crucial to get good results for classification, unless you have millions of labeled examples.
(Nov 16 '12 at 17:41)
Ian Goodfellow
1
In some cases dropout does well with no pretraining, IIRC. But it seems as if always at least one of the three is needed to get best results: pretraining, lots of data, prior knowledge wired in the architecture.
(Nov 17 '12 at 05:19)
Justin Bayer
3
These are the problems with deep learning You have a lot of parameters to fit ( lots of layers) each extra layer of nonlinearity basically loses the gradient information, introduces local optima so you have two sources of problems a) difficulty to converge leading to underfitting on training data) b) bad generalisation ( ie poor performance on test data) convolutional networks firstly reduce the number of parameters by weight sharing and localising receptive fields, which is likely to help a) and b). [the specific restrictions on weights make sense for image data so do not themselves lead to underfitting!] so just to echo Ian's comments (using other words) Hessian free is an optimisation routine so just improves a). no improvement on b) expected IMO dropout can be viewed as a form of extending the training data to encourage better generalisation- to get the same localisation as is performed by hardwiring in convolutional net
(Nov 17 '12 at 09:55)
SeanV
|
|
Ian's answer explains some differences between different approaches, but I would like to emphasize something else. Although conceptually the approaches might be quite different, in practice they all seem to work well. Different researchers and groups prefer different approaches for often subtle, nuanced, and sometimes even arbitrary reasons that in most cases aren't important to anyone else. If you want to understand the details and subtle differences, you will probably have to ask a more targeted and specific question. The 'impressions' you mention don't fit with my knowledge or experience. Auto-encoders and RBMs are both perfectly useful for large datasets. I have no idea why 'big data' would necessitate using unlabeled data more, but RBMs and auto-encoders can both use unlabeled data. Also, I have never actually found a local minimum during training. I never train that long. Adding more hidden units and parameters to each layer should make local minima harder to find. The difficulties in the optimization problem posed by large deep neural nets seem more dominated by saddle points and narrow ravines with very low curvature in the direction of improvement than actual minima. Nowadays, with GPUs and fast machines it is possible to get good results without pre-training or HF if you train long enough and carefully enough and use wide enough nets (another case where more parameters makes the optimization easier). Although perhaps pre-training would have improved my results, I recently trained neural nets with 2-4 hidden layers on a relatively modestly sized chemical data set without using pre-training and I was able to get good results. I agree with George. In most cases where I know the performance with and without pretraining, pretraining does help significantly. But you can definitely get very respectable performance without it.
(Nov 19 '12 at 14:02)
Ian Goodfellow
Wrt the chemical data sets: did you try it with pretraining and did it not improve things or did you just not bother to?
(Nov 20 '12 at 03:29)
Justin Bayer
|
Execuse me ,but where did you get the idea of Convolutional Nets sensitive to clutter? AFAIK, it performs quite well on many recognition tasks of natural scene images which surely have much clutter involved.