3
2

Before I ask my question, I want to preface it with my thought process ... I hope you'll bear with me here:

Many of the papers out of Hinton's group, as well as some of the lectures I've seen Hinton give, indicate there is a lot of value in using deep networks for generative models, and potentially other types of models.

There was a paper I ran across around 8 months ago -- unfortunately, I don't recall which one though I am trying to find it; I'll update my question if I can find it -- where they compared performance of a very shallow network to deep networks on some task or another (again, details are fuzzy).

They found, much to their (and my) surprise that the shallow network was able to give the same level of performance as the deep network on whatever task it was.

That said ... consider the case of an auto-encoder. From layer to layer, the activations should be binarized in some fashion, which prevents the network from passing more than 1 bit of information per unit (in the case of binary units).

On the one hand, if you view each layer individually it can only pass on as much information as can be encoded in the # of binary units that layer represents.

On the other hand, the weights in each layer could be thought of as a sort of dictionary or encyclopedia, storing information about the data it has been trained on.

My intuition says, then, that an auto-encoder would only be benefited by a deep network if it's easier for the NN to do dimensionality reduction in small increments -- eg, 1000 -> 900 -> 800 -> ... -> 100 produces better results than 1000 -> 100 simply because it's easier to find an efficient representation when going from 1000 -> 900, and similarly from 900 -> 800.

But, that paper I alluded to earlier seemed to give the impression that I should be able to get comparable results whether I used the deep or shallow networks.

Has anyone already gone down this research path? Does anyone have a good answer to this question (whether it's one based on intuition, or actual research performed)?

asked Apr 09 '12 at 02:14

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746


4 Answers:

I think the paper you are trying to remember is this one: http://ai.stanford.edu/~ang/papers/aistats11-AnalysisSingleLayerUnsupervisedFeatureLearning.pdf

An Analysis of Single-Layer Networks in Unsupervised Feature Learning, from Andrew Ng's group.

answered Mar 18 '13 at 16:46

Dan%20Ryan's gravatar image

Dan Ryan
40671116

It depends what you mean by 'the same level of performance'. Some shallow methods can learn any target function, take gausian SVM as an example. The caveat is: such shallow architectures may need many (perhaps even infinitely many) examples to achieve good performance. On the other hand deep architectures can represent some classes of functions more compactly, and therefore better performance on fewer examples should be expected. Hovewer, often in practice you don't know if your problem belongs to such class of functions. Even if you knew you still might not know how deep your network should be. In addition, it seems there are a lot of problems in training such deep networks which can also cause some loss in performance. Given such problems, it is often difficult to say a priori which architecture is better for your problem, and I guess you have to check different methods per problem to get an answer.

You can also read Learning Deep Architectures for AI for some insights.

Did you have this paper on your mind? The Importance of Encoding Versus Training with Sparse Coding and Vector Quantizationt or An analysis of Single-Layer Networks in Unsupervised Features Learning

answered Apr 14 '12 at 18:20

Mateusz's gravatar image

Mateusz
76124

edited Apr 14 '12 at 18:40

There are theoretical results showing some functions can be computed by deep circuits much more efficiently then shallow circuits. There is some discussion of this in the paper Shallow vs. Deep Sum-Product Networks.

answered Apr 10 '12 at 18:42

alto's gravatar image

alto
60351124

The best number of hidden layers to use in a neural network is problem dependent. Sometimes 0 layers works best, sometimes 1 layer, sometimes 5 layers. Not only is the answer problem dependent, but it interacts with the algorithm you use to set the weights of the neural network.

One thing I know for sure is that the depth of the model matters for many problems given the training algorithms we have at our disposal. I would be highly suspicious of any paper that suggests that the number of layers is irrelevant on any interesting problems (such as vision or speech tasks). Since deeper networks are often harder to train for a given layer size, as we get better at training them on problems of interest we may find that often the optimal number of hidden layers is not one or zero, but is larger.

Intuitively, we would expect a single hidden layer neural network to be very inefficient statistically when the function being learned has componential or complicated structure caused by interactions that are hard to capture with simple template matching.

One particular problem I am personally familiar with is large vocabulary speech recognition. In ASR, people have used single hidden layer neural networks for many years for acoustic modeling. In fact, they have used large ones with over 15,000 hidden units! Neural nets with several hidden layers produce results that are clearly much better than a single hidden layer neural net, even when the shallow neural net has more tunable parameters. Since historically shallow nets have been used much more, we would also expect typical researchers to have more experience training them and as people train more neural nets with many layers we may see even larger gains using multiple hidden layers.

answered Apr 09 '12 at 20:29

gdahl's gravatar image

gdahl ♦
341453559

edited Apr 09 '12 at 20:35

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.