Revision history[back]
click to hide/show revision 1
Revision n. 1

Jul 28 '10 at 23:12

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663

The Power of Huge Neural Networks with Unconstrained Architectures

This is a question about the (seemingly) marvelous development recently made by James Martens: the development of a fast second-order gradient-based method that effectively trains deep neural nets (in this case, an autoencoder) better than layer-wise pretraining + fine-tuning. This is something that people have been after pretty much since the early 90's. Until 2006, our closest thing to decent deep architectures was the convolutional net of Yann LeCun. Then RBMs and Deep Belief nets came along, and they were really cool. But this new method seems to offer a way to train neural nets at least as deep as anything yet developed, and very possibly it could extend to recurrent nets and other exotic variations.

I do not have a complete grasp of Martens method, though as soon as I have a chance I hope to do an implementation. However, if you read his paper, it seems to do everything standard backprop does but better, with faster convergence and on deeper nets.

So I ask the question: do you think this method can generalize to any net that's small enough to be tractable on today's computers (meaning tens of millions of weights if you parallelize, still millions with a cruder implementation), what good does that do us? Imagination is welcome. Of course, I also want to know any limits of this method that we need to be cautious of. After all, second-order methods are perennial favorites, but (perhaps until now) nothing has truly beaten humble stochastic gradient descent.

I'll kick it off. Obviously, it'll be nice if we can get below .8% error or so on the MNIST digit set without any pre-processing, but I think what's most promising is the chance to attack problems that are implicitly too ill-conditioned for a neural-net with first-order backprop - for instance, intelligent feature extraction layers and layers deep, incorporation of invariances beyond the scale and translation invariances of a convolutional net, or really excellent growth and pruning methods. On the opposite side of the spectrum, it'll make pre-made "plug-and-play" neural nets much, much more effective (all those kernel-lovers will be green with envy!).

click to hide/show revision 2
Revision n. 2

Jul 28 '10 at 23:13

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663

The Power of Huge Neural Networks with Unconstrained Architectures

This is a question about the (seemingly) marvelous development recently made by James Martens: the development of a fast second-order gradient-based method that effectively trains deep neural nets (in this case, an autoencoder) better than layer-wise pretraining + fine-tuning. This is something that people have been after pretty much since the early 90's. Until 2006, our closest thing to decent deep architectures was the convolutional net of Yann LeCun. Then RBMs and Deep Belief nets came along, and they were really cool. But this new method seems to offer a way to train neural nets at least as deep as anything yet developed, and very possibly it could extend to recurrent nets and other exotic variations.

I do not have a complete grasp of Martens method, though as soon as I have a chance I hope to do an implementation. However, if you read his paper, it seems to do everything standard backprop does but better, with faster convergence and on deeper nets.

So I ask the question: do you think this method can generalize to any net that's small enough to be tractable on today's computers (meaning tens of millions of weights if you parallelize, still millions with a cruder implementation), what good does that do us? Imagination is welcome. Of course, I also want to know any limits of this method that we need to be cautious of. After all, second-order methods are perennial favorites, but (perhaps until now) nothing has truly beaten humble stochastic gradient descent.

I'll kick it off. Obviously, it'll be nice if we can get below .8% error or so on the MNIST digit set without any pre-processing, but I think what's most promising is the chance to attack problems that are implicitly too ill-conditioned for a neural-net with first-order backprop - for instance, intelligent feature extraction layers and layers deep, incorporation of invariances beyond the scale and translation invariances of a convolutional net, or really excellent growth and pruning methods. On the opposite side of the spectrum, it'll make pre-made "plug-and-play" neural nets much, much more effective (all those kernel-lovers will be green with envy!).

click to hide/show revision 3
Revision n. 3

Jul 29 '10 at 00:20

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663

The Power of Huge Neural Networks with Unconstrained ArchitecturesApplications of Hessian-Free Deep Learning

This is a question about the (seemingly) marvelous development recently made by James Martens: the development of a fast second-order gradient-based method that effectively trains deep neural nets (in this case, an autoencoder) better than layer-wise pretraining + fine-tuning. This is something that people have been after pretty much since the early 90's. Until 2006, our closest thing to decent deep architectures was the convolutional net of Yann LeCun. Then RBMs and Deep Belief nets came along, and they were really cool. But this new method seems to offer a way to train neural nets at least as deep as anything yet developed, and very possibly it could extend to recurrent nets and other exotic variations.

I do not have a complete grasp of Martens method, though as soon as I have a chance I hope to do an implementation. However, if you read his paper, it seems to do everything standard backprop does but better, with faster convergence and on deeper nets.

So I ask the question: do you think this method can generalize to any net that's small enough to be tractable on today's computers (meaning tens of millions of weights if you parallelize, still millions with a cruder implementation), what good does that do us? Imagination is welcome. Of course, I also want to know any limits of this method that we need to be cautious of. After all, second-order methods are perennial favorites, but (perhaps until now) nothing has truly beaten humble stochastic gradient descent.

I'll kick it off. Obviously, it'll be nice if we can get below .8% error or so on the MNIST digit set without any pre-processing, but I think what's most promising is the chance to attack problems that are implicitly too ill-conditioned for a neural-net with first-order backprop - for instance, intelligent feature extraction layers and layers deep, incorporation of invariances beyond the scale and translation invariances of a convolutional net, or really excellent growth and pruning methods. On the opposite side of the spectrum, it'll make pre-made "plug-and-play" neural nets much, much more effective (all those kernel-lovers will be green with envy!).

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.