16
8

This is a question about the (seemingly) marvelous development recently made by James Martens: the development of a fast second-order gradient-based method that effectively trains deep neural nets (in this case, an autoencoder) better than layer-wise pretraining + fine-tuning. This is something that people have been after pretty much since the early 90's. Until 2006, our closest thing to decent deep architectures was the convolutional net of Yann LeCun. Then RBMs and Deep Belief nets came along, and they were really cool. But this new method seems to offer a way to train neural nets at least as deep as anything yet developed, and very possibly it could extend to recurrent nets and other exotic variations.

I do not have a complete grasp of Martens method, though as soon as I have a chance I hope to do an implementation. However, if you read his paper, it seems to do everything standard backprop does but better, with faster convergence and on deeper nets.

So I ask the question: do you think this method can generalize to any net that's small enough to be tractable on today's computers (meaning tens of millions of weights if you parallelize, still millions with a cruder implementation), what good does that do us? Imagination is welcome. Of course, I also want to know any limits of this method that we need to be cautious of. After all, second-order methods are perennial favorites, but (perhaps until now) nothing has truly beaten humble stochastic gradient descent.

I'll kick it off. Obviously, it'll be nice if we can get below .8% error or so on the MNIST digit set without any pre-processing, but I think what's most promising is the chance to attack problems that are implicitly too ill-conditioned for a neural-net with first-order backprop - for instance, intelligent feature extraction layers and layers deep, incorporation of invariances beyond the scale and translation invariances of a convolutional net, or really excellent growth and pruning methods. On the opposite side of the spectrum, it'll make pre-made "plug-and-play" neural nets much, much more effective (all those kernel-lovers will be green with envy!).

This question is marked "community wiki".

asked Jul 28 '10 at 23:12

Jacob%20Jensen's gravatar image

Jacob Jensen
1914315663

edited Jul 29 '10 at 00:20


4 Answers:
11

Hey all, I also want to put forward some results of Ilya Sutskever that he presented at CIfAR 2010. He basically used James Martens' method to train a recurrent neural network. I will not go into details of the experiment, and I expect a paper to appear soon on this subject. What he did was to train a RNN on wikipedia, where the network got as input a one-hot encoding of a character and was suppose to predict the next character. He used 2000 units ( quite a lot for a RNN) and trained it one month on the GPU ( maybe a year in CPU time or more - that is Ilya prediction ). Note that this is not your ordinary experiment, I for one would not have the patience to let it run for 30 days or so. Anyhow the results are amazing. First of all the RNN seems ( I haven't seen conclusive results in this respect, and Ilya just started playing with this) not to suffer from the vanishing gradient problem any more( he also tested this on a task proposed by Schmidhuber et al. in their first paper about LSTM; the task is known to be solvable only by LSTMs and was meant to show that LSTM can deal with connecting events that are far away in time). But the most amazing thing is that he could use the RNN to generate text which was coherent for a few words ( it would also remember to close brackets in some cases and so on). There are yet no qualitative measures of the network behaviour I think ( or Ilya has not presented any for now), so this are very recent results. But we can of course speculate on how well this training method works. After talking to Ilya I also got a few samples generated by the network ( I will post one here, hope he doesn't mind) :

Shortly thereafter it was purchased by the army attacks on Bill World Service, like many of which recognize the Oscar's liberation of the nobility before he desired anything obey, instead of , I to be flooding during the civil war, has been fell.

I'm currently investigating James Martens' algorithm, as well as going over Yoshua's paper on vanishing gradient trying to make sense out of all this. My intent is also to implement the algorithm with Theano (which should make the GPU stuff transparent). Once and if I manage to do that, I will definitely not mind sharing my thoughts and code.

This answer is marked "community wiki".

answered Aug 19 '10 at 11:24

pascanur's gravatar image

pascanur
61123

edited Jan 17 '11 at 13:55

David%20Warde%20Farley's gravatar image

David Warde Farley ♦
551820

I just asked about the availability of the Martin's algorithm in another thread. And I'm using Theano to implement my current models. So if this algorithm would be available I would be very interested in applying it to my problem, which I believe has some of the deep learning characteristics.

Unfortunately, currently I don't have time, and frankly neither background to implement it myself, but if it was out there (especially for Theano) I would be delighted.

(Aug 19 '10 at 12:24) antolikjan

This is where @antolikjan was asking for the code: http://metaoptimize.com/qa/questions/14/what-are-the-state-of-the-art-optimization-algorithms-for-problems-with-numerous-local-minima#60

(Aug 19 '10 at 13:41) Joseph Turian ♦♦
1

You also might be interested in this post from Justin Domke: http://justindomke.wordpress.com/2009/01/17/hessian-vector-products/ in particular in Barak Pearlmutters answer. I wonder, does theano already have all the stuff that Pearlmutter published about in the last years? Btw. Justin Domke's blog is quite a nice repository for recurring technical details (see left column), maybe this could be linked somewhere visible?

(Sep 10 '10 at 08:04) osdf

This post by Sebastian Walter shows how to implement the R operator in Theano, unless I'm missing something:

http://groups.google.com/group/theano-users/msg/61dea76310b0bc25

However, in a couple of cases, I had to simplify the computation of my objective function to get it to work (i.e., replacing indexing with dot products with a an indicator vector, removing calls to nnet.softmax, etc.).

(Jan 17 '11 at 04:00) Thouis Jones

I had a brief email conversation with James Martens about his paper, and came away with some info that may be useful if you're out to reproduce his algorithm.

The algorithm itself is rather simple in the sense that it doesn't require a ton of code, but wrapping your head around it might be a little more arduous depending on your background.

As reading material to become more familiar with some of the concepts, he recommended "Numerical Optimization" by Stephen Wright (ISBN 0387303030) which, among other things, gives some good background info on line search, trust region, conjugate gradient, and various other optimization techniques. Compared to other technically oriented books I find it rather readable (definitely not for the faint of heart, though; decent exposure to math recommended), though I'll freely admit I've only read through the first 4-5 chapters so far.

Although I wretch to think about the notation used in the book, you might also look at "Numerical Analysis" by Burden & Faires for some background info if you're not ready to jump into Wright's book. There's probably far better books out there than B&F; that's what we used in my numerical analysis class.

As for implementing the algorithm, go to Hinton's home page and download the Matlab source for their MNIST digit classifier (Hinton & Tieleman, ~2006 or 2007). If you open backprop.m, this code is used to do the backprop refining after pre-training with contrastive divergence.

At first glance it might appear somewhat hairy. When you peel back the layers, it's just backprop with a non-linear CG solver (minimize.m) to minimize the weight update using 3 line searches.

Martens' method replaces the code that uses non-linear CG with a quadratic approximation near the current region of the error surface, parameterized by the weights (q(theta)). Then he's using linear conjugate gradient (burden & faires has simple and well-written, albeit poorly described due to notation, CG algorithm at the end of chapter 7) with an n < N step line search to determine how much to 'trust' the given solution.

For those less familiar with conjugate gradient, it's a fairly straightforward method. Chapter 7 of B&F starts out by covering some iterative methods for solving linear systems of the form Ax=b, explaining better heuristics for improving the runtime, finally ending with CG.

At the heart of many iterative methods is the idea of a stationary point of a function. Any parameter satisfying x = f(x) is a stationary point of the function f(x). Many iterative methods exploit this idea by setting up a linear system related to the original Ax=b, taking a form similar to: x = Tx + c.

CG starts out with an initial approximation to the solution, taking a step in some direction that gets even closer. On the 2nd step, a new direction is chosen that is perpendicular to the first parameter update. The method successively chooses new directions to move that are perpendicular to all prior parameter updates.

At worst, with linear CG you're pretty much guaranteed not to lose ground after any given iteration, and in most cases you'll always see large improvements very early on. Early convergence of CG is [usually] extremely rapid even with a very poor estimate of the initial starting point -- also, this highly depends on how poorly conditioned the matrix A is in Ax=b. If it is poorly conditioned -- well, you can look that one up if you're interested.

In a typical Ax=b problem, you want the most accurate estimate for 'x' possible. In this case, we just want to move in a good direction. Provided the problem is conditioned well enough, CG is absolutely guaranteed to converge in N steps (where A is NxN). In a number of steps that can often be much smaller than N, CG can produce a pretty significant improvement particularly considering we don't really care to find the most accurate solution to Ax=b, just an 'x' that moves us in a good direction in our rough, local, quadratic approximation.

That is probably the single biggest win in Martens' method. Non-linear CG, as he and others point out, loses conjugacy after only a few iterations if the problem is highly non-linear. In short, this means that producing a change to the parameters conjugate to prior changes and in the same [rough] direction as the gradient is less effective than in linear CG, so you can't do any early stopping with non-linear CG and have any hope of having a good improvement to your parameters -- whereas in linear CG you can almost always expect a very good improvement after only a few iterations.

I'll quiet down now. Hopefully I haven't made any gross errors in my attempt and helping.

-Brian

This answer is marked "community wiki".

answered Dec 01 '10 at 01:39

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

Are you sure there is any linesearch involved at all? I implemented the algorithm and as far as I know the reliability check only looks at the quality of the final search direction as a quadratic approximation of the true objective function. Then again, I'm not getting great performance on larger datasets yet so I might have made a mistake somewhere.

(Dec 01 '10 at 07:42) Philemon Brakel

I am really interested in this, but don't have much time implementing it right now. But as much as I understood only modification/replacement of minimize.m is needed? If yes, it would be great if you don't mind sharing this piece of code... Brian have you implemented it?

(Dec 01 '10 at 11:17) Oliver Mitevski

When talking about Hinton & Tieleman's implementation, I was more trying to describe where the optimization is for backprop. minimize.m was not written by them, as noted in the comments at the top of that file.

Read through backprop.m (or whatever its name is) and get a feel for how it works. This is roughly what is happening:

  • Setup 8(?) sets of weights, where the 2nd half are just transposes of the first half; the column of 1s is added to make it easy to deal with bias units.
  • Start the training loop
  • Calculate training & validation set reconstruction errors
  • Reshape samples from minibatches into larger batch sizes
  • Do the feed-forward step, then call minimize.m passing in the weights and generated information necessary for the CG step, as well as the text string 'CG_MNIST'; the function referred to by that string is used to do evaluations of f(x) and f'(x) for the non-linear CG solver.
  • Update the weights (and biases? I don't recall) based on the results.

I don't have a solution implemented yet, but I'll be starting an implementation this week.

-Brian

(Dec 01 '10 at 11:48) Brian Vandenberg
1

Thanks Brian! I have actually translated that piece of code from MATLAB to python. (the RBMs and autoEncoder with pretraining from Hinton and minimize.m from Rasmussen). I did some refactoring so as to remove some redundancies in the original code. Have read the papers and spent some time experimenting with it. So I understand how it works, but honestly haven't read the Martens completely (just the beginning). Not afraid of the heavy math, I've done some optimization algorithms in college. So I was wondering if some addition/modification to the code could yield and reproduce the Martens results.

(Dec 01 '10 at 12:14) Oliver Mitevski
1

Philemon,

I won't pretend to be an expert on this, instead I'll quote (from an email) James Martens on this:

[quote]

Line searches are often not done in the context of gradient descent because it takes more work to run the line search than it takes to actually compute the update vector p. Thus alpha is computed using some pre-determined schedule, or just set by hand. In HF it is very much not the case that the line-search will be more expensive than computing p, and being a 2nd-order method it's more important than it is with gradient descent to use a good line-search with HF.

Line-searches are a basic optimization theory concept, and are used all the time in 2nd-order optimization algorithms. Usually, when the 2nd-order optimizer is performing well, alpha is almost always chosen by the line-search algorithm to be 1.0. See any good optimization text like "Numerical Optimization" by Nocedal and Wright for details.

[end quote]

There's also some contextual information in the slides on his ICML talk that may help as well:

http://www.cs.toronto.edu/~jmartens/docs/HF_talk.pdf

-Brian

(Dec 01 '10 at 12:23) Brian Vandenberg

Thanks a lot for the information. I didn't find any indication of the use of line-searches and since the stepsize is close to 1 when the algorithm performs well it could be that it is not necessary when carefully checking the quality of the approximation. Then again, the slides do mention the use of batches of data for line-searches and the original paper also didn't say anything about not updating the weights at all when the estimates are bad (which I had to do or otherwise things got unstable once in a while).

For other people who are thinking of implementing this I suggest to have a look at the source of the fmin_ncg (newton function in scipy. This is an optimizer that is essentially implementing Martens algorithm without all the new changes he came up with for increased performance and scaling it up to large datasets. Ironically, it came with a line search that I explicitly removed when refactoring the code. My own code is quite task specific (read ugly) and needs a gpu to run. It also doesn't work that well yet...

(Dec 01 '10 at 12:44) Philemon Brakel
showing 5 of 6 show all

Thank you for bringing this article to my attention I find the results and method very interesting. I am sure I share your disappointment at not yet having a reply from the very knowledgable folks here at MO.

It's a pity there is very little actual code to go along with the paper, are you aware of any source that would get us kick started on replicating the findings of the paper.

For the applications I use Hinton's DBMs I particularly like their poperty that unclassified examples can be used to improve results. It appears that this work shares that ability through the application of pretraining. However as of yet I have not seen any effort to quantify the effect of PT beyound improving the convergence rate.

As I said this is extreamly interesting work, I only wish the author was as genorous as Hinton generally is with his code.

This answer is marked "community wiki".

answered Aug 01 '10 at 16:38

DwoaC's gravatar image

DwoaC
603

Well, I wouldn't be too harsh w/r/t code. a) he'd probably share for a legitimate request, and b) it might be that it's "not ready for prime time" in some sense - cleanliness, simplicity or other reasons

What matters is that it looks very impressive, and soon enough I'm sure it'll be more easily duplicable.

(Aug 01 '10 at 20:33) Jacob Jensen

You are right of course and by no means did I mean to give the impression that I was being hard on the author. The work indeed appears very valuable.

(Aug 03 '10 at 08:28) DwoaC
1

The implementation used in the paper is CUDA-based (i.e. GPU-based), which probably makes it fairly time-consuming to get into a releasable shape due to various library dependencies.

(Aug 03 '10 at 09:55) Andriy

Although, from the paper, it doesn't seem that much harder to implement than a well tuned SGD + pretraining.

It seems to me that the main issue with second order methods is that estimating the hessian is not as easy as estimating the gradient, which can be surprisingly hard. I tried looking at the componentwise variance of the stochastic gradient in a high dimensional highly nonlinear problem (for which SGD worked fine but I was trying to do a variant of stochastic coordinate descent) and needed huge minibatch sizes to get it small enough to have high confidence in it; with the hessian this should probably be worse, hence the need for so many tunings and adjustments in the paper. It seems to me to be a good advancement on uniting stochastic optimization with second-order methods, but there's a lot of ground to be explored there.

(Aug 03 '10 at 22:56) Alexandre Passos ♦
1

Not hard to implement Alexandre? Well, maybe not absurdly complicated, but a big step up from stochastic gradient descent on a a regular neural net.

I want to implement this, but the complexity level seems rather high, and I have a slow PC and don't know how to do GPU programming so I'm worried I might be running week-long tests, or that I just might not be able to get it to work because of some random bug. I want more flexibility than pre-training can provide me though.

(Aug 07 '10 at 02:15) Jacob Jensen

"However as of yet I have not seen any effort to quantify the effect of PT beyound improving the convergence rate."

There has been an entire JMLR paper dedicated to this: Erhan et al, 2010

(Jan 17 '11 at 13:56) David Warde Farley ♦
showing 5 of 6 show all

On another note, I'm curious to see some hybrid approaches.

Initially, the weights for the entire system are completely random. To some extent, I still expect there to be some issues with the learning signal to be less effective for the earliest layers in a deep net (though I don't have any data to back that up).

There's a myriad of variations on this theme I've mulled over, but I'll point out two or three that seem at least worth thinking over.

With the type of architecture Martens' used in his paper (a number of layers, with sort of 'shadow' copies of those layers used to reconstruct the original data) as an end-goal, start with only 3 layers: inputs, intermediate, and reconstructed input. Using this method, perform enough training to remove some of the randomness from the weights, then add a 2nd hidden layer and repeat the process until the entire 'net has been built.

Another approach could be a hybrid between contrastive divergence learning (on each set of weights in parallel), and the HF optimized approach. This seems less likely to be as beneficial, but I may be overlooking something.

Linear and quadratic programming problems are often stated as "Solve < X > subject to < some list of conditions >". CD learning uses as its objective the minimization of the log probability of the joint probability distribution, possibly subject to some minor constraints (magnitude of the weights are small, sparsity constraints, etc).

We can always add more constraints to the mix, or modify the objective function. For example, rather than minimizing log(P(V,H)), we might produce a quadratic approximation of the objective function and use Martens' HF optimization method to come up with a new variation on contrastive divergence.

For example, one of Yann LeCunn's basic approaches is to minimize two (or more) separate objective functions. Each iteration, the weight update scheme might look like this:

pass 1: update weights in an attempt to improve constraint #1 pass 2: update weights in an attempt to improve constraint #2 pass 3: update weights in an attempt to improve constraint #1 pass 4: update weights in an attempt to improve constraint #2 pass 5: update weights in an attempt to improve constraint #1 pass 6: update weights in an attempt to improve constraint #2 ...

Anyway, it's getting to be bed time.

-Brian

This answer is marked "community wiki".

answered Dec 01 '10 at 02:12

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.