|
I'm working on an implementation for the algorithm James Martens published over the summer, and the following is a summary of what I get out of the paper. If anyone would like to critique it I would sincerely appreciate it. (...) EDIT:I finally had some extra time to jump back into this. Here's a mix between Matlab syntax and pseudo-code sketch of the algorithm. Please pick it apart, or use at your own risk.
-Brian
showing 5 of 9
show all
|
|
This seems mostly reasonable. A few things:
This answer is marked "community wiki".
In order, my responses:
-Brian
(Dec 05 '10 at 21:37)
Brian Vandenberg
3
Okay, but the way you wrote the code it sounded like rho was being computed even before the first run of CG. It's probably easier to think about the lambda adjustment as something that occurs after CG instead of before. Stopping the CG runs early is "helpful" early on only in that it makes the error go down faster at first. But this does permanent damage which affects long term performance. So it's never a good idea to do this, unless you don't plan to run the HF algorithm past the short-term. Not NIPS, a paper in progress.
(Dec 05 '10 at 21:59)
James Martens
I'm actually quite curious about some of the practical details like the hard stopping etc. I know these things are very task dependent but I find it odd that CG almost always terminates after just a couple of iterations because the criterion in 4.4 barely changes and that lambda always has very high values like 1e6 or 1e7. Makes me think there is a mistake somewhere but all the gradient/Gv like stuff seems correct according to finite differences... I also found that line-searches almost always result in alpha=1 btw. I didn't implement the pre-conditioned version yet though...
(Dec 06 '10 at 04:53)
Philemon Brakel
3
It sounds like you have a bug or are initializing the network badly. Either use the random initialization scheme I described in the ICML paper or something else thats reasonable (e.g. Nguyen-Widrow). On the autoencoder tasks preconditioned CG should want to run for ~20 iterations at the beginning, and hundreds at the end. Non-preconditioned CG should want to use even more since it tends to be less efficient per-iteration. A very high lambda value will cause very short runs of CG, but the bug isn't necessarily in your lambda adjustment code.
(Dec 10 '10 at 21:35)
James Martens
Thanks for the response. I'm actually training a recurrent neural network so I cannot directly compare the behavior of the algorithm to the results for the autoencoder. I will definately go over my code again and have a look at the Nguyen-Widrow method.
(Dec 11 '10 at 07:35)
Philemon Brakel
|
|
I guess that's a late response, but I've found few errors in a code. While computing Gv product in the beginning of r1 pass you backpropagate through M, that's an excessive computation since you have already propagated through M in f1 pass. So it have to be
Thanks a lot to Nic Schraudolph who helped me to double check it. Also there are few syntax errors where you use element-wise instead of matrix multiplication. And thank you for the code, Brian. It was really helpful :) Understanding Nic's paper was very key to getting this all figured out, and thankfully he was more than willing to help me out as well. Thanks for the critiques, I'm glad the code is helping people.
(Nov 06 '14 at 13:15)
Brian Vandenberg
|
I was under the impression that linear conjugate gradient gives an exact minimum for q. Do you still perform a line search then? Or do you perform that linesearch on f itself?
I think the answer is yes, but I won't go so far as to say it's absolutely true.
I will, however, quote something from Wikipedia that at least lends some credence to doing it (also, from an email conversation I had with James I got the distinct impression it's necessary, though you're the 2nd person to question it).
"The line search approach first finds a descent direction along which the objective function f will be reduced and then computes a step size that decides how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent, newton's method and Quasi-Newton method. The step size can be determined either exactly or inexactly."
-Brian
Oh, and you're right about the quadratic approximation for q. Linear CG, if it's allowed to converge completely (but we don't want to let it go that far; saves time), will find the optimal minimizer for q.
We don't want to minimize q, but rather the function it [locally] approximates, our target function f. Nevertheless, attempting to minimize q helps.
The line search attempts to find a coefficient a that helps find the best distance along the solution found by LCG. My intuition says I should try to go as far along the solution as is possible, but shorter jumps are better when pathological curvature is encountered. q probably isn't going to model the more extreme curvature all that well, and that's what you'll be trying to minimize.
This is sort of like asking for directions at a gas station and getting vague responses. If they say go 5 miles in 'that direction', would you trust that it's really 5 miles when the streets are a warren, or would you more likely go 2-4 miles and ask someone else.
-Brian
However, the backtracking of directions somehow takes care of the difference between q and f.
I guess the most important question is: does it work as advertised, doing deep learning with no need for layer-wise restrictions?
I tried to program this same algorithm in Java, but somehow it just wouldn't work. I'll take a look at your code and maybe attack it again!
I can't answer that yet, I don't have a fully functional implementation yet. That was more or less a sketch of the algorithm. However, one individual on these forums (name escapes me) wrote an implementation for his masters program (I don't believe it was his thesis), and his results seemed to agree with Martens' results.
There are three issues (so far) with what I wrote (ignoring inefficiencies, redundant code, etc). The conjugate-gradient stuff is written badly, there's one mistake when calculating the gradient, and it was written with a process more like you'd use for an RBM, and not what James described:
-Brian
Thanks for putting this together, very helpful. One question though, can you explain what's going on with dphi and it's role in the r0 phase?
Sure. In the abstract,
dphiis the direct derivative of the activation function at a given layer w/respect to its input. In my code, I've implemented it as a cell, where each element of the cell contains the derivatives for the corresponding layer.dphi{kk}is only used to back-prop error information to layerkk-1, sodphi{1}=0. In my fully fleshed out implementation, I set it to NaN to force an error condition if I ever accidentally used it. In the F0 pass, I'm calculating then caching these direct derivatives to speed up calculations in the R1 and F1 passes.Hi Brian,
I am also trying to learn the HF method using the Martens, Schraudolph, and Pearlmutter papers.
I notice in your code that hessian-vector product is noted as Hv, are you not finding the Gauss-Netwon approximation to the hessian, which I have seen as Gv?
I am currently working on just computing Hv because the documentation provided by Pearlmutter and Bishop is fairly clear. Once I have finished this I will try Gv, which is the end goal
Also, there seems to be some confusion in the uppe comments. Conjugate gradient itself doesn't minimize q, it just finds the newton direction to minimize q. Line search, I think, should still be used since q is just an approximation to f, and we don't want take such a large step away from our current location in the function domain to a point where q is no longer an accurate approximation of f.
I know you know. Writing it out just helps me learn this stuff...
Will