11
3

I'm working on an implementation for the algorithm James Martens published over the summer, and the following is a summary of what I get out of the paper. If anyone would like to critique it I would sincerely appreciate it.

(...)

EDIT:

I finally had some extra time to jump back into this. Here's a mix between Matlab syntax and pseudo-code sketch of the algorithm. Please pick it apart, or use at your own risk.

layer_dimensions = [ 784 500 2000 2000 2000 500 784 ]
ld_size          = size( layer_dimensions, 1 )
num_neurons      = sum( layer_dimensions )

% 'deal', as used here, just copies parameters; so x, rx, etc are
% duplicates of the cell
% 
% rx = R{x}
% dx = derivative of the error w/respect to x
% dphi = y * (1 - y)
[x rx dx rdx y ry dy rdy dphi] = deal( cell( ld_size,   1 ) )
[W rW dW rdW]                  = deal( cell( ld_size-1, 1 ) )

for ii = 1:ld_size
  x{ii} = zeros( layer_dimensions( ii, 1 ), num_samples )
  y{ii} = x{ii}

  if ( ii < ld_size )
    W{ii} = randn( layer_dimensions( ii:ii+1, 1 ) )
  end
end

% Initialize first layer
y{1}    = inputs
dphi{1} = zeros( size( y{1} ) )  % phi depends on x; since x{1}=0, phi{1} is
                                 % constant, so dphi{1} = 0
ry{1}   = dphi{1}                % y{1} receives no input from earlier layers

function hv = HV( V )
  % f1 pass begin
  for kk = 1:ld_size-1
    % Using material discussed near eqn (4.2) from Schraudolph's paper,
    % do a full f1 pass through M ...
    rx{kk + 1} = ( ry{kk}' * W{kk} + y{kk}' * V{kk} )'
    ry{kk + 1} = rx{kk + 1} .* dphi{kk + 1}
  end
  % f1 pass end

  % The matrix 'A' in eqn (4.2) is the identity matrix (section 2)
  %   ry{.} = A * JM * JN * V
  %
  % Now, set u = ry{ld_size} and backprop this; then we're done

  % r1 pass begin
  dy{ld_size}     = ry{ld_size}
  dx{ld_size}     = dphi{ld_size}  .* dy{ld_size}
  dW{ld_size - 1} = y{ld_size - 1} .* dx{ld_size}

  for kk = 1:ld_size-1
    dy{ld_size - kk}     = W{ld_size - kk} * dx{ld_size - kk + 1}
    dx{ld_size - kk}     = dphi{ld_size - kk} .* dy{ld_size - kk}
    dW{ld_size - 1 - kk} = y{ld_size - kk}    .* dx{ld_size - kk + 1}
  end
  % r1 pass end

  hv = dW
end

for ii = 1:numepochs
  for jj = 1:numbatches
    % f0 pass begin
    for kk = 1:ld_size-1
      x{kk + 1}    = ( y{kk}' * W{kk} )'
      y{kk + 1}    = s( x{kk + 1} )
      dphi{kk + 1} = y{kk + 1} .* (1 - y{kk + 1})
    end
    % f0 pass end

    % r1 pass begin
    dy{ld_size}     = y{ld_size} - y{1}
    dx{ld_size}     = dphi{ld_size}  .* dy{ld_size}
    dW{ld_size - 1} = y{ld_size - 1} .* dx{ld_size}

    for kk = 1:ld_size-1
      dy{ld_size - kk}     = W{ld_size - kk} * dx{ld_size - kk + 1}
      dx{ld_size - kk}     = dphi{ld_size - kk} .* dy{ld_size - kk}
      dW{ld_size - 1 - kk} = y{ld_size - kk}    .* dx{ld_size - kk + 1}
    end
    % r1 pass end

    % first candidate solution for conjugate-gradient:
    %   dW = J_LoMoN
    V = dW

    % used inside the CG loop
    normr_last = 0

    while ( not converged )
      hv = HV( V )

      % Initialize some stuff for conjugate-gradient
      if( this is the first iteration of the CG loop )
        r = V
        for kk = 1:ld_size-1
          r{kk}       = r{kk} - hv{kk}
          normr_last  = normr_last + r{kk}' * r{kk}
        end

        normr_last = sqrt( normr_last )

        p = r
      end

      % The rest from here on is a single pass of conjugate-gradient
      Hp = HV( p )
      normr = 0

      for kk = 1:ld_size-1        
        alpha   = sum(r{kk}.^2) / ( p{kk}' Hp{kk} )
        V{kk}   = V{kk} + alpha .* p{kk}
        r{kk}   = r{kk} - alpha .* Hp{kk}
        normr   = normr + r{kk}' * r{kk}
      end

      normr = sqrt( normr )

      if( normr is small enough, or max # of iterations exceeded )
        set condition to exit CG loop
      else

        beta       = normr / normr_last
        normr_last = normr

        for kk = 1:ld_size-1
          p{kk} = r{kk} + beta .* p{kk}
        end
      end % if statement

      % At this point, V contains the next best guess for the next iteration
    end % CG-loop

    % At this point, hopefully the CG algorithm converged.
    % V should contain a good weight update
  end % mini-batches
end % epochs

-Brian

asked Dec 05 '10 at 20:13

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Feb 02 '11 at 03:40

I was under the impression that linear conjugate gradient gives an exact minimum for q. Do you still perform a line search then? Or do you perform that linesearch on f itself?

(Dec 06 '10 at 02:33) Justin Bayer

I think the answer is yes, but I won't go so far as to say it's absolutely true.

I will, however, quote something from Wikipedia that at least lends some credence to doing it (also, from an email conversation I had with James I got the distinct impression it's necessary, though you're the 2nd person to question it).

"The line search approach first finds a descent direction along which the objective function f will be reduced and then computes a step size that decides how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent, newton's method and Quasi-Newton method. The step size can be determined either exactly or inexactly."

-Brian

(Dec 06 '10 at 03:47) Brian Vandenberg

Oh, and you're right about the quadratic approximation for q. Linear CG, if it's allowed to converge completely (but we don't want to let it go that far; saves time), will find the optimal minimizer for q.

We don't want to minimize q, but rather the function it [locally] approximates, our target function f. Nevertheless, attempting to minimize q helps.

The line search attempts to find a coefficient a that helps find the best distance along the solution found by LCG. My intuition says I should try to go as far along the solution as is possible, but shorter jumps are better when pathological curvature is encountered. q probably isn't going to model the more extreme curvature all that well, and that's what you'll be trying to minimize.

This is sort of like asking for directions at a gas station and getting vague responses. If they say go 5 miles in 'that direction', would you trust that it's really 5 miles when the streets are a warren, or would you more likely go 2-4 miles and ask someone else.

-Brian

(Dec 06 '10 at 04:01) Brian Vandenberg

However, the backtracking of directions somehow takes care of the difference between q and f.

(Dec 06 '10 at 06:00) Justin Bayer
1

I guess the most important question is: does it work as advertised, doing deep learning with no need for layer-wise restrictions?

I tried to program this same algorithm in Java, but somehow it just wouldn't work. I'll take a look at your code and maybe attack it again!

(Feb 02 '11 at 23:34) Jacob Jensen

I can't answer that yet, I don't have a fully functional implementation yet. That was more or less a sketch of the algorithm. However, one individual on these forums (name escapes me) wrote an implementation for his masters program (I don't believe it was his thesis), and his results seemed to agree with Martens' results.

There are three issues (so far) with what I wrote (ignoring inefficiencies, redundant code, etc). The conjugate-gradient stuff is written badly, there's one mistake when calculating the gradient, and it was written with a process more like you'd use for an RBM, and not what James described:

  • The gradient should be estimated using a much larger portion of the data-set than is used in the CG algorithm. James used the entire data-set for the gradient, then averaged (or summed, either one) and used that to do the CG step.
  • The gradient is an outer product of some vector related to y, and another related to x. Since there are multiple samples, the entity used is a matrix, so the math on that is being done wrong.
  • It was written under the assumption that the gradient estimate wouldn't be averaged (or summed) first.
  • It was written using a more generic CG skeleton that used the norm of the residual as termination conditions instead of the conditions James outlined in his paper. I had to face palm on that one; James was the one that pointed it out. In my defense, I was just trying to get an outline written, after-which I was going to go over his paper and get the details right.
  • Although I believe I paid good attention to the dimensions and how to mix/match the multiplications & whatnot, I was still thinking of W & V as vectors instead of matrices while writing the CG code, so some of the code needs to be adapted to more appropriately come up with a new candidate solution.

-Brian

(Feb 03 '11 at 11:42) Brian Vandenberg

Thanks for putting this together, very helpful. One question though, can you explain what's going on with dphi and it's role in the r0 phase?

(Mar 16 '12 at 03:20) louie

Sure. In the abstract, dphi is the direct derivative of the activation function at a given layer w/respect to its input. In my code, I've implemented it as a cell, where each element of the cell contains the derivatives for the corresponding layer. dphi{kk} is only used to back-prop error information to layer kk-1, so dphi{1}=0. In my fully fleshed out implementation, I set it to NaN to force an error condition if I ever accidentally used it. In the F0 pass, I'm calculating then caching these direct derivatives to speed up calculations in the R1 and F1 passes.

(Mar 16 '12 at 19:56) Brian Vandenberg

Hi Brian,

I am also trying to learn the HF method using the Martens, Schraudolph, and Pearlmutter papers.

I notice in your code that hessian-vector product is noted as Hv, are you not finding the Gauss-Netwon approximation to the hessian, which I have seen as Gv?

I am currently working on just computing Hv because the documentation provided by Pearlmutter and Bishop is fairly clear. Once I have finished this I will try Gv, which is the end goal

Also, there seems to be some confusion in the uppe comments. Conjugate gradient itself doesn't minimize q, it just finds the newton direction to minimize q. Line search, I think, should still be used since q is just an approximation to f, and we don't want take such a large step away from our current location in the function domain to a point where q is no longer an accurate approximation of f.

I know you know. Writing it out just helps me learn this stuff...

Will

(Jan 04 '13 at 13:43) will henry
showing 5 of 9 show all

2 Answers:

This seems mostly reasonable.

A few things:

  • It is useful to cache the activities of the units in order to compute the matrix-vector products, but not the gradients themselves. And you only need to cache the activities associated with the minibatch of training cases you will use for the matrix-vector products in the current HF iteration. The sum of entry-wise squared gradients that you need for the preconditioner can be easily computed in your back-prop code by adding a couple lines.

  • rho can only be computed after CG is run, not before

  • What do you mean by "early stopping" of CG? The criterion in 4.4 should always be used.

  • Also, it can be helpful to hard limit the number of steps to run CG. I didn't mention this is in the paper but I used a hard limit of 250 in all of the experiments (in addition to the criterion in 4.4 which will often cause CG to stop sooner). The reasons for this are complicated and will be part of the subject of my next paper. It still works without the hard limit, just not quite as well, and using a hardlimit that is "too small" for your particular problem can do more harm than good, so be warned.

This answer is marked "community wiki".

answered Dec 05 '10 at 21:10

James%20Martens's gravatar image

James Martens
131154

edited Dec 05 '10 at 21:10

In order, my responses:

  • caching may've been the wrong way to phrase that; I was more trying to explicitly point out optimizing the creation of M, though you probably gathered that based on that last sentence from point #1.
  • hm. I assumed that since lambda was being updated prior to CG-Minimize, rho needed to be set as well. Based on what you've said, I assume round 1 wouldn't touch lambda, but rounds 2+ would based on rho from previous steps.
  • I added that one based on your comments about early stopping being helpful early on but harmful later on. I'm only speculating, but if it helps early on and doesn't move us into an area of weight space that causes the algorithm to behave poorly later on (when early stopping of CG isn't used anymore), why not?
  • Is this for NIPS? I wish I could afford to go. Maybe next year. I look forward to reading it, nevertheless. Feel free to give us a sneak peek if you want any extra eyes on it :)

-Brian

(Dec 05 '10 at 21:37) Brian Vandenberg
3

Okay, but the way you wrote the code it sounded like rho was being computed even before the first run of CG. It's probably easier to think about the lambda adjustment as something that occurs after CG instead of before.

Stopping the CG runs early is "helpful" early on only in that it makes the error go down faster at first. But this does permanent damage which affects long term performance. So it's never a good idea to do this, unless you don't plan to run the HF algorithm past the short-term.

Not NIPS, a paper in progress.

(Dec 05 '10 at 21:59) James Martens

I'm actually quite curious about some of the practical details like the hard stopping etc. I know these things are very task dependent but I find it odd that CG almost always terminates after just a couple of iterations because the criterion in 4.4 barely changes and that lambda always has very high values like 1e6 or 1e7. Makes me think there is a mistake somewhere but all the gradient/Gv like stuff seems correct according to finite differences... I also found that line-searches almost always result in alpha=1 btw. I didn't implement the pre-conditioned version yet though...

(Dec 06 '10 at 04:53) Philemon Brakel
3

It sounds like you have a bug or are initializing the network badly. Either use the random initialization scheme I described in the ICML paper or something else thats reasonable (e.g. Nguyen-Widrow). On the autoencoder tasks preconditioned CG should want to run for ~20 iterations at the beginning, and hundreds at the end. Non-preconditioned CG should want to use even more since it tends to be less efficient per-iteration. A very high lambda value will cause very short runs of CG, but the bug isn't necessarily in your lambda adjustment code.

(Dec 10 '10 at 21:35) James Martens

Thanks for the response. I'm actually training a recurrent neural network so I cannot directly compare the behavior of the algorithm to the results for the autoencoder. I will definately go over my code again and have a look at the Nguyen-Widrow method.

(Dec 11 '10 at 07:35) Philemon Brakel

I guess that's a late response, but I've found few errors in a code. While computing Gv product in the beginning of r1 pass you backpropagate through M, that's an excessive computation since you have already propagated through M in f1 pass. So it have to be

% r1 pass begin
dW{ld_size - 1} = y{ld_size - 1} * ry{ld_size} 
...

Thanks a lot to Nic Schraudolph who helped me to double check it. Also there are few syntax errors where you use element-wise instead of matrix multiplication.

And thank you for the code, Brian. It was really helpful :)

answered Dec 20 '13 at 18:45

Alex%20Vlasov's gravatar image

Alex Vlasov
61127

Understanding Nic's paper was very key to getting this all figured out, and thankfully he was more than willing to help me out as well.

Thanks for the critiques, I'm glad the code is helping people.

(Nov 06 '14 at 13:15) Brian Vandenberg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.