|
I have a question about computing the Gauss-Newton matrix vector product using a method that is similar to backpropagation. Recently I have implemented a training method that is similar to the Hessian-free optimizer by Martens (2010). He argues that the use of the Gauss-Newton approximation works better than using the actual Hessian. So far I found the real Hessian indeed to be quite unstable. To compute the Hv product for a multilayer perceptron with softmax outputs, I used the R operator by Pearlmutter (1994) which leads to an altered backpropagation procedure. To obtain the actual Hv values is slightly less straightforward than for example obtaining the gradient with normal backprop and is based on the quantities R(DE/Dw) (gradient of the error with respect to the weights). According to Schraudolph (2002), the Gauss-Newton version of the product (Gv) can be obtained by first doing the forward computations in the same way as with the R method (but skipping the actual loss function at the end) and then backpropagating the result using the standard method. It is however not clear to me: which quantities now give the actual values of Gv? I'd be very happy if someone could enlighten me about this. |
|
The back-prop procedure outputs a scalar for each parameter in your network and concatenated into a large vector these give the Gv product. These are the same scalars that give the gradient when you feed the network's output errors into back-prop. I also have a practical suggestion: after you've implemented this be sure that the results are consistent with a finite-differences approximated version of the Gv product. I don't understand. What are you supposed to back-prop if not the errors? The gradient of the application of the network to v?
(Sep 28 '10 at 21:15)
Alexandre Passos ♦
2
Back-prop is basically just the multiplication of the J^T against a vector, where J is the Jacobian of the output units w.r.t. the parameters. You can backpropagate any vector you want. Read Schraudolph's paper for details.
(Sep 28 '10 at 22:13)
James Martens
1
To James Martens: Thanks a lot for the quick reply! What mainly confused me was the fact that for the Hv calculation the quantities for for example the hidden units are given by (off the top of my head...) R(dz)z^T + dzR(z)^T and not just dz*z^T. Apparently this is simpler for the Gv case. I used finite differences to check my Hv version but actually don't really know how to estimate Gv without calculating actual Jacobian matrices... To Alexandre Passos: As far as I understand you first calculate R(y) with the forward pass which should represent the change of the output as a function of adding an infinitesimal vector to the weights I think and backpropagate that back the normal way. I might be wrong about this and will read Schraudolphs paper more carefully myself to be sure :).
(Sep 29 '10 at 15:28)
Philemon Brakel
2
It's trickier to use finite differences to check the G matrix but still possible. Just compute the J matrix using finite differences and then form G = J'HJ where H is the "little hessian" which comes from the convex loss function (again, see Schraudolph if you don't know what I'm talking about).
(Sep 30 '10 at 00:56)
James Martens
Thanks again! I solved it now and learned a thing or two as well. It turned out I did most of it right already but I didn't realize that the biases should be included as R(1) = 0 and not just by concatenating 1 to the activation vectors. The other mistake I made is that I used the Jacobian of the network + output function and not the Jacobian of just the network to do the numerical estimation. All in all it took me a while to get into Schraudolph's notation style but now it all makes sense.
(Sep 30 '10 at 05:07)
Philemon Brakel
I was a bit confused with what R{y0} is for Pearlmutter's method. Is it simply the R{y0} = y0 (i.e. inputs)
(Nov 08 '10 at 09:41)
Subodh Iyengar
showing 5 of 6
show all
|