|
How does the performance of natural gradient (Natural gradient works efficiently in learning , Amari, Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons, Amari et al.) fare in comparison to pre-training + sgd or martens' technique? It seems to use second-order information, but in an odd way, for an optimization algorithm.
showing 5 of 6
show all
|
|
The second paragraph in section 6 of Martens' HF paper discusses briefly your question. |
|
The links between natural gradient and standard Newton method are still relatively unclear. These methods are somewhat tied through the Gauss-Newton approximation (which is an approximate Newton method whose formula looks very similar to that of natural gradient), but there are fundamental differences between C and H. Our recent ICML paper tried to answer the question of the relationship between these two matrices, but we only barely scratched the surface: A fast Natural Newton method (Le Roux and Fitzgibbon), referring to Topmoumoute Online Natural Gradient Algorithm (Le Roux et al.). In the former, we show how to incorporate information from these two matrices to yield faster convergence and stability. On some datasets we get both, on some we get increased stability, and on some we get neither. The latter case seems to indicate that, despite the fundamental differences between these matrices, there are still some numerical similarities. Also, bear in mind that, though Amari uses the uncentered covarianced matrix, our justification for natural gradient is based on the centered version. 3
As far as I know, most practical implementations of Natural Gradient use some kind of diagonal (or block-diagonal) approximation and thus simply won't be as good as a non-diagonal approach. When I tested the matrix used in natural gradient, I did it within my version of HF and was thus able to do the full non-diagonal thing. However, it didn't work, and I strongly suspect no amount tweaking can fix it. One of the key differences between the empirical Fisher matrix and the Gauss-Newton is that the former uses the error residuals and approaches zero as they do, while the later has no such degeneracy. Also, the rank of the Fisher is much lower than the rank of Gauss-Newton for multi-output problems.
(Dec 05 '10 at 21:24)
James Martens
|
The first link doesn't work. In general, it would be nice if everyone cited/linked things using an anchor like "Author (Year)" instead of just "here". There are two reasons: (1) It would be good to know what papers you talk about without having to download a whole PDF. (2) The PDF might not work anymore in a month or in a year, and so it won't be completely clear what we were talking about.
Good point, I'll be more careful. Fixed the links and edited the texts,
Frank, I really support this idea of using Author (Year), I wonder how hard it would be to coax other site users to do so.
@Joseph Turian: maybe flag a warning (with javascript) whenever a single-word link is posted, or add a mock-markdown template for papers (something like [[author][year][title]](url)?
Second paragraph in section 6 of Martens HF paper?
Thanks. I missed that when I first read it. Can you submit this as an answer so I accept it?