1
2

I'm reading Barak Pearlmutter's paper Fast exact multiplication by the Hessian (Pearlmutter, 1994), and ran across a phrase I'm having trouble finding a definition for. I'll quote the relevant portion below:

As is usual, quantities which occur on the left sides of the
equations are treated computationally as variables, and
*calculated in topological order*, which is assumed to exist
because the weights, regarded as a connection matrix, is
zero-diagonal and can be put into triangular form (Werbos, 1974)

I have a few questions on this sentence:

  • What is meant by "zero-diagonal"? I'm having a tough time locating Werbos' paper. Someone suggested it might just mean zero diagonal entries of the weight matrix, but if that's the case I don't see the logic behind why the diagonals would be zero. That seems arbitrary.
  • 'topological order', I assume refers to ordering the list of vertices from a directed acyclic graph. I'm not following what this imparts to the reader, though. Assuming a directed [acyclic] bipartite graph for the two layers ... am I missing something, or just over thinking it?
  • By triangular form, is that in the linear algebra sense? (upper/lower triangular, etc) Hopefully I can find Werbos' paper, that'd probably help a lot with that piece.

-Brian

asked Dec 07 '10 at 23:25

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Dec 07 '10 at 23:25

Werbos' paper wasn't published, except in his book, "The Roots of Backpropagation: (...)". Google books has a preview of it if anyone felt like diving into his thesis as I'm about to do:

http://books.google.com/books?id=WdR3OOM2gBwC&printsec=frontcover&dq=werbos+1974+backprop&source=bl&ots=M2qx3ZK1X6&sig=Yw7nEAYP-lly66-Y0cDsbZ1VmOQ&hl=en&ei=Pwr_TOy4NZSusAOpwfivCw&sa=X&oi=book_result&ct=result&resnum=3&ved=0CCgQ6AEwAg#v=onepage&q&f=false

(Dec 07 '10 at 23:35) Brian Vandenberg

2 Answers:

I think it just means that the network is not recurrent with units that are connected to themselves or units that are below them in the hierarchy. These constraints are not required for the algorithm to work but seem to aid for the general notation he uses where a single set of weights and a single set of variables describes the whole network. The most relevant part of the paper is how to compute the R() operator derivatives and this is very similar to normal back propagation.

Instead of lingering on this for too long I recommend having a look at the relevant part of Bishop's 'neural networks for pattern recognition' for an example using more usual notation. Since you also might want to use the Gauss-Newton approximation instead of the true Hessian, I also recommend to have a look at the paper by Schraudolph (forgot the precise title but the reference is in Marten's paper). I found his notation style to be a bit awkward at first but once I got it, it really made things very clear to me.

This answer is marked "community wiki".

answered Dec 08 '10 at 04:28

Philemon%20Brakel's gravatar image

Philemon Brakel
2445103560

I've read through Schraudolph's paper once, but there were a few areas where he deferred to this paper so I decided to understand it before going back to Schraudolph's.

Thanks for the response, I appreciate it.

(Dec 08 '10 at 10:42) Brian Vandenberg

Mr. Pearlmutter was kind enough to respond to my questions (I'd also emailed him). I'm quoting his response for the benefit of anyone else who was curious:

if you take all the units in the network, regardless of which "layer"
they are in or if they are input units or output units or whatever,
and put their activities in one vector z, then we have

 z_i = f(y_i) + [leaving out driving input term here]

for each unit i, where y_i is the total input of unit i, and y is a
vector holding the total inputs.  And what is y?  It is

 y = W z

where W is an n*n matrix of weights.  Saying that the network is
feedforward is the same as saying that it is possible to number the
units so that the matrix W is zero diagonal and lower triangular, in
other words, that unit i has a nonzero incoming weight from unit j
only if j<i.

So, now I need to digest that. Assuming I'm not reading it wrong, a W in that sort of arrangement would also be singular, because it's lower triangular with zeros on the diagonal. I'm having a bit of trouble visualizing it right now, but it'll click eventually.

-Brian

answered Dec 08 '10 at 11:35

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

The properties of the actual matrix are not really that important in this case. Like he said, the main reason seems to be to define a feedforward network with just two types of variables (weights and unit activations). I personally think it could be easier to read the equations when there are actually different symbols for for example the input, hidden and output units but that is just because I'm more used to that type of notation. The 'pattern recognition and machine learning' book by Bishop also includes the algorithm btw. Computing the R operator is almost the same as normal back propagation but with additional quantities you compute on the way. It might be easiest to start with an implementation of backprop and extend it afterwards. I wouldn't actually implement your network by really using the triangular matrix form.

(Dec 08 '10 at 14:05) Philemon Brakel

Ya, that form would be unwieldly and expensive to work with, but it does provide a different perspective with which to think about the problem, and that's more what I'm after is understanding the perspective he described.

For example, one perspective of neural networks I've seen little treatment for is describing them as a dynamical system, and investigating features of the NN from that perspective.

Given a particular model, does it have bifurcations? Are there stable, semi-stable, or unstable manifolds? Can those features be used to improve training, derive better models, create more stability, etc.

-Brian

(Dec 08 '10 at 14:30) Brian Vandenberg

I see your point. There is actually quite some literature on RNN's as dynamical systems in relation to spiking networks and liquid state machines/reservoir computing but you might already know about that.

(Dec 09 '10 at 04:13) Philemon Brakel

Actually, I didn't. I bought a book on spiking NNs years ago, but at the time my math skills were less than adequate for reading it.

-Brian

(Dec 09 '10 at 09:49) Brian Vandenberg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.