I have some questions on James and Ilya's paper and its supplementary materials; hopefully someone in the community can help shed some light on this.

Section 3.2 talks about the structural damping process, which amounts to treating the hidden units as outputs, and coming up with a non-linear, non-quadratic distance function D. When D is optimized, it would encourage the hidden units to resist changing state between time-steps.

I'm confused on a few points:

  • By the time section 4 begins, D is never actually stated explicitly, only said to match the hidden-layer activation function (eg, L matches g; analogously, D matches e)
  • Supposing, for example, g is just logistic. The matching loss function for g is cross-entropy error ... but what is being compared? t_i to t_{i+1}? That would make logical sense until you get to t_T ... what do you compare that to? The only conclusion I can draw here is you only iterate this over 1...T-1 instead of 1...T (supposing you're looking to actually calculate the error)
  • The end of page 5 shows an expression for the GN approximation to the hessian. The kronecker product is something I haven't had much exposure to, so I'm just not seeing why the 2nd derivative of that expression boils down to the sum of two simpler hadamard products.
  • Quoting: "Moreover, as long as we perform backpropagation starting from both s and t together (as we would when applying reverse-mode automatic differentiation) the required multiplication by the transpose Jacobian term [doesn't require any extra work] either."
    • This sort of dove-tails off one of the earlier questions. Firstly, I'll re-iterate something about hessian-free: there's essentially two 'types' of backprop used. One BPs an error vector (output - expected output), the other BPs an arbitrary vector. For hessian free, this arbitrary vector ends up being the output of the R{forward pass} phase of the algorithm.
    • I'm a little confused on what is meant in the above quote. The only way I can see to start from t is to have a target end-state for it. Moreover, the supplementary material algo looks exactly as I expected it to look (disregarding the line for damping) if you didn't start from t, leading me to wonder whether this particular bit of information applies to backprop of an error vector -- before the Gv product is ever begun.

That's about all I have for now. I may come up with some follow-ups, or others I didn't think of. Furthermore, if anyone else has questions, feel free to edit this one (I've made it community wiki) and add your question to the mix.

edit

Minor update: I haven't had a lot of time to tinker, but I did have one small bit to add: I think the key phrase to part of my question is this: "as we would when applying reverse-mode automatic differentiation". I'm guessing lines 11 and 12 are doing just that. I'm not familiar with automatic differentiation tools and techniques, but I think that sentence has more to do with cluing us in on how to derive appropriate expressions as opposed to any actual computations being performed.

asked Jun 30 '11 at 03:31

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Jul 05 '11 at 20:09

Maybe you should have a fixed, hidden "end" state, just like people use with HMMs, to finish the sequence without ignoring the last element.

(Jun 30 '11 at 03:34) Alexandre Passos ♦

Well, that would certainly explain line 12 of the supplement. If it's fixed, then certainly its derivative is zero. However, when calculating the error of the system wouldn't that affect the error? Unless ... state at t=T+1 should remain the same as at t=T ... hm.

(Jun 30 '11 at 03:40) Brian Vandenberg

It just occurred to me ... Doe = D(e) ... is a scalar. So is L(g) ... why would it need to be represented as a kronecker product?

(Jun 30 '11 at 03:42) Brian Vandenberg

I'm mildly curious about something: within minutes of me posting this question there was already almost 100 hits, and after an hour there were over 200. I posted this at 2am, so either there's a lot of people that lurk here at 2am, it's because there's a lot of geeks at ICML that were watching the board, or a large chunk of our audience are in a very different timezone from mine (US).

(Jun 30 '11 at 19:50) Brian Vandenberg

I'm west coast time. Did you ever get the hessian-free feedforward neural network working? And how did it perform?

(Jul 01 '11 at 02:05) Jacob Jensen

@Brian, I'm in the UK, and it was in the morning that I replied to you, probably.

(Jul 01 '11 at 03:35) Alexandre Passos ♦

@Jacob: I did get it working, and it worked pretty well. The training time is amazingly fast (1 geforce gtx 570) on mnist. I haven't tried on other data-sets yet.

(Jul 01 '11 at 12:35) Brian Vandenberg
showing 5 of 7 show all

4 Answers:

Hi Brian, here are some answers to your questions:

(Note that g is the output nonlinearity rather than the hidden-hidden nonlinearity; the latter is e).

  • That's right: the intention is to make D match e so that the gauss newton approximation to S is PSD. So if e(x) is logistic, then D is cross entropy.

  • The point of structural damping is to penalize changes in the hidden state sequence as a function of the parameter update. Thus the objective is D(h(theta-n)||h(theta-n + proposed-parameter-update)), where theta-n is fixed and the proposed-parameter-update is variable. Here h(theta) is the concatenation (h-1,...,h-T) from eq. (2). So if D(h(theta-n)||h(theta-n+1)) is small, then the hidden state sequence hasn't changed much.

  • If you are comfortable with the regular Gauss-Newton approximation L(y-target, y(theta-n + d)), then it should be equally easy to imagine the Gauss-Newton approximation of L(y-target, y(theta-n+d)) + D(h(theta-n)||h(theta-n+d)) where d is the change in the parameters and theta-n is constant as before. Given that h must be computed in order to compute y(theta), we can get the result with a single run of backprop (rather than two), analogously to the manner in which a single forward pass computes both h and y.

  • Backward automatic differentiation is simply backprop extended to arbitrary neural network architectures. Namely, you reverse the lines of the code that compute your function, and mindlessly differentiate each line. So for example, if you had a line X=WY, then you'd replace it with the lines

      dE/dW += outp(Y, dE/dX) and
      dE/dY += mult(W.T, dE/dX). [edit: '+=' was previously '=', which is incorrect when a variable is used multiple times].
    Note that previously Y and W were the inputs and X is the output. In the backward differentiation the situation is precisely reversed. Automatic differentiation is useful precisely because it's only slightly more expensive than the corresponding forward pass. And given that we share the computation of y and h, automatic differentiation will similarly share the computation in the backward pass.

This answer is marked "community wiki".

answered Jul 08 '11 at 00:57

Ilya%20Sutskever's gravatar image

Ilya Sutskever
133

edited Jul 10 '11 at 14:22

Thanks, Ilya. I'm almost there. I have everything up to and including CG written (though I haven't had time to put it through its paces yet). Now I need to get the training code written and start working out all the bugs and get the synthetic tests setup.

(Jul 08 '11 at 16:21) Brian Vandenberg

Considering function D, I guess we never need these target values, for Gauss-Newton approximation we actually need to compute dNet/dw and a second derivative of loss function with respect to x (net input), and since for matching loss functions (i.e. logistic sigmoid/cross entropy) first der. is y(x) - t, second derivative with respect x is just dy/dx quantities. I think that's the reason why D is not stated explicitly in the paper, we just assuming its 1st derivative is y - t, and we need the 2nd, not the 1st.

answered Dec 21 '13 at 10:51

Alex%20Vlasov's gravatar image

Alex Vlasov
61127

edited Dec 21 '13 at 11:38

re: the kronecker product stuff, per James Martens (personal communique) that should be a sum rather than a kronecker product. He intends to post a fixed version of the paper shortly.

answered Jul 06 '11 at 01:44

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Feb 21 '12 at 16:13

I have an answer to part of this question: the distance function, D, is notated as:

D( h(theta), h(theta_n) )

... in the paper, which I take to mean all hidden units across all time steps. It also appears that D is a scalar, and since it is a matching loss function if h were modeled as logistic, it would likely take the form:

D = (-1)*sum_{ii,jj}: h_{ii,jj}(theta) * ln|h_{ii,jj}(theta_n)| + (1 - h_{ii,jj}(theta)) * ln|1 - h_{ii,jj}(theta_n)|

Furthermore, derivatives are taken with respect to theta, not theta_n. Page 5 gives more pertinent details from there.

Comments? Critiques?

I'm sorry for the notation. I could make that far more readable in latex.

answered Jul 06 '11 at 00:50

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

This also just occurred to me: line 17 of the supplementary algo. is the only addition to compensate for structural damping, and the fact it is the only addition is a consequence of the statement that D=0 and its derivative is zero when evaluated at t=tn (when minimized); the quadratic approximation to D gives only the 2nd order terms, which as they note prevents qtheta_n from being biased, and only induces the one change mentioned earlier on line 17.

(Jul 06 '11 at 01:43) Brian Vandenberg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.