|
I have some questions on James and Ilya's paper and its supplementary materials; hopefully someone in the community can help shed some light on this. Section 3.2 talks about the structural damping process, which amounts to treating the hidden units as outputs, and coming up with a non-linear, non-quadratic distance function D. When D is optimized, it would encourage the hidden units to resist changing state between time-steps. I'm confused on a few points:
That's about all I have for now. I may come up with some follow-ups, or others I didn't think of. Furthermore, if anyone else has questions, feel free to edit this one (I've made it community wiki) and add your question to the mix. edit Minor update: I haven't had a lot of time to tinker, but I did have one small bit to add: I think the key phrase to part of my question is this: "as we would when applying reverse-mode automatic differentiation". I'm guessing lines 11 and 12 are doing just that. I'm not familiar with automatic differentiation tools and techniques, but I think that sentence has more to do with cluing us in on how to derive appropriate expressions as opposed to any actual computations being performed.
showing 5 of 7
show all
|
|
Hi Brian, here are some answers to your questions: (Note that g is the output nonlinearity rather than the hidden-hidden nonlinearity; the latter is e).
This answer is marked "community wiki".
Thanks, Ilya. I'm almost there. I have everything up to and including CG written (though I haven't had time to put it through its paces yet). Now I need to get the training code written and start working out all the bugs and get the synthetic tests setup.
(Jul 08 '11 at 16:21)
Brian Vandenberg
|
|
Considering function D, I guess we never need these target values, for Gauss-Newton approximation we actually need to compute dNet/dw and a second derivative of loss function with respect to x (net input), and since for matching loss functions (i.e. logistic sigmoid/cross entropy) first der. is y(x) - t, second derivative with respect x is just dy/dx quantities. I think that's the reason why D is not stated explicitly in the paper, we just assuming its 1st derivative is y - t, and we need the 2nd, not the 1st. |
|
re: the kronecker product stuff, per James Martens (personal communique) that should be a sum rather than a kronecker product. He intends to post a fixed version of the paper shortly. |
|
I have an answer to part of this question: the distance function, D, is notated as: D( h(theta), h(theta_n) ) ... in the paper, which I take to mean all hidden units across all time steps. It also appears that D is a scalar, and since it is a matching loss function if h were modeled as logistic, it would likely take the form: D = (-1)*sum_{ii,jj}: h_{ii,jj}(theta) * ln|h_{ii,jj}(theta_n)| + (1 - h_{ii,jj}(theta)) * ln|1 - h_{ii,jj}(theta_n)| Furthermore, derivatives are taken with respect to theta, not theta_n. Page 5 gives more pertinent details from there. Comments? Critiques? I'm sorry for the notation. I could make that far more readable in latex. This also just occurred to me: line 17 of the supplementary algo. is the only addition to compensate for structural damping, and the fact it is the only addition is a consequence of the statement that D=0 and its derivative is zero when evaluated at t=tn (when minimized); the quadratic approximation to D gives only the 2nd order terms, which as they note prevents qtheta_n from being biased, and only induces the one change mentioned earlier on line 17.
(Jul 06 '11 at 01:43)
Brian Vandenberg
|
Maybe you should have a fixed, hidden "end" state, just like people use with HMMs, to finish the sequence without ignoring the last element.
Well, that would certainly explain line 12 of the supplement. If it's fixed, then certainly its derivative is zero. However, when calculating the error of the system wouldn't that affect the error? Unless ... state at t=T+1 should remain the same as at t=T ... hm.
It just occurred to me ... Doe = D(e) ... is a scalar. So is L(g) ... why would it need to be represented as a kronecker product?
I'm mildly curious about something: within minutes of me posting this question there was already almost 100 hits, and after an hour there were over 200. I posted this at 2am, so either there's a lot of people that lurk here at 2am, it's because there's a lot of geeks at ICML that were watching the board, or a large chunk of our audience are in a very different timezone from mine (US).
I'm west coast time. Did you ever get the hessian-free feedforward neural network working? And how did it perform?
@Brian, I'm in the UK, and it was in the morning that I replied to you, probably.
@Jacob: I did get it working, and it worked pretty well. The training time is amazingly fast (1 geforce gtx 570) on mnist. I haven't tried on other data-sets yet.