|
This question is in part based on a comment from an earlier question by Philemon Brakel. In his comment, he said:
I have a mostly functioning implementation of this algorithm, but I'm a little unsure how to tackle biases. Philemon's comment gives me the impression that concatenating ones to the activation vectors might be appropriate for intermediate layers, but inappropriate for the output layer (this from the R(1)=0 piece). If someone could help by pointing me in the right direction I would sincerely appreciate it. |
|
The backpropagation algorithm to compute the H or G vector product is propagating two different quantities. If there is a layer of hidden units z in your network, you will also be propagating a vector R(z). A common trick to use biases in a neural network is to concatenate the value 1 to the input of a weight layer and add an extra column to the weight matrix itself. I made the mistake of not only concatenating 1 to vector z but also to vector R(z) while R(constant) should have the value 0 instead. The easiest fix is to append a 0 to R(z) instead or ignore the corresponding column of the weight matrix. So the difference is not about which layer of the network you consider but the type of quantity you are computing. For me this was one of those little bugs that you overlook for days while looking for mistakes in more complicated parts of the code...
This answer is marked "community wiki".
@Phil: Thank you, now I feel dumb for missing that :)
(Mar 16 '11 at 11:13)
Brian Vandenberg
@Phil: I have a follow-up on this if you don't mind. I ran through the math treating biases separate from the weight matrix, and I don't see any functional difference between having the biases as part of the weight matrix or their own entity other than the potential inefficiencies in terms of memory usage. The bias units still seem to affect R{f0} and R{r1}, but I think I see what you mean. R{z_i} = 0, where i is the bias unit, but the bias still has an effect on other R{z_j}. Does that accurately sum it up? Are there any gotchas?
(Mar 18 '11 at 05:50)
Brian Vandenberg
I'm not sure if I understand your notation correctly. Schraudolph uses f0 and r1 to refer to the different types of algorithms. Pearlmutter defines R{.} as (partial/partial*r)f(w + rv)|_{r=0} which can be seen as the rate of change in f caused by adding an rv to w with r going to 0. R{w} is just v for example. The catch is that R behaves as a standard differential operator so R{constant} = 0. If we have a feedforward network with outputs y and hidden units z, R{y} is defined as WR{z} + Vz. In this case there is a contribution of the column of V that corresponds to the biases (or the part of the vector v that is added to the part of the parameter vector that represents the biases). There is however no contribution from the actual biases itself that are part of W in this case because the corresponding R{z} value is 0. If you use the concatenation of 1 or 0 it just boils down to appending a 1 to z when appropriate and appending a 0 to R{z}. This is all assuming you are computing Hv. It might be a good idea to implement the Hv version first to be sure about the forward pass as it is easier to verify with finite differences. Checking Gv is also possible but a bit more involved. I'm not sure if that answered your question...
(Mar 18 '11 at 08:04)
Philemon Brakel
I think it did. Sorry for the notational kludges. Basically what I was trying to express is this: if I compute R_v{.} and R_c{.} -- where c is an update to the biases -- I don't see any functional difference between that and having c be a column of v. I was trying to ask whether there's any subtle consequences (other than what we've been discussing) w/regard to biases.
(Mar 18 '11 at 12:59)
Brian Vandenberg
1
In that case the answer is simply that there are no other subtle issues that I know about. There is indeed no functional differences between using separate biases or just using an additional column of the weights matrix. That's just a trick to simplify notation or optimize the performance of the implementation.
(Mar 19 '11 at 08:11)
Philemon Brakel
|