I'm working on constructing an RNN along the lines of what James Martens describes in his paper Learning Recurrent Neural Networks with Hessian-Free Optimization, and I'm a little confused on how backprop works with RNNs.

When calculating the derivatives, it looks as though some of the errors propagate forward through time.

Here are the forward equations (quoted from the paper):

t_i    = Whx*x_i + Whh*h_(i-1) + bh
h_i    = e(t_i)
s_i    = Wyh h_i + by
yhat_i = g(s_i)

... where h are hidden units, yhat are the output units, and t/s are their net inputs.

When calculating the derivative for Whx, since h_(i-1) makes use of that weight then the derivative of the error at the ith time-step has a dependency on derivatives from a prior time-step.

Basically, my intuition tells me I'm doing this wrong, but the math seems to suggest otherwise. Comments? Suggestions?


edit Adding an equation:

For just two time-steps with one output, where j indexes the input vector and k indexes the hidden-layer, this is the expression I came up with for dEdWhx:

dEdWhx_jk =    (yhat_1 - y1)            // dE/dyhat
            .* g'(s_1)                  // dyhat/ds
            .* Wyh_k                    // ds/dh_{1,k}
            .* e'(t_{1,k})              // dh_{1,k}/dt_{1,k}
            .* [                        // dt_{1,k}/dWhx_{j,k}
                  x_{1,j}               // the part that only depends on external inputs
                + d/dWhx_jk( Whh*h_0 )  // the part depending on a previous iteration
               ]
            + (...)                     // Derivative of the error at the 0th step output

asked Jun 03 '11 at 13:41

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Jun 03 '11 at 14:17


One Answer:

I'm not 100% sure what you mean, but part of the derivative does indeed come from previous time steps due to the derivative of the activation function e if that is what you mean. The derivative of tanh is for example 1 - h_i^2. Because h_i depends on previous timesteps, the final gradient also does.

In the end, backprop in an RNN is the same as in a feedforward network after unfolding it over time except for the difference that the weights are tight/shared so you have to sum their individual gradients in the end. When you are using matrix algebra to compute the gradients, this leads to matrix multiplications for the final quantities of interest that automatically include this summation over time. In matrix notation the backprop algorithm should be something like:

dtanh = 1 - h^2
pre_dh = Wyh^T * dy (so this is a matrix now)
dh[:, i-1] = pre_dh[:, i-1] + Whh^T * dh[:, i] * dtanh[:, i-1]
dh[:, end] += pre_dh[:, end] * dtanh[:, end]

dWyh = dy * h.T dWhh = dh[:, 1:] * h[:, :end]^T dWhx = dh * x^T

Sorry for mixing matlab and numpy notation a bit...

answered Jun 03 '11 at 13:56

Philemon%20Brakel's gravatar image

Philemon Brakel
2445103560

edited Jun 03 '11 at 14:10

1

No worries, it's still readable. We really need latex.

(Jun 03 '11 at 14:18) Brian Vandenberg

Just to briefly summarize: the reason the error propagates forward through time is because the weights are shared. In a typical back-prop trained network the error only propagates backward through the network because the weights only connect layers together in a strictly feed-forward fashion, and only connect between two layers. This architecture is different because the same set of weights is, in a sense, used for multiple layers in a large deep network.

(Jun 07 '11 at 01:43) Brian Vandenberg

The error does not propagate forward through time. It goes from t to t-1, t-2, ... which is backward in time. Furthermore, this does not have anything to do with the weight sharing, this would apply even if you had different weights. The "through time" idea is, that you can differentiate through those timesteps. Which is just the observation that something(t) can be differentiated with respect to somethingelse(t-1) if there is a relationship, since time is just a man made concept here.

(Jun 07 '11 at 02:11) Justin Bayer

@Justin - Yes, I stated that wrong. It isn't the error propagating forward through 'time', but rather some of the error associated with time step 'i' is derived from derivatives at earlier time steps.

(Jun 07 '11 at 15:58) Brian Vandenberg

If you want to go with the nomenclature of the community though, I'd drop the forward. To make the backward more intuitive: First you activate the network and the activations from t go to t+1 via W_hh. Afterwards, when you have the error at timestep t', this does depend on the outputs at timestep t'-1 and so the error has to flow back to t-1 as well. So first you move forwards through your connectivity graph and when you have the error you go backwards. It' like BACKprop. Through time. ;)

(Jun 08 '11 at 02:21) Justin Bayer
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.