Hi. I'm trying to implement deep learning with convolution and stacked autoencoders. Without convolution (working on full images) backprop works fine and after pre-training the network classifies fine on MNIST. When I try to cut the images down to small (say 5x5) chunks and use convolution - backpropagation doesn't work as expected.

On the forward pass after the convolution stage I flatten the result into a input vector x.

Then I feed the softmax input x to the softmax classifier, calculate the softmax delta delta_softmax = hypothesis - y where y is the label (as a vector).

And now here is the part I think I'm doing wrong but I'd like to have a confirmation: I calculate the next delta: softmax_theta' * delta_softmax .* fp(x).

Where fp(*) is the derivative of the sigmoid activation function.

This is what I would do if the previous layer was fully connected and I figured that after cutting out relevant parts of it would work fine, but when I verify it against the gradient calculated numerically it comes up wrong.

So am I correct that you can't just propagate the error in this way or does this discrepancy come from some other place?

asked Jul 12 '13 at 11:41

Bartosz%20Witkowski's gravatar image

Bartosz Witkowski
6223

could you provide your code snippet .. its a bit hard to understand what you are saying.

did you check the numeric failure was not just epsilon size related? ie you see convergence of numerical differentiation as epsilon ->0 ( but to different answer)

(Jul 13 '13 at 11:23) SeanV

Yes I've verified that the discrepancy doesn't come from numeric error (changed epsilon from 0.01 to 10^-6 and it converges to different values than I've calculated using backprop).

The snippet is lengthy so I've created a pastie: http://pastie.org/8140225 Thanks for the input!

(Jul 14 '13 at 12:28) Bartosz Witkowski

One Answer:

A high-level but maybe useful comment is that if you're implementing gradients yourself (for backprop) it's always good to check them against the finite differences method (evaluate the function in a point and then perturb each dimension by epsilon and evaluate there to estimate a partial derivative as (f(x+eps)-f(x))/eps).

It might also be that your problem is in optimization. If your gradient computation is correct you should always be able to set a sufficiently small learning rate and see the error go down if each gradient is computed on the entire data set (though I recommend using a subset for this kind of debugging). If that does happen then maybe you need to tune your SGD code.

answered Jul 12 '13 at 20:29

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Thanks! Yes this is what I meant when I said that I compared my backprop implementation to the gradient evaluated numerically. I know how to verify if the implementation is correct but I'm in a point where everything seems to check out BUT propagating the error from the highest layer to the previous one.

I thought that this is maybe because the next delta is wrong and comparing the biases to the ones calculated using the finite differences method shows that it is so but I don't know whether the bug exists because I calculate the delta wrongly or is it because of something else.

(Jul 13 '13 at 07:28) Bartosz Witkowski
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.