0
1

Developed by Masci et al., Convolution Auto-encoder algorithm is two-fold: Feature Maps Construction and Reconstruction. Let us forget pooling for brevity.

1) Feature Maps Construction (Understood)

Suppose x is an nxn matrix and W an mxm matrix. Computing for a feature map h is given by,enter image description here

              (Equation 1)

So, h^k is of size (m-n+1)x(m-n+1) when using valid convolution.

2) Reconstruction (Problem lies here)

After obtaining the feature maps h, it reconstructs the original image by the following equation,

Equation 2

             (Equation 2)

where y is the real input, which equals x. The problem lies here.

Knowing that,

  • W^~ is a filter, perhaps of size mxm
  • h is of size (m-n+1)x(m-n+1),
  • convolution has an output of smaller size than h

how would y, which is larger than either h or W^~, be reconstructed?

Thanks in advance.

asked Mar 09 '14 at 16:40

Issam%20Laradji's gravatar image

Issam Laradji
1217912


One Answer:

As you mentioned, the convolution in the forward direction is a valid convolution, but the convolution for reconstruction should be a full convolution to get the same size as the input. I'm not familiar with this particular model, but in the case of convolutional RBMs, that's how it's done.

The size of h^k is (n - m + 1) x (n - m + 1) by the way, you switched m and n I think.

The output size of a full convolution is input_size + filter_size - 1, so in this case you get (n - m + 1) + m - 1 = n, as desired.

If you only have an implementation for the valid convolution, you can use it to do a full convolution by padding the input with m - 1 zeros on each side.

answered Mar 10 '14 at 05:05

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734

I guessed the reconstruction involved full-convolution - thanks for confirming this. :)

(Mar 10 '14 at 06:37) Issam Laradji

what I don't understand is why W needs to be flipped

(Mar 11 '14 at 05:11) Ng0323

In the default definition of 'convolution', it is required that the filters are flipped (I always assumed this is because it makes the math prettier). If you don't flip the filters, what you are computing is called a 'correlation'.

However, when you are learning the filters from data, it doesn't actually matter whether the filters are flipped or not (if you flip them in your code, the learnt filters will just be flipped as well).

People usually include this flip so what they do adheres to the standard definition of a convolution, but in this context there really is no need for that. In fact, I believe the convolutions in cuda-convnet don't do it, for example (so they are actually correlations).

(Mar 11 '14 at 06:14) Sander Dieleman

Maybe he is asking why the decoders are defined as the flipped version of the encoders... If that is the case, this is just a regularization choice.

(Mar 11 '14 at 20:06) eder

thanks for the inputs! so then to flip or not to flip the decoder W is another hyperparameter. I'm quite surprised to learn that in the code it can be either convolution or correlation.

(Mar 11 '14 at 22:13) Ng0323
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.