Hello!

I am implementing the Convolutional Deep Belief Network described in the paper "Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations" by H. Lee et al., ICML 2009.

I managed to implement the Convolutional RBM, but I am not sure on how to do the stacking part. In the paper it is said that after training the first layer, its weights are frozen and its activations are used as input to the next layer.

Let's say my inputs are real valued grayscale images. If I am training k filters for the first layer, that means I will have k feature maps at the end of the first layer. How do I now use these feature maps, or their pooled versions to be more precise, as the input to the next layer?

Do I now use each of the k maps as an input on its own and train k_2 filters (with dimensions filter_width * filter_height) for every map (k_2 is the number of filters I want to learn for the second layer), which results in k * k_2 feature maps at the end of the second layer, or do I learn k_2 3D filters (of dimensions filter_width * filter_height * k) which results in k_2 feature maps at the end of layer 2? Or do I have to somehow combine the feature maps from the first layer to get a single input for the next layer?

In the paper "Gradient-based Learning Applied To Document Recognition" by LeCun et al., where convolutional neural networks are described, multiple feature maps from one layer are combined in a defined way (Table 1 in the paper I mentioned) before they are used as input to the next layer.

I wasn't able to find the way how they did it in the Convolutional DBN paper. Can someone explain what is the correct way of doing the stacking?

The other question I have is about the types of units I should use in higher layers of the DBN. In my first layer the visual layer consists of real valued units and the units in the hidden layer are binary. After I do the pooling, the units in the pooling layer are also binary and they should be used as input in the second layer. Should the input units of my second layer be binary, or do I use the probabilities of the pooled layer units being turned on and have Gaussian units in the input of my next layer?

Thank you for your help!

asked Jan 25 '14 at 21:05

Petar%20Palasek's gravatar image

Petar Palasek
31114

Have you find another answer to this question ? I have the same problem.

(Sep 16 '14 at 03:16) Baptiste Wicht

One Answer:

I have not read the paper you mentioned yet. But concerning Conv-nets, images are considered as tensors where their dimensions are channels x width x height. So for a 256x256 color images we have 3x256x256 (red, blue and green 256x256 images). THE FILTERS AT THE CONV LAYERS MUST HAVE THE SAME NUMBER OF CHANNELS AS THEIR INPUT "IMAGES".
To make it more clear lets continue our example with a color 256x256 image. Let us assume you want 10 filters in the first conv-pool layer. Assume that after convolution and max-pooling your resulting images are of size 128x128. So the dimensions of the input image to the second layer will be 10x128x128 and the dimensions of the filters have to be 10 x width x height. Got it?
So, if after all the conv-layers you want, say, a logistic regression layer you have to flatten the resulting 3D image into a 1D vector and feed it as input to the logistic regression or multilayer perceptron (the last two are also called fully connected layers).
If I had to translate this to a RBM, I would try to learn a separate conv-RBM for each channel and then combine everything using a regular (fully connected) RBM in the end of everything. You can try to combine everything before the end, but keep in mind that each channel has its own convolution, if you use the same filter across all channels, this will be the same as restricting/regularizing your net structure, which may be good or not, it depends on your data set.

So, I hope I didn't end up answering a different questions...

answered Jan 27 '14 at 09:47

eder's gravatar image

eder
2162511

Thank you! Using tensors makes the most sense. I'll leave the question marked as unanswered for a while so that someone maybe answers the second part too. Thanks again! :)

(Jan 27 '14 at 10:14) Petar Palasek

I'm sorry I missed that part. So, here goes my guess: remember what we do for the MNIST data set. The output of the layers are the probabilities. In order to calculate contrastive-divergente, we need to sample from those probabilities. So, I think you're not binning the data, you're sampling from that given "probability-field-something" generated by the conv layer. If you are sampling from a binary or gaussian distribution, it depends on your design choices.

(Jan 27 '14 at 22:08) eder

In this case, if the entire image is used for trained to learn the weights of layer 1, can you please explain the sequence of events taking place? I mean, if we use random patches to learn the weights, then convolution with the entire image makes sense, but if we learn the weights using the entire images, where does the convolution step come in? This question might be naive, but I am a little confused.

(Feb 10 '14 at 04:00) Sharath Chandra
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.