|
Hello! I am implementing the Convolutional Deep Belief Network described in the paper "Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations" by H. Lee et al., ICML 2009. I managed to implement the Convolutional RBM, but I am not sure on how to do the stacking part. In the paper it is said that after training the first layer, its weights are frozen and its activations are used as input to the next layer. Let's say my inputs are real valued grayscale images. If I am training k filters for the first layer, that means I will have k feature maps at the end of the first layer. How do I now use these feature maps, or their pooled versions to be more precise, as the input to the next layer? Do I now use each of the k maps as an input on its own and train k_2 filters (with dimensions filter_width * filter_height) for every map (k_2 is the number of filters I want to learn for the second layer), which results in k * k_2 feature maps at the end of the second layer, or do I learn k_2 3D filters (of dimensions filter_width * filter_height * k) which results in k_2 feature maps at the end of layer 2? Or do I have to somehow combine the feature maps from the first layer to get a single input for the next layer? In the paper "Gradient-based Learning Applied To Document Recognition" by LeCun et al., where convolutional neural networks are described, multiple feature maps from one layer are combined in a defined way (Table 1 in the paper I mentioned) before they are used as input to the next layer. I wasn't able to find the way how they did it in the Convolutional DBN paper. Can someone explain what is the correct way of doing the stacking? The other question I have is about the types of units I should use in higher layers of the DBN. In my first layer the visual layer consists of real valued units and the units in the hidden layer are binary. After I do the pooling, the units in the pooling layer are also binary and they should be used as input in the second layer. Should the input units of my second layer be binary, or do I use the probabilities of the pooled layer units being turned on and have Gaussian units in the input of my next layer? Thank you for your help! |
|
I have not read the paper you mentioned yet. But concerning Conv-nets, images are considered as tensors where their dimensions are channels x width x height. So for a 256x256 color images we have 3x256x256 (red, blue and green 256x256 images). THE FILTERS AT THE CONV LAYERS MUST HAVE THE SAME NUMBER OF CHANNELS AS THEIR INPUT "IMAGES". So, I hope I didn't end up answering a different questions... Thank you! Using tensors makes the most sense. I'll leave the question marked as unanswered for a while so that someone maybe answers the second part too. Thanks again! :)
(Jan 27 '14 at 10:14)
Petar Palasek
I'm sorry I missed that part. So, here goes my guess: remember what we do for the MNIST data set. The output of the layers are the probabilities. In order to calculate contrastive-divergente, we need to sample from those probabilities. So, I think you're not binning the data, you're sampling from that given "probability-field-something" generated by the conv layer. If you are sampling from a binary or gaussian distribution, it depends on your design choices.
(Jan 27 '14 at 22:08)
eder
In this case, if the entire image is used for trained to learn the weights of layer 1, can you please explain the sequence of events taking place? I mean, if we use random patches to learn the weights, then convolution with the entire image makes sense, but if we learn the weights using the entire images, where does the convolution step come in? This question might be naive, but I am a little confused.
(Feb 10 '14 at 04:00)
Sharath Chandra
|
Have you find another answer to this question ? I have the same problem.