Short Version:

In a (Deep) Convolutional Neural Network, how to determine which feature maps of one layer should be connected to which feature maps of the previous layer?

Long Version:

On deeplearning.net, CNNs are explained. When it comes to the connectivity between subsequent layers, the text says:

"Notice how the receptive field spans all four input feature maps."

If I am not mistaken, this means that each feature map in the layer n is connected to each feature map in the previous layer n-1.

Furthermore, in the paper Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks an experiment regarding the Public Street View House Numbers dataset is described. The authors state:

"The number of units at each spatial location in each layer is [48, 64, 128, 160] for the first four layers and 192 for all other locally connected layers."

Additionally, the authors say they used convolution kernels of the size 5x5.

Do I understand correctly that even in such a large scale CNN each feature map is connected to all previous feature maps? For example the last convolutional layer in this case: it consists of 192 units (-> 192 feature maps). Each of these feature maps is connected to 192 previous feature maps, covering a 5x5 area. This results in 5 * 5 * 192 + 1 = 4801 parameters for each unit (including the bias). Thus, the whole layer consists of 4801 * 192 = 921792 weight parameters. Computationally, this means at each spatial location a 4801 x 192 matrix multiplication has to be performed. Is this correct?

I can't imagine how this is feasible for large input dimensions like in high definition imagery (in this Street View example the inputs are just tiny 64x64 patches!)

If not, what is a good way to initialize the connections between feature maps? (so the number of parameters gets reduced - I am not speaking of techniques like sparsity targets here which drag some of the parameters towards zero).

asked May 06 '14 at 05:31

Leon%20Schreon's gravatar image

Leon Schreon
6113


One Answer:

The typical case is indeed that each feature map in layer n is connected to each feature map in the previous layer n-1.

However, "this means at each spatial location a 4801 x 192 matrix multiplication has to be performed" is a bit ambiguous.

To arrive at the output map values at a given spatial location, a 1x4801 vector (all the values in a 5x5 region across all 192 input maps, plus 1's for the bias) is being multiplied by a 4801x192 matrix to arrive at the output map values at a given spatial location.

However, in practice, this is never implemented in this fashion. Convolutions can be computed much more efficiently than that, especially on GPUs. So while it's true that computing convolutions involves a lot of computation, in practice it's not nearly as bad as you make it sound :)

As you point out, once you go beyond 64x64, this does become a bit slow, even if you have GPUs at your disposal. In practice, people have 'solved' this issue by performing strided convolutions, i.e. instead of shifting the filters by 1, you shift them by 4 or so (this is what Alex Krizhevsky did in his 2012 ImageNet paper). This considerably reduces the size of the output feature map, and reduces the amount of computation needed.

That said, Alex's library does seem to have a few implementations of what he calls 'sparse' convolutions, i.e. convolutions where each filter only looks at a subset of the input feature maps. I haven't seen this used in the wild yet, though. Have a look at the cuda-convnet documentation for more info.

answered May 07 '14 at 15:13

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734

edited May 08 '14 at 19:35

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.