|
Short Version: In a (Deep) Convolutional Neural Network, how to determine which feature maps of one layer should be connected to which feature maps of the previous layer? Long Version: On deeplearning.net, CNNs are explained. When it comes to the connectivity between subsequent layers, the text says:
If I am not mistaken, this means that each feature map in the layer Furthermore, in the paper Multi-digit Number Recognition from Street View Imagery using Deep Convolutional Neural Networks an experiment regarding the Public Street View House Numbers dataset is described. The authors state:
Additionally, the authors say they used convolution kernels of the size Do I understand correctly that even in such a large scale CNN each feature map is connected to all previous feature maps? For example the last convolutional layer in this case: it consists of I can't imagine how this is feasible for large input dimensions like in high definition imagery (in this Street View example the inputs are just tiny 64x64 patches!) If not, what is a good way to initialize the connections between feature maps? (so the number of parameters gets reduced - I am not speaking of techniques like sparsity targets here which drag some of the parameters towards zero). |
|
The typical case is indeed that each feature map in layer n is connected to each feature map in the previous layer n-1. However, "this means at each spatial location a 4801 x 192 matrix multiplication has to be performed" is a bit ambiguous. To arrive at the output map values at a given spatial location, a 1x4801 vector (all the values in a 5x5 region across all 192 input maps, plus 1's for the bias) is being multiplied by a 4801x192 matrix to arrive at the output map values at a given spatial location. However, in practice, this is never implemented in this fashion. Convolutions can be computed much more efficiently than that, especially on GPUs. So while it's true that computing convolutions involves a lot of computation, in practice it's not nearly as bad as you make it sound :) As you point out, once you go beyond 64x64, this does become a bit slow, even if you have GPUs at your disposal. In practice, people have 'solved' this issue by performing strided convolutions, i.e. instead of shifting the filters by 1, you shift them by 4 or so (this is what Alex Krizhevsky did in his 2012 ImageNet paper). This considerably reduces the size of the output feature map, and reduces the amount of computation needed. That said, Alex's library does seem to have a few implementations of what he calls 'sparse' convolutions, i.e. convolutions where each filter only looks at a subset of the input feature maps. I haven't seen this used in the wild yet, though. Have a look at the cuda-convnet documentation for more info. |