|
Given sufficient data, CNNs converge to a minimum which is as good as global with local optimisation (i.e. regardless of weight initialisations aka unsupervised pretraining). This is an empirical result, notably found here. But why? I asked this to Yann LeCun at a conference a few weeks ago, and if I remember his answer correctly, it can be shown that the error surface of CNNs has its extrema bunched up in the same area, and they have roughly the same values - it involves random matrix theory and polynomials. I can't find any papers about this - could anyone point to any? |