Kernel methods are interesting in that, given a small enough dataset you can achieve interesting results. In the past I've experimented with a convolutional architecture for an SVM, but ultimately abandoned it because the computational cost was so immense. It worked, but it was painfully slow. Using GPUs and other fancy tricks for improving training time only helps so much since the training time and memory requirements appears to scale quadratically with the size of the training set. There may be efficient methods of winnowing out examples that doesn't require O(n^2) (or worse) time; but I moved on to unsupervised methods and stopped researching SVMs, so I'm not familiar with them if they exist. My question is, could an unsupervised model not just learn the distribution of inputs, but be used to generate distinct  in a sense, 'conjugate'  examples from the data distribution that could be used as the basis for a kernelbased method of associating new data  e.g.: clustering, classification, regression, etc. Would it be better to use a latent representation and feed that into a kernel based method, or would there be any value in what I've described. It seems to me that feeding the latent representation into a kernel method just shifts the problem from using raw data to using another representation of the same samples. While you might end up with a more terse representation, the computational burden isn't on the inner product of the samples with eachother, but rather the number of samples you need to evaluate to identify 'good' support vectors and the memory requirements that go with this evaluation. Brian
This question is marked "community wiki".
asked Dec 22 '10 at 15:49 Brian Vandenberg 
Interesting question, I've been thinking in these lines too. I think that the goal of the unsupervised learning is to discover better features in the absolute sense, not only with respect to some artificial set of labels. And then from these intrinsic features one should be able train a possibly simpler model in a supervised fashion with respect to any set of reasonable labels with respect to that dataset. As in kernel methods. First map the data into a higher dimensional space and implicitly train a linear model in this higher dimensional space. So the distances in the intrinsic feature space should be more semantical, therefore the proximity matrix I expect is better for plugging it in a kernel SVM. In terms of speed there's no gain, but there may be some quality gains. But one thing is theory and intuition, and practice another, and cannot support you with empirical evidence. But discussion is interesting. I think at 7:40 Geoffrey Hinton is discussing this here answered Dec 22 '10 at 20:11 Oliver Mitevski 
This sounds like radial basis function neural networks. I like the treatment in Bishop's neural networks book. In short, there are many ways of doing that, most of them nonconvex, but even things like kmeans work well enough in some cases.
This answer is marked "community wiki".
answered Dec 22 '10 at 23:02 Alexandre Passos ♦ 
An idea occurred to me tonight: if a model were constructed for a DBN autoencoder, the detail necessary to reconstruct the original input is likely lost, but the hash could then be used to sample from the distribution of other samples that are semantically similar. The model could then be used to generate a much smaller set of artificial samples in either the data distribution, the autoencoder distribution, or perhaps the highest layer in the DBN that has not undergone any dimensionality reduction. Brian
This answer is marked "community wiki".
answered Dec 24 '10 at 01:27 Brian Vandenberg But you don't need a DBN for this purpose. An RBM will do just fine for generating "dreaming up" some good samples of the distribution. Or if you used an autoencoder use a shallow one, one or two layers at most.
(Dec 24 '10 at 06:10)
Oliver Mitevski
The reason for the autoencoder is to implicitly give an approximate size to the kernel matrix. For example, if the autoencoder layer is limited to 16 bits and is kept somewhat sparse (say, around 8 bits on at a time on average), then I'd hope to get approximately 12,870 support vectors, which would obviously have to be winnowed down from a larger sample, but hopefully a process like this could select a smaller sample set than the original set.
(Dec 24 '10 at 16:00)
Brian Vandenberg

I suppose there's also the option of using the weights as support vectors, provided the inner product used in the kernel method is [isomorphic?] to the linear combination of weights & data used in the unsupervised method.
For example, if the unsupervised method were a convolutional RBM, you might use the filters it learns as support vectors as long as the kernel method also uses a convolutional inner product.
Can you clarify the convolutional SVM architecture you mentioned? In answer to your question, there are some specialized algorithms for ultrafast SVM solving (particularly in the linear kernel case). People try all sorts of wacky hacks to "kernelize" these things, many of which seem to involve applying KPCA to get a set of "basis kernels" that approximate the real kernel. Using a subset of the original data, chosen by clustering and picking exemplars, is also done. Neither of these approaches seems fully robust when dealing with verylargescale data and sufficiently highdimensional or complex kernels, but they can get the job done.