Kernel methods are interesting in that, given a small enough data-set you can achieve interesting results. In the past I've experimented with a convolutional architecture for an SVM, but ultimately abandoned it because the computational cost was so immense. It worked, but it was painfully slow.

Using GPUs and other fancy tricks for improving training time only helps so much since the training time and memory requirements appears to scale quadratically with the size of the training set. There may be efficient methods of winnowing out examples that doesn't require O(n^2) (or worse) time; but I moved on to unsupervised methods and stopped researching SVMs, so I'm not familiar with them if they exist.

My question is, could an unsupervised model not just learn the distribution of inputs, but be used to generate distinct -- in a sense, 'conjugate' -- examples from the data distribution that could be used as the basis for a kernel-based method of associating new data -- e.g.: clustering, classification, regression, etc.

Would it be better to use a latent representation and feed that into a kernel based method, or would there be any value in what I've described.

It seems to me that feeding the latent representation into a kernel method just shifts the problem from using raw data to using another representation of the same samples. While you might end up with a more terse representation, the computational burden isn't on the inner product of the samples with each-other, but rather the number of samples you need to evaluate to identify 'good' support vectors and the memory requirements that go with this evaluation.


This question is marked "community wiki".

asked Dec 22 '10 at 15:49

Brian%20Vandenberg's gravatar image

Brian Vandenberg

I suppose there's also the option of using the weights as support vectors, provided the inner product used in the kernel method is [isomorphic?] to the linear combination of weights & data used in the unsupervised method.

For example, if the unsupervised method were a convolutional RBM, you might use the filters it learns as support vectors as long as the kernel method also uses a convolutional inner product.

(Dec 22 '10 at 15:56) Brian Vandenberg

Can you clarify the convolutional SVM architecture you mentioned? In answer to your question, there are some specialized algorithms for ultra-fast SVM solving (particularly in the linear kernel case). People try all sorts of wacky hacks to "kernelize" these things, many of which seem to involve applying KPCA to get a set of "basis kernels" that approximate the real kernel. Using a subset of the original data, chosen by clustering and picking exemplars, is also done. Neither of these approaches seems fully robust when dealing with very-large-scale data and sufficiently high-dimensional or complex kernels, but they can get the job done.

(Jun 05 '11 at 02:46) Jacob Jensen

3 Answers:

Interesting question, I've been thinking in these lines too. I think that the goal of the unsupervised learning is to discover better features in the absolute sense, not only with respect to some artificial set of labels. And then from these intrinsic features one should be able train a possibly simpler model in a supervised fashion with respect to any set of reasonable labels with respect to that dataset. As in kernel methods. First map the data into a higher dimensional space and implicitly train a linear model in this higher dimensional space. So the distances in the intrinsic feature space should be more semantical, therefore the proximity matrix I expect is better for plugging it in a kernel SVM. In terms of speed there's no gain, but there may be some quality gains. But one thing is theory and intuition, and practice another, and cannot support you with empirical evidence. But discussion is interesting. I think at 7:40 Geoffrey Hinton is discussing this here

answered Dec 22 '10 at 20:11

Oliver%20Mitevski's gravatar image

Oliver Mitevski

edited Dec 23 '10 at 05:55

This sounds like radial basis function neural networks. I like the treatment in Bishop's neural networks book. In short, there are many ways of doing that, most of them nonconvex, but even things like k-means work well enough in some cases.

This answer is marked "community wiki".

answered Dec 22 '10 at 23:02

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

An idea occurred to me tonight: if a model were constructed for a DBN auto-encoder, the detail necessary to reconstruct the original input is likely lost, but the hash could then be used to sample from the distribution of other samples that are semantically similar. The model could then be used to generate a much smaller set of artificial samples in either the data distribution, the auto-encoder distribution, or perhaps the highest layer in the DBN that has not undergone any dimensionality reduction.


This answer is marked "community wiki".

answered Dec 24 '10 at 01:27

Brian%20Vandenberg's gravatar image

Brian Vandenberg

But you don't need a DBN for this purpose. An RBM will do just fine for generating "dreaming up" some good samples of the distribution. Or if you used an auto-encoder use a shallow one, one or two layers at most.

(Dec 24 '10 at 06:10) Oliver Mitevski

The reason for the auto-encoder is to implicitly give an approximate size to the kernel matrix. For example, if the auto-encoder layer is limited to 16 bits and is kept somewhat sparse (say, around 8 bits on at a time on average), then I'd hope to get approximately 12,870 support vectors, which would obviously have to be winnowed down from a larger sample, but hopefully a process like this could select a smaller sample set than the original set.

(Dec 24 '10 at 16:00) Brian Vandenberg
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.