|
I am trying to do classification with a very small number of training examples (e.g. 10). I believe that naive bayes is generally considered superior to maxent with so few examples, but I may be wrong. Anyway, I am considering doing this form of semi-supervised learning: k-NN over LDA. Does this make sense? What would be the appropriate distance measure between LDA topic distributions in this case? |
|
I suggest taking a look at this giant paper on probability product kernels. Then you can think of defining a distance function based on such a kernel or even using a kernel method instead of nearest neighbors. |
|
Ironically, I think you should avoid the information-theoretic measures (KL divergence, etc) because they are far more sensitive to the tail of the distribution than to the head (which is, presumably, where the actual information you want to capture lies). I recall Dan Ramage saying informally that he had tried a few variations and really cosine similarity of LDA topics worked well in all the cases he tried, and moreover something like .8 * tf-idf cosine + .2 * lda cosine worked even better. I haven't tried this myself in data this small, but I agree with your intuition on avoiding discriminative things. I'm curious to try this distance measure from Ramage. Hanna Wallach recommended Hellinger distance to me today. I'll try that too.
(Sep 15 '12 at 02:54)
Joseph Turian ♦♦
|
See also: http://metaoptimize.com/qa/questions/6550/document-clustering-based-on-lda-topic-modeling