I am trying to do classification with a very small number of training examples (e.g. 10). I believe that naive bayes is generally considered superior to maxent with so few examples, but I may be wrong.

Anyway, I am considering doing this form of semi-supervised learning: k-NN over LDA.

Does this make sense? What would be the appropriate distance measure between LDA topic distributions in this case?

asked Aug 03 '12 at 03:01

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Sep 13 '12 at 04:49

See also: http://metaoptimize.com/qa/questions/6550/document-clustering-based-on-lda-topic-modeling

(Sep 13 '12 at 04:49) Joseph Turian ♦♦

2 Answers:

I suggest taking a look at this giant paper on probability product kernels. Then you can think of defining a distance function based on such a kernel or even using a kernel method instead of nearest neighbors.

answered Aug 03 '12 at 19:18

gdahl's gravatar image

gdahl ♦
341453559

edited Aug 03 '12 at 19:19

Ironically, I think you should avoid the information-theoretic measures (KL divergence, etc) because they are far more sensitive to the tail of the distribution than to the head (which is, presumably, where the actual information you want to capture lies). I recall Dan Ramage saying informally that he had tried a few variations and really cosine similarity of LDA topics worked well in all the cases he tried, and moreover something like .8 * tf-idf cosine + .2 * lda cosine worked even better.

I haven't tried this myself in data this small, but I agree with your intuition on avoiding discriminative things.

answered Aug 04 '12 at 13:42

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I'm curious to try this distance measure from Ramage. Hanna Wallach recommended Hellinger distance to me today. I'll try that too.

(Sep 15 '12 at 02:54) Joseph Turian ♦♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.