I have run the LDA analysis against a set of documents using Mallet, and got the topic proportions. However, my objective is to cluster this set of documents based on topic. Can I treated the topic proportions as a vector of extracted features for each documents, and run the normal clustering tools, such as K-means, based on these features. Thanks.

asked Jul 06 '11 at 14:12

huaiyanggongzi's gravatar image

huaiyanggongzi
71447

See also: http://metaoptimize.com/qa/questions/10747/how-to-perform-k-nn-on-lda-topics

(Sep 13 '12 at 04:49) Joseph Turian ♦♦

3 Answers:

You can see the topic proportions as a discrete probabilities of the membership of each document to the topics (kinda like a mixture model)

So this proportions are giving you already a clustering (Kinda) of the documents given the topics.

Example:

  • Doc1: Science:0.6, History:0.2, Politics:0.1, Sports:0.1
  • Doc2: Science:0.3, History:0.5, Politics:0.1, Sports:0.1
  • Doc3: Science:0.8, History:0.1, Politics:0.05, Sports:0.05
  • Doc4: Science:0.2, History:0.2, Politics:0.4, Sports:0.2

It would be fair to assume that Docs 1 and 3 are mostly about science (thus being part of that cluster, and document 2 and 4 are part of history and politics cluster, respectively

Try looking into mixture models and how they relate to LDA

answered Jul 06 '11 at 18:21

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

I want to know the anwer whether the topics distribution attained by LDA for each document can be treated as a vector that presents the document, and can we use normal clustering algorithm such as k-mean to cluster the documents of corpus?

can you give me a direct answer: yes or no? thank a lot !

answered Jul 09 '11 at 01:56

joylin's gravatar image

joylin
01

I think my answer was pretty straight forward, LDA as defined by Blei is a mixture over topics, which basically means it's a clustering algorithm. Doing KMeans or some other algorithm would add computational load to something that already solved your problem.

(Jul 09 '11 at 10:59) Leon Palafox ♦

thanks for Leon's good answer, I want to know how LDA cluter the documents in a corpus. maybe it a foolish problem, but the problem has confused me long time . or could you give me some advices to learn LDA ? if it is so kind of you , i will appreciate you so much.

(Jul 10 '11 at 04:22) joylin

Yes, you can. And Leon is absolutely correct. However, to help clarify the Q/A a little bit, the additional computational burden of secondary analysis is sometimes off-set by the additional insight into the data. For example, cyber security folks use a type of on-line LDA augmented with additional clustering. Works much better than LDA alone.

answered Jul 09 '11 at 11:50

Aengus%20Robinson's gravatar image

Aengus Robinson
23051114

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.