|
I have run the LDA analysis against a set of documents using Mallet, and got the topic proportions. However, my objective is to cluster this set of documents based on topic. Can I treated the topic proportions as a vector of extracted features for each documents, and run the normal clustering tools, such as K-means, based on these features. Thanks. |
|
You can see the topic proportions as a discrete probabilities of the membership of each document to the topics (kinda like a mixture model) So this proportions are giving you already a clustering (Kinda) of the documents given the topics. Example:
It would be fair to assume that Docs 1 and 3 are mostly about science (thus being part of that cluster, and document 2 and 4 are part of history and politics cluster, respectively Try looking into mixture models and how they relate to LDA |
|
I want to know the anwer whether the topics distribution attained by LDA for each document can be treated as a vector that presents the document, and can we use normal clustering algorithm such as k-mean to cluster the documents of corpus? can you give me a direct answer: yes or no? thank a lot ! I think my answer was pretty straight forward, LDA as defined by Blei is a mixture over topics, which basically means it's a clustering algorithm. Doing KMeans or some other algorithm would add computational load to something that already solved your problem.
(Jul 09 '11 at 10:59)
Leon Palafox ♦
thanks for Leon's good answer, I want to know how LDA cluter the documents in a corpus. maybe it a foolish problem, but the problem has confused me long time . or could you give me some advices to learn LDA ? if it is so kind of you , i will appreciate you so much.
(Jul 10 '11 at 04:22)
joylin
|
|
Yes, you can. And Leon is absolutely correct. However, to help clarify the Q/A a little bit, the additional computational burden of secondary analysis is sometimes off-set by the additional insight into the data. For example, cyber security folks use a type of on-line LDA augmented with additional clustering. Works much better than LDA alone. |
See also: http://metaoptimize.com/qa/questions/10747/how-to-perform-k-nn-on-lda-topics