|
(LDA = Latent Dirichlet allocation) It is straightforward to understand how to use SVD and random projection. Both create lower dimensional representation of document which saves cosine distance between those documents. I.e cos(Original[i], Original[j]) ~= cos(SVD_OR_RI_TRANSFORMED[i], SVD_OR_RI_TRANSFORMED[j]) and it is easy to understand why, all is about linear algebra. But how to get same approximation for LDA where result got by probabilistic inference? Is it possible to use LDA for cos approximation? If not what distance metric LDA approximate? So the question what I can expect from cos(LDA_TOPICS[i], LDA_TOPICS[j]) related to original TFIDF vectors?
showing 5 of 6
show all
|
LDA is not necessarily about linear algebra, nor does it try to directly approximate the cosine or anything like it. You can get a similarity function by multiplying the document-specific topic probability distributions for two documents as if they were vectors, and this is generally well-behaved.
So there are no relation between cos(Original_TFIDF[i], Original_TFIDF[j]) and cos(LDA_TOPICS[i], LDA_TOPICS[j]) but both works good?
There is no clear relation, as the LDA approximation tries to sparsely represent the words, so documents with high cosine can end up with a low LDA dot product as the same words can be explained by different topics. LDA is not directly concerned with multiplying these document-topic representations, hence the lack of guarantees. However, the feature encoding provided by the LDA representation often turns out to be useful.
I think there's still an interesting question here, specifically: What is an appropriate distance measure in the LDA space?
Joseph Turian: I find that an interesting question for which I don't really have an answer. For example, if you have two similar topics (in the sense that they both assign high probability to some of the same words) the inference process will almost never assign those two topics to the same document (an example of topics like this is neuroscience and neural network topics in NIPS data), as it can almost always have a sparser (if worse) solution by using only one of those. So what you would think would be very similar topics (high dot product) turns out to be things which never co-occur in the document-topic vectors, and vice versa.
Hanna Wallach (p.c.) recommended to me to use Hellinger distance to measure the distance in topic space.