|
Has anyone taken a corpus: 1) applied a vanilla LDA, then 2) used the resulting phi, theta to generate an artificial corpus (assuming consistent alpha and beta and vocabulary) 3) applied LDA to the generated corpus, and 4) compared the (statistical) similarity of the generated corpus against the original corpus? If someone has a link to something related to this I'd appreciate a pointer. [FWIW - I've done it and the results are less than satisfying in terms of the similarity metric. I'm curious if I'm doing some thing incorrect in implementing the generative model or if there is a problem with my similarity measure(s). I'm hoping I've made a mistake somewhere.] |
|
I think you should see http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/ its different from what you are asking for but it has a decent direction to look into. The paper that the person is describing here is basically choosing a set of topics(basically predefined word probabilities) mixing them up and generating documents. Personally I liked and disliked the idea as the corpus is generated from the exact reverse of what the LDA does so its a bit of cheating when you learn back from it. One should test it with a mixture from other distributions as well. Once the LDA is learnt, artificial documents are pushed through it again and the topic distributions are recovered again. Since we know what topics we used to make the artificial docs, there recovery offers some consolation. Thanks. I had completely forgotten about this. I suppose it's possible that I have made a mistake with the generation and I can double check it with this.
(Jul 18 '11 at 14:38)
Aengus Robinson
|
I'm not sure why you would want to do this, but actually generating data with generative models is a pretty common thing. LDA should fit all the statistics that it incorporates, so if you do word similarity then you'll get high numbers. However, the output will be absolute nonsense. What similarity are you using?
If my tokens were words, I can understand how you might expect the results to be nonsense. However, my corpus is not text. Along with a couple of others, I've tried Pearson and Spearman RCC, both pretty common similarity measures.