Has anyone taken a corpus: 1) applied a vanilla LDA, then 2) used the resulting phi, theta to generate an artificial corpus (assuming consistent alpha and beta and vocabulary) 3) applied LDA to the generated corpus, and 4) compared the (statistical) similarity of the generated corpus against the original corpus?

If someone has a link to something related to this I'd appreciate a pointer.

[FWIW - I've done it and the results are less than satisfying in terms of the similarity metric. I'm curious if I'm doing some thing incorrect in implementing the generative model or if there is a problem with my similarity measure(s). I'm hoping I've made a mistake somewhere.]

asked Jul 13 '11 at 18:28

Aengus%20Robinson's gravatar image

Aengus Robinson
23051114

1

I'm not sure why you would want to do this, but actually generating data with generative models is a pretty common thing. LDA should fit all the statistics that it incorporates, so if you do word similarity then you'll get high numbers. However, the output will be absolute nonsense. What similarity are you using?

(Jul 13 '11 at 19:43) Jacob Jensen

If my tokens were words, I can understand how you might expect the results to be nonsense. However, my corpus is not text. Along with a couple of others, I've tried Pearson and Spearman RCC, both pretty common similarity measures.

(Jul 14 '11 at 14:25) Aengus Robinson

One Answer:

I think you should see http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/ its different from what you are asking for but it has a decent direction to look into. The paper that the person is describing here is basically choosing a set of topics(basically predefined word probabilities) mixing them up and generating documents. Personally I liked and disliked the idea as the corpus is generated from the exact reverse of what the LDA does so its a bit of cheating when you learn back from it. One should test it with a mixture from other distributions as well.

Once the LDA is learnt, artificial documents are pushed through it again and the topic distributions are recovered again. Since we know what topics we used to make the artificial docs, there recovery offers some consolation.

answered Jul 15 '11 at 05:45

kpx's gravatar image

kpx
541182636

Thanks. I had completely forgotten about this. I suppose it's possible that I have made a mistake with the generation and I can double check it with this.

(Jul 18 '11 at 14:38) Aengus Robinson
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.