I am using the mallet toolkit to do lda-based topic modelling on my corpus. After many hours of calculation, it comes up with a surprisingly good set of 32 topics. One of the output files gives a mapping for each token to a topic number. (Remember that the mapping is on a token, not type, basis).

What I really want, tho, is a topic per paragraph. I can think of many, many different ways to get there, so I'd love to find out any information on the -best- way.

For example, I could simply create a vector, length 32, where entry I is the number of words in the paragraph assigned that topic. But should it be length sensitive? Maybe I should be the percent of words in that category... Or maybe compromise: use log(n)..

Alternately, maybe key is to do some smoothing, so that each sentence is assigned a single topic, and then create vector like above but where I reflects number of sentences instead of words. (Of course, leaves open how to assign sentences... Simple majority?)

Finally, since have these vectors per paragraph, how to get unique topic? Maybe do clustering on all paragraph vectors so you create new hierarchical topic map on top of what lda found?

Hopefully you see my problem. Thank you for your time.

asked Aug 31 '10 at 15:01

ian%20haking's gravatar image

ian haking
16112


One Answer:

These are very simple and interesting variations of the LDA algorithm. You can have a topic distribution theta per document and have each paragraph sampled from a topic. If you look at sentences you already have something very similar in Barzilay and Lee's content models, which use a topic per sentence, with the topics following each other in a hidden-markov-model sense. I have code for that here if you want, and you can very easily adapt it to the case of one topic per paragraph with a document-specific topic distribution (which resembles LDA a bit more).

If you want the technical details, the model becomes something between LDA and bayesian naive bayes, and the sampler (assuming you're using gibbs sampling for inference) has to adapt accordingly. Bob Carpenter describes very well how to sample both naive bayes and LDA here, and you can probably see how that's relevant from my code up there.

answered Aug 31 '10 at 21:09

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1677242188306

edited Aug 31 '10 at 21:11

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.