LDA is nice, but unpredictable as it does not always give me the topics I wanted it to give me.

I am looking for something like LDA, but semi-supervised. In the sense, i can pick the seed words for each topic, and then run some system which would figure out words related to those seed words, and then words related to those words, and so on.. and give me topics which are meaningful for my task and I would know, without manually checking, which topic covers which set of ideas.

Is there something like this out there?

asked Jul 05 '10 at 03:03

Aditya%20Mukherji's gravatar image

Aditya Mukherji

3 Answers:

Two models I've worked on might be relevant:

  1. Topic-in-Set Knowledge lets you do "semi-supervised" LDA by forcing (or merely encouraging) seed words to be assigned to chosen topics (or subsets of topics).
  2. The Dirichlet Forest prior lets you "must-link" a set of seed words together so that their probabilities within each topic are encouraged to be similar (ie, Must-Link(w1, w2) --> P(w1 | z) approx P(w2 | z) forall z)

If you want topics to be grounded to specific concepts you might also be interested in the Concept-Topic Model work by Chemudugunta et al.

answered Jul 05 '10 at 16:16

David%20Andrzejewski's gravatar image

David Andrzejewski

edited Jul 05 '10 at 16:21

David Blei's original implementation has pseudo support for that. Run it for one iteration and edit the output model files to give a higher probability in those words on those topics, and run it again.

answered Jul 05 '10 at 07:39

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

One hacky solution would be to add a fake document for each topic which only contains the seed words. If possible adjust document parameters to encourage a tighter topic distribution for those documents so the posterior has those words in a single topic.

answered Jul 05 '10 at 09:25

aria42's gravatar image


Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.