3
1

I have a large text corpus which I would like to tag with word clusters or topic tags. I don't really care what method is used, as long as the tags/clusters are learned unsupervisedly and group semantically related words together. Hierarchical topics would also make sense.

What are the best toolkits that give me such tags? (This has to work (separately) on corpora of different languages.)

asked Jul 26 '10 at 23:03

Frank's gravatar image

Frank
1349274453


3 Answers:

Mallet is a good first attempt. Another good one is the stanford topic modeling toolbox. Both these tools will help you extract the keywords and explore the topics. If you want to do more of the work yourself, any LDA implementation will do, and you will find lots of those around the web.

answered Jul 26 '10 at 23:33

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

LingPipe has a good and efficient implementation of LDA. See this tutorial. The downside of using lingpipe is that the license is quite restrictive if you plan to use it for non academic / personal stuff.

answered Jul 27 '10 at 01:31

yoavg's gravatar image

yoavg
741122331

The gensim package in Python implements LSA using incremental SVD, so that it does not need to store the unsupervised document corpus in RAM. Gensim also implements LDA and random indexing for inducing docreprs. According to their workshop paper (Rehurek and Sojka, 2010), they use Gensim to induce LSA and LDA models over 270 million word tokens, with a vocabulary size of 300K word types.

answered Oct 09 '10 at 11:51

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.