|
I have a large text corpus which I would like to tag with word clusters or topic tags. I don't really care what method is used, as long as the tags/clusters are learned unsupervisedly and group semantically related words together. Hierarchical topics would also make sense. What are the best toolkits that give me such tags? (This has to work (separately) on corpora of different languages.) |
|
Mallet is a good first attempt. Another good one is the stanford topic modeling toolbox. Both these tools will help you extract the keywords and explore the topics. If you want to do more of the work yourself, any LDA implementation will do, and you will find lots of those around the web. |
|
The gensim package in Python implements LSA using incremental SVD, so that it does not need to store the unsupervised document corpus in RAM. Gensim also implements LDA and random indexing for inducing docreprs. According to their workshop paper (Rehurek and Sojka, 2010), they use Gensim to induce LSA and LDA models over 270 million word tokens, with a vocabulary size of 300K word types. |