|
I am applying text mining to song lyrics to cluster them based on topic of the song. I have formatted my data as a form of term-document matrix of dimensions 5500*6500 where terms are binary weighted. If I try K-means clustering(using Weka) on the data, I get results in the form of document clusters which are impossible to label manually. So I am planning to apply Non-negative matrix factorization on it to get term clusters. As per my research Weka doesn't support NMF. Can you please suggest some good tools for the same. |
|
I just wrote a new example in scikit-learn to try and test this idea and timings. The code is here on github: Topics Extraction with NMF According to my timings I suspect that extracting 10 topics on your dataset should be doable in less than 1h with the current implementation of NMF in scikit-learn but this runtime is increasing very quickly with the dimensions of the problem. It might be worth to do some kind of feature selection (stop words removal and cut off of the less frequents) and sub sample the dataset. |
Why not use LDA? It sounds perfectly suited to your problem.
No matter what method you use, you might not have easily-labeled clusters. However, for best results, make sure you preprocess your data as well as possible with stemming, stop word removal and tf-idf reweighting. I also second topic models, implementations of which are available in many languages. As far as NMF, http://en.wikipedia.org/wiki/Non-negative_matrix_factorization#Software
Thanks, for the comments. I am trying out LSI using Gensim Python library. @Jacob, I have done stemming, function word removal and binary weighting. Thanks!
I suggest you check out scikits.learn. It's python library with many tools specifically for text analysis and can do NMF and many other factoring and clustering methods. As far as I know, it does not include LDA (yet).
@Jacob: In my experience it's easier to label topic extracted from LDA or NMF (they seem to produce sparse additive components) than from KMeans (probably because of the hard assignment assumption that does not match the intuitive assumption of multiple soft topics assignment for text documents) or PCA (for which the dense representation with both positive and negative weights is hard to interpret).
@aseembehl as you have tried LSI with gensim you should also give the LDA implementation a try too and qualitatively compare the results in terms of easiness labeling of the extracted components.