1
1

I am applying text mining to song lyrics to cluster them based on topic of the song. I have formatted my data as a form of term-document matrix of dimensions 5500*6500 where terms are binary weighted.

If I try K-means clustering(using Weka) on the data, I get results in the form of document clusters which are impossible to label manually. So I am planning to apply Non-negative matrix factorization on it to get term clusters. As per my research Weka doesn't support NMF. Can you please suggest some good tools for the same.

asked May 28 '11 at 05:32

aseembehl's gravatar image

aseembehl
568913

edited May 28 '11 at 05:34

2

Why not use LDA? It sounds perfectly suited to your problem.

(May 28 '11 at 12:04) Kevin Canini
2

No matter what method you use, you might not have easily-labeled clusters. However, for best results, make sure you preprocess your data as well as possible with stemming, stop word removal and tf-idf reweighting. I also second topic models, implementations of which are available in many languages. As far as NMF, http://en.wikipedia.org/wiki/Non-negative_matrix_factorization#Software

(May 28 '11 at 12:14) Jacob Jensen

Thanks, for the comments. I am trying out LSI using Gensim Python library. @Jacob, I have done stemming, function word removal and binary weighting. Thanks!

(May 29 '11 at 02:32) aseembehl

I suggest you check out scikits.learn. It's python library with many tools specifically for text analysis and can do NMF and many other factoring and clustering methods. As far as I know, it does not include LDA (yet).

(May 29 '11 at 10:21) Andreas Mueller

@Jacob: In my experience it's easier to label topic extracted from LDA or NMF (they seem to produce sparse additive components) than from KMeans (probably because of the hard assignment assumption that does not match the intuitive assumption of multiple soft topics assignment for text documents) or PCA (for which the dense representation with both positive and negative weights is hard to interpret).

@aseembehl as you have tried LSI with gensim you should also give the LDA implementation a try too and qualitatively compare the results in terms of easiness labeling of the extracted components.

(Aug 29 '11 at 09:46) ogrisel

One Answer:

I just wrote a new example in scikit-learn to try and test this idea and timings. The code is here on github: Topics Extraction with NMF

According to my timings I suspect that extracting 10 topics on your dataset should be doable in less than 1h with the current implementation of NMF in scikit-learn but this runtime is increasing very quickly with the dimensions of the problem. It might be worth to do some kind of feature selection (stop words removal and cut off of the less frequents) and sub sample the dataset.

answered Aug 28 '11 at 09:55

ogrisel's gravatar image

ogrisel
398464480

edited Aug 28 '11 at 09:56

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.