2
1

I'm currently trying to play around with NLTK and scikits-learn for text clustering news articles.

How do I extend the models to add the scaling of features from a document (I'm doing some preprocessing on the text articles) so I can experiment by weighting ?

I'm starting from this outline of document clustering:

https://github.com/ogrisel/scikit-learn/blob/master/examples/document_clustering.py

How do I approach this problem? Do I add develop heuristics to help tune the parameters I give kmeans?

a. Title
b. Body Text
c. Links (anchor text and link)

Thanks.

asked Feb 18 '12 at 11:43

user9821's gravatar image

user9821
31113

1

As nobody seems to have a better idea, you can simply multiple the values by a weight. As for in NLTK or scikits-learn, I'm not sure. The text extraction in scikits.learn is a little static at the moment, so you may be better off using the components. Have a look at the sklearn/feature_extraction/text.py file for details.

(Feb 19 '12 at 22:11) Robert Layton

Thanks for the response. I'm using a bunch of tools for the text work, I've figured that part out.

(Feb 20 '12 at 19:24) user9821

One Answer:

Here is a principled approach to determine the relative weight:

Let's say you want to pick the relative weight between Title and Body Text.

Pick an information retrieval error measure. For example, the accuracy that the top result corresponds to the search query.

Then, use the Link information as the search query. Search over the documents, where the document is the weighted combination of the Title and Body Text. i.e. find the nearest document. Choose the weighting that maximizes the accuracy of correct retrieval.

answered Feb 22 '12 at 04:15

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.