0
1

Hi,

I am encoding a few features for a text classification task. I accidentally encoded multiple features with different values. The type of feature were : Frequency of Letters (A-Z) and character level unigrams. Though the features names are sames, the values are different. Boolean in one case while numerical in another.

How does this effect a classifier? I know this question depends on the type of classifier but a small overview/heads-up on potential benefits and pitfalls would be great.

I use Python as my programming language and NLTK, Scikits.Learn as my machine learning libraries.

asked May 11 '11 at 11:06

Dexter's gravatar image

Dexter
416243438


2 Answers:

If each feature name:value pair is stored separately, it may not matter. But if there's any doubt, you should try use separate feature names. You definitely want to avoid one feature value overwriting another of the same name.

answered May 11 '11 at 13:09

Jacob%20Perkins's gravatar image

Jacob Perkins
116126

I think he is asking is there a problem if the same feature has multiple names.

(May 11 '11 at 13:27) Joseph Turian ♦♦

I myself am kind of confused here. Yes, it's stored separately but how do i uniquely identify it? It may not matter in NLTK but how do I convert it to a format acceptable by libSVM or SVMLight? I need to have a unique list of features. Which feature should I go for? The one with the bool value or one with the numerical one !

(May 11 '11 at 14:18) Dexter

As you said in the question, it depends on the classifier. Some will classify based on feature names, meaning it's likely to ignore either the first or second occurence. Others can accept multiple values for a feature, so A:7 and A:true will be processed separately. If you want to use both, just give them unique names: A-count and A-bool or something.

As for which is better, it really depends on the task at hand. Term frequency has more information the model can take advantage of. This probably helps, unless you don't have enough training data and it overfits. Thinking about your task, you can usually guess whether its occurence or frequency that's more useful.

Your best bet is to run both against your test set, and see which performs better.

answered May 11 '11 at 14:39

Paul%20Barba's gravatar image

Paul Barba
4464916

Paul, Thanks for the reply. I am pretty sure I want to use both. I should probably think about giving different names. The actual problem though is it would make my code look "dirty". For example, Character Unigrams can be retrieved as follows :

''' Returns character unigrams from text ''' def get_char_unigrams(words): return dict([(char, True) for word in words for char in word])

but now due to the similar features I might have to change it to :

''' Returns character unigrams from text ''' def get_char_unigrams(words): return dict([(char + "_bool", True) for word in words for char in word])

and it doesn't look intuitive or rather un-Python like.

(May 11 '11 at 14:45) Dexter
1

In the world of data hacking, sometimes code has to get ugly in order to produce clean results.

(May 11 '11 at 17:07) Jacob Perkins

Jacob, Thanks! :-)

(May 12 '11 at 04:44) Dexter
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.