0
1

Hi,

I used bag of words features for text classification using SVM. The number of features is about 10,000. The result was pretty high 90.03% . Then I added more features(about 50 new features) mainly text statistics such as type token ratio, average number of syllables per word and so on to the bag of words features. It improved the accuracy (90.15%) but not significantly. But when I used information gain to see which features contributed most, I was surprised to see that some of the new features (text statistics) were listed at the top. How can I explain this? If the new added features are very informative how come there is only little improvement in the accuracy?

I would be very grateful if you help me with my question.

This question is marked "community wiki".

asked Nov 19 '13 at 06:35

nour's gravatar image

nour
1123


2 Answers:

Thank you for your answer. But I don't think that is the reason because I used the binary representation of bag of words. So the feature vector consists of 0 and 1. But the new features that I added are independent of bag of words such as type token ratio, average word length, average word syllabus, number of named entity tags. How can these features be learned from bag of words binary representation?

I would be very grateful if you could explain in more detail.

This answer is marked "community wiki".

answered Nov 19 '13 at 13:30

nour's gravatar image

nour
1123

Some time ago I used linear SVM to build gender classifier based on images of human face. In the very first attempt it gave 100% correct results, which was quite suspicious. After some investigation it turned out, that there were only 28 different people, and classifier just learnt each of them (instead of "simpler" classes of male and female). This shows how well modern ML models can learn properties of data even if we, humans, can't see the process. In your particular case SVM could learn properties of specific sets of words and construct its own features similar to your statistics. E.g. if you have BOW milk, sugar, glass, your average # of syllables will be 4/3, but nothing stops SVM from assigning such "weights" to these words that their combination produces exactly the same new feature. Try running classifier with only your 50 statistics - most probably they will give relatively good results too (though lower, then with complete set of features). Also, please post comments instead of new answers and mention people with @<username> - this helps people to track updates to questions and their answers.

(Nov 20 '13 at 16:26) ffriend

Thanks @ffriend for your explanation. I will try that.

(Nov 21 '13 at 04:18) nour

Oh, hell, it seems like I lied to you - using "@" doesn't work here (I'm too used to StackOverflow, apparently). Anyway, I'm glad you found the comment. Feel free to post additional questions if you have any.

(Nov 21 '13 at 17:03) ffriend

Your statistics are not really new features, but derivatives from other features (words). For the sake of simplicity, consider linear regression model instead of SVM:

h(w; k) = k0 + k1*w1 + k2*w2 + ... + kn*wm

where x1..wm are your words and k0..km are model parameters. You have added some new features - statistics of words (let's call them si and corresponding parameters - r). So your model now looks like this:

h(w, s; k, r) = k0 + k1*w1 + k2*w2 + ... + kn*wm + r1*s1 + r2*s2 + ... + rm*sm

Cool! There are new features to learn! However, these statistics themselves depend on words, e.g.:

s1 = s1(w) = f(w1, w2, ..., wm)

where f(w) is a statistics function of words in training document (e.g. average word length). f introduces some (possibly non-linear) transformation of features, which is cool by itself too. In fact, such transformations may be essential for this particular task. That's one possible reason why your statistics are rated as the most important features. However, your main model h() could already learnt most of these transformations in its own way, and that's why you've got only little improvement.

This looks like the most likely case to me, but of course there are other possibilities. Anyway, it's quite hard to make conclusions about highly dependent features. Try using more independent things like part of speech for each word (e.g. import_VERB, import_NOUN, etc.) or n-grams (e.g. not just number words "go" and "back" in text, but also number of bigrams "go_back"). This may change classification results dramatically (or not :)).

This answer is marked "community wiki".

answered Nov 19 '13 at 08:20

ffriend's gravatar image

ffriend
56146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.