|
Hi, I used bag of words features for text classification using SVM. The number of features is about 10,000. The result was pretty high 90.03% . Then I added more features(about 50 new features) mainly text statistics such as type token ratio, average number of syllables per word and so on to the bag of words features. It improved the accuracy (90.15%) but not significantly. But when I used information gain to see which features contributed most, I was surprised to see that some of the new features (text statistics) were listed at the top. How can I explain this? If the new added features are very informative how come there is only little improvement in the accuracy? I would be very grateful if you help me with my question.
This question is marked "community wiki".
|
|
Thank you for your answer. But I don't think that is the reason because I used the binary representation of bag of words. So the feature vector consists of 0 and 1. But the new features that I added are independent of bag of words such as type token ratio, average word length, average word syllabus, number of named entity tags. How can these features be learned from bag of words binary representation? I would be very grateful if you could explain in more detail.
This answer is marked "community wiki".
Some time ago I used linear SVM to build gender classifier based on images of human face. In the very first attempt it gave 100% correct results, which was quite suspicious. After some investigation it turned out, that there were only 28 different people, and classifier just learnt each of them (instead of "simpler" classes of male and female). This shows how well modern ML models can learn properties of data even if we, humans, can't see the process. In your particular case SVM could learn properties of specific sets of words and construct its own features similar to your statistics. E.g. if you have BOW
(Nov 20 '13 at 16:26)
ffriend
Thanks @ffriend for your explanation. I will try that.
(Nov 21 '13 at 04:18)
nour
Oh, hell, it seems like I lied to you - using "@" doesn't work here (I'm too used to StackOverflow, apparently). Anyway, I'm glad you found the comment. Feel free to post additional questions if you have any.
(Nov 21 '13 at 17:03)
ffriend
|
|
Your statistics are not really new features, but derivatives from other features (words). For the sake of simplicity, consider linear regression model instead of SVM:
where
Cool! There are new features to learn! However, these statistics themselves depend on words, e.g.:
where This looks like the most likely case to me, but of course there are other possibilities. Anyway, it's quite hard to make conclusions about highly dependent features. Try using more independent things like part of speech for each word (e.g.
This answer is marked "community wiki".
|