Revision history[back]
click to hide/show revision 1
Revision n. 1

Jul 02 '10 at 19:49

ogrisel's gravatar image

ogrisel
354464171

I would not say that style per se really matter, but vocaulary obvisously does. If you use a bag of words / tokens / ngrams as input features for you classifier, the classifier will not be able use the information present in tokens never seen before during the training phase while they might be very important for your test set.

I think to answer this question you need to try and train models on your specific task and measure the variance of the precision and recall performance by using cross validations with folds coming from your different datasets (textbooks, research papers, news, wikipedia, ...).

It is also probably very dependant on the size of your training corpus and the number and the nature of classes / labels you target.

click to hide/show revision 2
Revision n. 2

Jul 02 '10 at 20:19

ogrisel's gravatar image

ogrisel
354464171

I would not say that style per se really matter, but vocaulary vocabulary obvisously does. If you use a bag of words / tokens / ngrams as input features for you classifier, the classifier will not be able use the information present in tokens never seen before during the training phase while they might be very important for your test set.

I think to answer this question you need to try and train models on your specific task and measure the variance of the precision and recall performance by using cross validations with folds coming from your different datasets (textbooks, research papers, news, wikipedia, ...).

It is also probably very dependant dependent on the size of your training corpus and the number and the nature of classes / labels you target.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.