|
Prior to classifier building, is it necessary to give weight (TFIDF) to terms in text categorization? Can we just stick to Boolean representation of the terms, e.g:(w1:false),(w2:true),(w3:true)....? |
|
If all documents are about the same length, and you have a substantial amount of training data, within document weights have relatively little impact. If document lengths are highly variable then you'll want to use something like a log-tf weight, and perhaps a document length normalization. IDF weights are only important if you have relatively little training data. As document length normalisation, doesn't using the proportion of values cover this? i.e. freq('a') / len(document)?
(Mar 11 '12 at 19:56)
Robert Layton
|
|
[1] McCallum A. and Nigam K. A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on "Learning for Text Categorization"; 1998; Wisconsin, USA; 1998. [2] Schneider K.-M. On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification. In: J. L. Vicedo, P. Martínez-Barco, R. Muñoz and M. S. Noeda, (eds.). Advances in Natural Language Processing. Berlin / Heidelberg: Springer, 2004, p. 474-485. |
|
Of course you can stick to a boolean representation, and you'll get a working classifier. It probably will not perform as well as a tf-idf weighted one, however, if years of experimentation are any good. If you use a count of terms, as opposed to binary, in principle you could learn models weights that implicitly normalize for the IDF. In general, it's an empirical question.
(Mar 01 '12 at 00:23)
Joseph Turian ♦♦
|