Prior to classifier building, is it necessary to give weight (TFIDF) to terms in text categorization? Can we just stick to Boolean representation of the terms, e.g:(w1:false),(w2:true),(w3:true)....?

asked Feb 29 '12 at 02:26

Fairuz%20Zaiyadi's gravatar image

Fairuz Zaiyadi
16334

edited Mar 01 '12 at 00:24

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


4 Answers:

If all documents are about the same length, and you have a substantial amount of training data, within document weights have relatively little impact. If document lengths are highly variable then you'll want to use something like a log-tf weight, and perhaps a document length normalization. IDF weights are only important if you have relatively little training data.

answered Mar 11 '12 at 15:03

Dave%20Lewis's gravatar image

Dave Lewis
890202846

As document length normalisation, doesn't using the proportion of values cover this? i.e. freq('a') / len(document)?

(Mar 11 '12 at 19:56) Robert Layton

ok thanks everyone =)

answered Mar 06 '12 at 23:58

Fairuz%20Zaiyadi's gravatar image

Fairuz Zaiyadi
16334

[1] McCallum A. and Nigam K. A comparison of event models for Naive Bayes text classification. In: AAAI-98 Workshop on "Learning for Text Categorization"; 1998; Wisconsin, USA; 1998.

[2] Schneider K.-M. On Word Frequency Information and Negative Evidence in Naive Bayes Text Classification. In: J. L. Vicedo, P. Martínez-Barco, R. Muñoz and M. S. Noeda, (eds.). Advances in Natural Language Processing. Berlin / Heidelberg: Springer, 2004, p. 474-485.

answered Mar 05 '12 at 06:02

Arash%20Joorabchi's gravatar image

Arash Joorabchi
713

Of course you can stick to a boolean representation, and you'll get a working classifier. It probably will not perform as well as a tf-idf weighted one, however, if years of experimentation are any good.

answered Feb 29 '12 at 08:49

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

If you use a count of terms, as opposed to binary, in principle you could learn models weights that implicitly normalize for the IDF. In general, it's an empirical question.

(Mar 01 '12 at 00:23) Joseph Turian ♦♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.