3
3

I am wondering what will be the influence of document length on text categorization. What is the general average document length in normal tasks? Do I need to do any adjustment if the analyzed documents are relatively short? I googled a lot and could not find any relevant paper. Please give me a hint. Thanks.

asked Oct 11 '10 at 11:29

Jfly's gravatar image

Jfly
2263612

edited Oct 11 '10 at 11:29


3 Answers:

Document length does matter, and you will find many references to it in literature. To deal with very short documents you might want to do dimensionality reduction to get a denser dataset, but even this is not necessary, and off-the-shelf naive bayes or logistic regression classifiers should work ok. If you have documents of wildly varying lengths you might want to try different strategies for building the feature vectors. In general, most documents used in standard evaluations are the size of paper abstracts, email messages, or news stories, although there's been research on tweets that doesn't seem to need profoundly different techniques. Book-length document classification is something I'm unaware of, and might be problematic, since you have few data points with a lot of variability.

answered Oct 11 '10 at 12:08

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Oct 11 '10 at 18:21

Jurgen's gravatar image

Jurgen
99531419

Thanks. Your answer is very helpful to me.

(Oct 11 '10 at 12:16) Jfly

This is a surprisingly poorly understood and poorly investigated area in text classification. As has been mentioned, there's some standard heuristics (like the b parameter in BM25, cosine normalization, pivoted normalization, and others), but they mostly were developed for text retrieval, and have been force-fit somewhat awkwardly into text classification.

The effect of very long documents depends on the particular machine learning method used. In general you would expect that a longer document provides more evidence of its content, and should be able to be classified more confidently. However, many machine learning algorithms (multinomial Naive Bayes most notoriously) will get wayyy to confident if given a very long document. Many of the heuristic term weighting methods developed in text retrieval are effectively trying to find a balance that lets one be more confident about long documents, but not too much more confident.

One difficulty in studying this area is that lack of test collections with widely varying document lengths. The CDIP collection we used in the TREC Legal track is nice from this standpoint, however, and could support some good work in this area. (Of course, there's the OCR errors to reckon with...)

answered Oct 20 '10 at 18:05

Dave%20Lewis's gravatar image

Dave Lewis
890202846

edited Oct 20 '10 at 18:05

Take a look at ̈ S.E. Robertson, K. Sparck Jones (1994), Simple, proven approaches to text retrieval.

Although this is not textcat per se, they discuss of the document length parameter b in the BM25 score. This discussion illustrates the range of assumptions that one can express through a typical document length hyperparameters:

The constant b, which ranges between 0 and 1, modifies the effect of document length. If b=1 the assumption is that documents are long simply because they are repetitive, while if b=0 the assumption is that they are long because they are multitopic. Thus setting b towards 1, e.g. b=.75, will reduce the effect of term frequency on the ground that it is primarily attributable to verbosity. If b=0 there is no length adjustment effect, so greater length counts for more, on the assumption that it is not predominantly attributable to verbosity. We have found (in TREC) that setting b=.75 is helpful.

answered Oct 12 '10 at 13:12

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.