|
I am wondering what will be the influence of document length on text categorization. What is the general average document length in normal tasks? Do I need to do any adjustment if the analyzed documents are relatively short? I googled a lot and could not find any relevant paper. Please give me a hint. Thanks. |
|
Document length does matter, and you will find many references to it in literature. To deal with very short documents you might want to do dimensionality reduction to get a denser dataset, but even this is not necessary, and off-the-shelf naive bayes or logistic regression classifiers should work ok. If you have documents of wildly varying lengths you might want to try different strategies for building the feature vectors. In general, most documents used in standard evaluations are the size of paper abstracts, email messages, or news stories, although there's been research on tweets that doesn't seem to need profoundly different techniques. Book-length document classification is something I'm unaware of, and might be problematic, since you have few data points with a lot of variability. Thanks. Your answer is very helpful to me.
(Oct 11 '10 at 12:16)
Jfly
|
|
This is a surprisingly poorly understood and poorly investigated area in text classification. As has been mentioned, there's some standard heuristics (like the b parameter in BM25, cosine normalization, pivoted normalization, and others), but they mostly were developed for text retrieval, and have been force-fit somewhat awkwardly into text classification. The effect of very long documents depends on the particular machine learning method used. In general you would expect that a longer document provides more evidence of its content, and should be able to be classified more confidently. However, many machine learning algorithms (multinomial Naive Bayes most notoriously) will get wayyy to confident if given a very long document. Many of the heuristic term weighting methods developed in text retrieval are effectively trying to find a balance that lets one be more confident about long documents, but not too much more confident. One difficulty in studying this area is that lack of test collections with widely varying document lengths. The CDIP collection we used in the TREC Legal track is nice from this standpoint, however, and could support some good work in this area. (Of course, there's the OCR errors to reckon with...) |
|
Take a look at ̈ S.E. Robertson, K. Sparck Jones (1994), Simple, proven approaches to text retrieval. Although this is not textcat per se, they discuss of the document length parameter b in the BM25 score. This discussion illustrates the range of assumptions that one can express through a typical document length hyperparameters:
|