Text classifiers trained by supervised learning can easily latch on to accidental characteristics of the training set. This is a particular problem in time-varying data streams such as newswire articles. Some classic examples are the "dead Jesuits" of the MUC-3 evaluations and the tendency for TDT systems to think that airliners will keep crashing in the same town. Does anyone know of an article that talks about this phenomenon in general, preferably with juicy examples?

asked Oct 05 '10 at 13:21

Dave%20Lewis's gravatar image

Dave Lewis
890202846

edited Apr 11 '11 at 04:44

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


2 Answers:

This is an interesting question which surfaces daily in all sorts of organizations. On the one hand, we don't want to trust the computer "too much", as mechanical analysis can, as you note, latch onto things which are completely nonsensical. Hence, many organizations review the output of such analytics, from the point of view of "sanity checks". On the other hand, the whole point behind using computer-based analysis is that humans, too, are fallible, and likewise draw mistaken connections.

There is an old story about a large computer system which would fail regularly on a particular day of the week. None of its human handlers could think of a reason for this pattern. There were not especially high loads on the computer that day of the week, and other obvious causes were ruled out. As it turned out, the reason for these periodic failures was that the air intake for the computer's cooling apparatus was jammed by grass clippings kicked up during the weekly mowing of the lawn. I have wondered whether a data mining system, supplied with all the workings of this business, including the computer failure reports and the lawn care schedule might not have found this correlation? Perhaps more importantly, would anyone have believed it, if it had?

answered Jan 28 '11 at 07:13

Will%20Dwinnell's gravatar image

Will Dwinnell
312210

I don't know offhand about any works talking about this particular example, but it seems that the issue is that the training domain is not representative of the test data.

One technique for alleviating this would be to gather a large, representative, unlabelled corpus, and do semi-supervised learning. A hyperparameter could control the amount by which the unlabelled data is used to regularize the model.

answered Oct 06 '10 at 11:32

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

2

Yes, there's various techniques one might use to address this. My concern is this question is not so much about predictive accuracy on particular test data, but the reaction a client would have to actually looking at the model. Even if the training and test set have the same distribution, and even if the bizarre features are helping, a client may distrust the model if it seems to be using stupid features.

(Oct 07 '10 at 08:44) Dave Lewis

Can you give some examples? I'm seeing something that I think is similar, and I am wondering if we actually do observe the same phenomenon.

(May 07 '11 at 01:18) Joseph Turian ♦♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.