|
If I build a classification model on textbook materials, will the model work for research papers, or newspaper articles? I ask this question is because that I think some of the materials are more convenient for manual classification. For example, a textbook can be considered as a collection of classified articles and this can save time for creating the training dataset. Thanks. |
|
Yes it does. It is a research problem to adapt a classifier trained in, say, newswire, to other styles of data. Usually you do it with either a bit of supervision (as in, a little bit of correctly labeled data in the new domain) or without (as in, with a lot of unlabeled data in the target domain). For supervised domain adaptation, the easiest approach I've seen (and implemented) is Daume's work. There will be a semi supervised extension, but I can't yet find a pdf. He also has a paper that uses active learning for that purpose, but I haven't read it yet. I don't know what is the best reference on semi supervised domain adaptation, though. You might check this year's ICML tutorial. "Domain Adaptation". Thanks a lot!
(Jul 02 at 21:11)
Jfly
1
I'll also recommend John Blitzer's adaption work. See this paper for an example application to sentiment analysis. The difference from the Daume work is that Blitzer doesn't assume labeled data in the new domain. His method works off of only unlabeled data.
(Jul 02 at 22:27)
aria42
1
If you want to look at such problems beyond the field of NLP/IR, a more general term for this is "transfer learning".
(Jul 03 at 05:08)
zeno
|
|
The short answer is that potentially everything matters. A supervising learning algorithm will sometimes latch on to peculiar characteristics of your training data in a fashion that doesn't transfer well to a new data set. This can happen even with two different news wires, though of course it's more likely the more different the two types of data are. Looking at the trained model as a sanity check is always a good idea. Fortunately, as others have mentioned, there's been a lot of work on domain adaptation and transfer learning. If the two types of data are very different, one possibility is to use the first data set to set a Bayesian prior for learning a model on the second. Dayanik, Madigan, Genkin, Menkov, and I had a paper in SIGIR 2006 on this. |
|
I would not say that style per se really matter, but vocabulary obvisously does. If you use a bag of words / tokens / ngrams as input features for you classifier, the classifier will not be able use the information present in tokens never seen before during the training phase while they might be very important for your test set. I think to answer this question you need to try and train models on your specific task and measure the variance of the precision and recall performance by using cross validations with folds coming from your different datasets (textbooks, research papers, news, wikipedia, ...). It is also probably very dependent on the size of your training corpus and the number and the nature of classes / labels you target. Theoretically, if the training corpus is big enough then important terms should already been included. The problem is that different styles may have different types of noises. Am I right? Moreover, if I have the time to cross validate one model on different datasets, maybe I could just develop different models for different datasets? Thanks.
(Jul 02 at 20:05)
Jfly
|