|
When attempting to do transfer learning on a corpus of documents, do I take into account only the words (features) that are present in the training set, or do I also take into account the words in the test set? |
|
One of the approaches is to actually identify "pivot features" that are good for both sets. See this paper "Domain adaptation with structural correspondence learning" which describes this approach. Yes, that's an approach I considered. But right now, I'm trying to implement the algorithm described in this paper: Transferring Naive Bayes Classifiers for Text Classification, where you perform EM on an existing Naive Bayes classifier. What I'm wondering about is whether to use only the words in the training set to build the naive bayes classifier, or whether i should also consider the words present in the test set. ordinarily, it wouldn't make a difference, but this paper seems to use 'total number of unique words' in a few calculations.
(Aug 09 '10 at 01:34)
priya venkateshan
1
If the size of the weight vectors you are learning in the source and target domains is the same then you should use the union of words from both domain (I think you are using a bag of words representation). In this case, the source domain classifier would usually put zero weights for words/features present only in the target (and not the source).
(Aug 09 '10 at 01:50)
spinxl39
sounds good. thanks. also, is it standard practice to use the union of words from the training and test sets in text mining generally? i have seen this in academia, but is the same practice done in the industry as well, or do they have other concerns due to which they don't use the union of words to build the feature vector?
(Aug 09 '10 at 02:00)
priya venkateshan
I don't really know what's the practice in industry. :) Using a union of words isn't the only approach though. Different transfer learning algorithms use different ways to constructs features. For example, the SCL paper I cited above doesn't actually use the union of words from both domains but does something very different (identifying pivot features); the "frustratingly easy domain adaptation" paper by Hal Daume does a simple feature augmentation of both source and target features. So it depends on the method.
(Aug 09 '10 at 02:21)
spinxl39
sure, there are many different ways to select features in transfer learning and domain adaptation. this paper here http://www.cs.ust.hk/~sinnopan/publications/TLsurvey_0822.pdf [pdf] summarizes them all. what about text mining in general? what is generally done there with respect to words there?
(Aug 09 '10 at 02:32)
priya venkateshan
I don't know about specific text mining problems but most generic approaches to transfer learning and domain adaptation, such as distribution matching of source and target distributions (e.g., MMD), are kind of black-box in the sense that you don't need to know much about the features in the data (hence oblivious to the feature representation used). These approaches also work reasonably well on text data as well. There are other approaches however such as SCL (the "Domain adaptation with structural correspondence learning" paper) that work by a careful selection of useful features. IMO, this can be often tricky unless you have considerable domain knowledge to do the required feature engineering. But practitioners have found it to be useful on some problems.
(Aug 09 '10 at 03:47)
spinxl39
1
Priya - As usual, it depends. In many applications, you have to train a model and apply it to streaming data, so it would be very unusual to make use of knowledge of the test "set" features. At the other extreme, in relevance feedback style applications of text mining, where you're doing learning on a more-or-less fixed set of documents, it would be routine to use knowledge of the set of test features, e.g. for computing IDF weights. As for the original question of domain adaptation, I can see arguments both ways. Also, I haven't come across much explicit use of domain adaptation algorithms in industry, though one often sees heuristic kludges, like weighting some training data examples less than others.
(Nov 09 '10 at 20:35)
Dave Lewis
showing 5 of 7
show all
|
|
If you're doing EM, as you stated in the comments, than you should include the test words in your model. Otherwise, you risk losing a bit of performance. What EM does is that, after classifying the target documents with the source domain, it will adjust the feature weights of the classes to better reflect the new distribution, then reclassify, etc. So if you have a feature that didn't appear on the source data, it will have zero weight, but if it correlates well with the target classes EM should find it as useful and use it to improve the model. thanks. that helps.
(Aug 10 '10 at 01:09)
priya venkateshan
|
In general, it's best to use the test set if it's available.
Well.. if you've access to the test set it's best to use only the features in the test set; features which are present in the training set only wont help you at test time.
Doesn't that defeat the purpose of having a test set (overfitting!)? You want to build a model that is generalisable, not just really good on one particular set