|
I am curious about this growing subfield of Transfer Learning. The trouble is I don't understand what makes it useful. In which domains do you have plenty of data laying around in a related problem. I also don't understand what transfer learning addresses that is not addressed by hierarchical bayesian models. |
|
Hierarchical bayesian models are one of the easiest ways of justifying transfer learning. In general, transfer learning works best in scenarios where you expect a hierarchical bayesian model to do better than many independent models. One early example of transfer learning can be found in Fei Fei Li One-shot learning of object categories. In object recognition, every classifier has to learn which features are good for one individual (type of) object. Transfer learning can help you determine which features are never good (like background features) and which features are probably not as good (as they're very good indicators of other objects), and allow a classifier to focus more readily in the good features that are unique to its object. You can take this to the extreme with Palatucci et al Zero-shot learning with semantic output codes where they train a mind-reading classifier from MRI data for far more classes than they have labeled data for, and they manage to do this by using something akin to transfer learning where they use data from some classes to learn the structure of data-space (kind of), mapping semantic atributes shared by words with highlighted areas of the brain in MRI data. The performance is impressive on unseen labels. Another area ripe for transfer learning is natural language processing. Most work on deep learning for natural lasnguage processing (like Turian et al http://www.iro.umontreal.ca/~lisa/pointeurs/turian-wordrepresentations-acl10.pdf and for more info you can see Collobert's entertaining tutorial on deep learning for NLP) explores the fact that there are many relationships between words that can be used to improve the performance of pretty much any classifier that works on words (things like sinonymy, co-occurrence patterns, etc). In this recent wave of deep learning papers, transfer learning (in the form of unsupervised pre-training) has an important role, by regularizing the very complex models to have parameters that easily correspond to most of the variance in the data (in the hope that these parameters end up being useful for all sorts of different tasks). And last, but not least, this can be seen as analogous to human/animal learning, where each new task is learned with full knowledge of all previously learned tasks, and transfer learning happens liberally. Some things are obvious (the more languages you know the more easily you can pick another one, likewise with sports). It is indeed hard to think of a scenario where there isn't and there will never be similar related tasks to be performed on the same data, and a lot of recent research involves, for example, jointly learning two related tasks (like POS-tagging and NP chunking, parsing and translation, etc) in a way that improves upon baselines for both tasks. But is hierarchical modeling a justification for transfer learning or a form of transfer learning?
(Nov 08 '10 at 19:05)
zaxtax ♦
Both. You can justify why training learning is useful by the hierarchical models argument, where while there are some problem-specific things a lot can be shared; you can also very easily create transfer learning algorithms from specific (implicit or explicit) hierarchical models.
(Nov 08 '10 at 19:45)
Alexandre Passos ♦
|
|
Transfer Learning is used whenever your training data is from a different distribution than your test data, but you want to leverage the similarities. For example, you might have a lot of data from blogs to train on for a sentiment classification task, but your test data is from instant messages. There are obvious differences in the data, but enough similarities too.
Due to this, transfer learning often has unlabelled test data available at training time. If you want to know more, this survey paper on transfer learning would be a good place to start: http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2009.191 This looks more like domain adaptation than transfer learning. For a deeper discussion on this difference, see this other question: http://metaoptimize.com/qa/questions/1139/difference-between-domain-adaptation-and-multitask-learning
(Nov 08 '10 at 16:06)
Alexandre Passos ♦
|
|
For me, transfer learning is useful because it allows us to reduce the cost of data acquisition. Different domains have different labeling costs: Consider the problem of predicting the sentiment of political blog posts. Labeling blog posts is laborious and time consuming. On the other hand, labeling product reviews is rather cheap (e.g. fetch reviews from amazon). Arguably the two tasks are related, i.e. they share certain predictive structures (e.g. words such as excellent, good, awesome). Transferring those predictive structures to the target task, allows us to reduce the number of labeled training examples for the target task and, thus, to reduces the deployment costs of our final model. Personally, I often think of transfer learning as an alternative to semi-supervised learning (SSL) - whereas SSL tries to reduce the cost of data acquisition by using unlabeled data; TF (or Domain Adaptation) does so by re-using existing labeled data. PS: This view is certainly very biased towards a special case of Transfer Learning known as Domain Adaptation. |