3
4

I am curious about this growing subfield of Transfer Learning. The trouble is I don't understand what makes it useful. In which domains do you have plenty of data laying around in a related problem. I also don't understand what transfer learning addresses that is not addressed by hierarchical bayesian models.

asked Nov 08 '10 at 14:52

zaxtax's gravatar image

zaxtax ♦
1051122545

edited Nov 09 '10 at 14:06


3 Answers:

Hierarchical bayesian models are one of the easiest ways of justifying transfer learning. In general, transfer learning works best in scenarios where you expect a hierarchical bayesian model to do better than many independent models.

One early example of transfer learning can be found in Fei Fei Li One-shot learning of object categories. In object recognition, every classifier has to learn which features are good for one individual (type of) object. Transfer learning can help you determine which features are never good (like background features) and which features are probably not as good (as they're very good indicators of other objects), and allow a classifier to focus more readily in the good features that are unique to its object.

You can take this to the extreme with Palatucci et al Zero-shot learning with semantic output codes where they train a mind-reading classifier from MRI data for far more classes than they have labeled data for, and they manage to do this by using something akin to transfer learning where they use data from some classes to learn the structure of data-space (kind of), mapping semantic atributes shared by words with highlighted areas of the brain in MRI data. The performance is impressive on unseen labels.

Another area ripe for transfer learning is natural language processing. Most work on deep learning for natural lasnguage processing (like Turian et al http://www.iro.umontreal.ca/~lisa/pointeurs/turian-wordrepresentations-acl10.pdf and for more info you can see Collobert's entertaining tutorial on deep learning for NLP) explores the fact that there are many relationships between words that can be used to improve the performance of pretty much any classifier that works on words (things like sinonymy, co-occurrence patterns, etc).

In this recent wave of deep learning papers, transfer learning (in the form of unsupervised pre-training) has an important role, by regularizing the very complex models to have parameters that easily correspond to most of the variance in the data (in the hope that these parameters end up being useful for all sorts of different tasks).

And last, but not least, this can be seen as analogous to human/animal learning, where each new task is learned with full knowledge of all previously learned tasks, and transfer learning happens liberally. Some things are obvious (the more languages you know the more easily you can pick another one, likewise with sports). It is indeed hard to think of a scenario where there isn't and there will never be similar related tasks to be performed on the same data, and a lot of recent research involves, for example, jointly learning two related tasks (like POS-tagging and NP chunking, parsing and translation, etc) in a way that improves upon baselines for both tasks.

answered Nov 08 '10 at 15:28

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

But is hierarchical modeling a justification for transfer learning or a form of transfer learning?

(Nov 08 '10 at 19:05) zaxtax ♦

Both. You can justify why training learning is useful by the hierarchical models argument, where while there are some problem-specific things a lot can be shared; you can also very easily create transfer learning algorithms from specific (implicit or explicit) hierarchical models.

(Nov 08 '10 at 19:45) Alexandre Passos ♦

Transfer Learning is used whenever your training data is from a different distribution than your test data, but you want to leverage the similarities. For example, you might have a lot of data from blogs to train on for a sentiment classification task, but your test data is from instant messages. There are obvious differences in the data, but enough similarities too. Due to this, transfer learning often has unlabelled test data available at training time.
A transfer learning technique might choose to pick similar features between the training and test data and learn only on those features. Or learn a classifier on the training data and tune it based on the test data. Now why would the data distributions on the training and test sets be different? It might be because you don't have sufficient labelled examples on the test distribution, while you have a large number of labelled examples on a somewhat different distribution. For example, for document categorization, you have very little labelled data in Spanish to sufficiently train a classifier. However, there is plenty of labelled data in English, which while not completely the same, can be used to help with this task. There is also multitask learning, which can sometimes be seen as another flavour of transfer learning. This is used when the data distribution is the same, while the tasks/labels are different.

If you want to know more, this survey paper on transfer learning would be a good place to start: http://www.computer.org/portal/web/csdl/doi/10.1109/TKDE.2009.191

answered Nov 08 '10 at 16:02

priya%20venkateshan's gravatar image

priya venkateshan
1646812

This looks more like domain adaptation than transfer learning. For a deeper discussion on this difference, see this other question: http://metaoptimize.com/qa/questions/1139/difference-between-domain-adaptation-and-multitask-learning

(Nov 08 '10 at 16:06) Alexandre Passos ♦

For me, transfer learning is useful because it allows us to reduce the cost of data acquisition.

Different domains have different labeling costs: Consider the problem of predicting the sentiment of political blog posts. Labeling blog posts is laborious and time consuming. On the other hand, labeling product reviews is rather cheap (e.g. fetch reviews from amazon). Arguably the two tasks are related, i.e. they share certain predictive structures (e.g. words such as excellent, good, awesome). Transferring those predictive structures to the target task, allows us to reduce the number of labeled training examples for the target task and, thus, to reduces the deployment costs of our final model.

Personally, I often think of transfer learning as an alternative to semi-supervised learning (SSL) - whereas SSL tries to reduce the cost of data acquisition by using unlabeled data; TF (or Domain Adaptation) does so by re-using existing labeled data.

PS: This view is certainly very biased towards a special case of Transfer Learning known as Domain Adaptation.

answered Nov 10 '10 at 10:32

Peter%20Prettenhofer's gravatar image

Peter Prettenhofer
5251911

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.