|
Let's say I have a task in which I have a small corpus (inspired by this question, and a small sentiment analysis corpus in a current project). I would like to do semi-supervised learning and supplement the small corpus with a representative large unlabeled corpus. Representative is the operative word. For example, in sentiment analysis, it would not make sense to use Wikipedia as a large corpus, because Wikipedia language is generally intended to be objective. For a biomedical application, it would not make sense to use a corpus of news articles. Given a particular small corpus, how do I select a representative large corpus? |
|
A quick-and-dirty approach is:
The devil is in the details, but even a naive implementation of the above could select a better representative large corpus than uniform sampling of the superset corpus. 1
Maybe you should write a (workshop?) paper on this as a sort-of-general way of augmenting training data. Evaluating on co-nll tasks would be nice, and you probably wouldn't need more than UKWAC for many of those.
(Aug 09 '10 at 19:49)
Alexandre Passos ♦
|