2
1

Let's say I have a task in which I have a small corpus (inspired by this question, and a small sentiment analysis corpus in a current project). I would like to do semi-supervised learning and supplement the small corpus with a representative large unlabeled corpus.

Representative is the operative word. For example, in sentiment analysis, it would not make sense to use Wikipedia as a large corpus, because Wikipedia language is generally intended to be objective. For a biomedical application, it would not make sense to use a corpus of news articles.

Given a particular small corpus, how do I select a representative large corpus?

asked Aug 09 '10 at 13:15

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Dec 03 '10 at 07:08

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421


One Answer:

A quick-and-dirty approach is:

  1. Find a very large (superset) corpus that will contain many documents of the desired type.

    For example, if you are looking for subjective sentences in English, the UKWAC (a 2 billion word corpus constructed from the Web limiting the crawl to the .uk domain) would be a good superset corpus.

  2. Perform retrieval over the superset corpus, using the smaller corpus. i.e. make a biased sample of it.

    For example, index the superset corpus using Lucene and, for each document in the smaller corpus, retrieve the top ten indexed documents using the small corpus document as the search query. I share source code implementing this approach in a repo called biased-text-sample.

The devil is in the details, but even a naive implementation of the above could select a better representative large corpus than uniform sampling of the superset corpus.

answered Aug 09 '10 at 13:27

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

1

Maybe you should write a (workshop?) paper on this as a sort-of-general way of augmenting training data. Evaluating on co-nll tasks would be nice, and you probably wouldn't need more than UKWAC for many of those.

(Aug 09 '10 at 19:49) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.