3
1

A lot of the simpler active learning algorithms revolve around these general steps:

  1. Ask for labels for an initial random subset of the unlabeled data
  2. Train a classifier on the labeled data
  3. Apply the classifier to the remaining unlabeled data
  4. Ask for the labels for the examples that have the most extreme (lowest/highest) uncertainty, margin or any other similar measure, plus a small random subsample to keep things unbiased.
  5. Goto 2

If the classifier is SVM, then you have at least one - if not 2-3 - hyperparameters to tweak. How would you go about finding the right hyperparameters for this in a practical setting? I can see a few alternatives:

  1. Set them to something reasonable beforehand and don't tweak them until you're done interacting with the user.
  2. Do a grid-search every iteration, keep the best hyperparameters for the next classifier.
  3. Train many classifiers in parallel and use some sort of combined metric to select samples (average margin, weighted by the average performance over cross-validation or something along these lines, etc.).

1 is hardly realistic given how fiddly svm can be. 2 is likely to be slow. I also have concerns with overfitting, especially with such a small number of examples (k-fold cross-validation over a handful of examples?). I'm not really sure how 3 would work since many of the hyperparameters may just be plain wrong, unless you introduce some sort of weighing scheme based on the classifier's performance.

Any ideas?

Edit: Just to give a bit more context: there will be a single user, the number of labeled examples added per iteration will be on the order of about 50 (maybe less, depending on how much time the training takes), and you can't expect the user to be willing to sit through more than 100 iterations. The data is fairly low dimensional (15 to 200).

asked Oct 31 '11 at 08:37

yeastwars's gravatar image

yeastwars
23171417

edited Oct 31 '11 at 08:42


4 Answers:

your problem is considered almost exactly in this paper:

A large-scale active learning system for topical categorization on the web

answered Nov 02 '11 at 22:58

downer's gravatar image

downer
54891720

While your options seem sensible, do keep in mind that the data obtained from active learning is biased, so cross-validation error is not an accurate estimate of held-out training error. Hence I'd keep a small set of IID points labeled and use validation error on these points to do model selection or, preferrably, model averaging.

answered Oct 31 '11 at 13:24

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Interesting question. This is a very practical problem which is probably often omitted in papers. If you receive enough labeled data in your step 1, I'd say you can get away with estimating the hyperparameters via cross-validation on this set.

I like your 3rd idea too. You could use something simpler than a combined metric such as: 1) choose a classifier with uniform probability 2) select an instance with your usual criterion (e.g., margin) on the chosen classifier 3) receive label 4) update all classifiers with the new instance. This idea probably works best if you retrain the classifiers after each received label (or use an online algorithm).

Another idea is to use logistic regression instead of SVM, as it is supposedly less sensitive to the hyperparameter choice.

answered Oct 31 '11 at 11:55

Mathieu%20Blondel's gravatar image

Mathieu Blondel
119121615

edited Oct 31 '11 at 14:28

1

If moving from SVM towards an explicitly probabilistic classifier, I would argue for using confidence weighted learning (http://www.aclweb.org/anthology-new/P/P08/P08-2059.pdf).

(Oct 31 '11 at 12:22) Oscar Täckström
1

Oops, I didn't read the link, which turned out to mention confidence weighted learning :)

(Oct 31 '11 at 12:32) Oscar Täckström
1

Indeed, CW can be very useful in an active learning setting.

(Oct 31 '11 at 14:27) Mathieu Blondel

I would probably do a combination of 1 and 2. Since you're doing active learning the number of instances should be reasonably low, so I don't see how a simple grid search could be a problem (also, it's trivial to parallelize). I would also consider starting with a heavily regularized model and then gradually relax the regularization as more data arrives.

answered Oct 31 '11 at 11:08

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström
2039133450

edited Oct 31 '11 at 11:09

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.