2
1

High-level thought-experiment...!

I have a Support Vector Machine (SVM) that is already trained. I measure its performance and determine that I want to build a better SVM. I still have all the training data, so I want to add more training data to optimally improve performance.

Obviously, more training data is better, and I can manually select examples and manually label them.

Given that I only want to create N new samples to add to the training data...Should I choose samples...

  • (1) so that I try to equalize the total number of positive and negative samples (get close to 50/50 ratio)
  • (2) so that I try to make the positive/negative sample ratio close to what it would be in real data
    • i.e. if normally 10% of samples are positive, I should try to train with a 10/90 positive/negative ratio in training samples
  • (3) that are close to the decision boundary
    • The SVM I currently have can return a 'confidence' measure indicating how close the sample is to the boundary
  • (4) that have a range of confidences
    • e.g.
    • c<-4 (SVM very confident it is negative)
    • c<-1 (SVM confident it is negative)
    • |c|<1 (SVM is uncertain)
    • c>1 (SVM confident it is positive)
    • c>4 (SVM very confident it is positive)
  • (5) using some other criteria...

Note: the labeling is the expensive part. I can automatically generate samples and get predictions/confidence values from the current SVM. If I can only do N labelings, then want to know how best to select those samples I should label.

Thanks!

asked Apr 05 '11 at 13:45

Ciar%C3%A1n's gravatar image

Ciarán
1713612

edited Apr 11 '11 at 04:46

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126


2 Answers:

There is a large subfield of machine learning called active learning that studies the best ways to choose samples to improve a given classifier. A very easy to implement solution, and one that is theoretically justified in some ways, is Leon Bottou's suggestion in the LASVM paper to select examples where the svm is uncertain (small |c|).

The justification is that examples with small |c| will certainly change the classifier if they are added to the training set, while the same cannot be said of examples with a large |c|, as those might as well be correctly classified. Also, hopefully, adding these small |c| examples will steer the hyperplane in a direction that will help it recognize gross misclassifications by assigning them a small |c| in later iterations.

answered Apr 05 '11 at 14:15

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Thank you Alexandre, that was my intuition about the problem (option 3). Much appreciated.

(Apr 06 '11 at 10:52) Ciarán

As Alexandre says, select an example that is close to the decision boundary. If you are selecting a batch of examples at this step in active learning, for labeling, you also want examples that are diverse. For example in "Incorporating Diversity in Active Learning with Support Vector Machines" (Brinker, 2003), about 200 examples close to the decision boundary are found, and then an incremental strategy is used to find a small number of examples that maximize the angle diversity.

(Apr 11 '11 at 04:52) Joseph Turian ♦♦

Selecting near the decision boundary is sensible, but you should importance weight the data to remain asymptotically consistent. Check out http://hunch.net/?cat=22

answered Apr 23 '11 at 01:08

Paul%20Mineiro's gravatar image

Paul Mineiro
91115

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.