Is there specific research in looking at the quality of data samples (in a test set)?

Given a classifier trained on a very large dataset (millions of samples) and fairly noisy data, are there papers describing how well the algorithm does in finding abnormal samples on a test set? For example, an svm classifier which misclassifies when given distorted data and a measure of how distorted the data is. More specifically when the type of distortion is unknown beforehand.

asked Sep 16 '11 at 10:12

crdrn's gravatar image

crdrn
327151825

edited Sep 16 '11 at 10:13

Wouldn't the abnormal samples on a test set be the misclassified samples? For an SVM classifier, the distance of a sample from the separating hyperplane would be a measure of distortion. In short, I think what you'd be looking for is some measure of the variance of your data.

(Sep 16 '11 at 17:24) Jonathan Purnell

2 Answers:

Do you want something related to outlier detection, density estimation, or one-class learning? You can start by reading the one-class SVM paper and then follow papers which cite that one if it seems to be relevant to your problem. Something else which might be useful is techniques related to detecing domain shifts.

answered Sep 18 '11 at 20:42

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1898244214335

By the way, "Online Passive-Aggressive Algorithms" by Koby Crammer et al. describes a nice one-class online algorithm. It's easy to understand and implement.

(Sep 20 '11 at 01:58) Mathieu Blondel

Thanks for the suggestion Alex, I think density estimation would be the closest thing I would like to study. I'm not sure how to interpret a binary SVM classifier I'm using as fitting a probability density because its outputs are classifications (positive class or negative class). It seems like the SVM would be separating the positive class' probability distribution from the negative class'.

Distortions or outliers could potentially occupy both or somewhere in between. I'm thinking there might be a specific distance metric I could use to make the SVM output more interpretable.

(Sep 21 '11 at 00:35) crdrn

@Cristopher: a one-class SVM is not at all a two-class SVM. It is more of a nonparametric kernel-based way to do density estimation (or, more honestly, density support estimation, as you don't estimate the density at each individual point, you just learn to discriminate between things that look like real points and things that don't). You should take a look at the paper, as I think it is very close to what you described.

(Sep 21 '11 at 07:44) Alexandre Passos ♦

I don't know what it's worth but here's a rule of thumb I came up with recently. With SVM or Logistic Regression, the prediction function is f(x) = w . x + b and the confidence of the classifier with respect to the prediction is |f(x)|. If the predictions for which the classifier is the most confident happen to be wrong (i.e., y (w . x + b) != 1), the corresponding instances are likely to be outliers or mislabeled (often course, not always). It seems to me that a classifier which "makes mistakes with great confidence" is an indicator of dataset difficulty.

answered Sep 19 '11 at 03:12

Mathieu%20Blondel's gravatar image

Mathieu Blondel
84621513

... or maybe the classifier isn't very good, or maybe there is a qualitative difference between the training and testing sets.

(Sep 22 '11 at 20:15) Art Munson

My answer assumes that you're using the appropriate kernel for the dataset. In the same fashion, at training time, the number of support vectors can be an indicator of dataset difficulty. By the way, by using a one-class SVM, we loose the fact that the original objective was a supervised one, so it may find outliers but not mislabeled instances.

(Sep 23 '11 at 02:26) Mathieu Blondel
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.