3
1

I'm looking for general relevant results in this area.

asked Oct 17 '10 at 12:46

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1895244214333

edited Dec 03 '10 at 07:16


5 Answers:

The main approaches, as far as I know, are:

  • The one-class SVM. Originally proposed in Schölkopf et al Estimating the support of a high-dimensional distribution, and can be extended to many other settings
  • Density-based approaches, where you fit a maybe parametric and maybe non-parametric model and cut off based on likelihood or something similar to that (too many papers using this idea for me to propose it clearly).
  • Reconstruction-based approaches, such as using a denoising auto-encoder on the training set and setting a threshold of reconstruction error above which a point is considered an outlier.

I found the Hodge and Austin A survey of outlier detection methodologies paper to be useful.

answered Oct 17 '10 at 12:48

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1895244214333

edited Oct 17 '10 at 20:17

You can refer the work by David M. J. Tax, esp. his Ph.D thesis. Moreover, he has a toolbox in matlab. See http://prlab.tudelft.nl/users/david-tax for details. Maybe it is useful for you.

This answer is marked "community wiki".

answered Oct 20 '10 at 02:43

Shuo%20Xu's gravatar image

Shuo Xu
1136

edited Oct 20 '10 at 02:44

There's a nice JMLR paper by Owen on logistic regression in the setting where the training set contains finitely many positive observations and infinitely many negative ones: Infinitely Imbalanced Logistic Regression

answered Oct 18 '10 at 05:05

Bob%20Durrant's gravatar image

Bob Durrant
301410

I think the literature on Robust Statistics is relevant, though I am not familiar enough be able to give a more precise answer.

answered Oct 17 '10 at 13:51

ogrisel's gravatar image

ogrisel
398464480

This paper might be of some interest to you Regularized F-Measure Maximization for Feature Selection and Classification. I have not used the method myself, though I might have to ( this being the reason I stumble upon it). Supposedly it works well on unbalanced datasets which I guess is what you have in mind.

answered Oct 18 '10 at 16:42

pascanur's gravatar image

pascanur
46113

The application I have in mind is not classification per se; I'm interested in the problem of detecting outliers, not necessarily dealing with them.

(Oct 18 '10 at 18:26) Alexandre Passos ♦
1

That's a classification problem though, isn't it (f(is_an_outlier)=0, f(!is_an_outlier)=1)?

(Oct 19 '10 at 04:27) Bob Durrant

@Durrant: This is true; I'm interested in detecting more than one sort of outlier, however, so it'd be hard to get reliable negative (isn't outlier) examples. And with logistic regression at least, as per the paper you mentioned, the only relevant thing would be the average features of the isn't outlier class, which is too coarse.

(Oct 19 '10 at 04:37) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.