|
Outlier detection is a broad classification problem area which is also known as anomaly detection, rare class mining etc. I have seen various survey papers in this area but haven't come across an study which clearly spells out a state of the art algorithm for working with multi-variate datasets. Diversity of the techniques and limited availability of proprietary datasets could be an issue but does anyone know any 'state of the art' algorithms in the outlier detection area for multivariate datasets? |
|
I don't know state-of-the-art, but using PCA to get a reduced dimension form of your data and then measuring the euclidean distance between the point the reduced form maps to in the higher dimension, and the actual point and checking if cutting some threshold on this distance to make your outlier judgement does a good job might be a possible easy thing to try on your dataset |
|
How much non-anomalous data do you have? If you have a lot, say >100K, you could try inducing an unsupervised model over the data. For example, induce an auto-encoder over to data to minimize the reconstruction error. A new instance is "anomalous" if the reconstruction error is above some threshold, because its variations are not being properly modeled. |
|
The key idea for any outlier detection method (in any context) is working from the "inside out"; that is, determination of the outlyingness of an observation must be made based on its distance from a subset of the data that is believed to be clean. This idea was introduced for univariate Gaussian data by Bernard Rosner in a 1975 paper in Technometrics, and was referred to as "backwards-stepping" (in contrast to "forwards-stepping," which tests observations from most to least extreme, resulting in masking effects). The challenge in more complex situations (such as regression and multivariate data) is defining what is meant by the "inside" (least extreme) observations. Ali Hadi and I (separately and together) were early contributors to this literature in the early 1990s in papers in an IMA volume, the Journal of the American Statistical Association and the Journal of the Royal Statistical Society, Series B. Ten years ago Anthony Atkinson and Marco Riani wrote a book for Springer discussing applications of this idea in regression, changing the terminology from "backwards-stepping" to "forward search." Most relevant for your question, they followed this up with a book on applications to multivariate data with Andrea Cerioli. They have continued this research since then; a paper by the three of them on multivariate outlier identification appeared in 2009 in JRSS-B. I hope that this helps. |
|
I'm not aware of any general works on outlier detection, but I've seen some specific tricks for some classes of models. In directed graphical models it is common to add a "background" distribution for some parameter, and allow it to either come from the model of from the background. Here are some examples of this technique. In classification I've seen some people using ramp losses to avoid giving a very high weight to outlying examples. You could also read some novelty detection paper. |