It's come time for me to automate the task of detecting spammers on my website, and I'm trying to do so using a learning algorithm trained on data that I've collected over the last couple of years of manually moderating on users.

The problem is that I've manually verified around 2000 users as legitimate but have only rejected around 100, and so every algorithm I try skews towards accepting every user, and often ignores spammers in my testing data. Algorithms perform even worse when I use a subset of the 2000 to try to even the playing field.

The worst failure was the Perceptron algorithm - my data does not seem to be linearly separable.

Recently I've found some success using a nearest neighbor algorithm, but it's still not quite good enough. My research seems to show that anomaly detection algorithms would probably be the best. Can anyone offer a suggestion for direction based on this information?

A tad more info: the features I'm using are all continuous (not categorical) at the moment.

Thanks, Adam

asked Jul 23 '12 at 17:19

Adam%20Jonz's gravatar image

Adam Jonz
1111

What basis are you using for your research? A lot of the anti-spam literature already deals with these problems (everyone has it in this field, with a small number of positive samples).

(Jul 23 '12 at 23:08) Robert Layton

One Answer:

It is difficult to give advise without more information on the features that you use. The problem you describe can countered by giving examples different weights (or as a crude alternative, resampling your data to achieve the same). For an example of instance weighting, see http://scikit-learn.org/stable/auto_examples/svm/plot_weighted_samples.html.

Perhaps the outlier detection of scikit-learn can help you further? See http://scikit-learn.org/stable/auto_examples/applications/plot_outlier_detection_housing.html. What I would do is to train one-class SVM (start with just a linear kernel) on all the valid users. I know from experience that this works really well on features from EEG data. The l2 regularizer attempts to keep the classification weights small, which promotes a model that detects most inliers but does minimize the amount of false positives.

answered Jul 24 '12 at 03:12

Bwaas's gravatar image

Bwaas
136138

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.