|
Hello all, I'm testing the parameters of an experimental ANN training algorithm for binary classification. This procedure is expected to do well on highly unbalanced data IF most attributes are continuous (it relies heavily on continuous probability distributions). I need data to test it. Any suggestions?
This question is marked "community wiki".
|
|
Teenagers take moringa capsules to cure acne in three days. College students take it to keep alert for late night study sessions and again in the morning prior to exams in order to enhance recall. Many suffering with Fibromyalgia find it’s the only thing that brings them relief nysbs moringa .
This answer is marked "community wiki".
|
|
One thing you can do is create your own artificial dataset that has these properties and looks as natural as possible. For example, I don't know what you mean by "relies heavily on discrete probability distributions" but if you can find a balanced dataset with that property you can create an unbalanced one by biased sampling.
This answer is marked "community wiki".
Relies on continuous probability distributions. I use characteristics of the training data probability distributions to change the training process. For an instance of the data I'm looking for, at work I analyze credit card fraud data. We have 30.000 fair transactions per fraudulent trasaction, a pretty unbalanced situation. I'm looking for similar data, but in a domain where most attributes are continuous (most my credit card variables are nominal). An artificial dataset is the easy way out, I'm looking first at real world data.
(Feb 28 '12 at 07:59)
Lucas Gallindo
My suggestion still stands: pick a dataset where the attributes are continuous, use biased sampling to make it unbalanced, and see how you perform.
(Feb 28 '12 at 08:01)
Alexandre Passos ♦
|
I don't have a dataset at hand but does ANN stand for Artificial Neural Networks or Approximate Nearest Neighbors?
What kind of number of dimensions / features can this algorithm scale to? 100, 10k, 1M? Does it handle sparse data (many zeros) efficiently?
ANN = Artificial Neural Network. In particular, MLPs.
I do not think the number of dimensions will be a problem, but sparse dimensions might present trouble. I have extensions for sparse data in my mind, but I do want to test it on "simpler" data first. So far, I tested it only on balanced data and artificial datasets.
Like ogrisel, I don't have a dataset in mind, but have a look for medical datasets -- often the number of positive cases are much smaller than the negatives.
Perhaps you should describe the real-world data-sets you have tested on; maybe that'll help spark an idea in people's minds.
@Brian Vandenberg, I tested it on three PROBEN1 datasets: Card, Cancer and Diabetes. I was comparing results with those of the IJCNN 2011 paper "PCA and Gaussian Noise in MLP Neural Network Training Improve Generalization in Problems with Small and Unbalanced Data Sets". There was a small but significant improvement.