1
1

Hello all,

I'm testing the parameters of an experimental ANN training algorithm for binary classification. This procedure is expected to do well on highly unbalanced data IF most attributes are continuous (it relies heavily on continuous probability distributions).

I need data to test it. Any suggestions?

This question is marked "community wiki".

asked Feb 27 '12 at 09:04

Lucas%20Gallindo's gravatar image

Lucas Gallindo
1123

I don't have a dataset at hand but does ANN stand for Artificial Neural Networks or Approximate Nearest Neighbors?

What kind of number of dimensions / features can this algorithm scale to? 100, 10k, 1M? Does it handle sparse data (many zeros) efficiently?

(Feb 27 '12 at 09:41) ogrisel

ANN = Artificial Neural Network. In particular, MLPs.

I do not think the number of dimensions will be a problem, but sparse dimensions might present trouble. I have extensions for sparse data in my mind, but I do want to test it on "simpler" data first. So far, I tested it only on balanced data and artificial datasets.

(Feb 27 '12 at 10:09) Lucas Gallindo
1

Like ogrisel, I don't have a dataset in mind, but have a look for medical datasets -- often the number of positive cases are much smaller than the negatives.

(Feb 27 '12 at 20:07) Robert Layton

Perhaps you should describe the real-world data-sets you have tested on; maybe that'll help spark an idea in people's minds.

(Mar 01 '12 at 12:53) Brian Vandenberg

@Brian Vandenberg, I tested it on three PROBEN1 datasets: Card, Cancer and Diabetes. I was comparing results with those of the IJCNN 2011 paper "PCA and Gaussian Noise in MLP Neural Network Training Improve Generalization in Problems with Small and Unbalanced Data Sets". There was a small but significant improvement.

(Mar 01 '12 at 13:19) Lucas Gallindo

2 Answers:
-1

Teenagers take moringa capsules to cure acne in three days. College students take it to keep alert for late night study sessions and again in the morning prior to exams in order to enhance recall. Many suffering with Fibromyalgia find it’s the only thing that brings them relief nysbs moringa .

This answer is marked "community wiki".

answered Mar 04 '13 at 02:57

mdoib%20dnt's gravatar image

mdoib dnt
1

One thing you can do is create your own artificial dataset that has these properties and looks as natural as possible. For example, I don't know what you mean by "relies heavily on discrete probability distributions" but if you can find a balanced dataset with that property you can create an unbalanced one by biased sampling.

This answer is marked "community wiki".

answered Feb 28 '12 at 07:52

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Relies on continuous probability distributions. I use characteristics of the training data probability distributions to change the training process.

For an instance of the data I'm looking for, at work I analyze credit card fraud data. We have 30.000 fair transactions per fraudulent trasaction, a pretty unbalanced situation. I'm looking for similar data, but in a domain where most attributes are continuous (most my credit card variables are nominal). An artificial dataset is the easy way out, I'm looking first at real world data.

(Feb 28 '12 at 07:59) Lucas Gallindo

My suggestion still stands: pick a dataset where the attributes are continuous, use biased sampling to make it unbalanced, and see how you perform.

(Feb 28 '12 at 08:01) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.