I am interested in ways if treating an imbalanced regression problem.

By imbalanced regression I mean that I have few samples for some regions of my input and target space and lots of features for other regions. I don't know a priori about the exact distribution and how they contribute to the overall error. All I know that the distribution I get as training data is not the distribution I expect during the runtime of the system.

I'd be interested in ideas on how to treat this problem robustly, theoretical work as well as quick heuristics/hacks that you found work well.

So far I have been thinking about:

  1. Boosting, bagging.
  2. Bin the data and chose N samples from each bin (e.g. by using K-Means or some hypercube measure)
  3. Minimizing a form of the squared loss, where I weigh each sample inverse to it's probability according to some model (e.g. a mixture of Gaussians) after removing outliers. Maybe a mixture of student-T's will do fine because of that.
  4. Minimizing the error but only up to some constant (e.g. don't minimize below 0.05 or so),

However I have no practical experience. Anyone encountered such a problem and successfully solved it?

asked Nov 19 '11 at 03:24

Justin%20Bayer's gravatar image

Justin Bayer
92651828


One Answer:

Maybe you can draw some inspirations from Smola's blogpost on this topic?

answered Nov 19 '11 at 05:54

Bwaas's gravatar image

Bwaas
106138

edited Nov 19 '11 at 08:15

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.