1
1

To make some predictions, I want to fit a probability distribution with a density like the one in blue. I've found that a log-normal fits somewhat nicely (in red), but the "spikes" creates some errors on the predictios. How would you tackle the problem?

Any advice would be appreciated, specially if you can add some suggestions in R, python or ruby :D alt text

asked Jul 07 '10 at 13:50

eduardofv's gravatar image

eduardofv
30123


4 Answers:

Fitting distributions to data is a pretty wide-open area. There are numerous methods for doing so, with varying strengths and weaknesses. Generally, this is how I would go about it:

  1. You may have background knowledge to constrain your choice of families of distributions. If you do not have warrant to chose a particular family, you can perform a nonparametric fit with kernel density estimation (as per Alexandre's advice).
  2. Though fit is (too) often eyeballed, it is much preferable to score fit by formal means, like the Kolmogrov-Smirnov test.
  3. There is always a danger of over-fitting your data, which can be mitigated by model selection criteria (AIC, BIC, etc.) that trade-off the fit of a model and the number of parameters used.

As for going about this in R, this is the best guide I have found, so far.

answered Jul 07 '10 at 14:55

John%20L%20Taylor's gravatar image

John L Taylor
61541518

You can try fitting a mixture model, if it's that important to cover the spikes. They look a lot like overfitting, though. To use that, see kernel density estimation (when you fit the variance of a mixture of one gaussian per data point in your distribution) or just a mixture of lognormal models. I'm not familiar with R, so I don't know of any code for these tasks.

answered Jul 07 '10 at 13:58

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1898244214335

Thanks, I'm heading to check that.

(Jul 07 '10 at 14:11) eduardofv

I believe I understand now.

Those bumps were not created by 'overfitting'! Please tell me if I am wrong about how you set about producing your graph.

You fitted a smoothed density function to the original data. You also generated a pseudo-random sample from a log-normal distribution and fitted a smoothed density to those data. These results are represented by the blue and red curves respectively.

Although you have asked only about how to fit density functions you do say, "...but the "spikes" creates some errors on the predictios." As always, the approach that one takes depends on the use that one intends to make of the results. Would you mind saying what you mean by the errors in the predictions, please?

answered Jul 09 '10 at 12:19

Bill%20Bell's gravatar image

Bill Bell
161

Thank you Bill, you are right on how I generated the graph. The data represents the probability of a job seeker applying to a job that offers such a salary (x). The spikes or bumps in the statistical data represents salary levels to which job seekers are more likely to apply than levels in its vicinity. For instance, a job seeker will apply much more likely to a job that offers $20,000 than $19,900 or $20,100, because $20,000 is one of those spikes.

Using only the lognormal fit to predict the mentioned case would underestimate the probability.

(Jul 09 '10 at 13:18) eduardofv
1

If you have a priori knowledge of what causes the "noise", just model it. In this case, I would consider a mix of periodic functions.

Just be careful of overfitting if you do it by hand. You should be able to argue why the function makes sense in a simple case like this.

(Jul 09 '10 at 17:24) rm999

There's something wrong with that 'log-normal', isn't there? A log-normal doesn't have bumps on it like that. (See http://www.wolframalpha.com/input/?i=lognormal) So, whether or not the log-normal is an appropriate density for your purposes you should be able to select one that is the best in some sense and display it as a bump-free curve overlaid on the original density function.

Something wrong with the calculations?

answered Jul 07 '10 at 14:49

Bill%20Bell's gravatar image

Bill Bell
161

1

The lognormal is the red line, and the density he's trying to fit is the blue line, no? If not, I got it really wrong.

(Jul 07 '10 at 14:51) Alexandre Passos ♦

Thanks both. Alexandre, you are right. The lognormal is in red. The small bumps on it are created by the random deviates used to "build" the curve.

(Jul 07 '10 at 14:57) eduardofv
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.