|
To make some predictions, I want to fit a probability distribution with a density like the one in blue. I've found that a log-normal fits somewhat nicely (in red), but the "spikes" creates some errors on the predictios. How would you tackle the problem? Any advice would be appreciated, specially if you can add some suggestions in R, python or ruby :D
|
|
Fitting distributions to data is a pretty wide-open area. There are numerous methods for doing so, with varying strengths and weaknesses. Generally, this is how I would go about it:
As for going about this in R, this is the best guide I have found, so far. |
|
You can try fitting a mixture model, if it's that important to cover the spikes. They look a lot like overfitting, though. To use that, see kernel density estimation (when you fit the variance of a mixture of one gaussian per data point in your distribution) or just a mixture of lognormal models. I'm not familiar with R, so I don't know of any code for these tasks. Thanks, I'm heading to check that.
(Jul 07 '10 at 14:11)
eduardofv
|
|
I believe I understand now. Those bumps were not created by 'overfitting'! Please tell me if I am wrong about how you set about producing your graph. You fitted a smoothed density function to the original data. You also generated a pseudo-random sample from a log-normal distribution and fitted a smoothed density to those data. These results are represented by the blue and red curves respectively. Although you have asked only about how to fit density functions you do say, "...but the "spikes" creates some errors on the predictios." As always, the approach that one takes depends on the use that one intends to make of the results. Would you mind saying what you mean by the errors in the predictions, please? Thank you Bill, you are right on how I generated the graph. The data represents the probability of a job seeker applying to a job that offers such a salary (x). The spikes or bumps in the statistical data represents salary levels to which job seekers are more likely to apply than levels in its vicinity. For instance, a job seeker will apply much more likely to a job that offers $20,000 than $19,900 or $20,100, because $20,000 is one of those spikes. Using only the lognormal fit to predict the mentioned case would underestimate the probability.
(Jul 09 '10 at 13:18)
eduardofv
1
If you have a priori knowledge of what causes the "noise", just model it. In this case, I would consider a mix of periodic functions. Just be careful of overfitting if you do it by hand. You should be able to argue why the function makes sense in a simple case like this.
(Jul 09 '10 at 17:24)
rm999
|
|
There's something wrong with that 'log-normal', isn't there? A log-normal doesn't have bumps on it like that. (See http://www.wolframalpha.com/input/?i=lognormal) So, whether or not the log-normal is an appropriate density for your purposes you should be able to select one that is the best in some sense and display it as a bump-free curve overlaid on the original density function. Something wrong with the calculations? 1
The lognormal is the red line, and the density he's trying to fit is the blue line, no? If not, I got it really wrong.
(Jul 07 '10 at 14:51)
Alexandre Passos ♦
Thanks both. Alexandre, you are right. The lognormal is in red. The small bumps on it are created by the random deviates used to "build" the curve.
(Jul 07 '10 at 14:57)
eduardofv
|
