|
I am using a matlab package called fastfit[1] to fit the parameters of a Dirichlet distribution. Since the distribution does not have a simple conjugate prior, it only gives a maximum likelihood estimate of its parameters(no MAP). Unfortunately, I have quite a limited dataset, so I get a very peaky solution. I know the logistic normal distribution is an alternative to the dirichlet[2] distribution, but I was wondering how bad it is to rescale the ML parameters by a multiplying them by a "hyperparameter" alpha. This effectively smooths the distribution in a way probably similar to a prior. Is there any way to justify this? Would it be specially bad for some reason? [1]http://research.microsoft.com/en-us/um/people/minka/software/fastfit/ [2]http://andrewgelman.com/movabletype/mlm/logistic-normal%20dist%20properties%20and%20uses.pdf |
|
I haven't tried this myself and don't know of any principled justification, but Madsen et al. describe something like this in Modeling Word Burstiness Using the Dirichlet Distribution, where they use the multivariate Polya distribution to model documents. They first estimate the parameters using maximum likelihood and then smooth them by adding a small constant (0.01 * min(w)) to each parameter, where min(w) is the value of the smallest parameter. My intuition with dirichlets is also that adding a small constant is better than multiplying by 1 + epsilon because there is a case where the parameters might go to zero and then the likelihood gets infinite.
(Oct 19 '11 at 11:32)
Alexandre Passos ♦
I think they are facing a different problem in that paper. Their problem is that some of their alphas are 0 because they have many dimensions and some of their training vectors will have many zeros. My problem is that my training data is all very similar, and I expect more variation (like fitting a Gaussian to very few points, underestimating variance). My alphas are very big, 43e4, and I want to make them smaller.
(Oct 19 '11 at 12:26)
Roderick Nijs
Ah, I see now what you're looking for. Did you consider just adding some appropriate noise to the data?
(Oct 19 '11 at 12:48)
Oscar Täckström
|
|
If you're optimizing numerically you can use any prior you want, not just a conjugate prior. With Dirichlets you probably want to regularize the alphas to be close to 1, so I'd just use a gaussian prior with mean 1 and variance estimated to avoid peakiness. Adding this prior will just add a (1/variance)(alpha - 1)^2 term to your objective function, which is always easy to do. Good point. I think the implementation I found [1] does not allow for this so easy though, because it is not doing gradient ascent but instead an alternated fixed point iteration. I did find a usable expression for the gradient [2], so this could be a nice alternative. [1]http://research.microsoft.com/en-us/um/people/minka/papers/dirichlet/minka-dirichlet.pdf [2] Maximum Likelihood Estimation of Dirichlet Distribution Parameters
(Oct 19 '11 at 12:39)
Roderick Nijs
|