|
People often use either one or the other as a point estimator of parameters, any idea when one is preferable? edit: Particularly, I'm wondering about theoretically grounded reasons for a preference |
|
The posterior mode can change with reparametrization (see the common problems with MAP estimation in continuous distributions), so it's not a good idea in one of these cases (dirichlet process mixtures are a good example off of the top of my head). On the other hand the posterior mean can be completely uninformative, as you can see in chapter, section 8.5.1, of the Koller & Friedman Graphical Models book (it talks about M-projection, but M-projection with a point estimate as the approximating distribution essentially gets the posterior mean). I'd say theoretically they're both unsound, and you're better off doing something better. In practice, choose the one that is less flawed on that specific problems (for example, with unidentifiable latent variables choose the posterior mode, if your density function diverges or there is more than one interesting parametrization choose the posterior mean). true, continuous distributions are more complicated. AFAIK mode is parameterization invariant when estimating parameters of discrete distributions, and also Laplace approximation is done by expansion around the mode rather than the mean
(Sep 21 '10 at 20:28)
Yaroslav Bulatov
Laplace approximation is basically I-projection of a gaussian over the true posterior, IIRC, precisely because of the thing we discussed on the other thread about having a good local approximation.
(Sep 21 '10 at 20:29)
Alexandre Passos ♦
well, I-projection in some very loose sense, it merely comes close to the mode rather than match it, see Figure 8.1 on page 275 of K&F
(Sep 22 '10 at 12:00)
Yaroslav Bulatov
That's because laplace approximation is "fit the mean and then estimate the variance" while I-projection fits both jointly.
(Sep 22 '10 at 12:13)
Alexandre Passos ♦
|
|
If the posterior is symmetric and unimodal, you can use the mode (which is also less expensive than computing the mean; the latter requires integration). For multimodal or skewed distributions, it's advisable to use the mean. Sometimes actually the posterior median is preferred over mode and mean, which tends to work well for non-symmetric distributions. Edit: In theory, the three point estimates mean, mode, median, all are consistent (i.e., converge to the true value as the sample gets larger), asymptotically unbiased, and efficient. But in practice, the behaviors may differ quite a bit. Apart from symmetry and multi-modality, IMO, the tail behavior of the distribution is also important. For one-sided tails, the mode may not be a good choice. The mean can be problematic for heavy tailed distributions since it ends up being quite far from the locations of significant probability mass. The median is the most robust to tail behavior but can be a bit difficult to compute since you have to solve int^{theta}_{-inf} p(theta|x)dtheta = 1/2. So you may have to use numerical methods to find the median location. Interesting approach I came across is Bayes Point Machine. IE, find a point estimate which gives the density closest to Bayesian predictive density. This seems to be the most theoretically founded point estimator, but I couldn't find it used outside of a small number of classification domains
(Sep 21 '10 at 18:41)
Yaroslav Bulatov
|
|
The most appropriate point estimate depends on your loss function. The posterior mean corresponds to to case where you have a square loss (L2) between the true parameter value and your estimate. Posterior median corresponds to absolute loss (L1) and the posterior mode (MAP) is the L0 loss. Obviously this will depend on your parametrization. The posterior mean of x is not the same as for log x. For this reason, the MAP estimate is some-what over-rated as the parametrization should not matter in the Bayesian context. See p. 306 of David MacKay's book for more information on this. For purposes of prediction you should integrate out the full posterior according to Bayesian theory. However, if you must use a point estimate then the optimal one is called the Bayes' point. The point estimate that minimizes the expected (under the posterior) loss in prediction. I've seen Bayes Point paper and it sounds like a principled approach, although I don't really see anybody use it in practice.
(Dec 29 '10 at 16:14)
Yaroslav Bulatov
I haven't of heard many people using the Bayes' point in practice, either. I haven't looked at it very closely. I am not sure what the typical computational difficulty is for finding the Bayes' point. I will ask Ed Snelson about it next time I see him ;) If you're willing to average over a collection of point estimates, you could try Max Welling's herding methods: Y. Chen, M. Welling and A. Smola (2010) Supersamples from Kernel-Herding UAI 2010
(Dec 30 '10 at 01:32)
Ryan Turner
|