I'm using a Metropolis sampler to approximate the joint probability of the data likelihood and a Gaussian prior over the parameters p(X|theta)p(theta). When I use a fake prior that is always equal to 1, the algorithm behaves as I would expect by initially finding some mode of high probability and wandering around in that area. However, when I add the prior, I find that the objective goes down systematically after a while. Every time the sampler makes a step, all the new proposals are of lower probability and it rejects most of them until one is just a bit lower. This goes on until the algorithm starts behaving more like a random walk again in a region of quite low probability...

This seems very odd to me as the sampler should have an equal probability to choose to go back in the direction of higher probability it just came from (I'm using a spherical Gaussian to generate proposals). I'm quite sure the sampling algorithm is correct as it worked well for a variety of tasks where I didn't include the prior. Even if the prior would be flattening the joint distribution a lot, I would still expect random walk behavior and not the systematic decrease I find now. Could this be due to numerical issues, high dimensionality or is this possible for certain peculiar types of distributions?

asked Dec 30 '10 at 13:33

Philemon%20Brakel's gravatar image

Philemon Brakel


Are you including the prior in the computation of the likelihood that you use for the metropolis step? I've had a similar bug in the past and it was due to inconsistent use of the priors.

Does this still happen with less data? It sounds like something that shouldn't happen unless you're using a very bad proposal distribution that never proposes something in a good direction. Can you try stopping the simulation, proposing a better step and seeing how often it's accepted/its probability?

(Dec 30 '10 at 20:33) Alexandre Passos ♦

Thanks for the help.

I'm indeed including the prior in the computation of the likelihood that I use for the metropolis step by simply adding the log of it. Is this wrong somehow?

The problem seems to be independent of the number of datapoints which I varied between 50 and 4000... I'm not really sure how to try your suggestion of using a better step manually. When I print the probabilities of accepting new steps, it is 1 when the new likelihood is higher and less than one when it is lower like it should be but somehow the proposals are almost always lower than the current value. They are also not that much lower but just a bit so they are easily accepted. After accepting one of these lower probability steps all the new proposals are immediately slightly lower than the value that has just been accepted. The acceptance rate seems to be lower in the beginning than later on. This seems odd as well because I'm more used to the opposite pattern (needing big steps for burn-in and smaller ones when sampling after convergence).

(Dec 31 '10 at 06:30) Philemon Brakel

That's interesting. So if the acceptance is right (1 with higher likelihood, smaller as the likelihood goes down), the problem can only be in your proposal distribution. Is it symmetrical? Is there a bug in computing it?

(Dec 31 '10 at 06:33) Alexandre Passos ♦

The proposal distribution is a spherical Gaussian that is centered on the last accepted sample. I'm using a standard implementation from python's numpy for this. I printed separate scores for the data likelihood and the prior and noticed that the data likelihood always goes up until some sort of mode is reached while the prior likelihood always goes down very slowly. This causes the total likelihood to go up first but than go systematically down forever after the data likelihood maxed out. I'm starting to suspect the function that computes the prior likelihood. It seems to work well for two dimensions though and even if it was unstable I wouldn't expect it to be so systematic that Metropolis can't cope with it... I'll try some simpler (spherical) priors to see if that is where the problem comes from.

You were talking about having a similar problem due to inconsistent priors and I start to think this might indeed also be what causes the strange results I'm finding now.

(Dec 31 '10 at 09:05) Philemon Brakel

One Answer:

I got convinced that the whole thing is simply due to an interaction of the high dimensionality of the problem I'm trying to solve and the strong differences in probability density at some regions. The bottom line seems to be that near a center of high density (for example the mean of a Gaussian), the set of directions that move to higher density is a lot smaller than the set of directions that keep the density equal or decrease it.

While I didn't do the math, I think it makes sense that given a spherical Gaussian proposal distribution and another spherical Gaussian we wish to sample from, the set of 'good' directions is approximated by the intersection of the two hyper spheres that you get if you take the distance between the current point and the mean of the target distribution as the radius of the first sphere and some value that is proportional to the variance of the proposal distribution as the radius of the second sphere. Now if the two spheres are very far apart, the first one will be very large and almost cut the smaller proposal sphere exactly in half using an almost flat hyperplane. In this case almost 50% of the proposals will be 'good'. When the spheres are near each other (so the differences in density become larger), the first sphere will be smaller and the intersecting volume will be smaller as well due to the curvature of the first sphere. My intuition is now that this volume becomes very small compared to the rest of the proposal sphere when the dimensionality increases in areas where the density gradient is large. This would mean that if you start near a mode of the distribution, the chances of finding high density directions are a lot smaller than accepting nearby points with slightly lower density once in a while until one has moved away so far that the gradient becomes smaller and chances of finding 'good' directions have increased again. This somehow fits with the fact that most of the probability mass of a high dimensional Gaussian is located at its 'edges' and not its center.

I'm curious if that made sense at all to anyone but for the moment I consider my problem solved. It seems I just need a lot of samples to compensate for this effect...

answered Jan 02 '11 at 04:31

Philemon%20Brakel's gravatar image

Philemon Brakel

I'm not entirely convinced because if the proposal distribution is symmetric and you've moved to a zone of low probability then you should have more probability to move back. The thing you mentioned about the likelihood staying high but the prior going down suggests that somehow your sampler is ignoring the prior when accepting proposals (otherwise it should be just as likely for the likelihood to go down and the prior to go up).

On the other hand it is normal for a sampler to leave a zone of high probability and get lost if, as you described, it is very small, and its probability is not that higher than that of the surrounding area (if the space is like a twisty valley that's not very steep then it's easy to see, as you mentioned, that almost always all proposals are just a little bit worse than the place you came from).

(Jan 02 '11 at 05:43) Alexandre Passos ♦

Thanks for thinking along again. Some additional thing I noticed where that after sampling for long enough the prior stops going down systematically and remains sort of stable. Of course my explanation above is a form of backwards reasoning so I should remain a bit skeptical... Since I'm immediately adding the data and prior likelihoods after computing them and treat them as a single score after that, I think the fact that they behave differently has to be caused by some properties about the shape of the distributions. I guess it is also useful to know that I'm starting to sample almost at the maximum of the prior (the zero vector in my case) and at a lower density area of the data likelihood. The behavior looked more normal when I started at some place of low probability for both types of likelihood.

(Jan 02 '11 at 08:16) Philemon Brakel
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.