3
1

I read somewhere that Bayesian Methods do not overfit. What exactly is meant by this statement and is it true?

asked Sep 26 '11 at 14:06

Lancelot's gravatar image

Lancelot
9071211

Also check out this related question: http://metaoptimize.com/qa/questions/4164/precise-definition-of-over-fitting, which has a couple of very good insights regarding overfitting.

(Oct 03 '11 at 09:49) Justin Bayer

Thanks for the link

(Oct 03 '11 at 09:54) Lancelot

2 Answers:

The important property of Bayesian methods to me is the marginalization, not the prior. Even when the prior has little influence on the results, the very act of integrating over model parameters can avoid overfitting. As Bishop's book says "the phenomenon of over-fitting is really an unfortunate property of maximum likelihood and does not arise when we marginalize over parameters in a Bayesian setting."

I think the most illuminating explanation of why Bayesian methods avoid overfitting shows up in the Gaussian Processes for Machine Learning book in chapter 5.

"It is primarily the marginal likelihood from eq. (5.4) involving the integral over the parameter space which distinguishes the Bayesian scheme of inference from other schemes based on optimization. It is a property of the marginal likelihood that it automatically incorporates a trade-off between model fit and model complexity."

I encourage you to read the section I quoted and look at figure 5.2 (page 110 in the book).

Very complicated models that are capable of fitting many hypothetical datasets very accurately have a marginal likelihood (since this is a valid distribution it integrates to 1) that smears probability density over many possible data sets and prevents the marginal likelihood from ever attaining as large values as it does for a simpler model.

answered Sep 28 '11 at 18:25

gdahl's gravatar image

gdahl ♦
15051633

edited Sep 28 '11 at 18:33

As far as I get it, the marginal likelihood scheme of things works great to do model selection, which is also a Bayesian scheme. Bayesian Linear Regression on the other hand is exactly the example I showed ;).

(Sep 28 '11 at 18:40) Leon Palafox

Sorry to misunderstand your answer then. When I read it it did not mention integrating over the model parameters at all. I guess I should have looked at what section 3.3 actually is.

(Sep 28 '11 at 19:11) gdahl ♦

Actually both answers are right, since Gaussian Processes can be seen as a more general scheme where Bayesian linear regression fits very well.

(Sep 28 '11 at 21:44) Leon Palafox
1

I really don't understand what you are trying to say. I don't think answering the question by telling someone to read about a specific Bayesian technique is as helpful as it could be. My point was about Bayesian methods in more generality and that the integration that includes weighting by the marginal likelihood is the important thing, not the prior sometimes being like penalized maximum likelihood.

(Sep 29 '11 at 00:10) gdahl ♦

Both of you have confused me, so are both answers correct? Please help me with this.....

(Sep 29 '11 at 05:43) Lancelot

You should read Bishop's 3.3, in there in the first paragraph, you can see: "[]A bayesian treatment of linear regression, which will avoid the overfitting problen of maximum likelihood " (your question) In chapter 3.4 he goes on Bayesian Model Comparison (which is what gdahl argues), which is a more general perspective of bayesian models, where you have different marginal likelihoods that are valid scores for different models. With that, you can choose (automatically) the complexity of your model (how many parameters, the size of them, etc..) For example, with a simple 2 parameter Bayesian linear approach, your are bound to find only lines in a 2D space for your data, but model selection allows you to look at all the possible models you have (lines and high order curves). Overfitting is avoided since you are usually averaging your data out, which is a property of Bayesian Analysis, the multiplication of an informative prior and a likelihood give you a good posterior for your data.

(Sep 29 '11 at 07:19) Leon Palafox
1

Ashutosh: first, bayesian methods can overfit in the loose sense (where this means "perform worse on test data than on training data") as they can still be sensitive to domain shifts.

However, they overfit far less than non-bayesian methods, the main two reasons for this are using a prior, as Leon said, and marginalizing, as gdahl said. Marginalizing probably has a stronger effect, as for most loss functions averaging can't make things worse, so gdahl's answer has a stronger point, but both are right.

If you're still confused feel free to ask more questions :-)

(Sep 29 '11 at 08:22) Alexandre Passos ♦

Here is my summary. Although penalizing the likelihood can help control overfitting, the reason a fully Bayesian approach inherently avoids overfitting is because of the marginal-likelihood-weighted integral over model parameters.

(Oct 02 '11 at 21:24) gdahl ♦

I haven't read Chapter 5 of the GPML book yet (going to do so now!), but is "Bayesian methods are an averaging technique" another way of stating your answer that marginalization is the important property of Bayesian methods with respect to overfitting? (I usually think about marginalization as a way of getting rid of a parameter, but in this case it seems like it's better to think of it as a way of averaging over many different models with different values of a parameter. I think this is exactly what you're saying in your last paragraph, but want to check if you agree =).)

(Oct 03 '11 at 17:45) grautur
showing 5 of 9 show all

A simple example,

Take a simple linear regression, if you use a standard square loss minimization technique, you reach out a nice closed solution. The problem with it is the over fitting, so instead of fitting using only the square loss, you use a regularizer, that controls how much the weights grow, so you can keep them in check.

This regularizer can take different forms, but for simplicity, we use an L2 norm.

Now, it can be proven that minimizing the likelihood times a Normal prior is the same as adding an L2 regularizer, so by choosing different priors, you are actually choosing different regularizers. For example, if you use a mixture of densities as a prior, you can account for a multimodal distribution.

For more on this, you can check chapter 3.3 of Bishop's PRML.

answered Sep 27 '11 at 01:02

Leon%20Palafox's gravatar image

Leon Palafox
31265471107

1

Thanks for the answer. So in case prior chosen is not correct then even the Bayesian methods can overfit right?

(Sep 27 '11 at 03:58) Lancelot

If you choose your prior to be 1, then yes, you'll have over fitting. But you seldom would do that, the point of using Bayesian Analysis is to avoid overfitting while trying to worry less on the regularization parameters. There would be no point of using Bayesian Analysis if you still have overfitting.

(Sep 27 '11 at 05:05) Leon Palafox

Yeah..thanks for the reply...

(Sep 27 '11 at 05:50) Lancelot

If the prior does not reflect your beliefs about the problem then there is no reason to trust the results of Bayesian inference.

(Sep 28 '11 at 18:03) gdahl ♦
1

The example Leon describes isn't really Bayesian since an optimization procedure still picks a single value of the model parameters and thus it is still vulnerable to overfitting. A fully Bayesian approach that integrates over all the model parameters would avoid overfitting.

(Sep 28 '11 at 18:30) gdahl ♦

Please send a note to Bishop to remove that example from Bayesian Linear Regression in his book ;)

(Sep 28 '11 at 18:35) Leon Palafox
1

Leon, what you describe in your answer is NOT what Bishop calls a Bayesian method. You simply quote the relationship between doing MAP with some particular prior and a corresponding regularization term. As charming as your sarcasm is, I worry that it confuses the issue even more. By all means, Lancelot should read 3.3 in PRML since it does discuss integrating over parameters of a model to form the prediction distribution for Bayesian linear regression. Yes, that section does mention in passing the relationship you describe in your answer, but it is ultimately irrelevant to the issue of overfitting. I can overfit with an L2 penalty and I can also overfit with a uniform prior over the weights or all sorts of others priors as long as I am picking a single set of weights that maximizes the posterior.

(Apr 07 at 02:51) gdahl ♦
showing 5 of 7 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.