I was watching John Shawe-Taylor's COLT tutorial on pac-bayes and saw that he went through a lot of trouble to use a pac-bayes bound for the SVMs.

Given the form of the standard pac-bayes bound, which states that E[TrueLoss(h)] <= E[EmpiricalLoss(h)] + sqrt(KL(posterior(h), prior(h)) + log(1/delta)) with probability 1-delta (this specific equation was given in the first part of the tutorial). He then goes to choose a gaussian prior and a gaussian posterior over the weights in order to compute these bounds.

Why, however, didn't he just use a point mass as the posterior distribution? In this case the KL divergence would reduce to log(1/prior(p)), which, in case of a gaussian prior, is just proportional to the norm of the weight vector divided by the C parameter, which would give a bound that almost justifies the SVM optimization problem. What, exactly, is gained by using an artificial distribution as the posterior?

asked Sep 27 '12 at 18:53

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

edited Sep 27 '12 at 21:04


One Answer:

Mark Reid answered this on twitter here, here, and here. It has to do with the KL divergence being ill-defined between point masses and distributions unless some special-cases apply.

answered Sep 27 '12 at 21:00

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.