|
I was watching John Shawe-Taylor's COLT tutorial on pac-bayes and saw that he went through a lot of trouble to use a pac-bayes bound for the SVMs. Given the form of the standard pac-bayes bound, which states that E[TrueLoss(h)] <= E[EmpiricalLoss(h)] + sqrt(KL(posterior(h), prior(h)) + log(1/delta)) with probability 1-delta (this specific equation was given in the first part of the tutorial). He then goes to choose a gaussian prior and a gaussian posterior over the weights in order to compute these bounds. Why, however, didn't he just use a point mass as the posterior distribution? In this case the KL divergence would reduce to log(1/prior(p)), which, in case of a gaussian prior, is just proportional to the norm of the weight vector divided by the C parameter, which would give a bound that almost justifies the SVM optimization problem. What, exactly, is gained by using an artificial distribution as the posterior? |