For concreteness, suppose our task is to learn a probability distribution over m binary variables. If several models fit the data equally well, there are some rigorous ways to justify picking the model with the smallest number of adjustable parameters. IE, in a large dataset setting, such model will also give the best performance on future data.

Our there any similarly rigorous (ie, Learning Theory, Frequentist or Objective Bayesian) approaches to justify picking between models of the same dimension?

As an example, consider $m=2, xi in {-1,1}$ and I consider model $P1(x1,x2) propto exp(a x1+a x2+b x1 x2)$ and $P2(x1,x2) propto exp(a x1+b x2)$. If we view P1 and P2 as a 2d subsets of R^4, P2 has area that's about twice as large, and the largest (Euclidian) distance between P2 and worst case distribution is smaller. It seems analogues of these properties should have a statistical interpretation. Also, it seems to me that P2 is not only better than P1, but that it's the best "nice" 2-parameter model here in some sense, are some ways to formalize this?

asked Aug 28 '10 at 19:22

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov
2333214365

edited Aug 30 '10 at 13:46


2 Answers:

You can use a margin/VC dimension as a way of deciding which is the best model. Vapnik's The Nature of Statistical Learning Theory gives pretty good justification for that. More generally, choosing any margin loss (ie, one that still penalizes some correctly classified instances) and minimizing some measure of complexity (such as the l2 norm of the parameters) + the margin loss gives you a model selection criteria.

In your setting, P2 clearly has VC dimension 2 (it can shatter any set of 2 points in general position, but three points in a triangle away from the origin can't be shattered), while P1 doesn't seem to be able to shatter any two points (although I can't prove this right now), so in one sense P2 has more capacity than P1; in the other, if both fit the data equally well you should choose P1, since it's VC generalization bound is smaller.

You can't talk generally of the best model with two parameters; Vapnik gives in the book an example of a model with one parameter that can shatter any set of points in general position (assuming the points lie in a bounded set). It's a sine function with really high frequency. You can, however, say that P2 (or something like P2 only with a bias parameter) is the nicest model with its VC dimension if look at things like the smoothness of the decision boundary, but this requires other sorts of justification that, in my opinion, come close to bayesian methods.

answered Aug 28 '10 at 20:08

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2551153277421

edited Aug 28 '10 at 20:18

OK, VC-dimension seems to be a rigorous way to judge complexity of classifiers, although not sure it can be extended to density estimation

(Aug 28 '10 at 20:26) Yaroslav Bulatov

Density estimation just has to be harder than classification---any notion of generalization ability must depend on either a smoothness assumption or a parametric assumption on the data distribution. You can however use a margin notion for something akin to density estimation with one-class SVMs, although that's pushing it.

(Aug 28 '10 at 20:30) Alexandre Passos ♦

Specially because any two model classes you pick should be equivalent modulo a nonlinear transformation of the data (otherwise there's a more economic parametrization of the same set, I think), and who's to say your data is in its "natural" form?

(Aug 28 '10 at 20:34) Alexandre Passos ♦

btw, I found an interesting overview of VC and PAC-Bayes bounds...the bottom line is that VC-bounds have almost no applicability in practice because they look at worst case hence prefer overly simple models

(Aug 30 '10 at 15:22) Yaroslav Bulatov

Yes, but if you have exactly the same fit you should prefer the simpler model, per the VC bound. This is only a problem if you're comparing models with different qualities of fit to the data

(Aug 30 '10 at 15:27) Alexandre Passos ♦

One common way is using the Minimum Description Length. Related measures are the Akaike information criterion, the Bayesian Information Criterion and the Minimum Message Length. All these scores try to balance the likelihood and the model complexity.

answered Sep 01 '10 at 03:40

Yuval%20F's gravatar image

Yuval F
452

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.