Suppose $x in X$ is discrete and we model log P(x) as a linear combination of some set of features. Suppose A and B are some sets of features of same size from an arbitrary orthogonal basis for space of functions over X. Are all A,B pairs equivalent from statistical point of view?

They are certainly not all equivalent from computational point of view. For instance, inference with feature-set {x1x2,x2x3,x3x4} has different complexity than {x1x2,x2x3,x3x1}

In particular I'm wondering if one could justify considering sets with lots of "low-order" features like x1 before sets of high-order ones like x1x2x3* even when base features xi have no intrinsic meaning

asked Oct 13 '10 at 10:20

Yaroslav%20Bulatov's gravatar image

Yaroslav Bulatov

edited Oct 13 '10 at 10:21

Shouldn't this be addressed by the methods for learning the structure of MRFs? After all, the sets of features considered are the factors of a factor graph (or cliques of a MRF, if you prefer).

(Oct 13 '10 at 13:11) Alexandre Passos ♦

MRF structure learning gives you a particular feature set. What I'm asking is whether you can say that some feature sets are better without relying on a domain expert

(Oct 13 '10 at 13:18) Yaroslav Bulatov

Any MRF structure learning method will give you a way of comparing feature sets, even if it's something as crude as a prior. What you describe sounds like a prior distribution that says feature i is used with probability inversely proportional to its clique size.

(Oct 13 '10 at 13:20) Alexandre Passos ♦

Sure, some structure learning algorithm imposes some ordering over feature sets, but that doesn't answer whether all feature sets are the same from statistical point of view (ie, worst-case error for modeling with given feature set)

(Oct 13 '10 at 13:28) Yaroslav Bulatov
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.