Suppose $x in X$ is discrete and we model log P(x) as a linear combination of some set of features. Suppose A and B are some sets of features of same size from an arbitrary orthogonal basis for space of functions over X. Are all A,B pairs equivalent from statistical point of view? They are certainly not all equivalent from computational point of view. For instance, inference with feature-set {x1x2,x2x3,x3x4} has different complexity than {x1x2,x2x3,x3x1} In particular I'm wondering if one could justify considering sets with lots of "low-order" features like x1 before sets of high-order ones like x1
asked
Yaroslav Bulatov |

Shouldn't this be addressed by the methods for learning the structure of MRFs? After all, the sets of features considered are the factors of a factor graph (or cliques of a MRF, if you prefer).

MRF structure learning gives you a particular feature set. What I'm asking is whether you can say that some feature sets are better without relying on a domain expert

Any MRF structure learning method will give you a way of comparing feature sets, even if it's something as crude as a prior. What you describe sounds like a prior distribution that says feature i is used with probability inversely proportional to its clique size.

Sure, some structure learning algorithm imposes some ordering over feature sets, but that doesn't answer whether all feature sets are the same from statistical point of view (ie, worst-case error for modeling with given feature set)