For sparse (eg L1-penalized) linear models there are more and more theoretical results such as consistency and sparsistency (basically extensions of the compressed sensing theory). Such results lead to the use of the corresponding predictors to perform high-dimensional statistical inference: in statistical inference settings the goal is not to predict a target but to find a true model with some confidence: model identification. In the special case of sparse models, prediction is used as a mean to perform subset-selection.

My question is: are there other examples of predictors that can be used to perform some inference in high-dimensional statistics, rather than a prediction task?

asked Jan 31 '11 at 02:52

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

edited Jan 31 '11 at 13:00


2 Answers:

I'm not sure I understand your question.

  1. Are you asking about theory more focused on prediction than recovery? If so, you might like Andrew Ng's paper Feature selection, L1 vs L2 regularization, and rotational invariance which proves that for L1 regularized models the sample complexity (the number of training examples needed to bring the test error down below epsilon) grows only as the logarithm of the number of irrelevant features, while any rotationally invariant model (which includes almost everything distance-based and/or L2-regularized) needs on the order of n examples (where n is the number of irrelevant features).

  2. Are you asking about other ways of achieving sparsity or dealing with many irrelevant variables? In bayesian methods you can use a spike-and-slab prior (see for example Ishwaran Spike and slab variable selection: frequentist and bayesian strategies), there is the dantzig selector (which comes from compressed sensing theory as well), methods that implicitly select a few features (such as boosted stumps), and methods that don't achieve sparsity but generalize very well (like Hash kernels). Some bayesian methods might be able to average out most of the irrelevant variables as well, but inference is very costly usually.

This answer is marked "community wiki".

answered Jan 31 '11 at 04:50

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I was asking for theory focused on recovery rather than prediction (or rather model identification as it seem that this is what statisticians like to call it). I wanted to know whether such theory was available for other models than sparse models. I'll edit my question to make it clearer.

I really like your reference by A. Ng, by the way. Thanks

(Jan 31 '11 at 12:59) Gael Varoquaux

Isn't recovery in non-sparse settings the standard frequentist concept of an unbiased estimator (that is, one that approaches the "true" estimator as the number of data points grows)?

(Jan 31 '11 at 13:08) Alexandre Passos ♦

I guess so. I think the best concept to use is consistency. And that's why my question mentioned consistency. However, I am interested in any use of learning techniques to do principled statistical inference.

(Jan 31 '11 at 13:47) Gael Varoquaux

I think it usually goes the other way around, and statistical inference is one of the principles behind learning. Something else you should look at, maybe, is PAC theory and VC theory, which essentially give finite-sample bounds on the convergence of estimators and conditions in which recovery is guaranteed.

I think most of what happens in learning tends to deal with the grittier computational side of things, where almost all problems are NP complete and one has to use approximations and cut corners, so principled is not a word I'd use to describe most of machine learning (apart from learning theory, but most of it fits either into the PAC or the VC framework, adding things like rademacher complexity and online learning).

(Jan 31 '11 at 13:54) Alexandre Passos ♦

Nice references again. I guess that PAC and VC are not really what I am interested in, as they give results on minimization of an empirical risk, but not of suitability of a set of hypothesis on data.

The consistency of feature selection via sparse learners (often called sparsistency) is one example of what I would be interested in, as it enables to formulate a hypothesis on the nature of the data at hand, and the probability of this hypothesis being wrong goes to zero.

Edit: Actually PAC learning defines the notion of PAC learnable 'concepts', where a concept can lead to a family of testable hypothesis.

I guess that I am learning what my question is as we go. How do they call this, 'learning from negative example'?

(Jan 31 '11 at 15:41) Gael Varoquaux

I think bayesian methods with optimal inference have this property, but usually this is too computationally difficult for all but the simplest models (the integrals are from NP-hard to worse, for example, when latent variables are involved). You can theoretically go around this with sampling, but sampling for model selection is far from a solved problem, and even annealed importance sampling, which is state-of-the-art for this problem, can need a bit of tuning to get good results.

So the question is more around model selection? There is a very nice survey by Guyon et al in JMLR, Model selection: beyond the bayesian/frequentist divide, with many references therein on the sort of methodology usually employed, apart from the bayesian ones. You might find some references in there interesting.

(Jan 31 '11 at 16:34) Alexandre Passos ♦

There is interesting work by Bernardo on hypothesis testing using decision theory: http://www.isds.duke.edu/research/conferences/valencia/IntStatRev.pdf. I don't know if this is what you meant by "suitability of a set of hypothesis on data".

(Feb 01 '11 at 10:31) Oscar Täckström
showing 5 of 7 show all

So, replying to my own question after a day of reading. :).

It seems that many predictors used in machine learning show at least consistency. For instance, I was surprised to hear that SVM can be consistent: if I understand correctly [Steinwart2005], estimators based on the minimization of a convex loss (with a few constraints: definition 2.1) and regularization function with fairly weak constraints (definition 2.2) are consistent, including in kernel spaces.

To sum up, many estimators show consistency and feature selection in a certain representation:

  • sparse primal problems (Lasso, Dantzig selector): feature selection in the original set of features

  • sparse problem in a transformed space (Kernel trick) [Steinwart2005]

  • l2l1 norms/group lasso [Bach2008], and thus structured-sparsity

Thus, these estimators can be used to select a low dimensional representation of the data (small number of parameters) with theoretical guaranties. On the corresponding set of parameters, conventional statistics (estimation theory of unbiased estimators, statistical decision theory) can be applied.

[Bach2008] Bach 2008, JMLR, 'Consistency of the group Lasso and multiple kernel learning'

[Steinwart2005]: Steinwart, Trans Inf Proc 2005, "Consistency of support vector machines and other regularized kernel classifiers"

Edit, A couple of months later- Another reference that I found interesting is:

[Mukherjee2006] Mukherjee et al. Adv Comp Math 2006, "Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization"

The paper basically gives some conditions on optimization-based learner for consistency, and shows a very interesting link between stability and consistency under some conditions (such as convexity of the objective function) that, in my view, relates to the bootstrap estimates in the papers above-mentioned.

answered Jan 31 '11 at 17:18

Gael%20Varoquaux's gravatar image

Gael Varoquaux
92141426

edited Apr 23 '11 at 07:21

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.