|
For sparse (eg L1-penalized) linear models there are more and more theoretical results such as consistency and sparsistency (basically extensions of the compressed sensing theory). Such results lead to the use of the corresponding predictors to perform high-dimensional statistical inference: in statistical inference settings the goal is not to predict a target but to find a true model with some confidence: model identification. In the special case of sparse models, prediction is used as a mean to perform subset-selection. My question is: are there other examples of predictors that can be used to perform some inference in high-dimensional statistics, rather than a prediction task? |
|
I'm not sure I understand your question.
This answer is marked "community wiki".
I was asking for theory focused on recovery rather than prediction (or rather model identification as it seem that this is what statisticians like to call it). I wanted to know whether such theory was available for other models than sparse models. I'll edit my question to make it clearer. I really like your reference by A. Ng, by the way. Thanks
(Jan 31 '11 at 12:59)
Gael Varoquaux
Isn't recovery in non-sparse settings the standard frequentist concept of an unbiased estimator (that is, one that approaches the "true" estimator as the number of data points grows)?
(Jan 31 '11 at 13:08)
Alexandre Passos ♦
I guess so. I think the best concept to use is consistency. And that's why my question mentioned consistency. However, I am interested in any use of learning techniques to do principled statistical inference.
(Jan 31 '11 at 13:47)
Gael Varoquaux
I think it usually goes the other way around, and statistical inference is one of the principles behind learning. Something else you should look at, maybe, is PAC theory and VC theory, which essentially give finite-sample bounds on the convergence of estimators and conditions in which recovery is guaranteed. I think most of what happens in learning tends to deal with the grittier computational side of things, where almost all problems are NP complete and one has to use approximations and cut corners, so principled is not a word I'd use to describe most of machine learning (apart from learning theory, but most of it fits either into the PAC or the VC framework, adding things like rademacher complexity and online learning).
(Jan 31 '11 at 13:54)
Alexandre Passos ♦
Nice references again. I guess that PAC and VC are not really what I am interested in, as they give results on minimization of an empirical risk, but not of suitability of a set of hypothesis on data. The consistency of feature selection via sparse learners (often called sparsistency) is one example of what I would be interested in, as it enables to formulate a hypothesis on the nature of the data at hand, and the probability of this hypothesis being wrong goes to zero. Edit: Actually PAC learning defines the notion of PAC learnable 'concepts', where a concept can lead to a family of testable hypothesis. I guess that I am learning what my question is as we go. How do they call this, 'learning from negative example'?
(Jan 31 '11 at 15:41)
Gael Varoquaux
I think bayesian methods with optimal inference have this property, but usually this is too computationally difficult for all but the simplest models (the integrals are from NP-hard to worse, for example, when latent variables are involved). You can theoretically go around this with sampling, but sampling for model selection is far from a solved problem, and even annealed importance sampling, which is state-of-the-art for this problem, can need a bit of tuning to get good results. So the question is more around model selection? There is a very nice survey by Guyon et al in JMLR, Model selection: beyond the bayesian/frequentist divide, with many references therein on the sort of methodology usually employed, apart from the bayesian ones. You might find some references in there interesting.
(Jan 31 '11 at 16:34)
Alexandre Passos ♦
There is interesting work by Bernardo on hypothesis testing using decision theory: http://www.isds.duke.edu/research/conferences/valencia/IntStatRev.pdf. I don't know if this is what you meant by "suitability of a set of hypothesis on data".
(Feb 01 '11 at 10:31)
Oscar Täckström
showing 5 of 7
show all
|
|
So, replying to my own question after a day of reading. :). It seems that many predictors used in machine learning show at least consistency. For instance, I was surprised to hear that SVM can be consistent: if I understand correctly [Steinwart2005], estimators based on the minimization of a convex loss (with a few constraints: definition 2.1) and regularization function with fairly weak constraints (definition 2.2) are consistent, including in kernel spaces. To sum up, many estimators show consistency and feature selection in a certain representation:
Thus, these estimators can be used to select a low dimensional representation of the data (small number of parameters) with theoretical guaranties. On the corresponding set of parameters, conventional statistics (estimation theory of unbiased estimators, statistical decision theory) can be applied. [Bach2008] Bach 2008, JMLR, 'Consistency of the group Lasso and multiple kernel learning' [Steinwart2005]: Steinwart, Trans Inf Proc 2005, "Consistency of support vector machines and other regularized kernel classifiers" Edit, A couple of months later- Another reference that I found interesting is: [Mukherjee2006] Mukherjee et al. Adv Comp Math 2006, "Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization" The paper basically gives some conditions on optimization-based learner for consistency, and shows a very interesting link between stability and consistency under some conditions (such as convexity of the objective function) that, in my view, relates to the bootstrap estimates in the papers above-mentioned. |