|
I am a PhD student in my first year (from a mathematics/statistics background). It seems that many researchers are very interested in theoretical results about computational complexity of ML algorithms. However, few seem to be interested in "sample complexity" results (results along the lines of VC theory, Pac-Bayes and the like). This seems to be particularly true of the applied ML researchers I encounter. Most can tell you the current state of the art in terms of computational complexity off the top of their head, but few could tell you much about the SLT results of even their favourite algorithm. i) Is this indeed the case? ii) Why? |
|
(I might expand on this later, if I have time.) I don't think sample complexity bounds are that useful in many cases for the reasons given by gdahl. That doesn't mean learning theory is useless. Think of classifying spam, or producing search engine results, or ad displays etc. It's an adversarial environment and you want to prove that your algorithm isn't going to be arbitrarily bad no matter what crazy stuff comes its way. In the bandit setting the sample complexity bounds are very useful as they seem to correspond very closely to observed performance. |
|
I use machine learning mainly as an inference tool to study complex systems (I do neuroscience). As such, I am extremely interested in the sample complexity of the estimators: we never have enough data, and we are interested in statistical bounds. I guess that people only interested in prediction, and thus in the 'utility' of their classifiers don't care as much: they know if it is working or not to solve the problem at hand. I'm curious; how are machine learning methods useful if not in a predictive setting? In other words, is there any scenario in which you can make a decision based on sample complexity bounds but not on something like held-out performance?
(Jan 05 '12 at 16:24)
Alexandre Passos ♦
1
Predictive power is mostly used as a model-selection metric: if a model predicts better than another, it is more useful. Of course, there is a limit to this approach, as discriminative models may predict y better, but tell us less about X. The settings we are in is pretty much a high-dimensional statistics setting. We are trying to find a low-dimensional parametrization of our problem that predicts an external variable well. Machine learning is good at identifying such representations, whether it is with sparse models (in the primal, or in the dual as with SVMs), or by shrinking, and thus removing effective degrees of freedom from the estimated parameters.
(Jan 10 '12 at 03:29)
Gael Varoquaux
|
|
I have yet to see a useful sample complexity bound, but computational complexity bounds are frequently useful. In general, any worst-case sample complexity bound that doesn't make strong assumptions about the data distribution isn't going to be useful because we don't expect our machine learning algorithms to work on arbitrary distributions nor do we care about arbitrary distributions; we only care about distributions of data that arise in real problems of at least modest practical interest (or problems that are plausibly related to real problems of at least modest practical interest). On the other hand, knowing that my Gaussian process regression program will have time requirements that are approximately cubic in the training set size, asymptotically, is very useful because it shows me I can't run it on a very large dataset. Furthermore, when I plot the actual runtime, it quickly starts to look like the asymptotic bound. Aren't computational complexity bounds also worst case? Don't the bounds apply regardless of the input you may receive? Do you think it might be possible to formalise what distributions one may encounter in "real problems" and perhaps derive bounds based on these?
(Jan 07 '12 at 21:34)
Avraham Ruderman
The computational complexity bounds can often fit the shape of the empirical runtime plots and thus are immediately useful. Also the "bounds" can be complexity statements about real algorithms. I see no hope whatsoever for usefully formalizing what data distributions exist in "natural" problems of the sort humans solve or parameterizing the "AI set" in a useful way. I would love to be wrong about this.
(Jan 08 '12 at 15:34)
gdahl ♦
For example, consider SVD. I can give you a worst-case bound on the runtime of in-core SVD by analyzing an actual SVD algorithm that I can then use. Often these algorithms will have the same typical case cost and worst case cost. Now my bound is actually a statement about a real thing I can run, and can be quite predictive of real runtimes assuming I didn't do shenanigans with the constants.
(Jan 08 '12 at 15:36)
gdahl ♦
|
|
I can't reply to (1), except that, yes, the more applied people have reasons why they don't do theory, just like the theory peope have reasons they don't really trust all the applied results. There are many, many problems with sample complexity bounds. First, the idea that you're trying to learn a function class is not really accepted these days. Secondly, it is known that for some tasks (say, classifying newswire by topic) hardly any examples are enough even with hundreds of thusands of features, while at the same time other similar tasks (classifying newswire by sentiment) are much harder and need far more examples for the same number of features. This kind of distinction is not at all captured in the simple PAC bounds you learn about in class. Also, the generalization bounds (that is, what people actually care about) obtained frm learning theory tend to be too loose to be practical ways of comparing algorithms. Can you expand on "trying to learn a function class is not really accepted these days"?
(Jan 05 '12 at 15:03)
Noel Welsh
You're trying to do well in a practical problem, having something like high accuracy or low regret or a low loss function. This is much more natural than, say, saying you want to learn CDF logical rules or whatnot.
(Jan 05 '12 at 15:15)
Alexandre Passos ♦
(also, just to be clear, theorists in general like statistical learning theory)
(Jan 05 '12 at 15:29)
Alexandre Passos ♦
With regards to the formulation in terms of learning function classes, there do exist alternative points of view. For example, the stability framework proposed by Bousquet and Elisseeff: http://jmlr.csail.mit.edu/papers/volume2/bousquet02a/bousquet02a.pdf With regards to worst case analysis, Wouldn't the same apply to computational complexity bounds? Don't some inputs yield faster running times than others even when using the same algorithm? In addition, there do exist data dependent generalisation bounds that take into account the nature of the distribution you are actually trying to learn. Rademacher complexity bounds depend on the inputs you observe. MDL and PAC-Bayesian bounds depend on both the inputs and outputs you observed, so that the bounds are much tighter for "easy problems". With regards to the looseness, my understanding is that both computational and sample complexity bounds typically ignore multiplicative constants and lower order terms. Why do we require sample complexity bounds to be tight but not computational complexity bounds?
(Jan 07 '12 at 21:32)
Avraham Ruderman
I'm not arguing that statistical learning theory is useless or pointless---far from that! My point is just that for a practitioner there are usually better ways of making decisions regarding which algorithms to use, and for researchers there are other guiding intuitions for what should good algorithms look like. Regarding the looseness of bounds, they do make computational complexity bounds useless in many cases. Look up "galatic algorithms" online for example, and it is really arguable whether these provide an actual benefit apart from increased understanding of the problem structure. My impression with learning theory bounds is exactly the same: they can provide very good insights about the problem structure, and are useful for researcers, but not so much for practitioners.
(Jan 08 '12 at 00:59)
Alexandre Passos ♦
|