4
2

Does anyone have an exact definition of over fitting? Is there a way to test if a "black box" method is over-fitting?

Many ML researchers use the term often and have an intuitive idea of what is means. In Chapter 1 of Bishop, examples using polynomial regression are used. With polynomial regression over fitting is obvious since we can monitor the test set performance as we increase or decrease the number of basis functions. However, this is not a general definition (i.e. in the black box case) since there might not be an obvious complexity parameter to vary.

It is sometimes assumed that poor performance implies over-fitting, but poor performance could be due to under-fitting or poorly fitting.

Some possible black box definitions could be:

  1. Some statistic based on the difference between out of sample performance and in sample performance. This is similar to the gap in the PAC framework.
  2. Test set performance increases by adding noise to the data, synthetically increasing the data set size, or using re-sampling methods (e.g. bagging).
  3. The empirical test set error is larger than the predictive uncertainty (only applicable to methods that give predictive uncertainty)

Perhaps there are certain axioms a sane definition of over fitting should obey. For instance, performing exact inference on synthetic data should by definition not be doing over fitting.

Does anyone know of theoretically grounded work that does something allow these lines?

asked Jan 04 '11 at 22:25

Ryan%20Turner's gravatar image

Ryan Turner
2414812

I like Alexandre's definition -- expected difference between training set and test set errors. You can't really analyse this measure theoretically because it's defined in terms of "real world" distribution which which we don't have.

(Jan 05 '11 at 20:25) Yaroslav Bulatov

6 Answers:

The (paraphrased) definition of overfitting that the Bishop book uses is that overfitting occurs when the model starts to capture the idiosyncrasies (noise) of the training data rather than the function that generated the training data. (This may or may not be formulated quite the same way in the Bishop book, but I think that's where its from.)

Unfortunately, the most precise way to test this is to have access to the generating function -- Impossible in everything but toy cases.

So approaches to detecting overfitting look for consistency in performance across samples from the generating function. One way that this is measured is the difference between training and testing error. Another is to examine the consistency in models generated based on different samples of the training data. If the models, or model performances, are consistent, it is unlikely that overfitting has occurred. However, if the trained model parameters, or performances vary widely, then it's likely that the model is capturing idiosyncrasies of the training sample rather than a general form of the underlying function. (These apply to batch training. I agree with Alexandre that it's harder to detect this in online training.)

It's important to note that overfitting when viewed from this perspective is a phenomenon that is more or less independent of the type of model that is being considered. Bayesian models with more parameters than data points, or poorly selected priors can overfit. Also, exemplar based models may not overfit, and this can be detected -- consider cases where the classifier decisions are (more or less) consistent across multiple samples of the training data.

Regarding the difference between over- and underfitting: Underfitting is characterized by high model stability (under different training samples) and high generalization error. Overfitting has low model stability and high generalization error. Optimal modeling has high model stability and low generalization error.

(It's a little tough providing additional insight to a question that already has 5 good answers, but maybe this can contribute.)

answered Jan 06 '11 at 10:07

Andrew%20Rosenberg's gravatar image

Andrew Rosenberg
156252135

edited Jan 06 '11 at 10:21

I really like this answer, and would vote for it to be the accepted one.

So the key idea is consistency in performance. I think this also applies well to unsupervised learning, as the recent work analysing the consistency of clustering algorithms show (see the Von Luxburg survey on clustering stability for a deeper discussion of this w.r.t. unsupervised methods, http://www.kyb.tuebingen.mpg.de/bs/people/ule/publications/publication_downloads/Luxburg_Stability_2010.pdf ).

I'd just add that model stability should me measured in the areas of high density of examples; it's not really overfitting if the predictions of nonsensical examples are different between different samplings of the training set while the predictions on an i.i.d. test set stay the same (this is the same issue with active learning with artificial examples, I'd say, where a labeler can easily try too hard to specify a precise boundary in areas where no naturally occurring example would ever exist).

(Jan 06 '11 at 10:37) Alexandre Passos ♦

I also like this answer. A stable model is never overfit, and an unstable model is often overfit. Of course, an unstable model can simply mean the model is junk (for a contrived example, a random number generator), but for practical purposes I think model stability is the best measure of overfitting I have heard so far.

(Jan 06 '11 at 17:15) rm999

I'd go with something closer to your definition 1. Definition 2 seems too specific (after all there are many other methods of controlling overfitting), and definition 3 doesn't apply to most methods where people talk about overfitting.

The simplest definition I can think of for the batch supervised learning setting is that the amount of overfitting is the difference between the average performance on the training and on the test sets. This amount can be negligible, big, or even negative (as sometimes happens when you have artificially hardened training sets, for example). Of course, this depends on how you measure performance. If you have a probabilistic batch unsupervised model I think it makes sense to use the same definition (mean held-out log-likelihood far smaller than mean training set log-likelihood means something bad).

For the online setting I'd say it's quite harder, as it's hard to distinguish over-fitting from under-fitting, since both will lead to higher regret.

answered Jan 05 '11 at 08:01

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Something like definition 1 is fine for comparing the same type of model, but can be dangerous when comparing across types of models. Some models, for example many non-parametric models, have very low or zero training error, but this does not make them inherently over-fit.

(Jan 05 '11 at 13:09) rm999

For comparing different models from different types, shouldn't "has higher performance on test data" be the criterion, instead of "has more similar performance on training and test data (i.e., doesn't overfit)"?

Also, I think it's arguable that nonparametric models that necessarily get every training example right must overfit. Look at SVM with a gaussian kernel: while it can get everything right by definition, usually that's a bad idea, and you should trade off getting training examples wrong with a more generalizable model.

(Jan 05 '11 at 15:09) Alexandre Passos ♦

Yes, I agree that overfitting is tough to compare across models, that is all I wanted to point out. This is what I explain in my answer, that I think unless you can vary some parameter (which is why Ryan Turner brings up black box models, I believe) overfitting is an arbitrary concept.

Perhaps this is an example of the lack of precision in the term "overfitting", but I do not think nonparametric models that create perfect predictions on the test set are necessarily overfit. Any robust model that creates good predictions on new examples is fit well. Avoiding perfect test predictions can actually lead to under-fitting if the model suffers.

(Jan 05 '11 at 15:38) rm999

Using the classical bias-variance decomposition of mean square error, overfitting can be described as a characteristic of models with high variance and low bias. Especially high variance indicates how sensitive are the results to the specific choice of the training set.

answered Jan 05 '11 at 06:42

Stelios%20Sfakianakis's gravatar image

Stelios Sfakianakis
1113

I have heard people relate the bias-variance decomposition to overfitting. However, this is not completely general since it is based on square loss. We need an analysis that is more general and not applicable to one loss function.

If we take overfitting to mean high variance/low bias methods then that would suggest something along the lines of definition 2. We could use some sort of bootstrap methods to infer the variance/bias and detect overfitting.

(Jan 05 '11 at 14:53) Ryan Turner

I think this is a good question, but I don't know of a precise definition. I think it is often an empirical judgment and that the term could suffer if it formalized too much.

I consider over-fitting as any situation in which the model looks "good" on the training data but does not generalize well to the kinds of data it will have to predict (which can be approximated with a test set). The problem is, "good" and "generalize well" are relative terms. In situations where you can vary some parameter (like number of basis functions, or number of epochs), you can compare the errors across the parameter space and look for examples when the train error is relatively good but the test error is relatively poor. If you can't vary a parameter, I think over-fitting is irrelevant because you have no control over how well the model fits your data. Instead, model performance on your test sets is what matters (which is half my definition of over-fitting). Instead of varying parameters, you can vary models. Models that perform better on your test datasets are probably better than those that do not.

I would also like to add to your list that over-fitting and model complexity are often related. While it is not always true, complex models tend to over-fit more than simpler models.

answered Jan 05 '11 at 02:50

rm999's gravatar image

rm999
6125

I tend to think of over-fitting as a under-generalization with over-specialization that focuses on artifacts in the data. It doesn't encompass all that I've seen discussed as over-fitting, but I think it's a reasonably precise definition

To deal with it, you could try generating a held-out set that contains a lot of artifacts in the data. This wouldn't necessarily be a random sample; it could be a deliberately chosen set of samples that may cause problems.

If it's learning artifacts and noise in the data, assuming what you're working on uses an MCMC-style approach you could see how much it 'wanders' around before and/or after reaching the equilibrium distribution (gibbs sampling). You could also come up with a good way of visualizing the activations and/or measuring their behavior. If it's trying to understand spurious features the latent/state variables will likely be more unpredictable/volatile.

-Brian

This answer is marked "community wiki".

answered Jan 05 '11 at 04:18

Brian%20Vandenberg's gravatar image

Brian Vandenberg
644183444

edited Jan 05 '11 at 04:22

Theoretically, you can avoid over-fitting by doing full bayesian learning. That is, your model will be a mixture of all possible models. And the models will be weighted by their posterior probabilities: P(model|seen_data). You can use the Solomonoff universal prior for the model priors.

So over-fitting + other flaws is the choice of a sub-optimal model. I don't know how to universally distinguish over-fitting from the other flaws of an approximation.

answered Jan 05 '11 at 04:23

Ivo%20Danihelka's gravatar image

Ivo Danihelka
23551015

Bayesian methods can overfit less than nonbayesian methods, but they can easily overfit if the prior's too weak, for example.

(Jan 05 '11 at 07:52) Alexandre Passos ♦

This might be to my point in the question. It is sometimes said Bayesian methods can't (or can) overfit. Or that AdaBoost can't (or can) overfit. If we don't have a formal definition of overfitting how can we objectively prove or refute these claims?

(Jan 05 '11 at 14:49) Ryan Turner

For all three definitions both bayesian methods and adaboost overfit, and according to definitions 1 and 2 adaboost overfits rather badly (for example) if there's even a small amount of label noise (for definition 2, think subsampling training sets with different label noise added).

(Jan 05 '11 at 15:21) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.