1
2

Let's say that I have a classifier which tends to overfit, and must absolutely be trained with stochastic gradient descent and early stopping. To stop early, I can split my dataset in 2 parts: one for training, the other as a validation set to monitor the loss. However, this is wasteful, especially on very small datasets. The classifier will never have "seen" the examples in the validation set. What if those examples were crucial to doing a proper job on new data points?

My approach to this is to repeat the process n times over random splits. This creates n different models that will have seen different subsets of the dataset. To classify a new data point, I can then run those n classifiers and either take the average of their output (assuming that they are well-calibrated probabilities), or take a majority vote.

Is there a potential issue with doing something like this, other than the obvious computational cost of having to run n classifiers each time I want to classify a new data point?

asked Oct 24 '12 at 18:18

yeastwars's gravatar image

yeastwars
23171417

edited Oct 24 '12 at 18:23

Why not just do hold one out or k-fold cross validation? Your method only trains learners with n/2 points, whereas hold-out trains learners with n-1.

Then for the "production" model instead of averaging all your learners, simply train a learner using the whole data set at with the same number of SGD steps.

(Oct 24 '12 at 18:46) Douglas Colkitt

Why not just do hold one out [...] Then for the "production" model instead of averaging all your learners, simply train a learner using the whole data set at with the same number of SGD steps.

That might work if training isn't too expensive.

Your method only trains learners with n/2 points, whereas hold-out trains learners with n-1.

I only mentioned splitting the dataset in 2 parts. The parts don't have to be equal size.

(Oct 25 '12 at 15:32) yeastwars

2 Answers:

Your point is valid. In general most people who do early stopping use only a small fraction of the data to measure the stopping criterion, while using a larger fraction for training.

If you want to compute what would be an optimal number of data points to use for early stopping you can probably compute a tradeoff using something like a Chernoff bound (assuming your loss function is bounded) to control the error with high probability and a union bound to control for the multiple comparisons you're doing in the early stopping process. This will give you a bound on how much worse you can expect your error to be on an unseen test set. It will end up being a number of samples quadratic on your desired error margin and logarithmic on the number of comparisons you're doing, as in an oracle inequality for model selection.

Note, however, that regardless of your favorite method for doing model selection (regularization, etc) you will have some hyperparameters to tune on a test set and will need to do so with fresh data, as otherwise your estimates are biased.

answered Oct 24 '12 at 20:06

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Train you model monitoring progress on a held out validation set. Record the training error of the optimal stopping point. Retrain the model on all the data and stop when you reach that training error.

What you are doing is fine also and works well.

answered Oct 24 '12 at 22:46

gdahl's gravatar image

gdahl ♦
341453559

edited Oct 24 '12 at 22:47

I've heard this approach mentioned before, and it seems to work very well in practice. I am just curious, do you have any references for the theoretical underpinnings of this approach?

(May 09 '13 at 12:18) Dan Ryan
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.