3
1

I am currently comparing many learning algorithms on a specific dataset. My methodology is:

  • Split the dataset into a test set of size N/3 and a design set of size 2N/3
  • Use k-fold cross validation on the design set to select the best parameters for each algorithm
  • This gives you one "best" set of parameters for each algorithm, which you can then evaluate on the test set either by retraining on the full design set or by bagging all k models
  • Ideally repeat with a new test/design split, but this is so expensive that I doubt I'll have the time to do it

This works fine for, say, SVM and GMM, but breaks down with algorithms trained by gradient descent such as neural networks. The usual way of training those is to do early stopping with a validation set. However, using the "real" validation set for early stopping would risk overfitting and give overly optimistic results. I can see a few alternatives to this:

  • Do nested cross validation on the training set (too expensive; the particular type of neural network I am using already has twice as many hyperparameters as the other algorithms...)
  • Set the number of iterations as a hyperparameter and try a few different values
  • Add regularization like weight decay (yet another hyperparameter, and has never worked very well for me in terms of avoiding overfitting)

Another concern is with the initialization of the weights. I am currently initializing them by picking random values in the range +/- 1/sqrt(#inputs to layer) (as described here http://www.willamette.edu/~gorr/classes/cs449/precond.html). I tried varying the random number generator's seed and noticed that the range of final errors is quite large. How, then, am I to fairly compare this algorithm to the other ones that don't depend on a random seed? Try many seeds and average the final test error? Train with many different seeds and bag the models?

Any suggestions would be appreciated.

asked Feb 05 '11 at 19:37

yeastwars's gravatar image

yeastwars
23171417

edited Feb 05 '11 at 19:43

2

Of your ideas, I think "Do nested cross validation on the training set" is the most rigorous, but expensive, as you note. In my opinion, the second best is: "Set the number of iterations as a hyperparameter and try a few different values", which is a little cruder, but not unreasonable. I would avoid "Add regularization like weight decay". Sorry, no deep insight, just some mild opinions.

(Feb 05 '11 at 20:12) Will Dwinnell

Thanks. Any particular reason for your avoidance of weight decay?

(Feb 05 '11 at 20:37) yeastwars
2

In neural networks weight decay empirically underperforms early stopping. It matters more when optimizing simpler, linear, models, specially with optimizers that find the global optimum or something similar.

(Feb 05 '11 at 20:38) Alexandre Passos ♦

It is okay if you overfit a validation set a little bit (i.e., don't get early stopping just right) if you then vote the resulting models. (You write "bag" above, but I think you really meant just voting.) The extra variance from slight overfitting may actually be beneficial if it comes with a reduction in bias.

Also, the paper "The Cascade-Correlation Learning Architecture" by Fahlman and Lebiere might be useful to you. If you believe the results (I have no personal experience either way), this kind of neural network is much faster and easier to train.

(Feb 17 '11 at 18:34) Art Munson

One Answer:

You can solve this easily in the cross-validation setting by, for the algorithms that need a validation set, dividing the training data (i.e., the folds selected for training in cross-validation) into a training and a validation set only for those algorithms. This cleanly accounts for the fact that you always need some extra data to do early stopping that other algorithms could incorporate directly into the training set.

The usual thing to do is trying many different random seeds and picking the best one on the validation set (not on the actual test set) at each step of the way, otherwise your comparison gets too unrealistic. There are seeding strategies to neural networks that don't lead to such a high variance of the results if you optimize it properly (be aware that validation error might go up momentarily during training, and you shouldn't stop at the first increase in validation error).

answered Feb 05 '11 at 20:10

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Thanks for your reply!

You can solve this easily in the cross-validation setting by, for the algorithms that need a validation set, dividing the training data (i.e., the folds selected for training in cross-validation) into a training and a validation set only for those algorithms. This cleanly accounts for the fact that you always need some extra data to do early stopping that other algorithms could incorporate directly into the training set.

This is what I meant by nested cross validation in my first option. The main problem with this is a lack of data and the extra computation cost incurred. I'm already spending more time on neural networks because they have so many more hyperparameters than the other algorithms I am looking at...

There are seeding strategies to neural networks that don't lead to such a high variance of the results if you optimize it properly

Do you have any references for those? My current initialization strategy is from the website I listed in the original post, and also shows up in LeCun's Efficient Backprop (http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf).

(Feb 05 '11 at 20:36) yeastwars
1

Be careful if you have very little data, however, as any cross-validation strategy would probably end up being too biased towards overfitting more complex models (such as neural networks). If you can't afford the computational cost think about switching to fixed training, development, and test sets, which can save you a lot of time.

(Feb 05 '11 at 20:40) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.