|
I am currently comparing many learning algorithms on a specific dataset. My methodology is:
This works fine for, say, SVM and GMM, but breaks down with algorithms trained by gradient descent such as neural networks. The usual way of training those is to do early stopping with a validation set. However, using the "real" validation set for early stopping would risk overfitting and give overly optimistic results. I can see a few alternatives to this:
Another concern is with the initialization of the weights. I am currently initializing them by picking random values in the range +/- 1/sqrt(#inputs to layer) (as described here http://www.willamette.edu/~gorr/classes/cs449/precond.html). I tried varying the random number generator's seed and noticed that the range of final errors is quite large. How, then, am I to fairly compare this algorithm to the other ones that don't depend on a random seed? Try many seeds and average the final test error? Train with many different seeds and bag the models? Any suggestions would be appreciated. |
|
You can solve this easily in the cross-validation setting by, for the algorithms that need a validation set, dividing the training data (i.e., the folds selected for training in cross-validation) into a training and a validation set only for those algorithms. This cleanly accounts for the fact that you always need some extra data to do early stopping that other algorithms could incorporate directly into the training set. The usual thing to do is trying many different random seeds and picking the best one on the validation set (not on the actual test set) at each step of the way, otherwise your comparison gets too unrealistic. There are seeding strategies to neural networks that don't lead to such a high variance of the results if you optimize it properly (be aware that validation error might go up momentarily during training, and you shouldn't stop at the first increase in validation error). Thanks for your reply!
This is what I meant by nested cross validation in my first option. The main problem with this is a lack of data and the extra computation cost incurred. I'm already spending more time on neural networks because they have so many more hyperparameters than the other algorithms I am looking at...
Do you have any references for those? My current initialization strategy is from the website I listed in the original post, and also shows up in LeCun's Efficient Backprop (http://yann.lecun.com/exdb/publis/pdf/lecun-98b.pdf).
(Feb 05 '11 at 20:36)
yeastwars
1
Be careful if you have very little data, however, as any cross-validation strategy would probably end up being too biased towards overfitting more complex models (such as neural networks). If you can't afford the computational cost think about switching to fixed training, development, and test sets, which can save you a lot of time.
(Feb 05 '11 at 20:40)
Alexandre Passos ♦
|
Of your ideas, I think "Do nested cross validation on the training set" is the most rigorous, but expensive, as you note. In my opinion, the second best is: "Set the number of iterations as a hyperparameter and try a few different values", which is a little cruder, but not unreasonable. I would avoid "Add regularization like weight decay". Sorry, no deep insight, just some mild opinions.
Thanks. Any particular reason for your avoidance of weight decay?
In neural networks weight decay empirically underperforms early stopping. It matters more when optimizing simpler, linear, models, specially with optimizers that find the global optimum or something similar.
It is okay if you overfit a validation set a little bit (i.e., don't get early stopping just right) if you then vote the resulting models. (You write "bag" above, but I think you really meant just voting.) The extra variance from slight overfitting may actually be beneficial if it comes with a reduction in bias.
Also, the paper "The Cascade-Correlation Learning Architecture" by Fahlman and Lebiere might be useful to you. If you believe the results (I have no personal experience either way), this kind of neural network is much faster and easier to train.