|
Hi all, Which of the two methods is the most effective to avoid overfitting: regularization or searching the model's hyperparameters through crossvalidation? is one of these methods preferable for large/small sets of data? is there any experimental evidence concerning this question? Thanks. |
|
The 2 methods are generally used together since they have different purposes:
It is possible to combine the 2 together by doing a cross validated grid search for the optimal value of the regularization parameter (e.g. C in SVM). Edit: arguably if the algorithm is able to scale to large datasets (linear training time) and that this data is available cheaply (generally not true for supervised learning) then regularization is less important as the redundancy in the training set will generally act as a natural regularizer that will prevent over-fitting. It is still interesting to do cross validation (maybe online cross validation to make it scalable) so as to measure the remaining amount of overfitting. Thanks. "online cross validation"? what is this?
(Feb 10 '12 at 06:17)
Lucian Sasu
1
If you have some much data that you know that your online algorithm will be fitted in one single pass then you can buffer the new data in minibatches and use it twice, first for testing and then for training:
To limit the stochasticity of the test error estimate your can smooth it with exponentially weighted averaging scheme. AFAIK the SGD model of Mahout is doing so along with maintaining several online model in parallel (+ some kind of evolutionary algorithm to blend the bests from time to time).
(Feb 10 '12 at 06:23)
ogrisel
|
|
There is no rule against using both of them together. Cross validation is usually a good method to find the best value for your regularization parameter. You can find the optimum set of parameters using your training cost, then use the cross validation cost and test the parameters. (The training cost includes the regularization, while the CV does not). Then you can test different values of the regularization parameters and use trial and error to find the best ones. |