|
I am puzzled by the idea of using cross validation to select hyperparameters (for example C and gamma for a SVM). The procedure is the following
First, what are the reasonable Second, this procedure is a huge processing time investment. The spet above is usually within a k'-cross validation to determine the quality o the classifier itself, therefore it is repeated k' times! But that investment in processing time could be justified if the resulting hyperparameters yield a much higher quality classifier. But that is unlikely: if the hyperparameters The reasonable procedure would be a standard validation-set procedure:
Does anyone had some experience where obtaining the hyperparameters using cross validation resulted in a classifier with much higher quality than using hyperparameters derived from a standard validation set procedure? ========== ADDED after the answers by Alexandre Passos, digdug SeanV and Leon Palafox ======== There is no standard way of globally answering answers so I am adding the to the original post. Thank you all for the answers. I was really wrong about the order in which the loop for all hyperparameters enters the procedure. According to the combined wisdom of the respondents, and so it is documented for future readers, the correct procedure for the cross-validation selection of hyperparameters is:
|
|
Isn't the whole point of cross validation to train using different sets of the training data. I might see how some hyperparameters like SVMs might seem like an overkill, but let's use your approach on K-Means The Hyperparameters in K-Means are usually the number of clusters (the support of the dirichlet prior in a Mixture of Gaussian) In your approach you are using essentially the same dataset to check for the best number of clusters, and are assuming that this will generalize well for the overall set of unseen data. While this might be true, this is a very strong assumption, and you go into questions of how large does the training dataset would need to be in order to generalize well. If you use cross validation though, you assure that you have multiple passes over different parts of the data, and if you are consistently getting K number of clusters, you could assume that is the right set. Other hyperparameters that come to mind is the order of a polynomial in linear regression, if you use a 5th order polynomial instead of a 3rd order one in a particular dataset, you would be overfitting for the specific training set. Once you test that in your test set, I would bet you won't get that good a regression, perhaps you would, but that would only happen if the training data looks very much like the test data, which in regressions (for futures prediction) usually don't happen. Instead of aggregating, you could also use the hypotesis that had the smallest error on the hold out test set. Then you retrain the model using these parameters and use that as your final model. |
|
I think that there are two potential problems with using a validation set as opposed to cross-validation or repeated splitting. One that Leon mentioned is the size of the data. When you say "select a validation set V of appropriate size", then in most (all?) cases you have no idea what an appropriate size is, and in many-many real-world cases you probably don't have enough data of "appropriate size" and have to work with what you have. This ties into the other problem, which is variance: if you have 10M samples split evenly into training and validation and 5 parameters to fit (exaggerating here), then the validation set results will suffice because the precision of your validation estimates will be very high (low variance), hence no need to do cross-validation really because you'll just get the same results. But if you have higher-dimensional data and not enough data, the validation estimate of AUC/R^2/MSE/whatever will likely be noisy and hence unreliable for picking hyperparameters, and repeating this multiple times allows you to average the performance estimates (kinda like the bootstrap), and reduce the variance of your estimates and get more reliable results. |
|
I think your pseudocode for cross-validation for hyperparameter selection is non-standard. In your code you do, in each fold, select the best setting. Usually, what is done is, for each hyperparameter setting, evaluate it on all folds, and pick the overall best. This doesn't have the issue you pointed out where it might not be clear how to aggregate the votes in each fold. Secondly, about the processing-time investment, indeed cross-validation is a lot more expensive than using a fixed train/test split to select hyperparameters. I've only ever found it to be worthwhile in settings where the total amount of data is small. Using a Chernoff bound, an estimate of accuracy on a test set converges to its true value at a rate of O(exp(-C n^2)) for a constant C, and for regression one can get a similar bound by assuming that the variable is bounded. So unless there is so little data that the natural variance of the problem is less than the difference between this bound for n equals to the full dataset size and for n equal to a fixed fraction of it you're better off using a fixed test set. However, I've found that for many learning and optimization methods, the variance in performance as you vary the hyperparameters is big enough that it's often a better use of a fixed computational budget to do hyperparameter selection on less data than it is to use more data with a rule-of-thumb hyperparameter. This is specially true of optimization methods like stochastic gradient descent where the hyperparameters are not very easy to set intuitively, there is often an order of magnitude difference in performance across different settings, and there is also a phase transition of sorts, in which there is an ideal learning rate, underestimating it makes the algorithm slower but overestimating it makes it not converge at all (Nesterov's accelerated gradient is an example of something of the sort). So cross-validation is not to blame for slowness unless you have too much data and don't need it so much, but grid search might be, as it is exponential on the number of hyperparameters. Recent papers have suggested that random search or bayesian optimization are much better alternatives whenever one has more than a few hyperparameters to tune. |
|
To put the previous answers more strongly: a) you don't aggregate the hyperparameters you aggregate the error scores. ie for each parameter you calculate the mean [squared ] error across each of the k -folds. b) you can also get "confidence intervals" - namely looking at the standard error of the (mean) estimates ( ie by looking at standard deviation of your k - samples of the validation error) c) k-fold rather than 1-fold is done to reduce the variance of your validation error estimates. by using k folds your variance is reduced by a factor of k. so as digdug said its only useful if you have "high variance" in your estimate but unless you do k folds it's hard to know what is the variance of your estimates - and therefore if your parameter vs validation error curve is noise |