|
Let's say I have a set of training examples, but no prescribed validation set. How do I choose my hyperparameters over them? How do I achieve the best generalization? If there is validation, we select hyperparameter values to maximize some objective on held-out validation data. But what if there is no validation set? One common approach is to partition the training set into a pseudo-training and pseudo-validation set, maybe a 90/10% split. You then train on pseudo-training, and optimize (hyper-optimize) your hyperparameters (including # of training epochs, regularization strength, etc.) to maximize some objective on pseudo-validation. Then, you take these hyperparameters, and you use them blindly by training over the entire original training set (including pseudo-validation) with this choice of hyperparameters. A colleague of mine objects that the blind use of hyperparameters over the full training set is inappropriate. However, I remember Leon Bottou recommending this technique to me. Is the blind training appropriate or not? Can I do better than this? For example, could I choose the hyperparameters over k-folds, and then take the mean (or median) over them? |
|
The standard way to deal with this is more or less like you have described. Actually, you would probably perform multiple runs as in k-fold cross-validation: You split your data set into k parts, then always take k-1 of those to train given a possible parameter set, and evaluate on the remaining part. Finally, you choose the best parameter based on the mean/median/minimal value achieved for that parameter setting. There are many variants to this. For example, you could perform a nested cross-validation: Split into k-parts, then perform another k-fold cross-validation on the pseudo-training part in the outer loop. That way, you can evaluate your performance on the part of the data you set aside in the very beginning. I think this is also the standard way to evaluate a method on some data set. Concerning whether or not it is ok to use the found parameters on the whole data set, the answer depends on a number of factors:
How do I combine the hyperparams from the different folds? Do I average or median the different values, or what?
(Jul 06 '10 at 18:37)
Joseph Turian ♦♦
|
|
For something that works on most machine learning techniques I'd say a k-fold cross validation is your friend. However, you wouldn't choose the mean of the hypers: instead you search over your hypers, and for each setting, you compute the k-fold cross validation score. Then you select your hypers that have the lowest objective value. Depending on your problem, another idea would be to use a Bayesian method with putting a prior on your hypers. Then let Bayes rule figure out a posterior on the hypers. Edit: Just to make the above a bit more clear: when you train and test for each of your k folds, you indeed get k error scores. The average of these k scores is generally called the cross-validation estimate of the prediction error (see "Elements of Statistical Learning", 3e, p242). As one of the commenters suggested, in an outer loop you comptue this cross-validation estimate for all your hypers and you go for the hypers that minimize this quantity. Hope this helps a bit. I don't understand the details of "instead you search over your hypers, and for each setting, you compute the k-fold cross validation score." Could you describe in more detail?
(Jul 06 '10 at 12:42)
Joseph Turian ♦♦
Say you have k folds (e.g. k=5) and one hyperparameter combination (setting, e.g. C=1 and gamma=1 for an SVM with an RBF kernel). You train+test your model k times, each time using a different fold as test data and the other folds as training data. Then the average of the k results (I think this is what Jurgen meant by 'k-fold cross validation score') is an estimate of the quality of the given hyperparameter combination.
(Jul 06 '10 at 13:00)
zeno
How do I combine the hyperparams from the different folds? Do I average or median the different values, or what?
(Jul 06 '10 at 18:37)
Joseph Turian ♦♦
1
Joseph, the way to do this is to have the outermost for-loops iterate over your parameters, and the innermost for-loop iterate over your folds. This way you get classification/regression results for all folds, for a single parameter setting. Finally, you simply take the best performing set of parameters.
(Jul 07 '10 at 03:46)
Michel Valstar
|
|
A real metaoptimizing Bayesian would set up a hyperprior and infer the distribution of the hyperparameter. The question then becomes, what are good hyperhyperparameters. We've tackled this in "A weakly informative default prior distribution for logistic and other regression models". The basic idea is that you infer hyperparameters from a corpus, not from the given dataset. If you're actually inferring them from the dataset, they aren't really hyper. Hyperpriors is what a statistician knows having seen other datasets, and is able to apply this to a new dataset that has just come up. In summary: Bayesians average over a number of choices for a hyperparameter, weighing them by a hyperprior (times prior times likelihood). Hyperpriors are inferred from a corpus. 1
Good answer, although the original question perhaps used "hyperparameter" in a loose way, meaning things such as learning rates, number of epochs, etc.
(Jul 08 '10 at 20:26)
Alexandre Passos ♦
1
I'd consider learning rates or number of epochs as (non-distributional) hyperparameters - the main challenge is to come up with a good hyperprior which carries the empirical finding from many experiments that too many epochs can actually be detrimental to the quality of the end-result.
(Jul 08 '10 at 22:49)
Aleks Jakulin
|
|
K-fold cross-validation works, but if you want to stay in the realm of probability, here's a Bayesian approach. If you have some idea of how to put a reasonable prior on your hyperparameters, you can infer them (like Jurgen suggested) using Bayes' Theorem. One simple way to do this is to use Gibbs sampling, where you find the conditional distribution of the hyperparameters given the other variables in your model. If this is hard to sample from, you might want to use Metropolis-Hastings to infer the hyperparameter values. |
|
Especially if your data set is small you may want to consider using bootstrap - generate multilple data sets by sampling with replacement from your original data set and search for the hyper parameters that perform best across these samples. |
Shouldn't the mean over k-folds be better than a 90/10% split, since you're making many such splits and averaging them? Although I suppose a 90/10% split isn't too risky in itself, since hyperparameters only change substantially when the scale of the problem changes (at least as far as my experience goes). I'm not posting this as an answer since I'm not adding anything concrete.
Do I want the mean over the k-folds? The median?
Are they ever different enough that this makes a difference (honest question here)?