|
In general, we want to select hyperparameter values to maximize some objective on held-out validation data. If I have just one hyperparameter, such as learning rate, I can optimize (hyper-optimize) the optimal value using a line search. But what if I have to tune several hyperparameters? If I have a handful of hyperparameter values that need to be chosen, how do I choose them? |
|
Or you can be fully Bayesian, and put priors on the hyperparameters and sample their values. With parameters that range over a finite interval you can use a flat prior, while with parameters that range over a half-closed interval you can use a "vague Gamma" distribution (MacKay's book has a nice description). In our grammar learning experiments with adaptor grammars we got a nice accuracy boost when we did this instead of a parametric sweep (http://aclweb.org/anthology-new/N/N09/N09-1036.pdf). One unexpected benefit is that inference actually got faster when we sampled the hyperparameters than when the hyperparameters were fixed. I think this is because at the start of inference when the model is lousy, the hyperparameters prefer innovation (i.e., generate from the base distribution), but as learning proceeds and the model gets better, the hyperparameters prefer reusing the cache. This is actually true for a model I'm doing now as well. Resampling the hyperparameters seems a lot better (and faster) than just a single value. Also, the [slice sampling paper][1] was useful in implementing a slice sampler for my model [1] ftp://ftp.white.toronto.edu/pub/radford/slc-samp.ps.Z
(Jun 30 '10 at 10:26)
Alexandre Passos ♦
Great point Mark. How/does this work in a variational setting? Best, Aria
(Jun 30 '10 at 15:58)
aria42
Yes, slice sampling is great -- it's now my work-horse 1-d sampling algorithm for situations where I don't have special knowledge of the distribution. For the paper above I tried lots of Metropolis-Hastings proposal distributions, but I couldn't find one that has a good acceptance rate (I think the text-books say a 66% acceptance rate is optimal for 1-d MH samplers, but I would have been happy with anything between 10% and 90%), but slice sampling works like a charm. Mark
(Jun 30 '10 at 18:18)
Mark Johnson
Hi Aria, I have much less experience with VB. Without conjugacy you'd have to either numerically integrate or make a MAP approximation and optimise, I think. I haven't tried either, so I don't really know. Comparing several strategies would probably make a nice ACL paper (smile) Mark
(Jun 30 '10 at 18:30)
Mark Johnson
On the hyperparameter setting, it is explored for LDA in http://www.gatsby.ucl.ac.uk/~ywteh/research/inference/AsuWelSmy2009a.pdf . Their conclusion is that different algorithms require really different settings of hyperparameters to perform similarly, and VB is hard to get right.
(Jul 05 '10 at 16:30)
Alexandre Passos ♦
|
|
Sample values from your hyper-parameters and keep the best as you go. No need to discretize the values (e.g. learning rates), and you get an answer at any time you want to stop, or even if some of the jobs have died before completing. I believe James Bergstra has some results that indicate that this technique is as good as grid search, and maybe even better. What are the pros and cons vs grid search?
(Jun 17 '10 at 15:04)
Joseph Turian ♦♦
I guess if your hyperparameter distribution is unimodal, this should find the mode with less sampling than grid search (since you're never sampling past a bad decision). On the other hand, if it's multimodal you might not find the best value.
(Jul 05 '10 at 16:36)
Alexandre Passos ♦
1
I have tried a bunch of grid search like techniques for this, mainly drawing from derivative free optimization methods (especially direct search methods, see, e.g., http://www.mat.uc.pt/~lnv/talks/siopt-2008.pdf) and found that including randomization was more important than the search method. If you have the compute cycles, I have found that random search (or a random grid hybrid) is usually hard to beat for many benchmarks. If you have less cycles, then something like Nelder-Mead might be worth looking at (e.g., https://www.aaai.org/Papers/FLAIRS/2005/Flairs05-071.pdf).
(Jul 05 '10 at 17:24)
Ryan McDonald
|
|
I have used recorded-step evolutionary techniques with good effect. This provides a good combination of random search with some conjugate gradient properties. http://arxiv.org/abs/0803.3838 My medium term plan is to use something like this in Mahout to optimize meta-parameters on the fly with a gradient descent learner. Since stochastic gradient descent learners are so often bottlenecked by reading and parsing data, it seems a good idea to run a few dozen of the learners on the same input data and periodically adjust meta-parameters. That has the secondary benefit that many hyper-parameters related to annealing schedules should just go away. I should also be able to do cross-validation on the fly. The full-on Bayesian approach works very well in clustering. Even if the only parameter is number of clusters, you get a large speedup because you are effectively testing many settings at the same time. |