I'm trying to do learning via minimizing empirical loss function on a training dataset. This minimization takes considerable time. The procedure of optimization has several parameters such as learning rates for different variables, regularization coefficients, ...

What is commonly used strategy for learning all these params?

The first strategy I am thinking about is

  • randomly choose small (for computational efficiency) training and testing subsets from my learning set
  • for all params in a grid (for ex. for all pairs (learning rate, regularization coeff) from cartesian product [0.1, 0.2, 0.3, 0.5] X [0.01,0.025,0.05, 0.075]) do the following
  • learn on the small training set
  • find loss function values on the small testing set
  • choose the best params
  • and use them to learn from the entire learning set.

The second idea is to set all params to some default values and then optimize them one-after-one.

Is there a better way?

asked Jul 11 '10 at 16:27

bijey's gravatar image

bijey
31226

closed Jul 11 '10 at 16:44

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126

The question has been closed for the following reason "Duplicate Question of: http://metaoptimize.com/qa/questions/551/how-do-i-choose-hyperparameters-if-i-only-have-a-training-set-and-no-validation-set-and-get-the-best-generalization and http://metaoptimize.com/qa/questions/32/how-do-i-optimize-hyperparameter-values But if you have a more specific question, or need one of those answers elaborated upon, please ask!" by Joseph Turian Jul 11 '10 at 16:44


One Answer:

You might want to look at this question and this question; although most of the answers are focused on hyperparameters for bayesian models some methods are useful for your sort of problem.

Generally, what you suggest seems good, although you should worry about the relative size of the smaller set. The smaller the training set is (compared to the number of parameters in your model), the more you need regularization to perform well, so this might bias you towards excessive regularization (which is less harmful than too little regularization, but this means you're throwing performance away). Since the regularization shouldn't depend much on the learning rate, you could maybe test many values for the learning rate on the small set and use the full training set to set the regularization coefficient (using a few good values of the learning rate), probably following the strategy described by Bengio in this answer. A good thing to keep in mind is that you don't make your life worse by changing the regularization coefficient of an already trained model (of course, you would need to retrain it to find another optimum). Since the performance versus regularization coefficient curve seems to be well behaved, you can do a grid search or some sort of line minimization algorithm to find the best value. I suggest, however, that you measure performance of a separate validation set to do this, instead of measuring the actual loss on your training set, to avoid overfitting.

So, tl;dr: use a validation set to measure performance, instead of looking at your surrogate loss directly; optimize the learning rate on the smaller version of the training set if it saves you time; and search for the best value of the regularization coefficient using as much training data as you can, without necessarily resetting the model before trying another value. Always keep count of the performance on the validation set, as well.

answered Jul 11 '10 at 16:53

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.