6
1

I've been trying to maximize a family of complicated goal functions of ~50 variables. The function is somewhat expensive to calculate as it involves eigenvalue decomposition; 10^4 evaluations can be done in a second, and the complete maximization must be done on the order of seconds. The gradient is not available. The solution is known to lie in a hypercube centered on the origin.

  • Brute force is out of the question, as there's 20^50 different combinations.
  • Nelder-Mead's simplex method gets stuck in the numerous local optima.
  • Genetic and particle-swarm algorithms require too many evaluations to be feasible.
  • BGFS and other gradient-based methods won't work since the gradient is unknown, and would take too many evaluations to approximate in 50 dimensions.

The most promising methods I've looked at are the Efficient Global Optimization (EGO) family of algorithms, which interpolate the goal function from a few samples and solves another, cheaper optimization problem in each iteration in order to decide where to sample next. I've only read the original paper by Jones et al, but these methods seem to focus on much more expensive goal functions than mine -- the problem that must be solved in each iteration seems to take a second.

Here is one instantiation of the goal function in two dimensions:

alt text

The 50-dimensional version retains many properties (although I can't visualize it): there are hyperplanes with higher function values that intersect close to the maximum, there's a lot of local optima, and the function is "smooth" in the intuitive sense.

Have I overlooked some family of algorithms? What algorithm would fit such a problem?

asked Jun 15 '12 at 11:26

Anna's gravatar image

Anna
91134

edited Jun 15 '12 at 11:32

Have you looked at automatic differentiation

(Jul 22 '12 at 09:31) marshallp

6 Answers:

One thing you haven't mentioned is how good the computed optima must be. If the solution doesn't need to be extremely precise, you could give Simulated Annealing a try. You've mentioned that:

  1. The functions have many local optima.
  2. Gradients are out of the question.
  3. You have some clue about the search space.
  4. Time is of the essence.

Based on (1)-(3), a search method like SA could work, as it requires no gradients, draws from a generation distribution you can choose to match your search space (i.e. your hypercube) and, given time->Inf you get some guarantees about convergence to an optimum value.

However, SA is not known to be very fast, so (4) could be a problem. This really depends on (a) how you schedule the annealing process, and (b) how good your solutions need to be. If you can use some heuristics to restrict your search space, and tune the annealing schedule in an attempt to trade off time for accuracy, then you might be able to make it work. Perhaps take a look at Adaptive Simulated Annealing. It's hard to say more without seeing the function(s).

Note: A SA method might work, but given that your particle swarm optimization was too slow I wouldn't be too optimistic that SA will be faster. The are numerous hybrid approaches that pair up SA with, for example, Neder-Mead's simplex algorithm (thus defeating the local optima problem.) That might be an avenue worth exploring.

Finally, in case you haven't seen it already, Luis Rios (CMU) has a paper that does a quantitative comparison of many derivative free methods.

Anecdotal advice: In situations like yours, I have found that my effort is better spent finding ways to improve gradient approximations and/or reduce dimensionality, as opposed to searching for fancier algorithms that can deal with my extremely hard problem.

answered Jun 16 '12 at 19:18

Kyzyl%20Herzog's gravatar image

Kyzyl Herzog
371144

You should use Bayesian optimization. A lot of Frank Hutter's work in this area is relevant to your problem since you have a high budget of trials (you can do many trials per second). See his website: http://www.cs.ubc.ca/~hutter/ Unfortunately, I don't know the exact best paper for you to look at so you might have to read and skim a bunch of them. For example, take a look at http://www.cs.ubc.ca/~hutter/papers/11-LION5-SMAC.pdf and http://www.cs.ubc.ca/~hutter/papers/11-IJCAI-RCRA-HydraMIP.pdf although other papers may be more appropriate (and even his thesis). Basically, he replaces the typical Gaussian Process regression component of Bayesian optimization with random forests so it is more scalable to large numbers of trials.

answered Jul 24 '12 at 02:46

gdahl's gravatar image

gdahl ♦
341453559

edited Jul 25 '12 at 01:59

If this sounds interesting to you, also have a look at http://probabilistic-optimization.org/Global.html

(Aug 07 '12 at 05:02) osdf

You could try the BOBYQA algorithm. It's described in the paper "The BOBYQA algorithm for bound constrained optimization without derivatives by M.J.D. Powell".

It's similar to the Efficient Global Optimization you mentioned in that it samples points and uses them to build an interpolated model, except that it builds a simple quadratic model. It then uses this model in a trust region framework. So its overhead is very minimal. It's also been integrated into a number of open source libraries, such as dlib.

answered Jul 22 '12 at 08:45

Davis%20King's gravatar image

Davis King
19626

The first thing I'd try in such a setting is SPSA.

http://www.jhuapl.edu/spsa/

Here's the original SPSA paper:

http://www.jhuapl.edu/spsa/PDF-SPSA/Spall_TAC92.pdf

The basic algorithm is described in Section 2.

answered Aug 06 '12 at 23:39

Yisong%20Yue's gravatar image

Yisong Yue
58631020

Good question. Not an answer, just some comments:

1) have you tried gridding x, snapping all x_i to multiples of say .01, with N-M ? (Any method can be run with first a coarse, then a finer grid, devil in the details.)

2) NLopt has a handful of algorithms which can be easily called from C, Python ... (what do you have ?) to try them out. They include Powell's BOBYQA, and DIRECT and DIRECT_L which sound like EGO.

3) is there a problem like yours online anywhere ? That would be useful for people who play with / refine optimizers.

answered Jul 22 '12 at 05:27

denis's gravatar image

denis
2812812

edited Aug 01 '12 at 11:30

If that slice is representative of your function, while it suffers from many flat areas and many local maxima, it seems to have only one area that is much higher that the others. That is a very good property for many global optimization algorithms, for example next paragraph.

Look at "Stochastic Simultaneous Optimistic Optimization" by Valko et al. Note these guys do theory. Being able to prove something about the performance of an algorithm across all problems may require more conservative behaviour than getting a good result on your particular problem, so you might want to take more risks than they do.

Another approach: don't give up on gradients too easily. Stochastic Gradient Descent (SGD) does not require a gradient, but an unbiased random estimate of a gradient. You can get that for pretty cheap: choose a random direction, compute just two function values to estimate the directional derivative in that direction, and normalize correctly, and bob's your uncle. See "Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback" for some hints. Note that none of the guarantees actually apply to you, since they are for convex problems, but SGD sometimes does just fine on non-convex problems too, so its worth a try.

answered Sep 17 '13 at 14:08

Daniel%20Vainsencher's gravatar image

Daniel Vainsencher
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.