# What are the state of the art optimization algorithms for problems with numerous local minima?

 20 13 I'm trying to fit a model that turns out to result in fitness function with loads of local minima, and simple gradient methods do not seem to come up with any good solutions (and I know from simpler models fit to the same problem that good solutions have to exist). After doing some reading, it seems my only option are various stochastic methods such as evolutionary method or monte-carlo optimizations. My question is what are the state of the art optimization algorithms that are know to handle many local optima well. Ideally, I would like to combine the stochastic search with the gradient so that the fact that I have the 1st and 2nd order gradient of the model is exploited. This question is marked "community wiki". asked May 21 '10 at 11:28 antolikjan 226●3●4●6 Alexandre Passos ♦ 25541●54●278●421 Could you tell us more about the nature of the data? (May 21 '10 at 11:36) Joseph Turian ♦♦ 1 The data consists of pairs of 1800 images (30x30 pixels) and corresponding responses of a neuron to these images. I then have a model of a neuron (inspired by the known anatomy of LGN an visual cortex), where its response is a sum of dot products of number (<20) of difference-of-Gaussians like linear filters with the input images passed through a sigmoid non-linearity. The free parameters of the models are the weights and positions(centers) of the difference-of-Gaussians within the 30x30 pixel input space, and the parameters of the sigmoid. (Jul 01 '10 at 07:55) antolikjan

 18 So most of the non-convex functions I've optimized have been done with LBFGS optimization with one crucial change: any time the optimizer thinks its converged, dump the history cache and force it to flush the current approximation of the inverse hessian and take just a normal gradient step. Most of the Berkeley NLP papers since 2006 which do LBFGS non-convex optimization have used this trick and found it pretty important I believe. answered Jun 29 '10 at 21:48 aria42 2099●7●24●41 Great trick. Is this technique published in any of those papers? (Jun 29 '10 at 22:40) Joseph Turian ♦♦ what about making some source code that does this? (Jun 29 '10 at 23:47) Mark Johnson 4 Sure. So, I incorporated this into my (future) UMass shared group code. The project is at http://github.com/aria42/umass-nlp. The specific LBFGS code, which was initially authored by Dan Klein, but modified substantially by myself and many others, is at http://github.com/aria42/umass-nlp/blob/master/src/main/java/edu/umass/nlp/optimize/LBFGSMinimizer.java. Search for "Dumping Cache". (Jun 30 '10 at 14:55) aria42 Can someone point me to a proof that the formula for CRF conditional probability is a convex function ? I am trying to train a model using LBFGS which seems to get stuck at various points. (Jun 30 '10 at 17:36) Aman Aman, that sounds like a good question to post at the top-level (Jun 30 '10 at 17:38) Joseph Turian ♦♦ 2 This gets up-voted so much. Is it really a non-written standard? (Jul 06 '10 at 14:59) Łukasz Lew 1 Yeah, why is this up voted? This seems very ad hoc? How does it compare to stochastic optimization methods? (Jul 06 '10 at 15:21) Tristan 2 I would be interested to hear if this trick is more widely known. This answer is getting upvoted so much because: a) according to Aria this technique is widely used in successful NLP applications, but not documented in most (all?) of the relevant publications. b) I've been linking to this answer when promoting the site, as an example of how this forum allows you to document previously undocumented tips and tricks. I would definitely be interested in a controlled comparison between this technique and less ad-hoc ones. (Jul 06 '10 at 16:39) Joseph Turian ♦♦ I think it's fine that it's not mentioned in the papers that use this. It's a little detail, just like many other things typically not mentioned, like exactly what line search algorithm is used within LBFGS, or how many of the previous iterations it uses to compute its Hessian approximation, etc. And frankly, I can't imagine it makes a big difference for the accuracy on a task. (Jul 08 '10 at 00:23) Frank @Frank: You might be wrong. If this detail is unimportant, then why do they keep using it? Maybe because of habit, or maybe it is important. I think part of the value of this site is discussing details which may or may not be important (but are otherwise not discussed), and then examining these details in our research. (Jul 08 '10 at 02:49) Joseph Turian ♦♦ I may be misunderstanding, but generally in a line-search algorithm such as L-BFGS, the algorithm iterates until ||grad f||
 5 I have had good success with genetic optimization using derivatives. It is both stochastic and uses derivative information to speed convergence. There is an R package which is easy to use. I'm not sure how well it scales to problems with huge numbers of parameters though. answered Jun 30 '10 at 00:38 Tristan 271●3●8 Hello, this does look like exactly the sort of algorithm I'm looking for. Do you by any chance know of any implementation for python - or perhaps any library in python which would be most suitable to implement this. I was able to find only an implementation for R. Thanks (Jul 01 '10 at 07:58) antolikjan You could access it through rpy, or try to link to the c-source directly. (Jul 01 '10 at 12:45) Tristan
 0 If you are looking for non-gradient based methods for 1-D functions, try Brent's optimization method. The method repeatedly tries to fit a parabola through three points on the curve; evaluate the function at the minima of the parabola; find the next set of three points and so on. The advantage of this method is it does not require knowing about the function to be optimized (no need to compute gradients) and can be implemented in a few lines. answered Jun 30 '10 at 14:48 Delip Rao 665●3●9●12 Has something similar to Brent's method been applied to multivariate functions? (Jul 01 '10 at 18:19) Janto not as far as I know. (Jul 06 '10 at 11:13) Delip Rao
 2 You might want to look at PGPE. While the cited paper presents the main idea in the context of Reinforcement Learning (particularly Policy Gradients), PGPE is by no means limited to that field. E.g. an ICANN 2010 paper shows how it is used to break a cryptographic system (much faster than standard Evolutionary Search). You also might want to give Stochastic Search a try, though one might find out that with parameter sizes >> 50, PGPE is faster. answered Jun 30 '10 at 17:26 osdf 670●3●11●19 Hello, the Stochastic Search looks very interesting. Do you by any chance know of any implementation, preferably in Python? Thanks (Jul 01 '10 at 08:02) antolikjan 2 A variant of Stochastic Search is in [PyBrain][www.pybrain.org]. Look for ExactNES. As I just learnt, you should also consider a very recent paper, [XNES][http://www.idsia.ch/~glasmachers/glasmachers2010b.pdf], which is also in pybrain, look for XNES. Btw, PGPE (and CMA-ES) is also in pybrain, check out the optimization module. (Jul 01 '10 at 11:55) osdf Wow, that sounds very promising, thank you very much for help! I realize this might be getting a bit too specific question but: I have in fact briefly tried pybrain, but realized that I couldn't use it properly as the kind of models I have in mind cannot be easily expressed in the pybrain neural networks framework. In the end I settled with Theano as it allows me to flexibly change my models without recomputing all the 1st and 2nd order derivatives. So my questions is (mainly to all the Theano and Pybrain people who I know roam around these forums :-) ): (Jul 01 '10 at 12:30) antolikjan Is there a way to combine Theano with PyBrain - eg. to let me express my models in Theano but exploit the optimization algorithms in PyBrain (such as the Stochastic Search) ? (Jul 01 '10 at 12:30) antolikjan 1 antolikjan: Maybe post that as a new question, and I will ask the Theano devs to take a look. (Jul 01 '10 at 13:08) Joseph Turian ♦♦ You can do that. PyBrains optimization API works on functions, similar to the scipy API. (Sep 05 '11 at 16:10) Justin Bayer showing 5 of 6 show all
 3 Before trying anything too complicated, just set up a quick evolutionary algorithm. If that does okay, add in self-adaptation for the parameters' sigmas. I've seen plenty of problems with lots of local minima, and almost always this will find the global optima. If you still have trouble there, add in speciation and explicit fitness sharing. answered Jun 30 '10 at 17:31 Wesley Tansey 206●3●6●12 Hello, sorry I'm not particularly familiar with evolutionary algorithms beyond the basics. Could you perhaps elaborate on the 'self-adaptation for the parameters' sigmas'? (Jul 01 '10 at 08:00) antolikjan Sure. The standard approach is to use Gaussian mutation for each parameter (C# code): ``````private double gaussianMutation(double mean, double stddev) { double x1 = Rand.NextDouble(); double x2 = Rand.NextDouble(); double y1 = Math.Sqrt(-2.0 * Math.Log(x1)) * Math.Cos(2.0 * Math.PI * x2); return y1 * stddev + mean; } `````` (Splitting this into 2 comments for character length restrictions) (Jul 01 '10 at 10:35) Wesley Tansey The stdev you're using is usually referred to as sigma. A good heuristic for sigma is usually 1/6 of the range of the variable. For example, if you're evolving a variable where the minimum value it can have is 0 and the max is 6, you would use a sigma of 1 as a rule of thumb. To do self-adaptation, you mutate the sigma as well: ``````if(SelfAdaptive) Sigma = Math.Max(0, Sigma * Math.Exp(gaussianMutation(0,1))); val = gaussianMutation(val, Sigma); `````` (Jul 01 '10 at 10:35) Wesley Tansey Ah thanks, now it is perfectly clear. (Jul 01 '10 at 12:22) antolikjan
 1 I've had good experiences with CMA-ES for doing neuro-evolution of recurrent neural networks. CMA-ES has two particularly nice features (among others): Invariance to order-preserving transformations of the fitness function Invariance to rigid body transformations of the search space (including rotation, reflection, and translation) Note of caution: Because it relies on a covariance matrix, it will start to bog down if the dimensionality of the parameter space you are working in is significantly > 100. answered Jun 30 '10 at 18:54 Thomas Lahore 16●1 XNES (my post/comment to it on this page) is similar to CMA-ES. See abstract of the cited paper in my comment). (Jul 01 '10 at 12:14) osdf
 1 An alternative I understand works relatively well when evolving neural networks is Novelty Search ( http://eplex.cs.ucf.edu/papers/lehman_ecj10.pdf ). The short version of the algorithm is that, rather than using the traditional fitness value for determining the next generation, you chose the genotypes whose behavior is most different from the others (and from an archive of the historically most unique). Although it was designed with neural nets in mind, it should be easily adaptable to other problem spaces, even if the "behavior" of a given genome is simply some measure of distance between the other genomes that exist in the search space. If your goal is to eventually find the global minimum, perhaps you could perform 100 generations with Novelty Search and 100 generations with traditional, fitness-based search, or any number of variations on the novelty-for-a-while/fitness-for-a-while system. answered Jul 01 '10 at 00:01 Rueben 16●3
 1 If you're interested in biologically-inspired techniques, I'd recommend taking a look at HyperNEAT. It's an extension of the NeuroEvolution of Augmented Topologies algorithm that is particularly well suited for input domains with a specific geometric regularity, such as images. http://eplex.cs.ucf.edu/hyperNEATpage/HyperNEAT.html answered Jul 01 '10 at 12:39 ElDuderino 16●1
 2 You might want to take a look at Graduated Non-Convexity. i.e gradually changing your fitness function from convex to non-convex during training. I'm not sure if you can apply it directly, but it might give you ideas. answered Jul 01 '10 at 18:00 Janto 166●1●3●8 This sounds like a continuation method. In a continuation method, you add constraints that make the space much smoother and with fewer local optima. You find the minimum of this overconstrained space, and then relax the constraint slightly. This makes the space more bumpy and you find the solution from where you currently site. For example, you start by adding a very strong l2-penalization, and then gradually relaxing the strength of the penalty each time you converge. Look at this image: http://www.iro.umontreal.ca/~bengioy/yoshua_en/research_files/continuation.jpg (Jul 01 '10 at 18:10) Joseph Turian ♦♦ Here are Yoshua Bengio's thoughts on non-convex optimization using continuation methods: http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html#optimization (Jul 01 '10 at 18:10) Joseph Turian ♦♦
 toggle preview community wiki

### Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

Asked: May 21 '10 at 11:28

Seen: 13,411 times

Last updated: Sep 21 '11 at 20:20

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.