I'm trying to fit a model that turns out to result in fitness function with loads of local minima, and simple gradient methods do not seem to come up with any good solutions (and I know from simpler models fit to the same problem that good solutions have to exist).

After doing some reading, it seems my only option are various stochastic methods such as evolutionary method or monte-carlo optimizations. My question is what are the state of the art optimization algorithms that are know to handle many local optima well. Ideally, I would like to combine the stochastic search with the gradient so that the fact that I have the 1st and 2nd order gradient of the model is exploited.

This question is marked "community wiki".

asked May 21 '10 at 11:28

antolikjan's gravatar image


edited Dec 03 '10 at 07:06

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

Could you tell us more about the nature of the data?

(May 21 '10 at 11:36) Joseph Turian ♦♦

The data consists of pairs of 1800 images (30x30 pixels) and corresponding responses of a neuron to these images. I then have a model of a neuron (inspired by the known anatomy of LGN an visual cortex), where its response is a sum of dot products of number (<20) of difference-of-Gaussians like linear filters with the input images passed through a sigmoid non-linearity.

The free parameters of the models are the weights and positions(centers) of the difference-of-Gaussians within the 30x30 pixel input space, and the parameters of the sigmoid.

(Jul 01 '10 at 07:55) antolikjan

10 Answers:

Martens, 2010 ("Deep learning via Hessian-free optimization") recommends Hessian-free optimization (a 2nd-order optimization method).

Martens specifically examines the problem of pathological curvature in deep networks. He argues that there are not many bad local minima in shallow and deep networks. Rather, there is pathological curvature in deep networks, which to first-order methods (like gradient descent) resemble bad local minima. But a strong optimizer like Hessian-free optimization can quickly detect and traverse these regions.

It depends upon the shape of your space whether HF optimization is appropriate for your particular problem.

answered Jun 23 '10 at 17:37

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

I'm sure this will be interesting to others but I should just add that my problem doesn't (yet :-)) have the characteristics of deep learning.

(Jul 01 '10 at 07:56) antolikjan

The algorithm he created is just fine for any difficult gradient learning. I was really amazed with the results he got on auto-encoder NN.

(Jul 06 '10 at 14:17) Łukasz Lew

There are also some exciting forthcoming results by Sutskever + Hinton + Martens in training an RNN to be a character-level language model. Hinton believes they could not successfully train the model without the hessian-free optimization technique.

(Aug 18 '10 at 15:50) Joseph Turian ♦♦

Are there any implementations of this learning algorithm out there yet?

I would be very interested in testing it on my problem as I now transitioned to a 3 layer model, and noticed that suddenly I cannot train it properly anymore. I guess the model now has the characteristics of deep learning and so the Martens algorithm could help?

(Aug 19 '10 at 07:36) antolikjan

I believe that some people in our lab are trying to implement it, but it's definitely not ready yet.

You could also try emailing James Martens and asking him to release his code. The more people that ask him, the more likely he is to do so.

(Aug 19 '10 at 10:11) Joseph Turian ♦♦

Update: UToronto are building some other applications with the hessian-free optimization code, and James Martens (p.c.) will not release his source code until these projects are done.

(Aug 22 '10 at 19:11) Joseph Turian ♦♦
showing 5 of 6 show all

So most of the non-convex functions I've optimized have been done with LBFGS optimization with one crucial change: any time the optimizer thinks its converged, dump the history cache and force it to flush the current approximation of the inverse hessian and take just a normal gradient step. Most of the Berkeley NLP papers since 2006 which do LBFGS non-convex optimization have used this trick and found it pretty important I believe.

answered Jun 29 '10 at 21:48

aria42's gravatar image


Great trick. Is this technique published in any of those papers?

(Jun 29 '10 at 22:40) Joseph Turian ♦♦

what about making some source code that does this?

(Jun 29 '10 at 23:47) Mark Johnson

Sure. So, I incorporated this into my (future) UMass shared group code. The project is at http://github.com/aria42/umass-nlp. The specific LBFGS code, which was initially authored by Dan Klein, but modified substantially by myself and many others, is at http://github.com/aria42/umass-nlp/blob/master/src/main/java/edu/umass/nlp/optimize/LBFGSMinimizer.java. Search for "Dumping Cache".

(Jun 30 '10 at 14:55) aria42

Can someone point me to a proof that the formula for CRF conditional probability is a convex function ? I am trying to train a model using LBFGS which seems to get stuck at various points.

(Jun 30 '10 at 17:36) Aman

Aman, that sounds like a good question to post at the top-level

(Jun 30 '10 at 17:38) Joseph Turian ♦♦

This gets up-voted so much. Is it really a non-written standard?

(Jul 06 '10 at 14:59) Łukasz Lew

Yeah, why is this up voted? This seems very ad hoc? How does it compare to stochastic optimization methods?

(Jul 06 '10 at 15:21) Tristan

I would be interested to hear if this trick is more widely known. This answer is getting upvoted so much because: a) according to Aria this technique is widely used in successful NLP applications, but not documented in most (all?) of the relevant publications. b) I've been linking to this answer when promoting the site, as an example of how this forum allows you to document previously undocumented tips and tricks.

I would definitely be interested in a controlled comparison between this technique and less ad-hoc ones.

(Jul 06 '10 at 16:39) Joseph Turian ♦♦

I think it's fine that it's not mentioned in the papers that use this. It's a little detail, just like many other things typically not mentioned, like exactly what line search algorithm is used within LBFGS, or how many of the previous iterations it uses to compute its Hessian approximation, etc. And frankly, I can't imagine it makes a big difference for the accuracy on a task.

(Jul 08 '10 at 00:23) Frank

@Frank: You might be wrong. If this detail is unimportant, then why do they keep using it? Maybe because of habit, or maybe it is important. I think part of the value of this site is discussing details which may or may not be important (but are otherwise not discussed), and then examining these details in our research.

(Jul 08 '10 at 02:49) Joseph Turian ♦♦

I may be misunderstanding, but generally in a line-search algorithm such as L-BFGS, the algorithm iterates until ||grad f||<eps (for unconstrained minimization). What do you mean to take a normal gradient step? BFGS is run until the gradient vanishes and thus a stationary point.

(Sep 21 '11 at 20:20) Jason Lee
showing 5 of 11 show all

I have had good success with genetic optimization using derivatives. It is both stochastic and uses derivative information to speed convergence. There is an R package which is easy to use. I'm not sure how well it scales to problems with huge numbers of parameters though.

answered Jun 30 '10 at 00:38

Tristan's gravatar image


Hello, this does look like exactly the sort of algorithm I'm looking for. Do you by any chance know of any implementation for python - or perhaps any library in python which would be most suitable to implement this. I was able to find only an implementation for R. Thanks

(Jul 01 '10 at 07:58) antolikjan

You could access it through rpy, or try to link to the c-source directly.

(Jul 01 '10 at 12:45) Tristan

If you are looking for non-gradient based methods for 1-D functions, try Brent's optimization method. The method repeatedly tries to fit a parabola through three points on the curve; evaluate the function at the minima of the parabola; find the next set of three points and so on. The advantage of this method is it does not require knowing about the function to be optimized (no need to compute gradients) and can be implemented in a few lines.

answered Jun 30 '10 at 14:48

Delip%20Rao's gravatar image

Delip Rao

Has something similar to Brent's method been applied to multivariate functions?

(Jul 01 '10 at 18:19) Janto

not as far as I know.

(Jul 06 '10 at 11:13) Delip Rao

You might want to look at PGPE. While the cited paper presents the main idea in the context of Reinforcement Learning (particularly Policy Gradients), PGPE is by no means limited to that field. E.g. an ICANN 2010 paper shows how it is used to break a cryptographic system (much faster than standard Evolutionary Search). You also might want to give Stochastic Search a try, though one might find out that with parameter sizes >> 50, PGPE is faster.

answered Jun 30 '10 at 17:26

osdf's gravatar image


edited Jul 01 '10 at 16:51

Hello, the Stochastic Search looks very interesting. Do you by any chance know of any implementation, preferably in Python? Thanks

(Jul 01 '10 at 08:02) antolikjan

A variant of Stochastic Search is in [PyBrain][www.pybrain.org]. Look for ExactNES. As I just learnt, you should also consider a very recent paper, [XNES][http://www.idsia.ch/~glasmachers/glasmachers2010b.pdf], which is also in pybrain, look for XNES. Btw, PGPE (and CMA-ES) is also in pybrain, check out the optimization module.

(Jul 01 '10 at 11:55) osdf

Wow, that sounds very promising, thank you very much for help!

I realize this might be getting a bit too specific question but:

I have in fact briefly tried pybrain, but realized that I couldn't use it properly as the kind of models I have in mind cannot be easily expressed in the pybrain neural networks framework. In the end I settled with Theano as it allows me to flexibly change my models without recomputing all the 1st and 2nd order derivatives. So my questions is (mainly to all the Theano and Pybrain people who I know roam around these forums :-) ):

(Jul 01 '10 at 12:30) antolikjan

Is there a way to combine Theano with PyBrain - eg. to let me express my models in Theano but exploit the optimization algorithms in PyBrain (such as the Stochastic Search) ?

(Jul 01 '10 at 12:30) antolikjan

antolikjan: Maybe post that as a new question, and I will ask the Theano devs to take a look.

(Jul 01 '10 at 13:08) Joseph Turian ♦♦

You can do that. PyBrains optimization API works on functions, similar to the scipy API.

(Sep 05 '11 at 16:10) Justin Bayer
showing 5 of 6 show all

Before trying anything too complicated, just set up a quick evolutionary algorithm. If that does okay, add in self-adaptation for the parameters' sigmas. I've seen plenty of problems with lots of local minima, and almost always this will find the global optima. If you still have trouble there, add in speciation and explicit fitness sharing.

answered Jun 30 '10 at 17:31

Wesley%20Tansey's gravatar image

Wesley Tansey

Hello, sorry I'm not particularly familiar with evolutionary algorithms beyond the basics. Could you perhaps elaborate on the 'self-adaptation for the parameters' sigmas'?

(Jul 01 '10 at 08:00) antolikjan

Sure. The standard approach is to use Gaussian mutation for each parameter (C# code):

private double gaussianMutation(double mean, double stddev)
    double x1 = Rand.NextDouble();
    double x2 = Rand.NextDouble();
    double y1 = Math.Sqrt(-2.0 * Math.Log(x1)) * Math.Cos(2.0 * Math.PI * x2);
    return y1 * stddev + mean;

(Splitting this into 2 comments for character length restrictions)

(Jul 01 '10 at 10:35) Wesley Tansey

The stdev you're using is usually referred to as sigma. A good heuristic for sigma is usually 1/6 of the range of the variable. For example, if you're evolving a variable where the minimum value it can have is 0 and the max is 6, you would use a sigma of 1 as a rule of thumb.

To do self-adaptation, you mutate the sigma as well:

  Sigma = Math.Max(0, Sigma * Math.Exp(gaussianMutation(0,1)));
val = gaussianMutation(val, Sigma);
(Jul 01 '10 at 10:35) Wesley Tansey

Ah thanks, now it is perfectly clear.

(Jul 01 '10 at 12:22) antolikjan

I've had good experiences with CMA-ES for doing neuro-evolution of recurrent neural networks. CMA-ES has two particularly nice features (among others):

  1. Invariance to order-preserving transformations of the fitness function
  2. Invariance to rigid body transformations of the search space (including rotation, reflection, and translation)

Note of caution: Because it relies on a covariance matrix, it will start to bog down if the dimensionality of the parameter space you are working in is significantly > 100.

answered Jun 30 '10 at 18:54

Thomas%20Lahore's gravatar image

Thomas Lahore

XNES (my post/comment to it on this page) is similar to CMA-ES. See abstract of the cited paper in my comment).

(Jul 01 '10 at 12:14) osdf

An alternative I understand works relatively well when evolving neural networks is Novelty Search ( http://eplex.cs.ucf.edu/papers/lehman_ecj10.pdf ). The short version of the algorithm is that, rather than using the traditional fitness value for determining the next generation, you chose the genotypes whose behavior is most different from the others (and from an archive of the historically most unique). Although it was designed with neural nets in mind, it should be easily adaptable to other problem spaces, even if the "behavior" of a given genome is simply some measure of distance between the other genomes that exist in the search space.

If your goal is to eventually find the global minimum, perhaps you could perform 100 generations with Novelty Search and 100 generations with traditional, fitness-based search, or any number of variations on the novelty-for-a-while/fitness-for-a-while system.

answered Jul 01 '10 at 00:01

Rueben's gravatar image


edited Jul 01 '10 at 00:17

If you're interested in biologically-inspired techniques, I'd recommend taking a look at HyperNEAT. It's an extension of the NeuroEvolution of Augmented Topologies algorithm that is particularly well suited for input domains with a specific geometric regularity, such as images.


answered Jul 01 '10 at 12:39

ElDuderino's gravatar image


You might want to take a look at Graduated Non-Convexity. i.e gradually changing your fitness function from convex to non-convex during training. I'm not sure if you can apply it directly, but it might give you ideas.

answered Jul 01 '10 at 18:00

Janto's gravatar image


This sounds like a continuation method. In a continuation method, you add constraints that make the space much smoother and with fewer local optima. You find the minimum of this overconstrained space, and then relax the constraint slightly. This makes the space more bumpy and you find the solution from where you currently site. For example, you start by adding a very strong l2-penalization, and then gradually relaxing the strength of the penalty each time you converge. Look at this image: http://www.iro.umontreal.ca/~bengioy/yoshua_en/research_files/continuation.jpg

(Jul 01 '10 at 18:10) Joseph Turian ♦♦

Here are Yoshua Bengio's thoughts on non-convex optimization using continuation methods: http://www.iro.umontreal.ca/~bengioy/yoshua_en/research.html#optimization

(Jul 01 '10 at 18:10) Joseph Turian ♦♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.