For what problems are rectified linear units better than tanh? Why are they better?

asked Oct 30 '13 at 04:48

Max's gravatar image

Max
476162729

edited Oct 30 '13 at 05:15


One Answer:

The most evidence is for them helping with speech, which seems to benefit from many layers much more than generic problems. Image problems are second since they are (sorta) intensity invariant in certain settings (RBMs) and seem to work quite well in convolutional networks.

There's two main arguments as to why they're better. The first is that since it's linear when active, where the error gradient does flow, it does not suffer the diminishing gradient problem of normal activation functions which helps in learning deep networks.

The second is that a rectified linear unit can be seen as sort of equivalent to an infinite amount of binomial units with the same learned bias and weights but at different fixed offset biases, suggesting it's a much more powerful representation unit (also more neurally plausible).

But they are more difficult to train correctly since they don't saturate, have dead zones, and there still pretty new so there's much less information/code/support around them.

answered Oct 30 '13 at 16:35

Newmu's gravatar image

Newmu
29641014

edited Oct 30 '13 at 17:01

+1 for intensity "invariance", but I find the other arguments unconvincing.

(Nov 01 '13 at 11:33) Max

I wouldn't say they are particularly difficult to train, if anything they are a bit more forgiving with regards to the scale of the input, which is convenient sometimes.

To avoid the "dead zone" problem, it's usually sufficient to initialise the biases to a slightly positive value (I've used 1.0 or 0.1 in the past, although 0 often works just fine). Tanh/sigmoid units actually have a very similar problem: if they always saturate in one direction, the gradient is going to be too small to get them out of that state.

As Newmu mentioned, they are piecewise linear, so networks with relus are a lot easier to train from an optimization point of view. You don't need fancy second order optimization strategies to get the most out of them, just stochastic gradient descent (maybe with some momentum) will do.

As an added bonus, computers find max(0,x) a lot easier to compute than tanh(x), so you get a nice speed boost because of that, and an additional boost due to the simpler optimization process.

One place where they do seem to be a lot more difficult to handle is recurrent networks (no personal experience, a colleague told me). There, the fact that they don't saturate really is a problem.

(Nov 01 '13 at 17:47) Sander Dieleman
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.