3
1

What are the key points we need to keep in mind to make the selection of transfer function in neural network? The different transfer functions like tansig, logsig and purelin are used in different cases of classification. Can anybody please give me some intuition on how to make the selection of transfer function?

asked Feb 10 '12 at 09:28

Kuri_kuri's gravatar image

Kuri_kuri
293273040

edited Feb 17 '12 at 14:21

Lucian%20Sasu's gravatar image

Lucian Sasu
513172634


4 Answers:

While the following aren't absolutely necessary, they are at least things people often take into consideration:

  • Differentiability; if the function isn't differentiable, there are some good/numerically stable algorithms for doing numerical approximations (cf Neural Smithing, by Robert Marks and Russel Reed)
  • Extreme behavior; eg, tanh and the sigmoid -- both solutions to the differential equation y' = y(1-y) -- approach a finite value at extreme values. Furthermore, they do not change rapidly. On the other hand, the hinge loss function has a rapid transition from 0 to 1, so numerical methods are necessary when using it (although, this ties in with differentiability above).
  • The loss function; it isn't a requirement, but often the loss function at the output layer is chosen because it "matches" the transfer function. In short, the loss and transfer function together make the arithmetic simpler.
  • The types of values you want to represent; your output layer could be made up of binary, softmax, linear (often called Gaussian), tanh, or other more exotic flavors (eg, rectified linear units). If the target result is a real value, linear may be what you need. If it's an integer, RLUs can work well. If the task is categorization, binary and softmax can do the task though they vary based on constraints that may exist. Softmax units are often used to assign membership to a single class amongst a group of classes and are essentially a combination of multiple binary units. However, unlike multiple independent binary units, softmax units have an additional constraint: the probabilities over each class must sum to 1.
  • Internal dynamics; Tanh units can induce interesting dynamics if used as internal units, though I'm unsure how they differ as output units from logistic units. Some recent papers from Hinton's group demonstrated that rectified linear units offer a very rich representation of latent variables, so work very well in place of binary units.
  • Computability; how expensive is it? Is it numerically stable? Can an approximation be used? What kind of bias does the approximation introduce (if any)?
  • Symmetries; eg, should it be symmetric about an axis? Anti-symmetric?
  • [Strictly?] monotonic?
  • Does it take extra parameters? For example, if the units are related to some probability distribution or another that requires 2 parameters (eg, the normal distribution requires an average & standard deviation) then you may need to train sets of parameters in separate stages (akin to linear/quadratic programming problems)

As an example of the computability issue, Rectified Linear Units are essentially linear units, where the [theoretical] transfer function could be defined as:

sum( sigmoid(x - i + 0.5), i = 0 ... N)

That's a rather expensive function to compute. However, in the limit as N goes to infinity, the transfer function approaches

log(1 + e^(x)) ~= max(0,x) (for sufficiently large x)

There's other considerations to make here (max(0,x) is not differentiable, how do you compute a gradient for this? Do you need to add a corrective term to account for the approximation? etc), but I think I've probably given you enough to chew on.

answered Feb 21 '12 at 16:09

Brian%20Vandenberg's gravatar image

Brian Vandenberg
824213746

edited Apr 09 '12 at 01:49

I suggest you this paper: Efficient BackProp by LeCun et al. (SpringrLink). You can find a PDF version of this paper here. It gives a lot of interesting hints about BP NNs design.

answered Feb 15 '12 at 10:23

Matteo%20De%20Felice's gravatar image

Matteo De Felice
9113

edited Feb 17 '12 at 14:22

Lucian%20Sasu's gravatar image

Lucian Sasu
513172634

Wow, I like that paper. Pretty easy reading, and a lot of good info.

(Feb 22 '12 at 15:54) Brian Vandenberg

For backprop, the transfer (activation) function must be differentiable.

answered Feb 13 '12 at 15:50

Melipone%20Moody's gravatar image

Melipone Moody
221468

Not entirely correct: they have to be locally differentiable. For example, f(x) = max(0, x) works fine, although it has a non differentiable point at x = 0.

(Feb 15 '12 at 16:37) Justin Bayer

thanks for the correction.

(Feb 17 '12 at 14:18) Melipone Moody

Further, even if it isn't locally differentiable you can still approximate the gradient using numerical methods.

(Feb 22 '12 at 15:55) Brian Vandenberg

Care to elaborate? I don't see how you approximate the gradient at a non differentiable location. Mainly because it does, by definition, not exist. (I don't refer to the gradient of an approximation, like log(1 + exp(x)).)

(Feb 22 '12 at 16:31) Justin Bayer
1

Sure. You can use a finite difference approximation (or any other numerical method). For details, see "Neural Smithing" by Reed & Marks (1999) page 56, or the references they site for the concept: "Neural Networks for Pattern Recognition" by CM Bishop (1995), and "Structural Risk Minimization For Character Recognition", by Guyon, Vapnik, Boser, Bottou, and Solla (1992)

(Feb 22 '12 at 17:20) Brian Vandenberg

In short, where you'd need d(transfer)/dx, you replace that by the finite-difference approximation.

(Feb 22 '12 at 17:23) Brian Vandenberg

You're right that the derivative doesn't truly exist; the assertion is that it doesn't really matter -- you just gloss over that fact with the numerical method.

(Feb 22 '12 at 18:17) Brian Vandenberg
showing 5 of 7 show all

These are hyperparameters, and you should tune them empirically (by looking at validation error, for example).

answered Feb 10 '12 at 10:43

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.