|
What are the key points we need to keep in mind to make the selection of transfer function in neural network? The different transfer functions like tansig, logsig and purelin are used in different cases of classification. Can anybody please give me some intuition on how to make the selection of transfer function? |
|
While the following aren't absolutely necessary, they are at least things people often take into consideration:
As an example of the computability issue, Rectified Linear Units are essentially linear units, where the [theoretical] transfer function could be defined as: sum( sigmoid(x - i + 0.5), i = 0 ... N) That's a rather expensive function to compute. However, in the limit as N goes to infinity, the transfer function approaches log(1 + e^(x)) ~= max(0,x) (for sufficiently large x) There's other considerations to make here (max(0,x) is not differentiable, how do you compute a gradient for this? Do you need to add a corrective term to account for the approximation? etc), but I think I've probably given you enough to chew on. |
|
I suggest you this paper: Efficient BackProp by LeCun et al. (SpringrLink). You can find a PDF version of this paper here. It gives a lot of interesting hints about BP NNs design. Wow, I like that paper. Pretty easy reading, and a lot of good info.
(Feb 22 '12 at 15:54)
Brian Vandenberg
|
|
For backprop, the transfer (activation) function must be differentiable. Not entirely correct: they have to be locally differentiable. For example, f(x) = max(0, x) works fine, although it has a non differentiable point at x = 0.
(Feb 15 '12 at 16:37)
Justin Bayer
thanks for the correction.
(Feb 17 '12 at 14:18)
Melipone Moody
Further, even if it isn't locally differentiable you can still approximate the gradient using numerical methods.
(Feb 22 '12 at 15:55)
Brian Vandenberg
Care to elaborate? I don't see how you approximate the gradient at a non differentiable location. Mainly because it does, by definition, not exist. (I don't refer to the gradient of an approximation, like log(1 + exp(x)).)
(Feb 22 '12 at 16:31)
Justin Bayer
1
Sure. You can use a finite difference approximation (or any other numerical method). For details, see "Neural Smithing" by Reed & Marks (1999) page 56, or the references they site for the concept: "Neural Networks for Pattern Recognition" by CM Bishop (1995), and "Structural Risk Minimization For Character Recognition", by Guyon, Vapnik, Boser, Bottou, and Solla (1992)
(Feb 22 '12 at 17:20)
Brian Vandenberg
In short, where you'd need d(transfer)/dx, you replace that by the finite-difference approximation.
(Feb 22 '12 at 17:23)
Brian Vandenberg
You're right that the derivative doesn't truly exist; the assertion is that it doesn't really matter -- you just gloss over that fact with the numerical method.
(Feb 22 '12 at 18:17)
Brian Vandenberg
showing 5 of 7
show all
|
|
These are hyperparameters, and you should tune them empirically (by looking at validation error, for example). |