I try to understand fast dropout and its applicability to recurrent neural networks.

Author says that: "Forward propagation through the non-linearity f for calculation of the post-synaptic activation can be approximated very well in the case of the logistic sigmoid and the tangent hyperbolicus and done exactly in case of the rectifier".

It seems to me it is quite natural to approximate tanh with error function. So my first question: what are the drawbacks if one uses directly error function as activation function?

And second: How to derive this integral:

alt text

and this:

alt text

asked Feb 18 '14 at 09:43

Midas's gravatar image

Midas
42151017

edited Feb 28 '14 at 13:09

I think the solution for the integral is: erf is gaussian's integral. Integrating by parts you have erf_1*erf_2 + integral(gaussian_1*gaussian_2) = erf_1*erf_2 + erf_3. I think it would be easier if choose your erf to match the Gaussian you got.

(Feb 28 '14 at 14:02) eder

Look here for details of the general solution. But I still do not understand how to use in practice variance propagation.

(Feb 28 '14 at 14:07) Midas

One Answer:

The answer to deriving the first integral is already here. I will mostly give some relevant references. The trick is really integration by parts, which is suggested by the form of the integral. Though the specific way to get the general transformed case is differentiation under the integral. This trick is used in:

MacKay, 1992, The Evidence Framework applied to Classi cation Networks

Spiegelhalter and Lauritzen, 1990, Sequential Updating of Conditional Probabilities on Directed Graphical Structures They have an appendix explaining this

Chris Bishops book in the Bayesian logistic regression section, which is unfortunately not freely available. And the fast dropout paper.

Section 4.1 of the fast dropout paper tries to explain how to approximate the variance. Which basically involves scaling/translating the logistic function so it looks like the squared logistic function. (plot of the two functions).

Finally the first question, why not just use the error function. People do it: see probit regression. As far as activation function goes, RELU seems more popular today than either of these. Here is my shallow understanding of the differences:

  1. logistic regression is max entropy, which is optimal in that sense.
  2. error function has a very thin tail, so if you use log probit as your loss, then it scales quadratically as a function of your margin of error. this could be more sensitive to outliers compared to either logistic regression or to SVM, which scale linearly.
  3. numerical issues, which i feel might be a big reason why probit models are less popular. so for the loss function, there are intuitive tricks to make loglikelihood of the logistic model numerically safe (minus the max). it is less intuitive how to do this for probit loglikelihood (of course it can be done). An naive implementation becomes problematic much more quickly: log(normcdf(-8.5)) is already -inf in matlab. The vanishing derivative issue happens more rapidly if this is used as an activation function.

answered Mar 03 '14 at 20:13

sidaw's gravatar image

sidaw
5623

edited Mar 03 '14 at 20:15

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.