|
I try to understand fast dropout and its applicability to recurrent neural networks. Author says that: "Forward propagation through the non-linearity f for calculation of the post-synaptic activation can be approximated very well in the case of the logistic sigmoid and the tangent hyperbolicus and done exactly in case of the rectifier". It seems to me it is quite natural to approximate tanh with error function. So my first question: what are the drawbacks if one uses directly error function as activation function? And second: How to derive this integral: and this: |
|
The answer to deriving the first integral is already here. I will mostly give some relevant references. The trick is really integration by parts, which is suggested by the form of the integral. Though the specific way to get the general transformed case is differentiation under the integral. This trick is used in: MacKay, 1992, The Evidence Framework applied to Classi cation Networks Spiegelhalter and Lauritzen, 1990, Sequential Updating of Conditional Probabilities on Directed Graphical Structures They have an appendix explaining this Chris Bishops book in the Bayesian logistic regression section, which is unfortunately not freely available. And the fast dropout paper. Section 4.1 of the fast dropout paper tries to explain how to approximate the variance. Which basically involves scaling/translating the logistic function so it looks like the squared logistic function. (plot of the two functions). Finally the first question, why not just use the error function. People do it: see probit regression. As far as activation function goes, RELU seems more popular today than either of these. Here is my shallow understanding of the differences:
|
I think the solution for the integral is: erf is gaussian's integral. Integrating by parts you have erf_1*erf_2 + integral(gaussian_1*gaussian_2) = erf_1*erf_2 + erf_3. I think it would be easier if choose your erf to match the Gaussian you got.
Look here for details of the general solution. But I still do not understand how to use in practice variance propagation.