Hi, I just started to learn deep learning recently by starting with Prof. Bengio's paper "Learning Deep Architectures for AI". In this paper, there is one paragraph about auto-encoder that puzzles me. I've search for quite some places on the Internet, but found no luck. Hope I can find some help here. Below is quoted from Section 4.6 in the paper:

To achieve perfect reconstruction of continuous inputs, a one-hidden layer auto-encoder with non-linear hidden units needs very small weights in the first layer (to bring the non-linearity of the hidden units in their linear regime) and very large weights in the second layer.

Assume that have the input data klzzwxh:0001 first encoded as h(x) in the hidden layer and the recovered klzzwxh:0002 is denoted as klzzwxh:0003at{x}:

http://mathurl.com/pxmar8w

Can someone explain the above quoted paragraph?

Thanks!

asked Apr 03 '14 at 04:16

mintaka's gravatar image

mintaka
6113

edited Apr 03 '14 at 05:08


2 Answers:

A continuous autoencoder typically does not have a sigmoid in the reconstruction step (otherwise it could not reconstruct inputs outside of the [0,1] interval). So it should be:

klzzwxh:0000at x = W_2^T h(x) + b_2

If the weights in W1 are very small, then:

h(x) = klzzwxh:0002athrm{sigmoid}(W_1^Tx + b_1) klzzwxh:0003pprox W_1^Tx + b_1 + 0.5

because the sigmoid function is approximately linear when the input is close to zero (and sigmoid(0) = 0.5, hence the additional offset). Then:

klzzwxh:0005at x klzzwxh:0006pprox W_2^T (W_1^Tx + b_1 + 0.5) + b_2 = W_2^T W_1^T x + W_2^T (b_1 + 0.5) + b_2

We can get rid of the constant term by setting b_1 = -0.5 and b_2 = 0. Then we can just choose W_1 and W_2 so the product of their transposes is the identity matrix:

W_1 = klzzwxh:0008psilon I, klzzwxh:0009uad W_2 = klzzwxh:0010psilon^{-1} I

where epsilon is a small constant. Now we get klzzwxh:0012at x klzzwxh:0013pprox x. The smaller epsilon, the better the approximation.

Of course this is undesirable because you want the hidden units to capture useful structure in the data, rather than copy them.

Note that the same reasoning applies to an autoencoder with binary inputs (in that case there is a sigmoid in the reconstruction step as well), you just need even bigger weights W_2 to drive the sigmoid into its saturation regions.

answered Apr 03 '14 at 05:52

Sander%20Dieleman's gravatar image

Sander Dieleman
155672734

Now I got it. Thanks for your nice explanation, Sander!

(Apr 03 '14 at 11:28) mintaka
1

There's actually a small mistake: we can't set b_1 = -0.5 because then the sigmoid is no longer in its linear region. It needs to be b_1 = 0, and then we can set b_2 = - 0.5 W_2^T e (where e is a vector of all ones) to compensate for it. Apologies for the confusion.

(Apr 04 '14 at 04:38) Sander Dieleman

I have another question, could you help me? Thanks in advance. As for autoencoder, why the function of intermediate layer is not sigmoid ?

answered Apr 09 '14 at 23:23

LTW's gravatar image

LTW
1

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.