|
I have a question regarding the auto-encoder when using squared error as loss function. In Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a local Denoising Criterion, the author tried two cases affine + sigmoid encoder and either affine decoder with squared error loss of affine + sigmoid decoder with cross-entropy loss. My question is that what is the problem with this case affine+sigmoid encoder and affine +sigmoid decoder with squared error loss? And for real value, in decoder part, why people use linear activation function instead of non-linear activation function such as tanh or sigmoid. Thanks |
|
Intuitively: The sigmoid thresholds your output, forcing it to be between 0 and one.
The two versions you found are straight forward extensions of popular methods you may want to look at: You can view the auto-encoder with linear output as a linear regression on top of a hidden layer, and the auto-encoder with cross-entropy loss as logistic regression on top of a hidden layer. Another possible interpretation from which these combinations follow automatically is to view the autoencoder loss as the conditional log likelihood of the data given the hiddens. If this is Gaussian with the variance fixed at 1, you get an MSE penalty + linear reconstruction. If it is Bernoulli-distributed, you get a cross-entropy penalty + sigmoid reconstruction.
(Mar 15 '13 at 06:28)
Sander Dieleman
Actually that's not entirely accurate, apologies. The variance should be constant, but it doesn't matter which value, since it just scales the objective function.
(Mar 15 '13 at 09:11)
Sander Dieleman
Thanks. That helps a lot.
(Mar 15 '13 at 10:11)
pop0432
|