|
The notes for Andrew Ng's machine learning course (page 15) describe a sparsity term for an autoencoder that is based on KL divergence: sum over j: KL (r || r'[j]) where r'[j] is the average hidden unit j activation, averaged over the training set, and r is a small number, such as 0.05. Here, KL(r || r'[j]) is just a function that is 0 if r = r'[j] and is greater than 0 otherwise. I can see how such a term would encourage the average activations r' to be close to r, but why would the activations be forced to be sparse? The activations are not binary, and so sparsity is quite different from "low average". |
|
On page 14 there is a paragraph which says:
Remember that the activation function is a sigmoid. Although the activations are not binary, as you remarked, they are almost all very close to zero or to one. It's not impossible that for a certain sparse unit (i.e. a sigmoid unit trained with the constraint that its average activations is small) some of its values are close to 0.5, instead of close to 0 or 1. But it's just not very likely to happen, due to the nature of the sigmoid function. For almost any input from -inf to +inf its value is close to 0 or 1. Just for inputs very near 0 that it would yield a value close to 0.5. |