The notes for Andrew Ng's machine learning course (page 15) describe a sparsity term for an autoencoder that is based on KL divergence:

sum over j: KL (r || r'[j])

where r'[j] is the average hidden unit j activation, averaged over the training set, and r is a small number, such as 0.05.

Here, KL(r || r'[j]) is just a function that is 0 if r = r'[j] and is greater than 0 otherwise.

I can see how such a term would encourage the average activations r' to be close to r, but why would the activations be forced to be sparse? The activations are not binary, and so sparsity is quite different from "low average".

asked Oct 12 '13 at 16:48

Max's gravatar image

Max
476162729


One Answer:

On page 14 there is a paragraph which says:

Informally, we will think of a neuron as being "active" (or as "firing") if its output value is close to 1, or as being "inactive" if its output value is close to 0. We would like to constrain the neurons to be inactive most of the time. This discussion assumes a sigmoid activation function"

Remember that the activation function is a sigmoid. Although the activations are not binary, as you remarked, they are almost all very close to zero or to one.

It's not impossible that for a certain sparse unit (i.e. a sigmoid unit trained with the constraint that its average activations is small) some of its values are close to 0.5, instead of close to 0 or 1. But it's just not very likely to happen, due to the nature of the sigmoid function. For almost any input from -inf to +inf its value is close to 0 or 1. Just for inputs very near 0 that it would yield a value close to 0.5.

answered Oct 14 '13 at 17:35

Saul%20Berardo's gravatar image

Saul Berardo
66127

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.