Hi all,

I've implemented a DBN in C++ * and it seems to work all right in principle (training with CD-1, basically following Geoffrey Hinton's DBN-Guide in most of the basic points**). However, trained on the MNIST data set, the first-level features learned by the network are a bit weired:

  • many (most) of the feature detectors aren't really used at all

  • some of them seem to brighten the complete visible layer

(see linked images, visualizing the effect of the first-level hidden features to the visible layer)

http://www.malte-probst.de/downloads/LrPerEx0.0015Mom0.9Ep20_BS20_EX30000.weights.jpg http://www.malte-probst.de/downloads/LrPerEx0.002Mom0.9Ep25_BS16_EX30000.weights.jpg

I've been playing around with the hyperparameters (learning rate, momentum, initial bias weights, number of epochs) for a while now, and I am not really sure if there is still a bug in my setup, or if it really takes that much fiddling around to learn a proper model.

Especially the huge number of more or less unused detectors is annoying - maybe someone has a hint for me? alt text alt text

Cheers and thanks for your help!

Malte


(*) I've just found the ready-to-go Theano code, so I might switch to this implementation in the long run, but for now I would really like to understand what is going wrong here

(**) Initializing the upper bias weights to -4 (sparsity), initializing the lower bias weight i to [pi /(1 − pi )].

asked Sep 02 '11 at 13:25

Malte%20Probst's gravatar image

Malte Probst
1113

edited Sep 02 '11 at 13:31

DBNs are known to take a lot of fiddling in the architecture and optimization hyperparameters to get right. Anecdotally stacked autoencoders are easier to get going.

(Sep 02 '11 at 15:22) Alexandre Passos ♦

2 Answers:

Hey Malte. Actually, I think the first picture seems ok. The second could be due to weird parameter settings. Tuning parameters is very important. There is a quite new paper by KyungHyun Cho that discusses how to avoid finding the right learning rate. It looks like most "sparse" filters do. If you don't do the -4 initialization, your filters will look "denser". For parameter setting, the default in our implementation is 500 hidden units, a learning rate of 0.01 and no weight decay, sparsity or momentum.

How do samples from your model look like? I think this is a somewhat helpful way to see if your algorithm works.

My lab has also some ready made code. It is mainly for the use with CUDA (as I think is theano) but can also be used on the CPU. If you want to work with RBMs, I suggest you try to get your hands on a recent NVidia card.

If you really want to know about your code, you can also use the AIS and parition function calculation in our code to analyse the learning of your model.

answered Sep 02 '11 at 15:27

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Hi guys,

thanks for your answers. I had a major bug in the code that handles the top layer including the labels, so don't have any proper samplings yet (just random 'strokes'). But the lowest layer visualizations look all right now, once i've stripped the learning of all tweaks like fancy weight initializations, momentum etc, and changed the first downward pass in the CD1 phase to stochastic sampling (instead of deterministic). The feature list of your implementation also looks promising, will look into this as well!

Cheers Malte

answered Sep 08 '11 at 08:39

Malte%20Probst's gravatar image

Malte Probst
1113

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.