|
I am trying to reproduce Hinton's results in his paper Improving neural networks by preventing co-adaptation of feature detectors. With a two hidden layers architecture, with 50% dropout in the hidden layers and 800 neurons in each, he achieve a 1.3% error rate on the MNIST database. I tried to do the same using Pylearn2. I monitor using cross-validation, and stop when the score haven't been increased for the past 100 iterations. It run for 683 iteration. The final error rate on the test database is 2.02%. I just found an old post of Ian Goodfellow about it, but there is no indication that he succeded. Here is my code :
|
|
The biggest problem is that you're using softplus. Softplus is terrible. Use RectifiedLinear instead. I did reproduce Hinton's results eventually: https://github.com/goodfeli/forgetting/tree/master/experiments/random_search_dropout_relu_mnist I didn't use exactly the same hyperparameters reported in that paper though. In particular, the learning rate they used in the paper seems too high. PS: to clarify, I mean that probably Pylearn2 implements something slightly differently than they did, and the Pylearn2 implementation requires a lower learning rate. A lot of neural net algorithms are open to design choices that are equivalent modulo the choice of hyperparameters, and I must have made a different decision than the Toronto group at some point.
(Feb 18 '14 at 10:59)
Ian Goodfellow
PPS: In the old post you linked to, I had already succeeded in reproducing the results, but using a weird hack that Misha Denil accidentally introduced.
(Feb 18 '14 at 11:01)
Ian Goodfellow
Rectified linear units is solving the issue. You used init_bias, it sounds good for ReLU. I looked at your workflow, I find it really neat, but I don't understand why you prefer Yaml files than Python scripts, which are more flexible. Is it less bug-prone ? Or easier to modify, or to use with Jobman ? PS: thanks for your work with Theano and Pylearn. This is a really nice tool, easy to use, clean, efficient and state-of-the-art.
(Feb 25 '14 at 20:17)
Matthieu B
Matthieu, which hyper parameters are you using in the final ReLU model???
(Feb 25 '14 at 22:13)
eder
I used 2 layers of 800 neurones. Irange was set to 0.05 on each. The gradient descent step size was 0.1, although 0.3 doesn't diverge. Dropout was 50% on each layer. Termination was when the cross-validated error stop decreasing for 100 iterations.
(Feb 26 '14 at 17:25)
Matthieu B
|
|
I also made a script to reproduce their 130 error(1.3%) and 110 error(1.1%) model. The only difference with their model(I believe) is that:
I am wondering what is a standard speed generally for this kind of problem? Mine takes around 17 seconds for 1 iteration (using gnumpy and cudamat on gtx 480 GPU) Here is a link to the script: github keithzhou, I really like your code. Have you by any chance written a convnet implementation?
(Nov 19 '14 at 01:15)
michaelsb123
|
What are your learning rate and momentum annealing schemes?
The first thing that stood out to me is that you're using the softplus activation function. I'd always assumed that they used rectified linear units in that paper, but now that I skimmed it again I can't actually find any mention of the type of units they used. So that might not be it.
That said, in "Deep Sparse Rectifier Neural Networks (2011)" Glorot et al. showed that rectified linear units tend to outperform softplus units (and they're faster to train as well), So it might still be worth a try. They discuss the comparison on page 6 of the paper, 2nd column.
With rectified linear units I get only a 2.23% accuracy. I get a 1.36% accuracy with irange=0.005 avec 279 iterations, so it can be considered consistent with Hinton's results. Thank you !