3
2

I am trying to reproduce Hinton's results in his paper Improving neural networks by preventing co-adaptation of feature detectors. With a two hidden layers architecture, with 50% dropout in the hidden layers and 800 neurons in each, he achieve a 1.3% error rate on the MNIST database.

I tried to do the same using Pylearn2. I monitor using cross-validation, and stop when the score haven't been increased for the past 100 iterations. It run for 683 iteration. The final error rate on the test database is 2.02%.

I just found an old post of Ian Goodfellow about it, but there is no indication that he succeded.

Here is my code :

from __future__ import division
import os

import numpy as np

from pylearn2.train import Train
from pylearn2.datasets.mnist import MNIST
from pylearn2.models import softmax_regression, mlp
from pylearn2.training_algorithms import bgd, sgd
from pylearn2.termination_criteria import MonitorBased
from pylearn2.train_extensions import best_params
from pylearn2.utils import serial
from pylearn2.costs.mlp.dropout import Dropout
from theano import function
from theano import tensor as T

h0 = mlp.Softplus(layer_name='h0', dim=800, sparse_init=40)
h1 = mlp.Softplus(layer_name='h1', dim=800, sparse_init=40)
#h2 = mlp.Softplus(layer_name='h2', dim=50, sparse_init=15)
ylayer = mlp.Softmax(layer_name='y', n_classes=10, irange=0)
layers = [h0, h1, ylayer]

model = mlp.MLP(layers, nvis=784)
train = MNIST('train', one_hot=1, start=0, stop=50000)
valid = MNIST('train', one_hot=1, start=50000, stop=60000)
test = MNIST('test', one_hot=1, start=0, stop=10000)

monitoring = dict(valid=valid)
termination = MonitorBased(channel_name="valid_y_misclass", N=100)
extensions = [best_params.MonitorBasedSaveBest(channel_name="valid_y_misclass",
save_path="train_best.pkl")]

algorithm = sgd.SGD(0.1, batch_size=100, cost=Dropout(),
                    monitoring_dataset = monitoring, termination_criterion = termination)

save_path = "train_best.pkl"
if os.path.exists(save_path):
    model = serial.load(save_path)
else:
    print 'Running training'
    train_job = Train(train, model, algorithm, extensions=extensions, save_path="train.pkl", save_freq=1)
    train_job.main_loop()

X = model.get_input_space().make_batch_theano()
Y = model.fprop(X)

y = T.argmax(Y, axis=1)
f = function([X], y)
yhat = f(test.X)

y = np.where(test.get_targets())[1]
print 'accuracy', (y==yhat).sum() / y.size

asked Feb 13 '14 at 19:17

Matthieu%20B's gravatar image

Matthieu B
51124

edited Feb 13 '14 at 19:37

What are your learning rate and momentum annealing schemes?

(Feb 13 '14 at 21:48) eder

The first thing that stood out to me is that you're using the softplus activation function. I'd always assumed that they used rectified linear units in that paper, but now that I skimmed it again I can't actually find any mention of the type of units they used. So that might not be it.

That said, in "Deep Sparse Rectifier Neural Networks (2011)" Glorot et al. showed that rectified linear units tend to outperform softplus units (and they're faster to train as well), So it might still be worth a try. They discuss the comparison on page 6 of the paper, 2nd column.

(Feb 14 '14 at 03:44) Sander Dieleman

With rectified linear units I get only a 2.23% accuracy. I get a 1.36% accuracy with irange=0.005 avec 279 iterations, so it can be considered consistent with Hinton's results. Thank you !

(Feb 17 '14 at 14:56) Matthieu B

2 Answers:

The biggest problem is that you're using softplus. Softplus is terrible. Use RectifiedLinear instead.

I did reproduce Hinton's results eventually: https://github.com/goodfeli/forgetting/tree/master/experiments/random_search_dropout_relu_mnist

I didn't use exactly the same hyperparameters reported in that paper though. In particular, the learning rate they used in the paper seems too high.

answered Feb 18 '14 at 10:56

Ian%20Goodfellow's gravatar image

Ian Goodfellow
1072162734

PS: to clarify, I mean that probably Pylearn2 implements something slightly differently than they did, and the Pylearn2 implementation requires a lower learning rate. A lot of neural net algorithms are open to design choices that are equivalent modulo the choice of hyperparameters, and I must have made a different decision than the Toronto group at some point.

(Feb 18 '14 at 10:59) Ian Goodfellow

PPS: In the old post you linked to, I had already succeeded in reproducing the results, but using a weird hack that Misha Denil accidentally introduced.

(Feb 18 '14 at 11:01) Ian Goodfellow

Rectified linear units is solving the issue. You used init_bias, it sounds good for ReLU. I looked at your workflow, I find it really neat, but I don't understand why you prefer Yaml files than Python scripts, which are more flexible. Is it less bug-prone ? Or easier to modify, or to use with Jobman ?

PS: thanks for your work with Theano and Pylearn. This is a really nice tool, easy to use, clean, efficient and state-of-the-art.

(Feb 25 '14 at 20:17) Matthieu B

Matthieu, which hyper parameters are you using in the final ReLU model???

(Feb 25 '14 at 22:13) eder

I used 2 layers of 800 neurones. Irange was set to 0.05 on each. The gradient descent step size was 0.1, although 0.3 doesn't diverge. Dropout was 50% on each layer. Termination was when the cross-validated error stop decreasing for 100 iterations.

(Feb 26 '14 at 17:25) Matthieu B

I also made a script to reproduce their 130 error(1.3%) and 110 error(1.1%) model. The only difference with their model(I believe) is that:

  1. used 1.0 learning rate instead of 10
  2. didnt use momentum.

I am wondering what is a standard speed generally for this kind of problem? Mine takes around 17 seconds for 1 iteration (using gnumpy and cudamat on gtx 480 GPU)

Here is a link to the script: github

answered May 02 '14 at 22:38

keithzhou's gravatar image

keithzhou
31226

edited May 02 '14 at 22:39

keithzhou, I really like your code. Have you by any chance written a convnet implementation?

(Nov 19 '14 at 01:15) michaelsb123
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.