I have a problem understanding Maxout Networks Paper. The maxout activation function is given by

h_i(x) = max_{j in [1, k]} z_{ij}

where z_{ij} = x^T W_{ij} + b_{ij}.

So is it right, that for a single maxout activation function I have to compute _k_ linear activations, and the weight updates gets only backpropagated to the maximum?

asked Nov 12 '13 at 13:43

Masala's gravatar image

Masala
6112


One Answer:

Yes, exactly.

The maxout activation function is as follows: for each output you multiply the input vector by a matrix, which gives you a vector, from which you pick the maximum coordinate. When backpropagating the gradient only that maximum coordinate will have its weight affected.

answered Nov 12 '13 at 15:28

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Ok thank you. May I ask one further question? And the Dropout happens in the x vector, is that right? So x is somehow sparse, or we can write z_{ij} = D x^T W_{ij}, where D is a matrix defining the dropout mask.

(Nov 12 '13 at 16:08) Masala

Yes, dropout makes x sparse, which is equivalent to D x^T W_{ij} as you pointed out.

(Nov 12 '13 at 19:25) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.