|
I have a problem understanding Maxout Networks Paper. The maxout activation function is given by h_i(x) = max_{j in [1, k]} z_{ij} where z_{ij} = x^T W_{ij} + b_{ij}. So is it right, that for a single maxout activation function I have to compute _k_ linear activations, and the weight updates gets only backpropagated to the maximum? |
|
Yes, exactly. The maxout activation function is as follows: for each output you multiply the input vector by a matrix, which gives you a vector, from which you pick the maximum coordinate. When backpropagating the gradient only that maximum coordinate will have its weight affected. Ok thank you. May I ask one further question? And the Dropout happens in the x vector, is that right? So x is somehow sparse, or we can write z_{ij} = D x^T W_{ij}, where D is a matrix defining the dropout mask.
(Nov 12 '13 at 16:08)
Masala
Yes, dropout makes x sparse, which is equivalent to D x^T W_{ij} as you pointed out.
(Nov 12 '13 at 19:25)
Alexandre Passos ♦
|