|
I am using neural nets(without hidden units, linear) ,softmax for multiclass classification. I have used stochastic gradient descent to to update the parameters. I am learning both the features and weight simultaneously. So input features are a vector of some fixed dimension. The problem i am facing is, if i keep learning rate higher like 0.01.0.05,0.001 , softmax overflows and parameters turn into NaN. If i keep lower learning rate like 0.00001, my algorithm does not converges even after 50's of epoch. I wonder, how could i adjust the learning rate? Are there any methods to prevent softmax overflow? I have used L2 norm with regularizing parameter 0.01 for penalizing the large increase in both features and weight. Could you please suggest me some of the literature which explains about this kind of problem? [EDIT] Initital representation of both weight and features are of mean 0, variance 0.01. |
|
How do you initialize your weights? Are you using the fan-in rules of Efficient Backprop? This paper has many other important implementation tricks. Also recent results by Schaul, Zhang and LeCun seem to show that is possible to automatically tune the learning rate from the data in an online manner although I have not yet implemented that technique myself. |
|
I've heard that setting per-coordinate learning rates with adagrad works really well. Essentially, you set the learning rate for each parameter to be one constant divided by the square root of the sum of the squares of all gradients you saw for that parameter so far. This automatically compensates for things like different gradient magnitudes in different layers, and things like that, and it's really efficient to implement (you keep one extra float per parameter equal to the sum of the squares of all the previous gradients for that parameter). For best results you should reset this auxiliary vector every once in a while, if some learning rates have come down to zero. |
|
If you wish to solve a logistic regression problem and your dataset fits in memory, I recommend our last algorithm which converges quickly with a fixed learning rate (which can be set easily): A Stochastic Gradient Method with an Exponential Convergence Rate for Strongly-Convex Optimization with Finite Training Sets. On an unrelated note, be careful how you implement your softmax function. Since the output of the softmax does not change if you add a constant to all the elements of your vector, you might want to remove the biggest element before computing the exponentials. |
|
As Nicolas Le Roux alludes to, you might not be using a numerically stable softmax implementation. Below I have an example in python using numpy. The axis parameter specifies which axis to do the normalization over.
|