|
Hi, I have a log-linear model which I'm optimizing for minimum risk. I use stochastic gradient descent with an adaptive learning rate and mini-batches of randomly selected training instances. I don't use any regularization. When I plot the minimum risk against the number of training iterations, as expected, the risk goes down and then plateaus. However, if I keep continuing optimization, after a while, the risk starts increasing. I find this behavior very strange: since there is no regularization, shouldn't the model start overfitting? That is, I would expect the risk to keep going down or to plateau but surely not to start increasing. Has anyone experienced something similar? |
|
What do you mean optimizing for minimum risk? Do you mean min_params E_{x | params} L(x,x^*)? If so, for most model structures, this objective is numerically difficult to optimize and so while you may have a bug and that would be my first guess, you might just have a difficult function to optimize. Try the bump test: at a given input value compute the gradient and take a tiny step in that direction and make sure you're objective is decreasing. If it isn't you have some sort of gradient bug. If it does then its probably the case that the gradient is changing too quickly to optimize effectively. Just FYI using SGD when you're not sure if the code is correct seems like a bad idea since there are sources of error from online optimization. If you can use a batch optimizier like LBFGS or conjugate-gradient to remove that source of uncertainty. |
|
If your model has reached the optimum and you have a large learning rate, with batch gradient descent, it will flail around the optimum attaining lower objective values instead of necessarily converging. With SGD this can be slightly worse. What happens is that, after it has reached a point close to a minimum of the average objective, any training example will push your classifier towards classifying it better at the expense of the other examples. Hence, the observed objective can go down, but not by much. You can try using minibatches to improve this, or regularization, or a smaller learning rate. |
|
I think you're right to be suspicious -- what you describe sounds like a bug. Maybe your examples aren't shuffled? |
|
I've seen this behavior, where the objective keeps increasing. It can happen when the step size is (still) too large. You are around the bottom of a valley (but not exactly at the bottom point, where the gradients would be zero), you take a gradient step in the correct direction, but it's too large, so the objective function goes up. At the new location, the gradients dramatically point back, so you take a relatively large step back, which raises the objective again, you go back to the other side of the valley and so on. Try another learning rate (gain) schedule, which decreases more sharply. I hate gradient descent for these reasons. Having said that, it can't hurt to check for a gradient computation bug. The best method is the finite differences method: Compute ( f(x+eps)-f(x) ) / eps and compare to your gradient for x. You can always use a line search method such as Backtracking line search to find a step size (between 0 and 1) to multiply with the gradient (or, search direction from any optimization algorithm) to make sure that you don't over-step and reach a point where the f(xNew) > f(xOld).
(Dec 06 '10 at 21:34)
Aman
|