|
I've just spent a week debugging an implementation of the MaxEnt classifier (aka multinomial logistic regression, exponential models, etc.) and finally "succeeded" by putting in random start values for its weights vector. The strange thing is, none of the implementations I could get my hands on use random init, nor do any of the textbooks and papers I've consulted mention this. If I start from all-zero weight vectors, the model either converges around the uniform distribution ( I'm using the LBFGS optimization algorithm, specifically the SciPy wrapper around Nocedal's original Fortran implementation, which to me is a completely black box. Using SciPy's conjugate gradient method yielded the same results, but I'm still wondering whether this is a bug in SciPy's optimizers or an oversight on my part; is there a smart way of initializing the weights in MaxEnt? |
|
Does the optimization converge completely? I have noticed that my logistic regression (LR) implementation has trouble converging when the solution is far from the initial values when using SciPy's conjugate gradient. The trouble seems to originate in the line search step --- it cannot jump far enough. When it tells me it has converged, it displays the kind of performance I expect from a MaxEnt classifier. Otherwise, it terminates with "abnormal line search" and performs miserably. Unfortunately, the line search parameters can't be set, can they?
(Oct 18 '11 at 04:43)
larsmans
I get "Warning: Desired error not necessarily achieved due to precision loss" with fmin_cg. I don't think SciPy's line search can be adapted (and if we could, how?). I was planning to do a python CG implementation since it should just work, but your LBFGS wrapper looks even more interesting :)
(Oct 18 '11 at 07:43)
Bwaas
|
|
In the fully supervised case the logistic regression likelihood is a concave function, so there should only be one optimum and there shouldn't be any problems with starting from the zero vector as far as I know. When you use random initialization, are you sure it converges to the global optimum? Note that when you start with the zero vector, you start with the uniform label distribution for all instances. It doesn't always find the global optimum, but usually it does. Starting from a zero vector just always fails miserably; LBFGS tries to get out of the uniform distribution (the gradient is non-zero there), but then apparently goes back to it because its line search can't find a better coefficient.
(Oct 18 '11 at 04:47)
larsmans
This does seem like a bug in the scipy LBFGS code. Did you contact the scipy people about this?
(Oct 18 '11 at 05:08)
Oscar Täckström
Nope, I haven't contacted the SciPy people. I'm first trying a different implementation of LBFGS (https://github.com/larsmans/pylbfgs), if that works I'll get in touch with them.
(Oct 18 '11 at 06:56)
larsmans
Isn't the gradient computation code tiny enough so that you can paste it into the post? By the way, when I implemented a CRF with stochastic gradient optimization I had to explicitly normalize the marginal distributions to sum to one, because performing the normalization in the log-domain gave very small numerical errors that prevented the optimization from converging.
(Oct 18 '11 at 07:45)
Oscar Täckström
@Oscar: the gradient computation requires too much scikit-learn machinery for handling scipy.sparse matrices to post here, but it can be seen at https://github.com/larsmans/scikit-learn/blob/maxent/sklearn/maxent.py
(Oct 18 '11 at 10:22)
larsmans
The gradient you're computing is not of the log loss but of the squared loss on the predicted probabilities, which is maybe why your optimizer is having difficulties. The line
should instead of (Y_pred - Y) dot X should compute P(Y not equal to true Y) dot X.
(Oct 18 '11 at 11:13)
Alexandre Passos ♦
Or maybe the negative of what I said above, as I'm not sure whether you're maximizing or minimizing.
(Oct 18 '11 at 11:15)
Alexandre Passos ♦
showing 5 of 7
show all
|
|
There is something funny going on and you probably have a bug in your code. As Oscar said there is no symmetry to break in multinomial logistic regression and the problem is convex, so you should always be able to converge to an optimum value. How are you computing your objective function and your gradients? Have you checked your gradients by comparing them with what you get from a numeric gradient on your objective function? +1 for checking your gradient with the numerical gradients. For example, with the finite central difference, you have f'(z) approx (f(z + epsilon) + f(z - epsilon)) / (2 epsilon), where epsilon is a small value.
(Oct 18 '11 at 07:13)
Mathieu Blondel
|