Are these two approaches to ell_2 regularized logistic regression in scikit-learn equivalent?

From my understanding, scikit-learn has a regularizaed logistic regression module based on LibLinear (called LogisticRegression) and a second one based on stochastic gradient descent (called SGDClassifier).

The LibLinear-based solver solves the following problem:

$$ min_{w} C sum_{i} log{(1+ e^{y^i X_i^T w)} + frac{1}{2} w^Tw$$

while the SGD solver minimizes $$ min_{w} sum_{i} log{(1+ e^{y^i X_i^T w)} + frac{rho}{2} w^Tw$$,

where we are assuming that the labels $y_i in {-1, +1}$.

If the above is correct regarding which problems each technique is solving, then they should be equivalent if $C gets frac{1}{alpha}$.

However, running both of these solvers on even very small, low-dimensional classification problems, I don't get the same weights $hat{w}$. Nor does it appear that the objective functions are minimized.

I wrote a Python link:script that compares the solutions returned by the LibLinear approach, the SGD approach, and my own implementation in Cvxpy.

Here are the results for the three techniques:

  1. CVXPY

    • Coefficients=(array([-0.39929211]), array([[ 0.02260313]]))

    • Loss=59.0909466078

  2. LogisticRegression

    • Coefficients=(array([-0.03588511]), array([[ 0.03284387]]))

    • Loss=67.3132269213

  3. SGDClassifier

    • Coefficients=(array([-0.17192813]), array([[ 0.00436165]]))

    • Loss=62.4807117467

Now, I understand that for gradient methods, they can get within a neighborhood of the global minimum fairly quickly but require many iterations to converge to the global minimum.

I also understand that there may be additional parameter settings for SGD that lead to smaller losses. Still, I am a bit surprised that neither the LibLinear approach nor the SGD approach work well out of the box on what is essentially a tiny problem (there are only 2 parameters to learn and 100 training points!)

So I suspect either a programming error on my part, or something else important that I am missing.

asked Feb 16 '13 at 07:31

Pads%20Niels's gravatar image

Pads Niels
16112


One Answer:

My favorite... Yes, the problems are equivalent, but as always, the devil is in the details.

LogisticRegression works with dual=True by default, so indeed, it does not optimize the same objective, it optimizes the dual (the optima are the same, but the path there differs). Also there is a 'tol' parameter that controls the precision. Try to turn that down.

The other issue might be scaling of the penalty. There is the issue of multiplying / dividing C by n_samples, which often serves as a source of great pleasure... not.

I have an implementation of a structured SVM in my pystruct package. It has two dual solvers based on cvxopt and a stochastic gradient descent solver and could reproduce the behavior of LibLinear with both (for the hinge loss, but still) for binary and multi-class svm problems (crammer-singer svm).

In the stochastic subgradient version, I needed to multipy C by the number of samples to get the behavior of LibLinear back.

For the SGDClassifier, convergence depends a lot on alpha and the eta0, but with enough iterations, I think you should not see much difference from LibLinear on such a toy problem.

Hth, Andy

answered Feb 16 '13 at 14:53

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Oh, btw, liblinear is totally non-deterministic. Try varying the seed and have fun interpreting the curve... I did that for quite some time before finding my mistake.

(Feb 16 '13 at 14:54) Andreas Mueller

Hi Andy,

SGDClassifier actually seems to be outperforming LibLinear with the settings I used...in the post above, it was CVXPY < SGD < LibLinear.

Also, I'm setting the tolerance for LogisticRegression() (aka the LibLinear wrapper) to 1e-16. If tolerance roughly maps to a duality gap, then after setting the tolerance that low I should be closer to the value returns by CVX.

(Feb 16 '13 at 18:17) Pads Niels

Yeah, tol should be duality gap. So still it might be that the scaling of C is different. What is the advantage of cvxpy vs cvxopt btw? Are you solving the primal?

(Feb 16 '13 at 18:58) Andreas Mueller

Basically, the only reason I implemented logistic regression in cvxpy was because I didn't understand why LibLinear and SGD were giving such vastly different results, and was hoping that there would be agreement between cvxpy and one of the Scikit techniques. Alas...

I only chose to implement in cvxpy rather than cvxopt because I found it much faster to write the code. Cvxpy actually uses cvxopt under the hood.

(Feb 16 '13 at 19:06) Pads Niels
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.