• I am fitting Logistic Regression Model over Kaggle Dataset and I get the following classification metrics
  • [0 means driver was not alert, 1 means driver was alert]

    C= 1.9

    Cross Validation Set metrics
    precision recall f1-score support

          0       0.83      0.73      0.78     51003
          1       0.82      0.89      0.85     69863
    
       avg / total       0.82      0.82     0.82    120866
    

    Test Set metrics
    precision recall f1-score support

          0       0.82      0.73      0.77     50590
          1       0.82      0.89      0.85     70276
    
       avg / total       0.82      0.82     0.82    120866
    
  • The dataset is normalized with zero mean and standard deviation of 1

  • I iterate through multiple C values ranging from 0.1 to 10.0 with a step of 0.2 but the classification metrics is the same
for i in np.arange(0.1, 10.0, 0.2):
lr = LogisticRegression(C=i, penalty='l1')
model = lr.fit(training[:,0:-1], training[:,-1])
cv_predict = model.predict(cv[:,0:-1])
cv_test = model.predict(test[:,0:-1])
print 'C=', i
print classification_report(cv[:,-1],cv_predict)
print classification_report(test[:,-1],cv_test)
print '----------------------------------------------------------'
  • What steps I can perform to make the precision/recall rate higher?

asked Feb 21 '12 at 23:55

daydreamer's gravatar image

daydreamer
105689


One Answer:

There are different thing you can do to improve Logistic Regression, but it depends on how messy you want to get with the code.

I see you are using "l1" as a penalty, why? Is your data-set sparse? If not, perhaps you can have better results using l2.

When you use Cross Validation, how many folds are you using? 2,3,4.

You have to check if you have a high variance error or a high bias error.

If you have a high variance, using more data or a smaller set of features might be a good idea.

If you have a high bias, you can try looking for more features.

You can also try to modify the weight of the "penalty" parameter.

If you have access to the source code, you can slo try using other optimizers.

Since it is Kaggle, you can always try to use other algorithms, like SVMs or Gaussian Processes.

Here is a link with some rules of thumb to improve your results in a classification setting.

answered Feb 22 '12 at 00:32

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

what specifically penalty means and when should we use them? What do you mean by folds in Cross Validation ? also Thank you Leon for details

(Feb 22 '12 at 00:50) daydreamer

In regression problems (linear or logistic), you can say that the weights might tend to get out of control (keep growing) if you do a simple optimization. To prevent this, we use penalty (or regularization) terms, which will "clamp" the growth of the weights. Without this clamping, your model would be a perfect fit for the training data, but a terrible fit for your test data (that is called overfitting). http://en.wikipedia.org/wiki/Overfitting

The clamp makes your model flexible to new training data. We call this: "preventing overfitting".

Basically, the model you end up with is not a perfect fit for your training data, but it will be a better fit for new data.

Common examples of penalties are l1 and l2 regularization, which are the magnitude of your parameter set in 2 different spaces.

I hope is clear

(Feb 22 '12 at 00:55) Leon Palafox ♦

totally understood this, thank you Leon, I would try out the suggestions and will share my results. thank you again, much appreciated!

(Feb 22 '12 at 01:01) daydreamer

Sure, good luck

(Feb 22 '12 at 01:02) Leon Palafox ♦

BTW, I forgot, folds in the Cross Validation mean the number of times you divide your data to do a cross checking. http://en.wikipedia.org/wiki/Cross-validation_(statistics) Usually libraries have a default of 2, but perhaps you can change that to more and see what happens.

(Feb 22 '12 at 01:04) Leon Palafox ♦

cool, thank you

(Feb 22 '12 at 01:13) daydreamer

Could you clarify what you mean by 'spaces' in "...are l1 and l2 regularization, which are the magnitude of your parameter set in 2 different spaces." please?

(Feb 22 '12 at 16:36) Vam

L2 and L1 are metrics, a regularizer is basically the sum of all the magnitudes of your weights. And the magnitude of a vector is it's distance to the origin.

L1 and L2 define different spaces with different metrics, which means that you define different operations for the distance.

There are as many LP spaces as you want, but L1 and L2 are the commonly used in ML settings.

http://en.wikipedia.org/wiki/L1-norm

(Feb 22 '12 at 17:48) Leon Palafox ♦
showing 5 of 8 show all
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.