Is there a better classifier (than scikit's logisticregression()) around implemented in python (could be scikit or mlpy/nltk/orange/etc) that can (i) handle scipy.sparse matrices, (ii) produce (something close to) probabilities, and (iii) work with multiclass classification?

asked Jul 23 '13 at 15:37

turn%20chang's gravatar image

turn chang
1112

edited Jul 23 '13 at 15:39

Did you try sklearn, their logistic regressor can handle probabilities.

They also have a couple classifiers on that vein as well

(Jul 23 '13 at 17:24) Leon Palafox ♦

Hi Leon, thanks for your advice, but the point is that I'd rather use something better than logistic regression (scikit's logistic regression is what I'm currently using (with parameters tuned from gridsearchcv))--before I realized I needed probabilities though, I was using scikit's linearsvc, which provided a significant improvement over logistic regression. The naive bayes based models will work, but are dramatically less accurate. I'd like to use nn/mlp/rbn or (ideally) some form of svc/svm, but I can't seem to find anything that will work within my constraints. Any ideas?

(Jul 23 '13 at 17:31) turn chang

Again, try sklearn, it has all that niceties, last time I checked you can use SVMs in sklearn.

(Jul 23 '13 at 18:08) Leon Palafox ♦

sklearn == scikit-learn (== what i mean by 'scikit') --> afaik, sklearn's implementation of linearsvc, which uses liblinear, doesn't output probabilities, and sklearn's implementation of svm/svc uses libsvm and doesn't scale well with lots of training data and high dimensional features

(Jul 23 '13 at 18:59) turn chang

Ahh ok, just remember that scikits is a general term for many python scientific libraries (http://scikits.appspot.com/scikits), so I was not sure what were you refering to.

AS for large amounts of data, how large is large, I have used their logistic regression library (that outputs probabilities) with datafiles of around 2 GB, 1000 features and 6000 training examples.

I know that in NLP you may have exponentially large features than this, (most likely sparse), mine are dense matrices.

Matlab tends to do well handling sparse matrices and high dimensionality data, since it checks for faster ways to do inverses (which is usually the bottleneck) and pseudoinverses.

Check this thread http://comments.gmane.org/gmane.comp.python.scikit-learn/4985

Andreas proposes a way to hack the output to get probabilities. Also, if you are implementing Linear SVM, the results between that and logistic regression should be very similar, since both are linear classifiers.

(Jul 24 '13 at 12:38) Leon Palafox ♦

Yea--good point : )

I have around 290k training data and 150k features--but yea, highly sparse. I've tried that hack, but it drastically lowers my accuracy.

At any rate, there's a chance I don't need probabilities--check out this post http://stackoverflow.com/questions/17725461/ideal-classifiers-in-python-to-fit-sparse-high-dimensional-features-with-hierar --perhaps you have some idea on how I could use my hierarchical classes without propagating the probabilities out of each node to the leaves?

(Jul 24 '13 at 14:11) turn chang
showing 5 of 6 show all

One Answer:

AFAIK sklearn can produce probability estimates with SVC http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.predict_proba

I haven't used it myself though!

You probably want to investigate the development version [0.14] because the current version doesn't handle gridsearch/crossvalidation with anything other than classification error... eg AUC is currently wrongly computed based on class output [ie 1/0] rather than the real valued decision function/"probability estimate" -this is fixed in 0.14

answered Jul 24 '13 at 16:54

SeanV's gravatar image

SeanV
33629

Hi Sean--sadly I've tried it, and it is both extremely slow, and offers very poor performance (I'm guessing because there are so many classes), and "SVC and NuSVC implement the “one-against-one” approach (Knerr et al., 1990) for multi- class classification". Linearsvc, logistic regression, and sgdclassifier, on the other hand, implement one-vs-all (which I need to do because I have so many classes and training data).

(Jul 24 '13 at 17:26) turn chang
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.