|
My classification task is simple - given a somewhat limited vocabulary, classify the document into one of ~30 classes. The classification itself should be easy - I'm guessing naive bayes would work quite well. I need tens of thousands of these classifiers (one for each user), possibly hundreds of thousands (each user can her own set of classes, and hence a separate classifier). I'd like the classifier to start guessing immediately (as soon as the first few examples), and improve over time. That is, I want to train it on a continuous basis, having it improve as time goes on. I've used nltk's naive bayes before and liked it, but I'm not sure if I can do continuous training with it - that is, given new training data, I'd like to be able to train the classifier without loading all the old training data. I want it to learn incrementally. General requirements: python based, fast to load (I'll be loading one for each user), easy to save, simple. What do you recommend? |
|
The Divmod Reverend package was the simplest pure-Python Naive Bayesian classifier I found that could be trained incrementally/online/continuously. It's written specifically for spam-filtering, but I was able to tweak it's code fairly easily for arbitrary classification. After the Divmod company went out of business, the code's gotten a little hard to find, but it's out there. EDIT: To illustrate my "tweaks", they're not so much tweaks as they are defining a custom tokenizer. By default, Reverend expects to be given a chunk of text (e.g. an email), and the default tokenizer splits this into a list of words. I simply wanted to pass it a list of symbols, and not do any tokenizing, so I just created a "pass-through" tokenizer that essentially does no tokenizing. Download an install Reverend 0.4 from my link above, and then the following code should work.
EDIT: One more note. since the above implementation stores a reference to an instance method, Python's default pickler can't serialize it. However, the following code explains how to pickle an instance method, so you can save your classifier by simply doing
Any chance you could publish your tweeked code? Not sure what the original divmod license was, but if it's possible to publish the generalized classifier I'm sure it'd be appreciated by many.
(Oct 20 '11 at 23:07)
Parand
The license is LGPL.
(Oct 21 '11 at 11:49)
Cerin
Much thanks Cerin. I found the Reverend code and ran a few experiments, looks like it'll work well.
(Oct 24 '11 at 10:57)
Parand
|
|
I don't know much about Python ML libraries but I know that there are online versions of both Naive Bayes and Logistic Regression that would probably suit you nicely. Online Naive Bayes is available as NaiveBayesUpdateable in Weka or NaiveBayes in MOA. The source code for both of these is available and in the worst case you could use it to translate a "batch" naive bayes classifier in Python (such as the one given here) into an online classifier. Hope this helps. Thanks Troy. I ended up with online naive bayes via the Reverend package.
(Oct 24 '11 at 10:58)
Parand
|
|
You should probably check out the new book Machine learning for email by Drew Conway and John Myles White? The book has online example code in R which can be useful, as the setting is quite similar to what you're talking about. |
If this task is more of a preference modelling as opposed to regular classifications, it might be better to add user specific features into your classifier as opposed to building a separate classifier for each user.
This is more of a search or recommendation problem, so look into those topics. The key is that you should be able to usefully carry information between users.