|
I have a machine learning problem where I want to continually improve my classification (and/or regression) model with new data. Currently, I'm using SVMs to do batch learning and then retraining the SVM when I get a new batch of data. Ideally, I'd like to do the learning online (update the model with each new sample I obtain). Can anyone point me to freely available libraries or source code to do online learning? (C++/python would be ideal) Thanks! |
|
I am reposting the original answer (before the crash) for completeness of metaoptimize: For a Python library, you can try Bolt, http://pprett.github.com/bolt/. I haven't used it myself, but I have used other code by the same author, and it has always been of excellent quality. The discussion led to the conclusion that Bolt does not do on line learning with continuous partial fits, but rather batch learning to tackle large scale problem. Thus Bolt does not answer the OP's needs. |
|
How about John Langford's vowpal wabbit Your link is broken.
(May 03 '11 at 18:33)
Cerin
Fixed broken link.
(Oct 11 '11 at 17:45)
Oscar Täckström
One aspect that might intrest you in Vowpal Wabbit is that you can set hyper-parameters in a way so that past updates are forgotten as new data arrives. This might be useful if you need to track a non-stationary distribution. Other libraries might support this as well, but VW explicitly mentions how to get this behaviour, somewhere in the documentation.
(Oct 11 '11 at 17:51)
Oscar Täckström
|
|
Basic stochastic gradient descent (SGD) is very easy to implement yourself (and more flexible) if you're already familiar with a scientific library like numpy/scipy:
I just did this for a fairly hair-y online regression problem and it worked surprisingly well. Note that this will work best with a smooth objective (e.g. logistic regression, Huber loss). You might need an additional check or two for non-smooth objectives like the SVM/hinge loss. I use SGD a lot (actually for almost all gradient based optimization), however often you need to run multiple passes over your data set as well as tune hyper-parameters on a held out set, in order to get optimal performance. As a true online learning algorithm I'm not sure it's the best choice.
(Oct 11 '11 at 17:54)
Oscar Täckström
|
|
consider MOA (Massive Online Analysis) http://moa.cs.waikato.ac.nz/details/. This tool is open source java. It is closely related to weka. Presently it does not support SVM , but online decision trees, bagging etc are supported. |
|
sofia-ml is pretty nice. The C++ code is highly readable and it supports classification, regression and ranking with SGD, Pegasos, Passive-Aggressive and all the good stuff. This is also the reference implementation of the Minibatch k-means algorithms for "Web-scale" clustering.
(Oct 14 '11 at 12:41)
ogrisel
|
|
I use www.safaribooksonline.com They're not free, but you have access to everything. I think this question is more about programming libraries. It is mentioned that as "(C++/python would be ideal)"
(Oct 12 '11 at 07:13)
Atilla Ozgur
|
|
Fuzzy ARTMAp can perform online learning, solves stability-plasticity dilemma and are universal approximator. See an implemented version here or a simplified Fuzzy Artmap implementation here (disclaimer: these links were retrieved by google; I am not sure on the code quality, though. However, Fuzzy ARTMAP works fine). |
|
Version 2.0 of Léon Bottou SGD project introduced an implementation of Averaged SGD for both linear SVMs and CRFs with a pretty amazing convergence rate for large scale problems. |
|
C++ and Python would be ideal for that. I don't think you could find great codes that are free, most of the full version are paid. |