|
I am trying to fit a GLM with an exponentially distributed variable that has 128,000 training examples, with 66 features per example. Is there any known quick way to do this? I need to do it many times, as I am using it as part of an EM algorithm, where I must optimize the parameter values many times in a row, and I don't have a lot of computing resources available. The methods I've explored are batch gradient descent, IRLS, and stochastic gradient descent. With batch gradient descent I seem to get the best results, but it takes the longest. With IRLS I seem to get similar results to gradient descent in a somewhat shorter amount of time, but the SciPy solver diverges on occasion, and it's still not fast enough for my needs. Stochastic gradient descent is the only method I've found that has an acceptable speed, but it performs slightly worse than the two other approaches and I wanted to check to see if there is a better approach. |
|
LBFGS is another option, that works like batch gradient descent -- you only need to give it the gradient function -- but it optimizes much faster than batch gradient descent. The conventional wisdom is, SGD optimizes faster than LBFGS for the early part of the run, but to get the last little bit, LBFGS is better. (There's a paper by Langford and collaborators describing a setup where they start with SGD then switch over later.) A good LBFGS library is LibLBFGS: http://www.chokkan.org/software/liblbfgs/ I think some of the comparisons out there find coordinate descent can sometimes be better (I'm thinking of one that was done for L1-regularized logistic regression). The main reason for LBFGS is that it's easy to use and apply to different models, since the only thing you have to derive is the gradient function. |
|
Coordinate descent or LARS should be faster than other algorithms for a small number of features. Also 128K items is more than enough to estimate 66 parameters. You should be able to only use a subset of your data with almost no loss in accuracy so you may get the most accurate method you have to run in acceptable time Alternatively, if you are not already doing this, you may be able to improve on the SGD result by using a variable learning rate that is inversely proportional to the number of steps and by turning of regularization. Variable learning rate should improve accuracy and the algorithm may converge well before the first pass is completed and with 128K items for 66 parameters regularization is far more likely to hurt accuracy than to actually help. Also if there is a non-iterative (analytic) algorithm for this type of problem, I would try that. |
Check this paper by Nicolas Le Roux; might be a good fit for the problem you have. http://hal.inria.fr/docs/00/71/50/15/PDF/sag_arxiv.pdf
Thanks for your answers everyone. I am going to see how sampling, LBFGS, and the algorithm from the Le Roux paper work. Unfortunately there does not appear to be an analytic solution to the problem, and I am already using a learning rate that decreases over time. I also am not regularizing, as I figured there would not be a risk of overfitting given the size of the data set.