The answer might depend on which method we choose.

For example, when we want to optimize a loss that can be represented as L(inner product (w, x), y) (w is the current solution and (x,y) is a sample) using regular gradient descent, the usual computational time is O(md) for m data points in d dimension. However, if the data x is sparse we may keep a representative vector z for the inner product (w,x), and obtain the partial gradient ddw_j L() using the loss L(z, y), and then update z for non-zero entries of x only since we know for a gradient-based approach with a loss of this form zero entries of x does not contribute to the update. This decreases the computational time to O(ms), where s is the average number of non-zero entries of your data (effective dimension). This trick is presented in the paper http://jmlr.org/papers/volume12/shalev-shwartz11a/shalev-shwartz11a.pdf, where the implementation is used for a stochastic version of coordinate descent.

I see similar trick for Gradient Descent as a solver for SVM, but wasn't able to understand it very well (see discussion at the end of section to of this paper: http://eprints.pascal-network.org/archive/00007137/). Can somebody explain how this works to me?

Also, what are the other implementation tricks that uses sparsity for gradient-based algorithms? And other tricks for different optimization methods? Are there any survey papers on this?

asked May 30 '14 at 23:24

Cheng%20Tang's gravatar image

Cheng Tang
1222

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.