Are there an algorithms that are known to efficiently handle millions of features and billions of observations?

asked Dec 29 '12 at 23:52

Nwb's gravatar image

Nwb
1112

1

It's actually not clear at all what the OP had in mind, i.e., classification, regression, etc??? Where are there billions of observations, CERN?

(Jan 02 '13 at 07:38) digdug

I don't have a specific data set in mind. I was thinking along the lines of mining large text corpa, word frequencies and the like. But the central idea is probing what types of algorithms are being used in general for mining huge data sets.

(Jan 02 '13 at 16:41) blip1234

3 Answers:

I have found that linear models work exceptionally well. It's possible to code up a logistic regression or neural network trained with stochastic gradient descent in a few lines of C code that will train and test very quickly and use very little memory even without dimensionality reduction since only one instance needs to be held in memory.

If you're not a fan of C programming, John Langford has already done the work for you with a very full-featured program called Vowpal Wabbit. Installing Vowpal Wabbit can be a bit of a pain, depending on how frequently you do these sorts of things, but this link might be of some help. Vowpal Wabbit uses a neat dimensionality-reduction trick called feature hashing, which has been discussed in many other questions here... just search for it if you're interested.

If VW doesn't meet your needs, there are a couple other fast-linear-model programs out there:

One thing worth noting is that Vowpal Wabbit, while it can train an SVM-like model with hinge loss, doesn't train kernel SVMs and as such will train a strictly linear model. LibLinear and LibSVM will build kernel models.

Hope this helps.

answered Dec 31 '12 at 11:07

Troy%20Raeder's gravatar image

Troy Raeder
89972025

I would add that SVM-Perf (based on SVM-Light) is similar (at least in functionality) to LibSVM, but has some extra things in it that might be helpful with large datasets such as Joachims's Cutting-Plane Subspace Pursuit algorithm and improved optimization.

(Dec 31 '12 at 14:43) Daniel E Margolis
1

Not completely sure, but I don't think LibLinear supports kernel methods; it is meant for large scale linear models.

(Jan 02 '13 at 11:54) Rakesh Chalasani

libsvm is not scalable (the complexity is more than quadratic with the number of observation) and both libsvm and liblinear assumes that the training set fits in memory which is quite unlikely for billions of samples and millions of features unless they are very, very sparse. Only VW is likely to be able to scale to very large data.

(Jan 04 '13 at 12:12) ogrisel

Linear methods, as suggested by Troy, are a very good candidate for your problem. Linear methods tend to scale well with the number of examples & very high dimensionality makes it likely that your problem is linearly separable. However with more than 10k training samples you should really use either primal space SVMs (eg liblinear, Vowpal Wabbit, Pegasos & SGD) or approximate solvers like those based on core sets or Nydstrom method. Avoid dual space SVMs (eg libsvm, svmlight and most of the older svm solvers) for data sets of this size. I have also found that for very sparse data, as I am guessing yours is, L2 regularization usually works better than L1, contrary to what is recommended for most applications.

For this kind of data the main alternative to linear methods are ensemble methods (eg random forests). They can be made to scale by training the base learners on manageable sized samples of the original data, considering only subsets of the features during individual steps of the algorithm & parallelizing the training of the base learners.

A useful technique that can be combined with almost any ML technique is to use hashed representations or random projections (the only dimension reduction techniques that are likely to scale to your data). This can make it feasible to use methods that would not be able to handle your data in it's original form and may actually help with accuracy even for methods that can handle the original data. However, the dimension reduction must not be too aggressive; you should probably leave at least 1000s of dimensions.

answered Dec 31 '12 at 17:45

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

Well, traditionally you could use Principal Component Analysis (or Factor Analysis) to reduce the high dimensionality, but if we assume a nonlinear feature space you would probably be best served with a Manifold Learning technique.

Assuming you manage to get the feature situation under control, handling a large number of observations is really dependent on what you want to do with the data. If you look around at a lot of the big name conferences in recent years, many of the popular techniques for classification or clustering, such as SVM and k-means have had the theory extended for large datasets.

answered Dec 30 '12 at 01:08

Daniel%20E%20Margolis's gravatar image

Daniel E Margolis
1065510

1

Covariance methods are not a good choice for millions of features, since they need feature^2 memory.

(Jan 02 '13 at 04:43) Justin Bayer

Justin's comment is not true - you just use an iterative method... see principal components analysis in wikipedia for instance.

(Jan 03 '13 at 08:31) SeanV

You are right, there are ways to use PCA on big data. It's not trivial, however.

(Jan 06 '13 at 17:01) Justin Bayer
Your answer
toggle preview

Subscription:

Once you sign in you will be able to subscribe for any updates here

Tags:

×3

Asked: Dec 29 '12 at 23:52

Seen: 2,250 times

Last updated: Jan 06 '13 at 17:01

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.