I have a dataset composed of millions of examples, where each example contains 128 continuous-value features classified with a name. I'm trying to find a large robust database/index to use to use as a KNN classifier for high-dimensional data. I tried Weka's IBk classifier, but it chokes on this much data, and even then it has to be loaded into memory. Would Lucene, specifically through the PyLucene interface, be a possible alternative?

I've found Lire, which seems to use Lucene in a similar way, but after reviewing the code, I'm not sure how they're pulling it off, or if it's the same thing I'm trying to do.

I realize Lucene is designed as a text indexing tool, and not as a general purpose classifier, but is it possible to use it in this way?

asked Apr 06 '11 at 14:39

Cerin's gravatar image

Cerin
462314046

"128 continuous-value features" Do you mean there are 128 features total (i.e. a dense feature vector)? Or that every example has 128 non-zero features, but potentially many more zero features (i.e. that you have a sparse feature vector)?

(Apr 11 '11 at 04:25) Joseph Turian ♦♦

4 Answers:

As far as I know, all exact KNN implementations are exponential in the dimensionality of your data vectors and logarithmic in the number of items in your dataset. I doubt that any implementation will be able to handle your case.

There are however, possibilities.

answered Apr 06 '11 at 17:27

Justin%20Bayer's gravatar image

Justin Bayer
170693045

Can you explain why the complexity should be exponential in the dim of the feature vectors ? IMHO, it should be linear in the dim.

(Apr 22 '11 at 19:15) Aman

If I understand it correctly you get the exponential runtime wrt the dimensionality if you want logarithmic runtime wrt the number of data points. A naive implementation that finds the nearest neighbour needs only n*d steps, of course.

(May 19 '11 at 14:14) Justin Bayer

This is a very good approach only if your data is both high dim and sparse (e.g. text token occurrences). Lucene provides you with the ability to speed up the queries if you trim down the features that are dimmed less important according to an ad-hoc IDF weighting: checkout MoreLikeThisQuery for a somehow high level wrapper and MoreLikeThis for the actual implementation.

You can also adjust the metrics used by replacing the DefaultSimilarity (the classical IDF weighting) instance by something else.

In your case, try to further reduce the dim with PCA or LSH and use an in memory KD-tree or ball-tree.

answered Apr 09 '11 at 07:40

ogrisel's gravatar image

ogrisel
498995591

edited Apr 09 '11 at 07:43

I agree with @ogrisel that you should only use PyLucene for k-NN if your inputs are sparse and high-dimensional. (See my comment on your question. I am not sure if that is the case.)

If you truly have a dense and low-dimensional representation (128 features total), then you need a fast approximate nearest neighbors algorithm. Try FLANN. It implements several different fast k-NN algorithms.

You can run it on the original data set, without any preprocessing. Just make sure that FLANN is using the correct distance measure for your data set. You could also try running PCA on the data first, but I don't know if this will help.

answered Apr 11 '11 at 04:29

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Do you have a NVidia graphics card? If so, you can try a CUDA library that we develop in our lab. The limiting factor here would be that at the moment all data has to fit into you graphic memory. But that could be changed quite easily. At the moment we have just nearest neighbour. What would be some sensible value of K in your application? If the nearest neighbour works for you, I might just hack the KNN together...

You can find an example here: http://www.ais.uni-bonn.de/~schulz//2010/12/15/cuv_1nn.html, the library is here: https://github.com/deeplearningais/CUV.

This answer is marked "community wiki".

answered Apr 09 '11 at 06:05

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Actually, it would be interesting how this thing scales in comparison to sophisticated algorithms on the CPU, like KD-Trees et al.

(Apr 09 '11 at 06:55) Justin Bayer

Is there a knn in Python that works with kd-trees? And when you say scale, how many examples are you thinking about?

(Apr 10 '11 at 04:37) Andreas Mueller

A million items of dimensionality 128 each maybe? I have no idea, but it would be interesting if the curse of dimensionality can be overcome in this case via GPU for practical use cases.

(Apr 10 '11 at 15:10) Justin Bayer

I will try running that and see how long it takes. At the moment we only have a single gpu implementation but using multiple gpus shouldn't be to hard with CUDA 4.0. I'll report back when I have the times :)

(Apr 11 '11 at 06:13) Andreas Mueller
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.