Hello, Lets assume that I have to built engine for nearest neighbor search with very large number of samples and binary feature space. I can use 2 techniques: first is locality sensitive hashing, second is inverted index. The question is when LSH is better than Inverted Index? Because it seems to me that inverted index could be much faster than LSH, e.g. when samples have nonuniform distribution. LSH with samples that has nonuniform distribution could be quite slow because in one bucket could be a lot of samples to compute a distance (and other buckets are empty). In inverted index such situation is not a case. Inverted index is also much more predictable because it does not use random functions. So the first question is: when LSH is better than Inverted Index? And second question: can I use Inverted Index with other feature space, like Euclidean feature space?

asked Nov 04 '11 at 04:03

nightride's gravatar image

nightride
1025611

edited Nov 04 '11 at 04:08


One Answer:

Inverted indices are only efficient when your representation is sparse, e.g.: each query and each indexed document / observation has a few (100 to 1K) non zero word occurrences (or any other kind of features - those could be "visual words" or interest points cluster index in computer vision for instance) out of a vocabulary of say 100k to > 10M.

With most inverted indices implementations you can customize the similarity function used to rank the results. For instance by default Apache Lucene & Solr use a TF-IDF similarity but you can code your own in Java if you want to alter the induced structure of the feature space.

If your natural representation is dense then an inverted index won't work. You have two options:

1- find an alternative sparse representation by clustering your dense observation to find code-words (e.g. k-means cluster centers, singular vectors, Non Negative Matrix Factorization components...) and encode each observation into this code-word space using as few code word as possible (Orthogonal Matching Pursuit, Lasso or even thresholded projections might work depending on your data). Once this is done you can use an inverted index to perform efficient sparse nearest neighbors queries in that representation.

2- reduce the dimensionality to a Hamming space and use collisions as in LSH to speedup neighbors lookups

Both methods will introduce some approximation error. The theory will give you bounds for LSH but I don't know if they are useful in practice. The LSH approach is probably simpler to implement and debug.

answered Nov 04 '11 at 04:53

ogrisel's gravatar image

ogrisel
398464480

edited Nov 05 '11 at 07:58

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.