Hello, I have implemented LSH algorithm for finding nearest neighbours in set with 14 milion elements. The problem is that when I feed my algorithm with 14 milion samples I've got in some buckets small amount of samples (10 - 30) and in some buckets very large amount of samples (20 000 - 30 000). When query sample falls into big buckets it takes a lot time to retrieve a few nearest neighbours. How can I speedup my algorithm? The implementation is based on article: http://research.yahoo.com/pub/2648 so I have few parameters to control amount of samples that fall into bucket. For example parameter w could be used to choose width of bucket. But when I decrease it, then buckets that have small amount of samples will have even smaller amount. How can I deal with such non-uniform distribution of samples?

asked Jul 02 '11 at 08:07

endless's gravatar image

endless
16223

1

Have you tried adding more buckets? Any extra bucket should significantly reduce the average number of items per bucket, which can probably speed things up, specially as it's not really that costly to add more buckets.

It looks as if your data is very clustered. Maybe you can improve the projection matrix to spread out these clusters, or you can do PCA on your data to make it isotropic (and maybe even centering and dividing by the variance of each component might do the trick).

(Jul 02 '11 at 12:25) Alexandre Passos ♦

Is LSH a metric (I can't determine confidently from searches)? If so, you can use the triangle inequality to reduce the number of comparisons needed. Something like "Optimizing All-Nearest-Neighbor Queries with Trigonometric Pruning" by Emrich et al. 2010.

(Jul 03 '11 at 19:11) Robert Layton

One Answer:

Have you tried adding more buckets? Any extra bucket should significantly reduce the average number of items per bucket, which can probably speed things up, specially as it's not really that costly to add more buckets. I cannot add more buckets. I can only control the width of buckets. When the width of bucket is small there will be more buckets with lower amount of samples. But this still does not solve my problem because there are buckets with huge amount of samples and bucket with small amount of samples.

It looks as if your data is very clustered Couly you please explain what does it mean "clustered"?

Is LSH a metric (I can't determine confidently from searches)? If so, you can use the triangle inequality to reduce the number of comparisons needed. Something like "Optimizing All-Nearest-Neighbor Queries with Trigonometric Pruning" by Emrich et al. 2010. LSH as far as I know is not a metric. I use euclidean distance as a metric.

answered Jul 04 '11 at 07:36

endless's gravatar image

endless
16223

Try this: http://www.aaai.org/ocs/index.php/AAAI/AAAI10/paper/download/1839/2032

(Jul 04 '11 at 18:49) Robert Layton
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.