|
I have implemented a random hyperplane based LSH algorithm (based on Charikar 2002, "Similarity Estimation Techniques from Rounding Algorithms"). Due to the probabilistic nature of LSH, large number of randomly selected hashing functions will result in better results. Is there any theoretical methods to determine the number of hashing functions required for a particular problem. In my case, I have 36,000 k-dimensional feature vectors (K=1900). Its a sparse vector i.e. on average each feature vector contains only 25 features where maximum is 144 and minimum is 2. Currently I am using 128 random hash functions (d=128) and the results are very good. But is there any way I can determine the value of d theoretically? If there any good research papers related to this question, please provide me the details. Thank you. |