|
Hi, all I implemented basics LSH algorithms, including 1) minhash for jaccard distance2) random projections for Euclidean and cosine distance. My aim is to understand the LSH algorithms and use it in my project hopefully. Two parameters are included in the algorithms, k(length of signature for each data), L(# of hash tables). I suppose k is responsible for precision of retrieved results and L is responsible for recall of retrieved results. My first question is: is there any other parameter I need to consider in LSH implementation? Take random projection method for example. I initialize k*L random vectors, and make k vectors a group. For each data, I compute signature of the data with each group of vectors(the sign the of the dot product is a bit of signature), i.e., I got L signature for each data, and each signature has a length of k bit. Then I compute a hash value of the k-bit vector and treat the value as a key in hash table. In this manner, I will have L hash table. I call this training phase. My question is: If the k-bit vector is compressed into a hash value, I can only retrieve data with exactly the same signature(100% matched k-bit vector). Is this what LSH does? Another question is: how can I evaluate my implementation? In particular, if I am focused on accuracy, how can I make sure that I tuned the optimal parameters? My current approach is to make the exact KNN results as baseline, and compute the precision/recall of the retrieved data compared with KNN results. Any problem with my method? Thanks for your notices. |
|
Regarding your first question, you might want to consider how many things you will evaluate at search time as a parameter as well, but you might not. What LSH is usually used for is not for getting things which hash exactly to your input's hash but for getting everything which hashes sufficiently close (in terms of number of differring bits) to your key. This allows you to look up a neighborhood of a point in time which theoretically depends sublinearly on the size of your training set. K and L allow you to control this time versus the size of the neighborhood you're finding. You can be shallower than this if you want; for examply by computing how often your LSH implementation is able to find the K exact nearest neighbors. Thanks very much. 1. you mean I need to consider the # of neighbors needed in setting k and L? 2. This means that I still need to compare all the signatures in a table. And get the similarity from the # of bits they match? I think that if there are a lot of buckets in a table, that is still a linear search time, even though the comparison is now different from KNN. 3. this sounds very simple, but what if I retrieved a lot of data that covers the K exact NN? Btw, I still have a question. Take random projection method of Euclidean distance for example, I generally calculate k integers, what kind of hash function do I need to hash these k values into a fixed-length signature, say 32bits. I am not sure whether the hash function I choose will affect the comparison if I am computing the # of different bits. Thanks
(Feb 03 '12 at 08:30)
DDX
1
The random projections method for computing the euclidean distance works by multiplying a vector with a random vector of zeros and ones, and then squaring the result. You then do this many times, average, and get an estimate of the norm of the vector. To do locality sensitive hashing with random projections the procedure is the same, except instead of squaring the result from the projections you throw that information out and look only at its sign. Repeating this many times will give you a bit vector, which is your key. You can generate these random numbers using a hash function, by computing the hash of their index in the vector.
(Feb 03 '12 at 08:37)
Alexandre Passos ♦
Thanks! Very useful. Actually, I am confused about the procedure of the search phase. I consider it the same for all kinds of LSH function. i.e, compare every hi in <h1, h2,="" ...,="" hk="">. But it seems not like this from your answer, right?
(Feb 03 '12 at 11:49)
DDX
|
|
I am going to break out my results into a separate answer, even though they are related to Alexandre's. First of all, you should sanity check your code as follows:
Now that you know your code works correctly, figure out for the desired F-measure of your results. (You measure FMS of the k-best LSH results vs. the exact k-NN results.) Also figure out the desired time in milliseconds (or nanoseconds) for each retrieval. Now, you can tune your hyperparameters h and L. (I'm using h, not k, for the length of the hash, so as not to confuse with the hyperhyperparameter k from k-NN.) There is also one more hyperparameter, d. d is the maximum hamming distance you will search. The choice of d controls how much you search in the neighborhood of the hash code. So if d=2, you will look at items in the original hash, you will look at h bit-strings with hamming distance 1 from the hash, and you will look at h^2 bit-strings with hamming distance 2 from the hash. I will give an example:
Does this make sense? So now you can optimize your hyperparamter h, L, and d so that the time and FMS are what you desire. [I will ping Olivier Grisel to jump in, because he is writing LSH code for scikits.learn right now. He might have other insight] It probably makes sense to macro-average the FMS scores.
(Feb 03 '12 at 15:51)
Joseph Turian ♦♦
Thanks for your answer. Does your method work for different kinds of LSH families? For example, h is the # of hash value for each data, noted by <hash_1>
(Feb 03 '12 at 22:10)
DDX
If I'm not mistaken, hashi is always a bit in LSH. (If you think otherwise, please provide a link.) The choice of the hashing function is what ensures that the Euclidean distance in the original space should be the same as the Hamming distance in the hash space.
(Feb 22 '12 at 04:47)
Joseph Turian ♦♦
|