I want to get started with Hashing , as in implement it for a few documents. But there are several kinds of hashing and i do not want to learn, for now, hashing used in cryptography. [Although that would be great too but not for now] . I want to learn about the hashing techniques which are used in IR . It will be awesome to have a list of resources to get started in hashing for a total newbie( i know what hashing is but nothing more than that ).
I am guessing that for finding out duplicate content and for storage, hashing is used widely. Also, like Udi Manber says , there are three algorithms used in Yahoo! . they are ,"hashing, hashing and hashing". Thanks.
asked Feb 04 '11 at 08:31
Two things are involved in hashing for information retrieval: a hash function and a way of doing lookups based on the hash.
Here you can use any dimensionality reduction technique that will give you a short real-valued or bit-valued (which one is better depends on your lookup algorithm) vector for each large data feature vector you have (this can be a tf-idf vector if you're indexing text, but it can really be anything, like a vector-quantized tf-idf for SIFT features for images, similar keypoint extractors for audio, etc). Some popular approaches are:
c. semantic hashing and other deep-learning approaches
I'm assuming you want to query for similarity. That is: you give a document and want to ask for similar documents (usually other approaches can be reduced to this). If you have a real-valued vector, what is usually done is an approximate nearest neighbor search. For bit-vector hashes you can do something far simpler, which is looking at all documents that are one-bit away from the query document. If you have an array of pointers such that the i-th position in the array corresponds to hash number i, this is very easy to do.
Of course, there are many more variations, but this should get you started.
answered Feb 09 '11 at 12:14
Alexandre Passos ♦
There are many different uses of hashing in IR, for example: