I want to get started with Hashing , as in implement it for a few documents. But there are several kinds of hashing and i do not want to learn, for now, hashing used in cryptography. [Although that would be great too but not for now] . I want to learn about the hashing techniques which are used in IR . It will be awesome to have a list of resources to get started in hashing for a total newbie( i know what hashing is but nothing more than that ).

I am guessing that for finding out duplicate content and for storage, hashing is used widely. Also, like Udi Manber says , there are three algorithms used in Yahoo! . they are ,"hashing, hashing and hashing". Thanks.

asked Feb 04 '11 at 08:31

crazyaboutliv's gravatar image


Hashing can mean meany things in many different contexts. You seem to be confusing/conflating these, and I'm not sure how to start answering this question.

(Feb 04 '11 at 12:16) Alexandre Passos ♦

Oh :( . Well i just want to know the basics and see how hashing is used in IR. Is it just like generic hashing or is it different due to the sheer number of words involved and say, someone wants to find similar words, is there some sorta hashing which can do that? That sort of stuff. I just want to have a nice survey of hashing methods as used in IR [or to constrain, as used for text]

(Feb 04 '11 at 12:20) crazyaboutliv

2 Answers:

Two things are involved in hashing for information retrieval: a hash function and a way of doing lookups based on the hash.

  1. The hash function

Here you can use any dimensionality reduction technique that will give you a short real-valued or bit-valued (which one is better depends on your lookup algorithm) vector for each large data feature vector you have (this can be a tf-idf vector if you're indexing text, but it can really be anything, like a vector-quantized tf-idf for SIFT features for images, similar keypoint extractors for audio, etc). Some popular approaches are:

a. Latent dirichlet allocation, latent semantic analysis, etc

b. random projections and locality sensitive hashing

c. semantic hashing and other deep-learning approaches

  1. The lookup technique

I'm assuming you want to query for similarity. That is: you give a document and want to ask for similar documents (usually other approaches can be reduced to this). If you have a real-valued vector, what is usually done is an approximate nearest neighbor search. For bit-vector hashes you can do something far simpler, which is looking at all documents that are one-bit away from the query document. If you have an array of pointers such that the i-th position in the array corresponds to hash number i, this is very easy to do.

Of course, there are many more variations, but this should get you started.

answered Feb 09 '11 at 12:14

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

There are many different uses of hashing in IR, for example:

  • Hashing for duplicate detection (MD5 and variants).
  • Hash-kernels as used in machine learning to save memory and allow fast cross products of features.
  • Hashing for approximate similarity search (especially multimedia).
  • Distributed hashing for efficient key-value storage spread out over a large number of servers.
  • ...

answered Feb 04 '11 at 13:10

Oscar%20T%C3%A4ckstr%C3%B6m's gravatar image

Oscar Täckström

edited Feb 05 '11 at 16:53

Thank you ,gives me some direction .

(Feb 04 '11 at 13:16) crazyaboutliv

I thought the OP was asking for Hashing for approximate similarity, as it seems more specific for IR than the other general connotations for hashing.

(Feb 09 '11 at 11:46) probreasoning
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.