I would like to do similarity computations between documents in a corpus using tf-idf. More precisely, for each document, i would like to know the most similar other documents in the corpus. This is easy with cosine similarity and doing a linear scan/comparsion, however, I would like to do it in a faster than linear way. Since tf-idf vector has so many dimensions, using methods like kd-trees and hashing doesn't seem to work. Also, precomputing all the ranks is also infeasible. Any suggestions/pointers how to do this?

asked Jun 26 '11 at 06:31

Axs's gravatar image

Axs
46113

I would suggest that you use an online clustering algorithm. This would have two good properties: - you run the algorithm only once over all documents, then you can infer nearest neighbors easily for each of them - you can stop the clustering when needed to get an estimate of the most probable nearest neighbors.

I've used 'online spherical k-means' along with tf-idf. You'll find an implementation in the Seeks Project (http://www.seeks-project.info/). Look for the oskmeans.h / .cpp if you need to. Or get in touch if you go this path and need more information.

(Jun 26 '11 at 16:38) Emmanuel Benazera

One Answer:

You could vectorize all your calculations, if you haven't already, and then use C, Matlab or Numpy.

Precompute your Tf-IDF dictionary if you haven't.

dimensionality reduction is actually a pretty reasonably approach - Johnson-Lindenstrauss or Locally Sensitive Hashing are worth investigating.

answered Jun 26 '11 at 12:09

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

Can you elaborate on Johnson-Lindenstrauss? I know the method from wikipedia, but do you have any pointers applying it in tf-idf like contexts. Same thing for LSH.

(Jun 26 '11 at 14:50) Axs

Actually, with tf-idf, it might be a bit of a hassle. I guess the key would be to re-weight the initial matrix by the sqrt of the tf-idf diagonal weight matrix, and then applying Johnson-Lindenstrauss ( a random matrix multiplication, basically), you could reduce dimension pretty significantly and then do normal similarity between each row.

It might not work well though, I'm not sure.

What scale data are you working with? I feel that tf-idf can be pretty fast in general.

(Jun 26 '11 at 22:20) Jacob Jensen

use inverted index. Most search engines use it to return the top K candidate (k-nearest-neighbor) in the first phase. You can use open source lucene search engine. Although the dimension of document vector space (unique number of words as indexes) is high, the number of unique words per document is small (less than 500 words in average document). If two documents do not share any word, the similarity is zero. Use inverted index to avoid scanning all documents in corpus

(Jul 24 '11 at 11:51) Jianxiong Dong
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.