I would like to calculate "topic-similarity" between two documents. Lets say, I have document A with a few words and document B with few words. Both documents are variable in length with maximum length being up to 3200 lines. I would like to calculate a "similarity" score between the two documents based on topics (probably frequent n-grams?) within the documents.

I am looking for something more sophisticated than Cosine Similarity, Jaccard Co-efficient et al. I DONT have a collection of documents.

asked Jun 24 '11 at 08:13

Dexter's gravatar image

Dexter
416243438


2 Answers:

This might be a long shot, but you may try to subdivide your space in the documents. Among paragraphs, then calculate Topics for those specific paragraphs (like what they are doing with tweets)

And then, with those paragraphs you may generalize to the documents

answered Jun 24 '11 at 21:24

Leon%20Palafox's gravatar image

Leon Palafox ♦
40857194128

Leon, Thanks. Can you direct me to those papers on tweets?

(Jun 25 '11 at 00:10) Dexter

I don't have access to my computer right now, but try looking in Google scholar for :Microblog, LDA, it was a pretty popular application last year

(Jun 25 '11 at 00:13) Leon Palafox ♦

I only read on LDA for tweets, there are more techniques, but since my research is network analysis rather than NLP I only know the general solutions

(Jun 25 '11 at 00:15) Leon Palafox ♦

Leon : Thanks. But, LDA requires some training. I require similar to an online version of similarity matching.

(Jun 25 '11 at 05:08) Dexter

I meant, I just have a collection of 5 documents on the first run. Once the similarity measure is calculated I may gain access to further more number of documents.

(Jun 25 '11 at 05:11) Dexter

You could use LDA to extract term weight vectors for n_topics ranging from 100 to 1000 topics from a generic corpus such as Wikipedia articles unsupervisedly. Then compute the cosine similarity to those topic vectors (and maybe then take the top 10% scores to get a sparse representation in this n_topics-dimensional space).

Then computing the cosine similarity of the document in this topical space rather than the original space should give you want your are looking for.

answered Jun 24 '11 at 08:24

ogrisel's gravatar image

ogrisel
498995591

edited Jun 24 '11 at 08:25

Olivier, Thanks! The domain doesn't allow me to use a generic corpus as one doesn't exist. I saw such an approach at WWW2008 : www2008.org/papers/pdf/p91-phanA.pdf

Unfortunately, I don't have such a luxury.

(Jun 24 '11 at 08:27) Dexter
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.