|
I would like to calculate "topic-similarity" between two documents. Lets say, I have document A with a few words and document B with few words. Both documents are variable in length with maximum length being up to 3200 lines. I would like to calculate a "similarity" score between the two documents based on topics (probably frequent n-grams?) within the documents. I am looking for something more sophisticated than Cosine Similarity, Jaccard Co-efficient et al. I DONT have a collection of documents. |
|
This might be a long shot, but you may try to subdivide your space in the documents. Among paragraphs, then calculate Topics for those specific paragraphs (like what they are doing with tweets) And then, with those paragraphs you may generalize to the documents Leon, Thanks. Can you direct me to those papers on tweets?
(Jun 25 '11 at 00:10)
Dexter
I don't have access to my computer right now, but try looking in Google scholar for :Microblog, LDA, it was a pretty popular application last year
(Jun 25 '11 at 00:13)
Leon Palafox ♦
I only read on LDA for tweets, there are more techniques, but since my research is network analysis rather than NLP I only know the general solutions
(Jun 25 '11 at 00:15)
Leon Palafox ♦
Leon : Thanks. But, LDA requires some training. I require similar to an online version of similarity matching.
(Jun 25 '11 at 05:08)
Dexter
I meant, I just have a collection of 5 documents on the first run. Once the similarity measure is calculated I may gain access to further more number of documents.
(Jun 25 '11 at 05:11)
Dexter
|
|
You could use LDA to extract term weight vectors for Then computing the cosine similarity of the document in this topical space rather than the original space should give you want your are looking for. Olivier, Thanks! The domain doesn't allow me to use a generic corpus as one doesn't exist. I saw such an approach at WWW2008 : www2008.org/papers/pdf/p91-phanA.pdf Unfortunately, I don't have such a luxury.
(Jun 24 '11 at 08:27)
Dexter
|