I understand that it is possible to use LDA results to measure the similarity between two documents from the same corpus (e.g. Hellinger distance). Has there been any thought to measuring the similarity between corpra? Or under what conditions can I even consider measuring the similarity? Share a vocabulary? Same number of topics?

asked May 04 '11 at 18:45

Aengus%20Robinson's gravatar image

Aengus Robinson
21551114

edited Jun 30 '11 at 15:43

ogrisel's gravatar image

ogrisel
398464480

Thanks to both responders for taking the time to provide some suggestions. I think I've gotten an initial handle on this using material from metabolic pathway searching. Lot's more to do, but at least an initial step at a statistical foundation

(Jun 23 '11 at 22:37) Aengus Robinson

Don't forget to tick one as an answer if your problem was solved!

(Jun 24 '11 at 01:29) Robert Layton

3 Answers:

In case the issue comes up again, I thought I'd post a brief summary of what I found. I'm not sure if I would have dug up the background if I had limited my digging to only the linguistic area. FWIW - My current interest happens to be subgraph isomorphism and the measure of network similarity. I am focused on statistical measures of similarity rather than exact isomorphism. There's a pretty clear tie to linguistic analysis methods such as Topic Analysis and LDA.

In summary, I found that there was not a 'best' or accepted approach. Here are a few references that provide some basic background, and I leave the more recent references for those interested to dig up. I'm happy to send a more complete bibliography.

Dunning probably doesn't get enough credit for laying the foundation and I've also included a few older references. My recent work suggests that information measures (that are available from, for example, LDA) are generally better than the traditional methods. I've included two references that, IMHO, provide an interesting change in direction.

Foundational papers

Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics.

Kilgarriff, A., & Rose, T. (1998). Measures for corpus similarity and homogeneity. In Proceedings of the 3rd conference on Empirical Methods in Natural Language Processing, Granada, Spain (pp. 46–52). Presented at the Proceedings of the 3rd conference on Empirical Methods in Natural Language Processing, Granada, Spain.

Strehl, A., Ghosh, J., & Mooney, R. (2000). Impact of similarity measures on web-page clustering. AAAI Tech Report WS-00-01. Workshop on Artificial Intelligence for Web.

K. Parsons and A. M. A. M. Butavicius, “Human Dimensions of Corpora Comparison: An Analysis of Kilgarriff's (2001) Approach,” pp. 1–62, Jul. 2009.

Topic Model related pubs (biased sample)

Konietzny, S. G., Dietz, L., & McHardy, A. C. (2011). Inferring functional modules of protein families with probabilistic topic models. BMC Bioinformatics, 12(1), 141. BioMed Central Ltd. doi:10.1186/1471-2105-12-141

Parkkinen, J. A., & Kaski, S. (2010). Searching for functional gene modules with interaction component models BMC systems biology, 4, 4–. doi:10.1186/1752-0509-4-4

answered Jun 30 '11 at 11:21

Aengus%20Robinson's gravatar image

Aengus Robinson
21551114

edited Apr 19 at 17:35

You could train the parameters of a graphical model (like LDA) on one, then put in the other and measure its log-likelihood. This will give you a good relative distance to the original document, and you can get pairwise distances (this is reminiscent of KL divergence, since it asks to describe the probability of a sample y based on an inferred distribution of sample x), but wouldn't be a useful general metric necessarily.

You could also look at the sorts of measures used in bioinformatics, like edit distances. Dot products of vocabulary vectors or hierarchy of vocabulary vectors would also be okay.

answered May 05 '11 at 00:52

Jacob%20Jensen's gravatar image

Jacob Jensen
1644285360

A good starting point is http://acl.ldc.upenn.edu/W/W00/ W00-0901 to W00-0906

answered May 04 '11 at 20:43

Awais%20Athar's gravatar image

Awais Athar
462

Thanks, I wasn't familiar with this. My hope is to make a statistical comparison, but this will be a good start.

(May 04 '11 at 21:15) Aengus Robinson
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.