It seems like most of the papers on key-phrase extraction (see below) start with a training corpus containing key phrases and use that to train a classifier to tell whether n-grams are key-phrases or not (as you would with, say, faces in an image).

It seems to me that if the documents you're dealing with are sufficiently homogenous you could say that ngrams that occur significantly more frequently in this document than in the corpus as a whole are key-phrases. Is there some reason why this wouldn't work? It would certainly be a lot faster, and could be computed incrementally without an up-front training step.

links:

KEA

A Ranking Approach to Keyphrase Extraction

asked Dec 06 '12 at 15:14

bobpoekert's gravatar image

bobpoekert
21223


One Answer:

Key phrase extraction is not really my field but I think I have seen at least as much work on unsupevised keyword extraction as on on supervised. There is certainly nothing wrong with the idea. At one point Amazon used to show a set of statistically improbable phrases for books that for which it had the text. Another approach to unsupervised key phrase mining is to look for words whose occurrance are correlated ie word1 word2 occurs more frequently that the individual probabilities of word1 and word2 would suggest.

answered Dec 10 '12 at 00:51

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.