Hi all, I'm searching software for automatic keyword extraction from a single web document. For now, I've found KEA and TextRank but I didn't try them yet. I intend to use it in order to extract representative terms for an organization giving its homepage or Wikipedia page. For instance, for "Apple" and http://www.apple.com/ or http://en.wikipedia.org/wiki/Apple_inc, I would retrieve {"iphone","imac","steve jobs",...}

Any ideas? Thanks in advance,

asked Nov 01 '10 at 17:09

Damiano%20Spina's gravatar image

Damiano Spina

edited Nov 04 '10 at 03:12

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

3 Answers:

Amazon has this famous feature that they call "Statistically Improbable Phrases". It basically looks like some sort of TF-IDF weighting.

Just googling these buzzwords together gives pretty good results about possible implementations. Check this stackoverflow thread, for example: "How does Amazon's Statistically Improbable Phrases work?".

answered Nov 02 '10 at 23:33

ivank's gravatar image


Something as simple as the unigram probability statistics for the document would likely work well enough for the Wikipedia examples. Just take the top N words in P(w_i) = n_i / N. For something like apple.com it might be a bit harder since the data is a lot more sparse and the page is always changing. You might need to look at a separate data source. However, looking at obvious things such as the meta "description" tag at apple.com I see it contains: "Apple designs and creates iPod and iTunes, Mac laptop and desktop computers, the OS X operating system, and the revolutionary iPhone and iPad." There's some top words right there, though I'm sure that not all sites will work out that nicely. You could try the same approach described above and just see what kind of results you end up getting.

Another thing is are you looking at individual pages only or an entire site? For the wikipedia examples I guess it's obviously just one page, but if you want to described everything under apple.com you might need to use a crawler to get a bunch of pages on apple.com and then work with those. You could try the simple method outlined above with that?

answered Nov 02 '10 at 11:24

Will%20Darling's gravatar image

Will Darling


Complementing Will Darling's answer, it can trivially fail in simple scenarios. For most pages on English text, the most frequent word types are stopwords (the, and, as, if, a, an), and selecting by frequency will usually give you a list of stopwords. To avoid this you should either (a) use a stoplist (manually curate a list of words that are usually stopwords) or (b) have some background word frequency information and list as keywords words that have the largest increase in expected frequency from the background to the observed in the page.

(Nov 02 '10 at 13:43) Alexandre Passos ♦

I wrote a blog post releasing an XML-RPC version of KEA, and it contains a list of alternate keyphrase extraction implementations. Five filters also has a list, with essentially the same answers.

I am curious more about Paco Nathan's implementation 'textrank' of Mihalcea et al's graph-based algorithm, which I hadn't seen until you linked to it. If you try it out, please post an answer or comment here to describe your reaction.

answered Nov 04 '10 at 03:17

Joseph%20Turian's gravatar image

Joseph Turian ♦♦

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.