KEA Keyphrase Extraction as an XML-RPC service (code release)

Sum­mary

We release code writ­ten by Ali Afshar, which turns the KEA keyphrase extrac­tor into an XML-RPC ser­vice. This allows you to use KEA as a ser­vice, call­ing it from a vari­ety of dif­fer­ent pro­gram­ming lan­guages. The code is released under the New BSD License.


Back­ground

Keyphrase extrac­tion (AKA ter­mi­nol­ogy min­ing, term extrac­tion, term recog­ni­tion, or glos­sary extrac­tion) is the process of extract­ing multi-word phrases that sum­ma­rize the mean­ing of a text passage.

For exam­ple, in this doc­u­ment enti­tled “The Grow­ing Global Obe­sity Prob­lem: Some Pol­icy Options to Address It”, the keyphrases might be: [“devel­op­ing coun­tries”, “food con­sump­tion”, “over­weight”, “taxes”, “prices”, “price poli­cies”, “fis­cal poli­cies”, “feed­ing habits”, “nutri­tional require­ments”, “diet”, “nutri­tion poli­cies”, and “food intake”.]

These keyphrases are use­ful for sum­ma­riz­ing the topic of the text. Also, these keyphrases are use­ful in later NLP pro­cess­ing steps, and some­times more infor­ma­tive and dis­am­biguat­ing than just the indi­vid­ual word tokens in the text.

KEA is a great keyphrase extrac­tion imple­men­ta­tion. It is use­ful because it is open-source, backed by solid research, comes with some anno­tated train­ing data, and because it can extract keyphrases over unre­stricted text, with­out need­ing a vocab­u­lary of pos­si­ble keyphrases.

Other imple­men­ta­tions of keyphrase extrac­tion include:

  • Maui, a topic extrac­tor from the same peo­ple that wrote KEA.
  • topia.termextract is a Python term extrac­tor, which is rel­a­tively noisy, and pro­poses many bogus key­words, but it sim­ple to use. This is my rec­om­men­da­tion for quick-and-dirty but works imme­di­ately out-of-the-box.

API imple­men­ta­tions include:

  • Ter­mine by NacTem, a per­mis­sive term extrac­tor I’ve used in the past. They will give you bulk access for research pur­poses. It is a UK web­ser­vice that also is rel­a­tively noisy, and pro­poses many bogus key­words. How­ever, it appears to me to be slightly more accu­rate than topia.termextract. YMMV.
  • Alchemy’s term extractor.
  • The Yahoo term extrac­tion API, which is now only avail­able through YQL. It is low recall but high pre­ci­sion. In other words, it gives you a small num­ber of high qual­ity terms, but misses many of the terms in your documents.
  • Five Fil­ters, a web ser­vice ver­sion of topia’s term extrac­tor (see above).
  • Maui on Appspot.

Peter Tur­ney has done a lot of research on keyphrase extrac­tion, and licenses his imple­men­ta­tion.

There is a wide aca­d­e­mic lit­er­a­ture on term extrac­tion, which I won’t sum­ma­rize here. The best intro­duc­tory tech­niques are writ­ten by Park, who is now at IBM:
“Auto­matic glos­sary extrac­tion: beyond ter­mi­nol­ogy iden­ti­fi­ca­tion” and
“Glos­sary extrac­tion and uti­liza­tion in the infor­ma­tion search and deliv­ery sys­tem for IBM tech­ni­cal sup­port”. You can read more about how to roll your own ter­mex imple­men­ta­tion here.

More infor­ma­tion about the topic is avail­able on the Maui blog.


Code

When run­ning KEA, instead of a stand­alone pro­gram which reads input from disk, for speed one might want a res­i­dent ser­vice that keeps the model in mem­ory. Addi­tion­ally, one might want to call this ser­vice from non-Java lan­guages. XML-RPC is a widely sup­ported stan­dard for imple­ment­ing remote services.

We hereby release KEA ser­vice writ­ten by Ali Afshar, which turns the KEA keyphrase extrac­tor into an XML-RPC ser­vice. This allows you to use KEA as a ser­vice, call­ing it from a vari­ety of dif­fer­ent pro­gram­ming lan­guages. The code is released under the New BSD License.

Also included in the doc­u­men­ta­tion is a descrip­tion of how to this Java pro­gram was con­verted into a XML-RPC service.

  • http://twitter.com/zelandiya/status/21546967287 Aly­ona Medelyan

    Cool: @turian has just released an XML-RPC ser­vice for keyphrase extrac­tion using Kea: http://bit.ly/97NtV8

  • http://twitter.com/hntweets/status/21592848164 Hacker News

    KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release): http://bit.ly/bt5sKj Com­ments: http://bit.ly/9BWWrD

  • http://twitter.com/hackernewsyc/status/21593593441 Hacker News YC

    KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) http://goo.gl/fb/XOhIA

  • http://twitter.com/kicauan/status/21594513887 kicauan

    KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) — http://su.pr/2DRns8

  • http://twitter.com/turian/status/21594542243 turian

    KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) http://bit.ly/97NtV8

  • http://twitter.com/ogrisel/status/21596532019 Olivier Grisel

    RT @turian: KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) http://bit.ly/97NtV8

  • http://twitter.com/brendan642 bren­dan o’connor

    Nice sum­mary of all those tools out there…

  • http://twitter.com/vitorcoliveira/status/21632219841 vitor­co­l­iveira

    RT @zelandiya: Cool: @turian has just released an XML-RPC ser­vice for keyphrase extrac­tion using Kea: http://bit.ly/97NtV8

  • http://twitter.com/communicating/status/21638249101 Chris@SocialTexture

    “@turian: KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) http://bit.ly/97NtV8”

  • http://twitter.com/mendicott/status/21639359896 Mar­cus L Endicott

    RT @turian: KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) http://bit.ly/97NtV8

  • http://twitter.com/nicolastorzec/status/21677594134 Nico­las Torzec

    Code for turn­ing KEA (http://bit.ly/nnqL) into an XML-RPC Term Extrac­tion ser­vice has been released: http://bit.ly/97NtV8. Via @turian #NLP

  • http://knowledgebases.wordpress.com/2010/09/06/selected-news-from-august-2010/ Selected News from August 2010 « About: Knowl­edge Bases

    […] Code for turn­ing term extrac­tor KEA into an XML-RPC ser­vice has been released: Metaoptimize blog […]

  • JF Richard

    Hello,

    I invite you to look at Syn­chroTerm, a bilin­gual term extrac­tion tool I devel­oped. http://www.terminotix.com/docs/factsheet_synchroterm_en.pdf
    JF Richard, Terminotix

  • MAX

    Hey it seems that when I setup the xml-rpc ser­vice, the KEA stop­words func­tion­al­ity is not run­ning anymore(It’s work­ing if I run it in com­mand line), any idea about this? Thanks!

  • http://twitter.com/analyticsdennis/status/162290680192958464 Den­nis Plucinik

    Read­ing: KEA Keyphrase Extrac­tion as an XML-RPC ser­vice (code release) – MetaOp­ti­mize: http://t.co/qfsrcWTX

blog comments powered by Disqus