Summary
We release code written by Ali Afshar, which turns the KEA keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the New BSD License.
Background
Keyphrase extraction (AKA terminology mining, term extraction, term recognition, or glossary extraction) is the process of extracting multi-word phrases that summarize the meaning of a text passage.
For example, in this document entitled “The Growing Global Obesity Problem: Some Policy Options to Address It”, the keyphrases might be: [“developing countries”, “food consumption”, “overweight”, “taxes”, “prices”, “price policies”, “fiscal policies”, “feeding habits”, “nutritional requirements”, “diet”, “nutrition policies”, and “food intake”.]
These keyphrases are useful for summarizing the topic of the text. Also, these keyphrases are useful in later NLP processing steps, and sometimes more informative and disambiguating than just the individual word tokens in the text.
KEA is a great keyphrase extraction implementation. It is useful because it is open-source, backed by solid research, comes with some annotated training data, and because it can extract keyphrases over unrestricted text, without needing a vocabulary of possible keyphrases.
Other implementations of keyphrase extraction include:
- Maui, a topic extractor from the same people that wrote KEA.
- topia.termextract is a Python term extractor, which is relatively noisy, and proposes many bogus keywords, but it simple to use. This is my recommendation for quick-and-dirty but works immediately out-of-the-box.
API implementations include:
- Termine by NacTem, a permissive term extractor I’ve used in the past. They will give you bulk access for research purposes. It is a UK webservice that also is relatively noisy, and proposes many bogus keywords. However, it appears to me to be slightly more accurate than topia.termextract. YMMV.
- Alchemy’s term extractor.
- The Yahoo term extraction API, which is now only available through YQL. It is low recall but high precision. In other words, it gives you a small number of high quality terms, but misses many of the terms in your documents.
- Five Filters, a web service version of topia’s term extractor (see above).
- Maui on Appspot.
Peter Turney has done a lot of research on keyphrase extraction, and licenses his implementation.
There is a wide academic literature on term extraction, which I won’t summarize here. The best introductory techniques are written by Park, who is now at IBM:
“Automatic glossary extraction: beyond terminology identification” and
“Glossary extraction and utilization in the information search and delivery system for IBM technical support”. You can read more about how to roll your own termex implementation here.
More information about the topic is available on the Maui blog.
Code
When running KEA, instead of a standalone program which reads input from disk, for speed one might want a resident service that keeps the model in memory. Additionally, one might want to call this service from non-Java languages. XML-RPC is a widely supported standard for implementing remote services.
We hereby release KEA service written by Ali Afshar, which turns the KEA keyphrase extractor into an XML-RPC service. This allows you to use KEA as a service, calling it from a variety of different programming languages. The code is released under the New BSD License.
Also included in the documentation is a description of how to this Java program was converted into a XML-RPC service.