|
Wikipedia is recognized as a workable lexical semantic resource. Although a weak semantic source in its form, its data includes many features which can be used to mine effective taxonomies and more generally provide support for various NLP tasks such as Named Entity Recognition (NER), or multilingual mappings. For illustration but by no mean fully representative of the trend, see how the theme of using Wikipedia as a semantic source is developed by the NLP group at the Heidelberg Institute for Theoretical Studies. For example, Deriving a Large Scale Taxonomy from Wikipedia (2007) by Simone Paolo Ponzetto and Michael Strube I have plenty of ideas -many probably misguided- on how to "drill into" Wikipedia to support NE disambiguation in a upcoming pipeline of mine. This question deals with the practical issue of creating a Wikipedia back-end for easy use during NLP processing. What's the "Best" way of importing Wikipedia data dumps, for easy interrogation during NLP processes ? I'm about to try the Java Wikipedia Library (JWPL) and its mySQL backend.
I'd like suggestions of alternative libraries and also any practical information about this process at large for example |
|
Why do you want to import it into a DB? Most of the time NLP tools works best with raw files. You can use a SOLR / lucene index if you want to do fast fulltext and similarity queries. To work on small files you can use the mahout wikipediaXMLSplitter to split into roughly equal files (e.g. 100MB each). You might also be interested in some work I have just started here: pignlproc - this is a bunch of Apache Pig / OpenNLP based utilities to mine the wikipedia dumps on a hadoop cluster (or a single machine) using pig scripts. I plan to couple it with the DBpedia.org ontology and entity-type relationship dataset (and/or uberblic.org and/or freebase.com) to build multilingual, find grained set of NER training corpus for OpenNLP and friends. I also plan to wrap malt parser as pig function and try to do semantic relationship extraction too. Please feel free to fork pignlproc and contribute new stuff :) One restriction though I don't want any non-ASF compatible licensed dependency since I might want to contribute that to Apache OpenNLP or Apache Mahout at some point. Merci, Olivier, for your suggestions. I'm quite sure I want to load to a relational or semi-structured back-end. The Wikipedia data is itself quite structured with various relations, lists etc.
The text per-se of WP articles may, in time, be a good source of features and indeed
(Dec 23 '10 at 17:50)
ecotone
|
|
For getting a plain text dump (the additional question) there are two easy ways: (1) use wikiprep, a large perl script by Zemanta. It can use all your CPUs, which is handy because otherwise we are talking days (!). wikiprep.sf.net (2) use Freebase's WEX: http://wiki.freebase.com/wiki/WEX |
I have a related question: what's the best way to get rid of all the wikimedia markup and remain with the plain-text only? (would also be nice to remove all the non-textual sections of the pages).