2
1

Wikipedia is recognized as a workable lexical semantic resource.  Although a weak semantic source in its form, its data includes many features which can be used to mine effective taxonomies and more generally provide support for various NLP tasks such as Named Entity Recognition (NER), or multilingual mappings.

For illustration but by no mean fully representative of the trend, see how the theme of using Wikipedia as a semantic source is developed by the NLP group at the Heidelberg Institute for Theoretical Studies. For example, Deriving a Large Scale Taxonomy from Wikipedia (2007) by Simone Paolo Ponzetto and Michael Strube

I have plenty of ideas -many probably misguided- on how to "drill into" Wikipedia to support NE disambiguation in a upcoming pipeline of mine. This question deals with the practical issue of creating a Wikipedia back-end for easy use during NLP processing.

What's the "Best" way of importing Wikipedia data dumps, for easy interrogation during NLP processes ?

I'm about to try the Java Wikipedia Library (JWPL) and its mySQL backend. I'd like suggestions of alternative libraries and also any practical information about this process at large for example
  - What's a good filtered dump one could use for a dry run?
  - How long does the import process of pages-articles.xml.bz2 for the enwikisource dump takes? (obviously system dependant but a rough idea would be nice).
  - Any particular gotchas? (issues with utf8 encoding, malformed xml...)
  - Ways of reducing the breadth or depth of the repository without loosing significant NLP-usable patterns
  - ...

asked Dec 21 '10 at 15:11

ecotone's gravatar image

ecotone
100126

I have a related question: what's the best way to get rid of all the wikimedia markup and remain with the plain-text only? (would also be nice to remove all the non-textual sections of the pages).

(Dec 24 '10 at 14:07) yoavg

2 Answers:

Why do you want to import it into a DB? Most of the time NLP tools works best with raw files. You can use a SOLR / lucene index if you want to do fast fulltext and similarity queries.

To work on small files you can use the mahout wikipediaXMLSplitter to split into roughly equal files (e.g. 100MB each).

You might also be interested in some work I have just started here: pignlproc - this is a bunch of Apache Pig / OpenNLP based utilities to mine the wikipedia dumps on a hadoop cluster (or a single machine) using pig scripts. I plan to couple it with the DBpedia.org ontology and entity-type relationship dataset (and/or uberblic.org and/or freebase.com) to build multilingual, find grained set of NER training corpus for OpenNLP and friends.

I also plan to wrap malt parser as pig function and try to do semantic relationship extraction too.

Please feel free to fork pignlproc and contribute new stuff :) One restriction though I don't want any non-ASF compatible licensed dependency since I might want to contribute that to Apache OpenNLP or Apache Mahout at some point.

answered Dec 21 '10 at 16:26

ogrisel's gravatar image

ogrisel
398464480

edited Dec 21 '10 at 19:21

Merci, Olivier, for your suggestions. I'm quite sure I want to load to a relational or semi-structured back-end. The Wikipedia data is itself quite structured with various relations, lists etc. The text per-se of WP articles may, in time, be a good source of features and indeed Solr or some similar system with all the built in text parsing and indexing could prove to be useful if I ever go that route. The hint about DBpedia was very useful; , I'll positively learn and steal what I can from there! Also I'll be sure to check pignlproc (Man, you're getting me distracted... good stuff, good stuff!). Happy holidays.

(Dec 23 '10 at 17:50) ecotone

For getting a plain text dump (the additional question) there are two easy ways:

(1) use wikiprep, a large perl script by Zemanta. It can use all your CPUs, which is handy because otherwise we are talking days (!). wikiprep.sf.net

(2) use Freebase's WEX: http://wiki.freebase.com/wiki/WEX

answered Dec 27 '10 at 12:00

Jose%20Quesada's gravatar image

Jose Quesada
1863710

edited Dec 27 '10 at 12:01

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.