|
I'm trying to find a good NLP corpus in order to learn the grammatical structures from sentences. Specifically, I'm doing completely supervised training by using the annotated parse trees. The testing procedure involves re-creating those same parse trees from a given sentence. So far I've tested and trained on a toy grammar (with good success), and then stepped up to using NLTK's subset of the Penn Treebank. While I'm getting promising results from the Penn Treebank, I think the small size (5% of the total) is responsible for at least some of the error. While I have the option to buy the rest of the Penn Treebank, it's not clear from their description if the other 95% contains the same annotations. Also, I'd rather avoid paying if there's already good data out there for free. One option would be to use the Stanford Parser to create the annotations for me (I'm more interested in learning the rules than specifically human-annotated text). While I have code now to do this for me, I'm having trouble finding a good source of sentences. The books from Project Gutenberg appear to in decent shape (I think there's still some filtering that has to happen -- I come across random things like 'underlined' words and 'random-hypthenations'. Additionally, because of the nature of the project, many of the books are older and contain much more complex sentences than I would like. For example, I tried two of Jane Austen's books (they are relatively large in size), but after examining their parse trees have decided to try to find something simpler. I've also (briefly) looked at Wikipedia, but it looks like non-words will often appear in the text as well as other types of errors. Again, this can be filtered, but if this has already been done, I'd much rather use someone else's work. So my general question is: Can anyone suggest a good source that has relatively simple sentence construction and is fairly clean? What corpora are other people using? I tried doing a basic literature search, but I'm having trouble finding any general corpora used. edit oops, I'm an idiot. The Penn Treebank document does say the number of skeletal parses it contains:
|
|
I strongly recommend and frequently use the WaCky Corpora. It's totally free, you just have to register with them so that they know who'se using it. They have a large collection of blog and news articles that have been heavily cleaned so that they're mostly free of noisy tokens in English, French, Italian, and a few other languages. For english, they also provide dependency parsed and part of speech tagged versions of the main corpus and of a 2009 dump of Wikipedia. If you'd like to use Wikipedia, this is probably the easiest way to do so since all the wiki markup is removed and everything's been tokenized. These are the biggest free dependency parsed corpora that I know of and they've generally been large enough to test out any ideas i've had. As for free corpora that have already been tagged with phrase structure's, I don't know of any. But it shouldn't be too hard to parse UkWac or Wackypedia (their name for wikipedia) using the Stanford parser as you suggest, especially as it's already tagged and tokenized. Thanks, it looks promising. I may just use the wikipedia extractor they link to and apply the Stanford parser to that. I'm not as familiar with the Malt parser, so I'm a bit hesitant to introduce too many unknowns. I think I may first create a word list and then filter our sentences that have words that don't appear in the list.
(Jan 31 '12 at 16:06)
nop
That's also a good approach. Altough, even if you don't want to use the malt parse information, you can still download the parsed version and just use the part of speech tags and already split tokens. This were both done with fairly standard tools and the xml format is really clean. From my experience, dealing with raw Wikipedia has been a tremendously tedious task.
(Jan 31 '12 at 18:55)
Keith Stevens
|
|
The only problem is that your model will only be as good as the Stanford Parser. In that case, you might as well use the Stanford Parser. If you are doing dependency parsing, you can look at corpora from the CoNLL shared tasks. Jenny Finkel has used the OntoNotes corpus for evaluating her parser. I'm not sure where to get it, but I think it's free. My goal right now is to to induce any type of grammar. Whether it's human-compatible or not is less important. While in theory it would be nice to have some human-annotated corpus, I fear that the number of training samples required will make that a difficult goal. Thanks for the pointers, I'll check them out.
(Jan 31 '12 at 23:35)
nop
|
|
You can get code from the University of Pisa to extract plain-text from Wikipedia dumps. The Stanford Parser takes about 2 seconds per sentence on this data, so parsing all of Wikipedia will probably need several weeks even on a multi-core machine. (IIRC, I had to leave it running for three weeks on an 8-core machine.) One caution about using data from Project Gutenberg: sentences seem to be a lot longer in older text, long enough to frequently choke the Stanford Parser. I guess you'll know better about this from your own trials. But how much data are you looking for? Do you have a particular kind of data in mind? I found the same problems with Project Gutenberg. After a decent amount of searching, I couldn't find any corpus with short-enough sentences. I'm trying to test an algorithm's ability to learn CFGs, so I wanted to look for a variety of corpora that I could use as benchmarks. It requires a decent amount of data for training (currently I'm using 10,000 training samples), which rules out many of the shorter corpora. I'm currently using hand-made CFGs for diagnostic purposes, as it means I know both the structure of the grammar as well produce as many training samples as I want.
(Jul 06 '12 at 10:25)
nop
|