16
9

I was looking for a current survey of natural language parsers. Which ones are the most accurate? Which ones are good and fast? What kind of accuracy should I expect?

I will of course be running tests on my text corpus to see which one best suits my needs but I was wondering which ones are worth testing. I don't want to miss out on any good but lesser known parsers.

asked Jul 01 '10 at 07:26

Anshul's gravatar image

Anshul
241234

edited Sep 24 '10 at 17:11

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

DeSR (Dependency Shift Reduce) is a transition based dependency parser with state of the art accuracy and the fastest performance.

(Jul 15 '10 at 13:17) Giuseppe Attardi

9 Answers:

If you're interested in constituents, I'd vote for the Berkeley parser, or the Charniak/Johnson one. I find the Berkeley parser to be somewhat more accurate on non-news text, but Charniak's is faster (use the -T50 option for a big speed gain with a minor loss in accuracy). They are both far more accurate than Stanford's. Even if you are interested in Stanford Dependencies, they can be extracted from the output of these others parsers, and will be more accurate than those produced by the stanford parser (see recent report).

For dependency parsing, I'd go for the MstParser (use the second-order model), or, if speed is important, my own EasyFirst dependency parser.

answered Jul 23 '10 at 00:50

yoavg's gravatar image

yoavg
741122331

I think the Combinatory Categorial Grammar (CCG) are also well worth considering. The tools (can be found here) are very well supported and quite mature. As far as speed goes: I've heard they parsed all of english wikipedia with a few tens of machines in under 3 hours (but perhaps the experts can get the exact numbers).

Update: I can't find a reference to the wikipedia quote so it might be incorrect. In "Linguistically Motivated Large-Scale NLP with C&C and Boxer" I did find that they parsed the Gigaword corpus using 18 machines in under 5 days.

answered Jul 15 '10 at 15:46

Jurgen's gravatar image

Jurgen
99531419

edited Jul 15 '10 at 16:19

2

Yes, we've parsed all of Wikipedia using an undergraduate lab's worth of computers. It was discussed at the JHU workshop in 2009. (I'm one of James Curran's postdocs/former students. I work on the parser.)

On modern machines you can expect about 30 sentences a second with the current version checked out from trunk, and about 20 sentences/second with version 1.02. Work presented at ACL '10 by Kummerfeld et al. takes the speed up to 70 sentences a second by training the supertagger on the parser's output. These models are not currently being distributed, simply because we do not have a good hosting solution for models that are a few hundred mb (bah, humbug!).

These speeds refer to a configuration where all sentences, no matter how long, are attempted. If you're willing to throw out 1-2% of your sentences you could parse two or three times faster, I expect.

Output is available in Stanford Dependencies or Briscoe and Carroll grammatical relations. The raw CCG dependencies should only be used by those with strong CCG fu. We believe the parser is roughly as accurate as the Berkeley parser, based on comparisons described by Curran and Clark at ACL 2009 and Fowler and Penn at ACL 2010.

(Jul 18 '10 at 20:08) syllogism

The Berkeley parser is very accurate (more institutional bias, probably), but I actually use the Stanford parser, which is far easier to use, better documented, and has accurate dependency parsing as well.

The accuracy varies depending on how close your text is to the PTB, but they've got F1's in the high 80's as far as I remember

answered Jul 07 '10 at 13:16

aditi's gravatar image

aditi
85072034

Well, "best" how... Most comparative studies I have seen report Charniak's or Collin's parsers as most accurate. (Although Connexor, a commercial system, usually -- but not always -- scores materially higher.) However, in a comparison of dependency features, Stanford's parser usually wins.

answered Jul 03 '10 at 17:08

sgs's gravatar image

sgs
312

1

Which Collins parser? His thesis parser hasn't been SOTA since Charniak's deprecated it, IIRC.

(Jul 03 '10 at 19:27) Joseph Turian ♦♦

Especially for biomedical text, there's Enju and Mogura which are HPSG parsers. They come packaged together, but Enju is optimized for accuracy and Mogura for speed.

answered Jul 03 '10 at 10:54

Cory%20Giles's gravatar image

Cory Giles
66127

edited Jul 03 '10 at 10:55

There is also the Stanford Statistical Parser GPL with models for English, Chinese, German and Arabic and the OpenNLP maximum entropy based parser (English model only) under Apache and LGPL licenses. Both of them are implemented in Java.

In the python world the NLTK project comes with a bunch of parsers too.

I don't know whether they are state of the art though. I would be very interested in a comparative summary of their respective performances should someone here had tried some of them.

answered Jul 01 '10 at 13:51

ogrisel's gravatar image

ogrisel
498995591

2

I've used the Stanford Parser quite a bit in the past, and recently wrapped it in some Clojure code. From what I understand it does fairly well, and is considered pretty "state-of-the-art."

(Jul 01 '10 at 15:33) Hamilton Ulmer
3

Compared with many of these other parsers, the Stanford Parser is well-documented and has a beautiful API. The conversion to typed dependencies is nice too. But in my experience, it's quite slow, especially compared with some of the parsers implemented in C++. (Of course not nearly as slow as the NLTK parsers).

(Jul 03 '10 at 11:47) Cory Giles
2

I have been using The Stanford parser for a while now and as of now I can say, it just works the way you need it. The API is good and development becomes hassle free.

(Jul 09 '10 at 06:33) ArchieIndian

If you prefer to try a dependency parser (as opposed to a phrase structure parsers like the ones recommended before), MaltParser and MSTParser are good choices. Each of them has achieved state-of-the-art accuracy on several languages (which one is better for you depends on the language and dataset, of course).

answered Jul 01 '10 at 10:43

cgomezr's gravatar image

cgomezr
462

edited Jul 01 '10 at 12:44

Might be institutional bias, but I highly recommend the Berkeley Parser, which is state-of-the-art. I've found that its performance suffers a bit less from parsing non-newswire text relative to other parsers. Two small things: (1) it reports parse forest posteriors if you need them and (2) it is multi-core out of the box.

Update: I have a wrapper around the Berkeley Parser in my NLP Clojure package Mochi. See parser.clj.

answered Jul 01 '10 at 09:30

aria42's gravatar image

aria42
209972441

edited Jul 03 '10 at 11:51

Charniak's lab has released their parsers for a while, and those have always been state-of-the-art, or near to it. You should look at the relevant publications, but the self-trained parser is better than the reranking parser, IIRC. You can also use their self-training method to adapt their parser to your text domain, if you don't mind training it.

Besides that, for a while the parser by Carreras et al (2008, "TAG, Dy­nam­ic Pro­gram­ming, and the Per­cep­tron for Ef­fi­cient, Fea­ture-​rich Pars­ing") was state of the art. I emailed Terry Koo and asked nicely and he shared the code with me. Which just goes to show, it never hurts to ask.

answered Jul 01 '10 at 09:08

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.