Now, I want to employ a sentence segmentation tool to pre-process my data. However, I find the famous opensource nlp tool OpenNLP doesn't work well for the task of sentence seg.

What's the other options recommended, thanx!

asked Aug 21 '11 at 02:01

charlie's gravatar image

charlie
140121417

2

I'm not sure on what exactly you want from a sentence segmentation tool (segment text into sentences? segment sentences into chunks?), but have you looked at the stanford NLP tools? http://nlp.stanford.edu/software/

(Aug 21 '11 at 08:48) Alexandre Passos ♦

3 Answers:

I do not know about the actual best, but when I needed a sentence splitter fair few years ago, this was the best freely available one I could find then:

http://cogcomp.cs.illinois.edu/page/tools_view/2

It was not perfect but it was pretty good for my needs at the time.

More recently I have come across these:

Includes proper tokenization and models for very high accuracy sentence boundary detection (English only for now). The models are trained from Wall Street Journal news combined with the Brown Corpus which is intended to be widely representative of written English. Error rates on test news data are near 0.25%.

However, I have not had the chance to use either of them. I think GATE also has some sentence splitters, but I have no idea how good they are.

answered Aug 21 '11 at 18:23

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

edited Aug 21 '11 at 20:14

(Note: I didn't actually try any of these)

The splitter by Dan Gillick (splitta) mentioned in the answer by Daniel Mahler should be good, at least according to the paper describing it. However, as it is trained on WSJ+Brown, it might be overfitted and not the best for your domain.

Another alternative is the unsupervised algorithm by Kiss and Strunk (2006) which is implemented in nltk's "punkt" model. This implementation comes with a pre-trained English module, but can also be trained on any other (un-annotated) text.

answered Aug 22 '11 at 12:22

yoavg's gravatar image

yoavg
741122331

The LingPipe Sentence Extractor works pretty well

answered Sep 26 '11 at 01:59

y2p's gravatar image

y2p
1663912

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.