I am looking for methods for splitting text into coherent sentence like fragments. Tools I have tried like the NLTK Punkt splitter do quite well on well edited/curated text like books and news print, but have a hard time with text from the web which may lack capitalization, spaces after punctuation, have no punctuation at all and also often have grammatical and spelling problems as well. In such situations one may arguably not have sentences as such but it is usually still possible for human readers to identify more or less independent statements. I am not looking for chunkers or shallow parser which identify phrases which is too fine grained for my purposes. Though I would want to split excessively complex sentences into simpler sentence like parts. I am looking for research and tools for this problem

asked Apr 30 '14 at 17:04

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

edited Apr 30 '14 at 17:08


One Answer:

Have a look at this post. It may be helpful for you.

answered May 01 '14 at 09:48

Midas's gravatar image

Midas
42151017

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.