|
I am looking for methods for splitting text into coherent sentence like fragments. Tools I have tried like the NLTK Punkt splitter do quite well on well edited/curated text like books and news print, but have a hard time with text from the web which may lack capitalization, spaces after punctuation, have no punctuation at all and also often have grammatical and spelling problems as well. In such situations one may arguably not have sentences as such but it is usually still possible for human readers to identify more or less independent statements. I am not looking for chunkers or shallow parser which identify phrases which is too fine grained for my purposes. Though I would want to split excessively complex sentences into simpler sentence like parts. I am looking for research and tools for this problem |