|
What is the best word tokenizer for unrestricted text? |
|
Depends on the task you're solving; tokenization is so important that any serious industrial-strength application has to roll their own, with their own set of exceptions and hacks. I have never heard of any serious endeavour that would be using an out-of-the-box algorithm (much less a single regular expression). Most likely, a complex application will be using several tokenizers, as different parts in the processing pipeline will have different requirements. A mistake at such fundamental level will naturally propagate into all higher levels, and is very costly to correct later. TLDR; I guess my answer boils down to: no such thing as "the best word tokenizer for unrestricted text" :-) To be a little constructive, here are some questions to consider before settling on a solution, from the IR perspective:
|
|
This Bob Carpenter blog post might be of interest. I guess the main issue is what to do with punctuations, more specifically periods and hyphens (and also maybe numbers). I usually use w+ for tokenizing unless I want something good, and then I use NLTK, as recommended by spinxl39. |