2
2

What is the best word tokenizer for unrestricted text?

asked Aug 13 '10 at 19:06

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126


3 Answers:

Depends on the task you're solving; tokenization is so important that any serious industrial-strength application has to roll their own, with their own set of exceptions and hacks. I have never heard of any serious endeavour that would be using an out-of-the-box algorithm (much less a single regular expression). Most likely, a complex application will be using several tokenizers, as different parts in the processing pipeline will have different requirements.

A mistake at such fundamental level will naturally propagate into all higher levels, and is very costly to correct later.

TLDR; I guess my answer boils down to: no such thing as "the best word tokenizer for unrestricted text" :-)


To be a little constructive, here are some questions to consider before settling on a solution, from the IR perspective:

  1. ambiguous spelling handled here, in post-processing (index time) or at query resolution time? (Nokia N8 vs. Nokia N-8 vs. Nokia N 8; Windows 2000 vs. Windows 2k; don't vs. dont, ...)
  2. what meta-information to keep (glue characters in between tokens? character case?)
  3. how to handle encoding errors (ignore/replace/correct?)
  4. how to handle dates (12.4.2010), ranges (10-55), IP addresses or compound tokens in general (including smileys :-), <3)
  5. unicode support both from the technological side and practical side (what to do with non-english input? for latin based scripts, what to do with accents?)
  6. extra-algorithmic questions like performance (anything below 1MB/s really hurts), ability to maintain, hack and extend code base

answered Nov 06 '10 at 06:23

Radim's gravatar image

Radim
29669

I think Freeling, Andrew McCallum's Rainbow, and the one in NLTK are pretty good.

answered Aug 13 '10 at 19:44

spinxl39's gravatar image

spinxl39
3458104368

This Bob Carpenter blog post might be of interest.

I guess the main issue is what to do with punctuations, more specifically periods and hyphens (and also maybe numbers). I usually use w+ for tokenizing unless I want something good, and then I use NLTK, as recommended by spinxl39.

answered Aug 13 '10 at 20:18

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.