|
Hi, I am working with the OpenNLP library for doing tokenizing/postagging/etc, it has allows a variable called beam size for using the POS Tagger. Can someone enlighten me for what effect this variable has on the analysis? The default value is '3', and I've played around with values from 0 to 20 with no noticeable change in output. |
|
POS taggers are usually built as hidden Markov models, conditional random fields or maximum-entropy Markov models. The probabilistic parts of these models only score, for a given text and a given sequence of POS tags, how good it is. To tag a text then you must search for the best such sequence. The algorithm usually employed for this task is called the viterbi algorithm, and it must keep, to perform exact inference, a running count of all POS tags for each word in the sentence. To speed up these models what is usually done is just storing the top scoring tags for each word, and pretending that's all there is. The number of tags stored per word is referred to as the beam size. Why not just use the best at a given point? In some cases, information later in the sentence will help disambiguate past decisions, but this doesn't help all that much in most cases. So, to set this parameter just choose the smallest value that seems to work for your task.
This answer is marked "community wiki".
Excellent, that answers the question. I'll leave it at the default setting the OpenNLP team used with the option to change it if required. Thanks!
(Jul 01 '10 at 17:12)
Lee H
|