Hi,

I am working with the OpenNLP library for doing tokenizing/postagging/etc, it has allows a variable called beam size for using the POS Tagger. Can someone enlighten me for what effect this variable has on the analysis? The default value is '3', and I've played around with values from 0 to 20 with no noticeable change in output.

asked Jul 01 '10 at 17:01

Lee%20H's gravatar image

Lee H
76125


One Answer:
10

POS taggers are usually built as hidden Markov models, conditional random fields or maximum-entropy Markov models. The probabilistic parts of these models only score, for a given text and a given sequence of POS tags, how good it is. To tag a text then you must search for the best such sequence. The algorithm usually employed for this task is called the viterbi algorithm, and it must keep, to perform exact inference, a running count of all POS tags for each word in the sentence. To speed up these models what is usually done is just storing the top scoring tags for each word, and pretending that's all there is. The number of tags stored per word is referred to as the beam size.

Why not just use the best at a given point? In some cases, information later in the sentence will help disambiguate past decisions, but this doesn't help all that much in most cases.

So, to set this parameter just choose the smallest value that seems to work for your task.

This answer is marked "community wiki".

answered Jul 01 '10 at 17:09

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Excellent, that answers the question. I'll leave it at the default setting the OpenNLP team used with the option to change it if required. Thanks!

(Jul 01 '10 at 17:12) Lee H
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.