|
I want code that allows me to measure the grammaticality of a sentence. Alternately, I want a POS tagger or parser implementation that can easily output a score, so if there are weird language glitches that make it hard to tag or parse then the sentence receives a low score. [edit: I'm assuming that a POS tagger or parser will assign lower scores to constructions like "A number of would be happy." because there is a missing noun, but perhaps this assumption is not true. Your thoughts?] Can anyone point me to existing code that makes it simple to do this? |
|
NLTK's HMM tagget can give you a sentence's log probability. They come with a trained model, but I'm failing to load it in a generic way. Regardless, it's easy to train one yourself on the treebank. Given the tagger, just use the log_probability method. Bear in mind that this log-probability always goes down with the sentence length, so account for that when using it. Something similar feels like it should be possible with their parser packages, as they're using PCFGs, but I couldn't find an API. You can probably hack the stanford parser into giving you this information, however. Is the log probability based on only the tags or parse structure, but not the words themselves? You don't want to confuse rare words/phrases with ungrammatical sentences.
(Nov 17 '11 at 15:24)
Rob Renaud
I think rare words should be ok, as these things have UNK tokens under the hood (and, arguably, they are less grammatical than common words)?
(Nov 17 '11 at 21:16)
Alexandre Passos ♦
|