|
I want to feed a sentence into a function and have returned the parts of speech of each word. So if I put in "tomato sauce" it should know that "tomato" is an adjective here. I am using OpenNLP right now and it thinks "tomato" is a noun here using the English tokenizer/pos tagger at http://opennlp.sourceforge.net/models/english/. I prefer to work with a Clojure library, but Java is acceptable as well. Also, I am new to this field, so links/resources to read further is appreciated, although I would also like a quick answer since I am trying to hit a deadline at the moment! |
|
Perhaps the problem is the POS tagger you have built in :) Tomato is a noun in "tomato sauce", at least by most common POS tag schemes. The trouble is that you're essentially duck-typing the words. To you, "tomato" is acting in a way that adjectives normally act, so you want to call it an adjective. To probe the true class of "tomato", you can try to put it in a few contexts that only take adjectives, and cannot take nouns: 1) I feel tomato; 2) very tomato. These are both ungrammatical, showing that "tomato" cannot be an adjective. The key concept you're missing is a distinction between the form of a word and its function. Noun, verb, adjective etc are word class categories, that essentially say "constituents syntactically headed by this word will be of type X". But constituents of a given type can perform a variety of functions. Noun phrases can function adverbially ("I got it done last week"), and individual nouns can function as nominal modifiers in noun phrases (your "tomato" in "tomato sauce"). There are tools which assign grammatical function labels, but the label schemes are pretty complicated. You'd need to learn about a specific linguistic theory like HPSG or CCG to understand what the tagger was telling you. I doubt this is a practical solution, but I'm happy to make recommendations if it is. It's unfortunate that you need to know this sort of thing. Ideally, good tools should hide as much of their internals from users as possible. Syntactic annotation tools currently don't do this very well. To use them, you really have to know a lot about the annotations they're giving you back. As someone who works on such tools, I wish I had an answer to this. |
|
There isn't a very good adjective/noun distinction in English. POS taggers consistently get this wrong even on in-domain data. What application are you interested in which requires this distinction? It is true that in newswire pre-nominial modifiers are very typically nouns (think "company man"), but the issue is more difficult than that. If you want to get a sense for what words can be used as noun modifiers, why not just count the number of times a word preceedes a terminal noun, where terminal means the next word is a non-noun. Also, Alex's suggestion of using a parse-tree probably won't help since the tagging from parse trees typically underperforms that of a CRF tagger. I had the feeling that this should be a thorny issue, but I haven't really studied POS tagging in depth. I deleted my answer, since it adds nothing, and was slightly wrong.
(Jul 19 '10 at 18:49)
Alexandre Passos ♦
|
|
I have been using Stanford Parser. I will not say it is perfect. But it does a decent job. Try it out online here first. |