Are there any results that prove that named entity recognition helps for statistical semantics (LSA, topics)? I haven't seen any, but the lit. that I check is mostly from psychology and there NER is not that popular.

What I want is paper that shows that you can do some human task (i.e. TOEFL, ranking, categorization, etc) better after you have performed NER. Basic off-the-mill NER, such as the one opennlp provides...

asked Jan 22 '11 at 09:52

Jose%20Quesada's gravatar image

Jose Quesada
1863710

Why should it help? And how would you use it?

(Jan 22 '11 at 10:09) Alexandre Passos ♦

I think it could help since it's able to extract non linear features that might be significantly correlated with the supervised-category signal while the underlying terms less so due to ambiguity. I am not sure this is really important in practice. Maybe on short text with very ambiguous names (but I don't have a practical example at hand).

Terms extraction using POS tagging / chunking (to extract all noun-phrases occurrences) and IDF-weighting might be a good way to extract good feature for topic categorization in a smaller dimensional space than considering all the raw3-grams and let the classifier sort out the garbage). However since classifiers are now scalable enough to able to handle very wide sparse problems (see glmnet, vowpal wabbit, and SGD models) I am not sure this is useful in practice either.

(Jan 22 '11 at 10:22) ogrisel

ogrisel: except that NER doesn't give you better features; all it does is mark phrases in text as being this or that entity. Maybe full coreference could help, but I don't see how NER can help unless you want to train an LSA model of named entities (and I don't see how this is useful).

(Jan 22 '11 at 10:28) Alexandre Passos ♦

I meant, if you have the text: "Mr White declared to the assembly that he was happy" and apply a NER analysis on it you will be able to enrich your initial bag of words (unigrams for the sake of clarity): {white, declared, assembly, happy} with a new artificial word "person_white" which might be strongly correlated with the topic classes you are trying to extract (e.g soul music) while the single token "white" might be very often related in the context of colors hence would be too noisy as a predictor for "soul music" texts.

(Jan 22 '11 at 11:42) ogrisel

I agree with ogrisel, this is why I asked. One problem that statistical semantic representations have is that you collapse all meanings of a word into a single vector. In ogrisel example, Mr. white and the color white would be a single vector. By adding the named-entity type you may separate those out. This could be important given how many of those examples there could be in natural language. I think it may help, and I'd be surprised if this is not already out there.

(Jan 22 '11 at 11:50) Jose Quesada

I think that semi-supervised word and n-gram embeddings are more promising and more generic approaches (see the work by Collobert and Weston for instance).

(Jan 22 '11 at 12:11) ogrisel
showing 5 of 6 show all
Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.