|
Are there any results that prove that named entity recognition helps for statistical semantics (LSA, topics)? I haven't seen any, but the lit. that I check is mostly from psychology and there NER is not that popular. What I want is paper that shows that you can do some human task (i.e. TOEFL, ranking, categorization, etc) better after you have performed NER. Basic off-the-mill NER, such as the one opennlp provides...
showing 5 of 6
show all
|
Why should it help? And how would you use it?
I think it could help since it's able to extract non linear features that might be significantly correlated with the supervised-category signal while the underlying terms less so due to ambiguity. I am not sure this is really important in practice. Maybe on short text with very ambiguous names (but I don't have a practical example at hand).
Terms extraction using POS tagging / chunking (to extract all noun-phrases occurrences) and IDF-weighting might be a good way to extract good feature for topic categorization in a smaller dimensional space than considering all the raw3-grams and let the classifier sort out the garbage). However since classifiers are now scalable enough to able to handle very wide sparse problems (see glmnet, vowpal wabbit, and SGD models) I am not sure this is useful in practice either.
ogrisel: except that NER doesn't give you better features; all it does is mark phrases in text as being this or that entity. Maybe full coreference could help, but I don't see how NER can help unless you want to train an LSA model of named entities (and I don't see how this is useful).
I meant, if you have the text: "Mr White declared to the assembly that he was happy" and apply a NER analysis on it you will be able to enrich your initial bag of words (unigrams for the sake of clarity): {white, declared, assembly, happy} with a new artificial word "person_white" which might be strongly correlated with the topic classes you are trying to extract (e.g soul music) while the single token "white" might be very often related in the context of colors hence would be too noisy as a predictor for "soul music" texts.
I agree with ogrisel, this is why I asked. One problem that statistical semantic representations have is that you collapse all meanings of a word into a single vector. In ogrisel example, Mr. white and the color white would be a single vector. By adding the named-entity type you may separate those out. This could be important given how many of those examples there could be in natural language. I think it may help, and I'd be surprised if this is not already out there.
I think that semi-supervised word and n-gram embeddings are more promising and more generic approaches (see the work by Collobert and Weston for instance).