|
One thing that has always unsettled me about many NLP methods (pLSI, LDA, many CRF based taggers / parsers) is that they treat each word in a language as an atomic token that is given it's own direction. If you approach any serious size corpus of text this way you end up with ~10^4 - 10^5 unique words (fewer if you apply standard stemming algorithms) and to deal with this people frequently do things like remove stop words and ignore infrequent words (which are likely to carry a lot of information because of their rareness) I respect the pragmatism of this approach, but it seems that throwing away the most and least frequent words is wasting a lot of relevant information. It is also clear from inspection that many of the rare words in large corpora are morphologically derived from more frequently occurring words. Does anyone know of any work that has been done to reduce the dimensionality of word reps by using e.g. a character level model (which could exploit the morphological structure) and using these learned representations as an input to methods like LDA? A related technique I've been very impressed are the neural network language models (see http://www.scholarpedia.org/article/Neural_net_language_models for background info and references) that represent words based on the distribution of contexts in which they occur. I would also love to know of any practice and experience on what works or doesn't work with these methods.
This question is marked "community wiki".
|
|
Google 'morphological parser' and find something you like and try to use it in a real world problem. If it seems to significantly improve your task's performance, let us know! :) Also, I had once used LDA with a very large number of topics, and the topics seemed to have captured the different forms of the same word (i.e. the top words of many topics were just variants of the same thing [run, runs, ran, etc]). I don't know if it would do that consistently, but still, it was interesting to see it do that. Yes, LDA usually captures different uses of the same words. See the Griffiths and Steyvers probabilistic topic models paper for more information on that http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.9625&rep=rep1&type=pdf .
(Aug 28 '10 at 22:14)
Alexandre Passos ♦
|
|
You are right, it is possible to use deep convolutional neural networks to find an embedding of the words senses into some reduced dimensional space based on there frequent occurrence contexts in windows of 10 words. You should have a look at the work of Ronan Collobert and Jason Weston on a Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning and the implementation of Senna not open source but you can test for research purposes. I reproduce here the performance numbers advertised on the senna homepage:
Edit: I had not seen the character level part of your question: senna is using complete tokens, not character ngrams. But the first layer embedding is still somehow partly solving the same issue. |
|
I think even with standard linear models for classification you find approximations to that. If I recall correctly, most (for example), CRFs used for chunking/taggin/named entity recognition use morphological features. They're usually two, three, or four letter prefixes/suffixes and capitalization features (that separate CamelCase from Capitalized from ALLCAPS from uncapitalized). This paper gives some examples of this. Also, on Daume's blog, there was a post on using gzipped or byte features in standard NLP algorithms. But, yes, it would be interesting to study more deeply these neural network architectures, or use more morphological models for word features as well. |