One thing that has always unsettled me about many NLP methods (pLSI, LDA, many CRF based taggers / parsers) is that they treat each word in a language as an atomic token that is given it's own direction. If you approach any serious size corpus of text this way you end up with ~10^4 - 10^5 unique words (fewer if you apply standard stemming algorithms) and to deal with this people frequently do things like remove stop words and ignore infrequent words (which are likely to carry a lot of information because of their rareness) I respect the pragmatism of this approach, but it seems that throwing away the most and least frequent words is wasting a lot of relevant information.

It is also clear from inspection that many of the rare words in large corpora are morphologically derived from more frequently occurring words. Does anyone know of any work that has been done to reduce the dimensionality of word reps by using e.g. a character level model (which could exploit the morphological structure) and using these learned representations as an input to methods like LDA?

A related technique I've been very impressed are the neural network language models (see http://www.scholarpedia.org/article/Neural_net_language_models for background info and references) that represent words based on the distribution of contexts in which they occur. I would also love to know of any practice and experience on what works or doesn't work with these methods.

This question is marked "community wiki".

asked Jul 05 '10 at 00:47

jbowlan's gravatar image

jbowlan
1062710

edited Aug 28 '10 at 21:35

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


3 Answers:

Google 'morphological parser' and find something you like and try to use it in a real world problem. If it seems to significantly improve your task's performance, let us know! :)
Personally I don't think this will help a lot, atleast for english, where a lot of words don't have much connection to their etymological roots. Maybe some domains like medicine, where theres a lot of complicated but logical taxonomy, might benefit.


Also, I had once used LDA with a very large number of topics, and the topics seemed to have captured the different forms of the same word (i.e. the top words of many topics were just variants of the same thing [run, runs, ran, etc]). I don't know if it would do that consistently, but still, it was interesting to see it do that.

answered Jul 05 '10 at 02:08

Aditya%20Mukherji's gravatar image

Aditya Mukherji
2251612

edited Jul 05 '10 at 02:16

Yes, LDA usually captures different uses of the same words. See the Griffiths and Steyvers probabilistic topic models paper for more information on that http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.9625&rep=rep1&type=pdf .

(Aug 28 '10 at 22:14) Alexandre Passos ♦

You are right, it is possible to use deep convolutional neural networks to find an embedding of the words senses into some reduced dimensional space based on there frequent occurrence contexts in windows of 10 words. You should have a look at the work of Ronan Collobert and Jason Weston on a Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning and the implementation of Senna not open source but you can test for research purposes.

I reproduce here the performance numbers advertised on the senna homepage:

  • Part of Speech (POS) (Toutanova et al, 2003) (Accuracy) 97.29%
  • Chunking (CHK) CoNLL 2000 (F1) 94.32%
  • Name Entity Recognition (NER) CoNLL 2003 (F1) 89.59%
  • Semantic Role Labeling (SRL) CoNLL 2005 (F1) 75.49%

Edit: I had not seen the character level part of your question: senna is using complete tokens, not character ngrams. But the first layer embedding is still somehow partly solving the same issue.

answered Jul 05 '10 at 03:55

ogrisel's gravatar image

ogrisel
498995591

edited Jul 05 '10 at 03:57

I think even with standard linear models for classification you find approximations to that. If I recall correctly, most (for example), CRFs used for chunking/taggin/named entity recognition use morphological features. They're usually two, three, or four letter prefixes/suffixes and capitalization features (that separate CamelCase from Capitalized from ALLCAPS from uncapitalized). This paper gives some examples of this. Also, on Daume's blog, there was a post on using gzipped or byte features in standard NLP algorithms.

But, yes, it would be interesting to study more deeply these neural network architectures, or use more morphological models for word features as well.

answered Jul 05 '10 at 07:45

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.