Hello everyone,

My question is in two parts actually. To put it in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against).

So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted the link to a gist here: Corpus preparation

What I find very suspicious is that I get the following results:

... build
... train
len(corpus) = 1161192, len(vocabulary) = 13817, len(train) = 1103132, len(test) = 58060
perplexity(test) = 4.60298447026

With a perplexity of 4.6 it seems Ngram modeling is very good on that corpus. If my interpretation is correct then the model should be able to guess the correct word in roughly 5 tries on average (although there are 13817 possibilities...). If you could share your experience on the value of this perplexity? I did not find any complaints on the ngram model of nltk on the net ( but maybe I do it wrong).

Anyway, my second question is about good practice for training and comparing LM. Is there a recommended corpus, a tokenizer. What words should be considered rare? Is it better to keep the capitalisation of words. Actually, I'd like any advice you can share or experience you had doing something similar.

Thanks!

asked May 09 '13 at 13:55

Arnaud's gravatar image

Arnaud
15335

edited May 09 '13 at 14:34

Be the first one to answer this question!
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.