|
What is the state of the art in language modeling? In most applications where language models (LMs) are needed, people just use simple 3-, 4- or 5-grams (estimated by count-and-normalize) that are smoothed in some way (Witten-Bell, Kneser-Ney, etc.) and maybe pruned (e.g. based on relative entropy or weighted-difference). Do they use those simple LMs just for convenience, because they are easy to estimate, represent and to compute with? Or are those LMs also the best in performance (measured in perplexity or task-based performance)? How well do the alternative LMs perform, e.g. neural-net LMs, random-forest LMs, parsing-based LMs, local or globally normalized maximum-entropy LMs? |
|
There are many hard to compare different variants:
Not all of these have been compared in all datasets, so you might find a bit of variability. Kneser-Ney still seems to be either state-of-the-art or very close to it, however. In general, each of these papers claims to beat baselines and other models. The HLBL model fares better on perplexity than kneser-ney, IIRC, but all of these papers (and their follow-up work) have positive-looking evaluations against baselines and each other. You're better off reading their evaluation sections for more specific results. The main thing that bugs me in n-gram-based language models is that they're useless to do anything that is not assign probabilities to documents. IMO the most interesting things about these neural language models is that having access to a language model can help you solve many unrelated tasks for example by using the hidden units of the model as features, but also by re-training your language model in a multi-task sense for different tasks, etc. Part of this is due to the standard shallow linear algorithms, but a lot of it is also due to the fact that a smoothed n-gram table is really justa shallow representation of the documents. This doesn't match the usual intuition that knowing a language makes all sorts of tasks in that language easier. The second main thing that bugs me is that these models mainly ignore morphology, which IME is usually a very reliable clue to do many natural-language-related tasks (but maybe this is due to the fact that my first language is portuguese, which is a lot richer morphologically than english). 1
We need a language-modeling shared task (e.g. at CoNLL) so the different approaches can be compared in a standardized way. The problem is that this could only evaluate on perplexity, but not on task-based measures like WER in speech recognition or BLEU in machine translation, since many LMs are too hard to integrate into such systems ...
(Aug 28 '10 at 11:45)
Frank
I think it'd be hard to get a fair evaluation for a language modeling shared task unless you crippled the models. For language modeling there'd be a strong incentive to cheat and just use whatever corpora you can get to improve your results, so you'd end up evaluating skill in acquiring related data instead of model performance (and even if you don't actually use this extra data to train the model you can do a lot of tuning of hyperparameters and similar things to improve perplexity). On the other hand, if do something like get a corpus in an unspecified language and replace all word types by random numbers (to avoid the harvesting extra data problem) you end up crippling the model, and smoothing over the fact that for unseen words morphological cues are very important (which someone could leverage to improve performance in such a task, specially if the language is not english and has declinations/conjugations). And also there is the perplexity vs a "real-world" loss issue as you stated, although this could be testable if the shared task included a full pipeline for a translation/recognition system where you can plug the language model and get dev set results.
(Aug 28 '10 at 12:01)
Alexandre Passos ♦
|
|
To add to Alexandre's reply, in terms of scalability I'd suggest looking at the approaches proposed in this paper and this paper, both of which propose streaming based solutions. The first paper uses a dynamic Bloomier filter, whereas the second is based on approximating the frequency counts over streaming data. Also this paper http://www.icml2010.org/papers/549.pdf is I think the latest in this trend, and has an easier to implement method with roughly equal guarantees.
(Aug 28 '10 at 10:27)
Alexandre Passos ♦
|