Frank writes: "We need a language-modeling shared task (e.g. at CoNLL) so the different approaches can be compared in a standardized way."

I was trying to think of a comparative evaluation technique that is part of a real world task, but that is hard to game. Following up on Frank's MT idea, how does the following MT reranking task sound:

You are given k sentences, each called e, and you are asked to assign them all Pr(e).

These k sentences are potential translations of the sentence f (which you are not told). The MT system has an estimate Pr(f|e) for each sentence e, and you are also not told these estimates.

The evaluation is to use your LM to pick one of the k candidates, the argmax of Pr(f|e)*Pr(e). Then, the score of your LM is an MT evaluation measure computed over your choice of e.

Does this sound like a good way to compare language models? What objections do you see? One objection is that the model of Pr(f|e) could introduce bias into the evaluation. e.g. if Pr(f|e) was, say, a PBMT system trained with an n-gram language model, then this might unfairly favor n-gram language models.

asked Aug 28 '10 at 21:35

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
467541105126


One Answer:
-1

I have two objections:

  1. A language model is supposed to be generative, and not discriminative, and this is clearly a discriminative setting (if we're only after the argmax que can use margin methods and other non-probabilistic methods to estimate things that won't necessarily be densities). Also this doesn't take in consideration the fact that there is more than one answer for this sort of question; it's more honest to frame this as a ranking problem (there is a true order between the sentences and you're trying to predict that) than a classification problem, but this makes it harder to evaluate (kendall's tau? NDCG? ERR?). More clearly, if we have access to the loss of choosing the wrong translation we can optimize that (say, with an M³N) instead of building a consistent generative model. This might be better suited for this task, but it's not an actual languahe model (in the same sense that naive bayes is a language model but logistic regression isn't).
  2. You might be able to solve this task better without a real language model (this is related to the ranking thing above). If we can see that all the candidate sentences are similar in some ways (since they're presumably samples from a decoding model) we can probably hack something that produces inconsistent probabilities (like prefers sentence A to B in some contexts but B to A in others) that does better than a straight language model for this task (assuming we have access to training data)

But overall I like this idea. Maybe one could couple that with speech recognition and text compression (i.e., test-set perplexity) to get a fair evaluation.

Also there is the issue that BLEU might bias towards n-gram-based models.

Maybe the ideal way to represent this bakeoff should be code-focused: a corpus format is specified, together with an interface for querying the log-probability of a sentence, and maybe a toy subset of the real data is made public, so people can be sure their code works correctly. Then the evaluation process consists of people submitting the code and something like MLCOMP actually runs the pipeline with the corpus, the development sentences, and the test sentences and reports the results on the development sentences (and before a conference workshop or something like that the real test results are revealed and people get to present their methodology). What do you think?

answered Aug 28 '10 at 22:12

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1893744214333

edited Aug 29 '10 at 14:05

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.