|
Frank writes: "We need a language-modeling shared task (e.g. at CoNLL) so the different approaches can be compared in a standardized way." I was trying to think of a comparative evaluation technique that is part of a real world task, but that is hard to game. Following up on Frank's MT idea, how does the following MT reranking task sound: You are given k sentences, each called e, and you are asked to assign them all Pr(e). These k sentences are potential translations of the sentence f (which you are not told). The MT system has an estimate Pr(f|e) for each sentence e, and you are also not told these estimates. The evaluation is to use your LM to pick one of the k candidates, the argmax of Pr(f|e)*Pr(e). Then, the score of your LM is an MT evaluation measure computed over your choice of e. Does this sound like a good way to compare language models? What objections do you see? One objection is that the model of Pr(f|e) could introduce bias into the evaluation. e.g. if Pr(f|e) was, say, a PBMT system trained with an n-gram language model, then this might unfairly favor n-gram language models. |
|
I have two objections:
But overall I like this idea. Maybe one could couple that with speech recognition and text compression (i.e., test-set perplexity) to get a fair evaluation. Also there is the issue that BLEU might bias towards n-gram-based models. Maybe the ideal way to represent this bakeoff should be code-focused: a corpus format is specified, together with an interface for querying the log-probability of a sentence, and maybe a toy subset of the real data is made public, so people can be sure their code works correctly. Then the evaluation process consists of people submitting the code and something like MLCOMP actually runs the pipeline with the corpus, the development sentences, and the test sentences and reports the results on the development sentences (and before a conference workshop or something like that the real test results are revealed and people get to present their methodology). What do you think? |