|
Hello! We are two students at Chalmers University of Technology, Göteborg Sweden, doing a master thesis on single document summarization. We have implemented a couple of different algorithms for creating summaries (by sentence extraction), which we are testing on the DUC 2002 dataset using the ROUGE-system for evaluation. We would like to compare our results to others presented in the area. A few papers we have read present results for "ROUGE score – Ngram(1,1)" (e.g. "TextRank: Bringing Order into Texts"), we are wondering if anyone could tell us what setting for the ROUGE script "ROUGE score – Ngram(1,1)" corresponds to? Currently we are using this treminal command:
Any help is greatly apreciated! /Jonatan and Christoffer |
|
The Rouge Settings site was very helpful when figuring out the basic settings for ROUGE aswell as the format of the summaries to be fed into the script. Our problem, however, is that we can't find the settings used in, for example, the TextRank paper. In that paper the results are simply given for "ROUGE score – Ngram(1,1)", with additional options "basic", "stemmed" and "stemmed no-stopwords" (as we use the option -m in the command displayed above, our result should correspond to "stemmed"). After reading the paper Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics, we are under the impression that "ROUGE score – Ngram(1,1)" should correspond to "ROUGE-1" (for 1-gram). The thing is that when we compare the score for the baseline of just taking the first sentences of an article from the corpus to form a 100 word summary, presented in the TextRank paper, with that of our own similar baseline the results differ alot: The "ROUGE score – Ngram(1,1)" with additional option "stemmed " result, for the baseline presented in the TextRank paper: 0.477 The result for our similar baseline on the same dataset:
Of course our basline implementation could differ from the one in the paper by some small amount, but 0.477 and 0.627 is quite a big difference. Also the algorithms we have implemented and tested socres around the area of the result of our basline. So can anyone see any obvious thing that we have missunderstood or overlooked? |
|
After alot of bugsearching we have now found what we think was the root of our problem. Due to not getting the input type "SEE" to work (regardless of input we got a ROUGE-score of 0), we used the input type label "SPL". As we still tried to follow the file layout for summaries for "SEE" presented on the ROUGE Settings site, this format was probably not applicable to "SPL". The reason why the "SEE" input type label didn't work for us was because we didn't separate the lines in the summary html-files with a new line char. As the files are in a html format, we didn't think new line separation would make a differance, but apperently it did. After getting the input type label "SEE" to work, the results of our ROUGE-evaluation for our baseline implementation correlated much better to the one presented in TextRank. |