Hello!

We are two students at Chalmers University of Technology, Göteborg Sweden, doing a master thesis on single document summarization. We have implemented a couple of different algorithms for creating summaries (by sentence extraction), which we are testing on the DUC 2002 dataset using the ROUGE-system for evaluation. We would like to compare our results to others presented in the area. A few papers we have read present results for "ROUGE score – Ngram(1,1)" (e.g. "TextRank: Bringing Order into Texts"), we are wondering if anyone could tell us what setting for the ROUGE script "ROUGE score – Ngram(1,1)" corresponds to?

Currently we are using this treminal command:

./ROUGE-1.5.5.pl -e data -a -m -n 4 -w 1.2 -2 -3 -u data/settings.xml

Any help is greatly apreciated!

/Jonatan and Christoffer

asked Mar 29 '12 at 07:14

Jonatan%20Bengtsson's gravatar image

Jonatan Bengtsson
61117

edited Mar 29 '12 at 07:16


3 Answers:

Maybe this link will shed more light on the meaning of some of the settings: Rouge Settings

answered Mar 30 '12 at 06:29

Svetoslav%20Marinov's gravatar image

Svetoslav Marinov
26618

The Rouge Settings site was very helpful when figuring out the basic settings for ROUGE aswell as the format of the summaries to be fed into the script. Our problem, however, is that we can't find the settings used in, for example, the TextRank paper. In that paper the results are simply given for "ROUGE score – Ngram(1,1)", with additional options "basic", "stemmed" and "stemmed no-stopwords" (as we use the option -m in the command displayed above, our result should correspond to "stemmed").

After reading the paper Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics, we are under the impression that "ROUGE score – Ngram(1,1)" should correspond to "ROUGE-1" (for 1-gram). The thing is that when we compare the score for the baseline of just taking the first sentences of an article from the corpus to form a 100 word summary, presented in the TextRank paper, with that of our own similar baseline the results differ alot:

The "ROUGE score – Ngram(1,1)" with additional option "stemmed " result, for the baseline presented in the TextRank paper: 0.477

The result for our similar baseline on the same dataset:

C ROUGE-1 Average_F: 0.62692 (95%-conf.int. 0.62188 - 0.63199)

Of course our basline implementation could differ from the one in the paper by some small amount, but 0.477 and 0.627 is quite a big difference. Also the algorithms we have implemented and tested socres around the area of the result of our basline.

So can anyone see any obvious thing that we have missunderstood or overlooked?

answered Mar 31 '12 at 09:32

Jonatan%20Bengtsson's gravatar image

Jonatan Bengtsson
61117

After alot of bugsearching we have now found what we think was the root of our problem. Due to not getting the input type "SEE" to work (regardless of input we got a ROUGE-score of 0), we used the input type label "SPL". As we still tried to follow the file layout for summaries for "SEE" presented on the ROUGE Settings site, this format was probably not applicable to "SPL".

The reason why the "SEE" input type label didn't work for us was because we didn't separate the lines in the summary html-files with a new line char. As the files are in a html format, we didn't think new line separation would make a differance, but apperently it did.

After getting the input type label "SEE" to work, the results of our ROUGE-evaluation for our baseline implementation correlated much better to the one presented in TextRank.

answered Apr 18 '12 at 03:25

Jonatan%20Bengtsson's gravatar image

Jonatan Bengtsson
61117

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.