|
Hello. I'm interested in re-implementing some of the models described in "Exploring Content Models for Multi-Document Summarization" by Haghighi and Vanderwende. All of the models are based on minimizing the KL-divergence between a proposed summary and a modeled distribution of a document collection. In the simplest case, for calculating KL(P||Q), the "true" distribution P is simply the unigram distribution of the document collection that we would like to summarize and Q is the unigram distribution of the proposed summary. In the paper, it is noted that as minimizing the KL divergence is exponential in the total number of sentences in a document collection, a simple approximation was instead used where "sentences are greedily added to a summary so long as they decrease KL-divergence". I am trying to implement the simple version using this heuristic but I can't get results anywhere close to what is reported in the paper. I'm not sure if I'm misunderstanding this heuristic. I've tried two approaches: the first is to go through a list of all the sentences in the document collection and progressively build up a summary. If adding the current sentence lowers the KL divergence, we add it and move on until we reach the maximum number of words. Note that the first sentence will always be added this way. The other way I've tried is to go through every sentence in the document collection and add the one that lowers the KL divergence the most. I add that sentence and then continue this approach of adding the sentence that results in the lowest KL divergence between the summary so far and the document collection's unigram probability distribution until we reach the limit. Both approaches give poor results. Does anyone see a problem with my logic? Thanks very much! |
|
Your second approach seems to me what is done in the paper, which is like a more probabilistic version of sumbasic. Are you stemming/skipping stopwords/throwing away very rare words? How are you smoothing the distributions? The smoothing coefficient (I'm assuming you're doing add-one smoothing) should affect performance, so you should treat it as a hyperparameter and optimize it accordingly. Are you doing I-projection or M-projection? For summarization you want an M-projection (ie, minimizing KL(document-collection||summary), where the expectation over words is computed using the document collection). Also take care to normalize both probability distributions, and experiment a few smoothing strategies (such as smooth-normalize and normalize-smooth-normalize, and take care not to smooth too many times the sentences that have already been added). Another issue is sentence ordering, which was not at all clear from the paper. Hi Alexandre. Thanks for your reply. I'm doing add-one smoothing with an M-projection (i.e. KL(document-collection || summary)). One problem that I have noticed is that with the second approach, the first sentence added is almost always the longest sentence. I imagine that because the document collection is only so big, and the steadily built-up summary will be very small, a lot of the low KLdiv is being obtained by having similar mass on words that occur once in the document-collection distribution and 0 or 1 times in the summary. So, if a sentence has several words, even if they're not very important, the contributions to the distribution from these words will match closely the contributions to the other distribution from these same unimportant words. On sentence ordering -- it shouldn't matter when just looking at ROUGE scores.
(Sep 13 '10 at 15:52)
Will Darling
1
Update: I've gotten to the point where my KLSum implementation outperforms the results reported in the original paper. Everything came down to pre-processing and smoothing. If anyone has any questions, please feel free to ask.
(Nov 15 '10 at 11:44)
Will Darling
1
Will -- I have re-implemented the TopicSum and HierSum models and would be interested in comparing results if you decide to re-implement those models as well.
(Nov 16 '10 at 14:51)
Rebecca Mason
|
|
I've heard from the author of the paper and it seems that the correct route is the latter approach I described in my original post: iteratively add the sentences that lower the KL-divergence the most at each step. However, an alternative approach was also described by the author that uses a local search technique: every step you consider either adding, removing, or adding and removing a sentence; whichever action achieves the greatest drop in KL-divergence. The other important thing, of course, is smoothing. I had been trying add-one smoothing and it turns out that this is "way too much" smoothing. The author suggested something closer to 0.001 or 0.0001. It was also noted that the content and summary distributions should be "mixed" although I'm having a difficult time understanding what is meant by that. In any case, my results are improving but still far from those reported in the paper. I believe it also has to do with pre-processing that I haven't been performing (words are of course lower-cased but I've yet to look into stemming or removing things such as datelines, bylines, etc.). I hope this helps someone else and if there is more input or insight, please don't hesitate to post it here! Thanks. |