|
What kinds of techniques are used in generating search engine snippets? By a snippet, I mean the short summary for each search result. For example, the first result on Google for "metaoptimize" is
and the snippet would be "Machine learning, natural language processing, predictive analytics, business intelligence, artificial intelligence, text analysis, information retrieval, ..." I'm guessing it's not so much summarization techniques, and more a mixture of looking within particular HTML sections + scoring sentences for similarity to the query, but I'm just guessing. |
|
Here's a paper which describes how Yahoo (probably) did it - Predicting the Readability of Short Web Summaries. Their algorithm is too complicated if you're looking for something quick & dirty. In this case see the references or "related work" section in the paper - they use several simpler scoring algorithms as features in their ML model. ...and more a mixture of looking within particular HTML sections... Well, extracting text from HTML is the first very basic thing that all web search engines have to do. How else would they build the inverted index? It's very reasonable to use for summarization the same extracted text that goes into indexing. + scoring sentences for similarity to the query, Yes, that's the basic idea. Generate some candidates (not just sentences, but fragments of texts, which could be parts of a sentence or span several sentences), and then score them. Awesome, thanks for the paper. I think there's more to snippet generation than readability, though? By looking within particular HTML sections, I meant maybe search engines rank certain HTML parts differently (e.g., maybe title and header tags are more important in general, maybe list tags are ranked higher when the search engine recognizes the query is asking for a list of things). In other words, I'm guessing search engines take the HTML structure of the page into account as well.
(Oct 07 '10 at 14:10)
grautur
|
|
I found this solution works well enough. I made my own version too. The idea is simply to strip away junk and then return the largest text block. |
|
There were two CLEANEVAL competitions for cleaning the text of webpages and removing boilerplate text, navigation bars, etc. Unfortunately, only CLEANEVAL-1 has a webpage, but I believe the proceedings for both conferences are online. NCleaner is a decent HMM-based tool (if I'm not mistaken) for converting HTML to text and stripping non-body text. WebStemmer is another one, but I haven't used it. You can use one of these tools and just take the first k words. jReadability is Java code to extract the article text from HTML. I have not used it. |
|
I think the simplest way would be to obtain a LDA clustering that would give you a distribution of words over topics. So If you gathered all of metaoptimize's posts and ran a LDA over them you could get a number of "topics" with the corresponding distributions over words. Now i would probably get the most probabilistic words belonging to each topic and concatenate them. That is how one could(naively) do it. LSI could achieve the same results too. It is the most basic IR technique and simpler to understand than LDA. But If you are looking at large corpuses, there is ofcourse more to it... One would need a pipeline and some evaluation metric, pseudo feedbacks etc to boost results. 1
I don't think either LSA or LDA in this context should be called simple. I'd go with ivank's answer.
(Oct 06 '10 at 06:43)
Alexandre Passos ♦
1
@Alexandre Agreed! The simplest thing to do would be to use the top-N characters from the webpage. ivank's suggestions are in the right direction. Score fragments for relevance. Of course, the exact method for how this is done is a fairly open ended text summarization question. LDA/LSI/tfidf/etc/etc/etc can be used, but all of these are specific tools to aid in this summarization task.
(Oct 06 '10 at 08:14)
Andrew Rosenberg
Either way, Andrew, you're still generated a bag-of-words (or, at best, an auto-tagging) of the HTML. The question is about text snippets, suggesting actual syntactically well-formed sentences and phrases.
(Oct 06 '10 at 11:25)
Joseph Turian ♦♦
1
@Joseph Turian: What andrew suggested was deeper than treating the whole html as a bag of words. His suggestion is: represent each sentence as a bag of words and choose the sentences closest to the query bag of words as a snippet. If you switch from independent sentence selection to selecting sentences that maximize the similarity between the query and sentence set you have so far this gives you nearly a state-of-the-art extraction-based summarization algorithm.
(Oct 06 '10 at 11:37)
Alexandre Passos ♦
Alexandre, I missed that. It actually sounds like a good idea, as long as you also weight earliness in the document more.
(Oct 06 '10 at 15:17)
Joseph Turian ♦♦
|
|
This is a link to a simple quick and dirty approach to the search engine snippet generation
This answer is marked "community wiki".
|
This is a great question, and it seems to come up a lot.