10
8

What would be a good way to extract content from a website? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites.

asked Nov 18 '10 at 13:02

Ben%20McCann's gravatar image

Ben McCann
171458

edited Nov 22 '10 at 16:41


10 Answers:
11

I am the author of the paper "Boilerplate Detection Using Shallow Text Features", presented at WSDM 2010.

The algorithms presented there are available as Open Source, from http://code.google.com/p/boilerpipe/ There also is a web app to demonstrate the algorithms on arbitrary pages: http://boilerpipe-web.appspot.com/

Commercial support and custom solutions are available through http://www.kohlschutter.com/

answered Nov 20 '10 at 05:15

Christian%20Kohlsch%C3%BCtter's gravatar image

Christian Kohlschütter
191133

Boilerpipe is a great technology.

What would be the correct interface for annotating many more webpages? Do you have a Mechnical Turk interface?

(Jan 17 '11 at 23:23) Joseph Turian ♦♦
2

The program you'd want if you need to annotate additional data is KrdWrd (krdwrd.org): It's a Firefox plugin that lets you mark DOM nodes as "boilerplate", "uninteresting" or "main text" and then store these classifications on a server.

(Mar 11 '11 at 02:57) sqrt17

I watched an interesting videolecture about this problem the other day from the other direction.

Boilerplate Detection Using Shallow Text Features

answered Nov 19 '10 at 08:27

Scott%20Frye's gravatar image

Scott Frye
151138

See my comment referring to "boilerpipe" for the corresponding implementations :)

(Jan 08 '11 at 14:15) Christian Kohlschütter

Here is a list of resources on Tomaz Kovacic's blog.

On the same blog, here is an overview of approaches: Extracting text from html documents.

answered Mar 11 '11 at 02:21

Justin%20Bayer's gravatar image

Justin Bayer
169192944

edited Mar 12 '11 at 19:33

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
577551125146

I just read this blog post, and have to say that it is the most comprehensive summary I've read, to date.

(Mar 12 '11 at 19:30) Joseph Turian ♦♦

This is a hard research problem in general. Some of the research solutions out there are : Stalker and RoadRunner. A commercial solution is AgentBuilder from Fetch. http://www.fetch.com/products/

answered Nov 18 '10 at 13:36

Aman's gravatar image

Aman
2614916

You could also convert a page to text using lynx -dump url. It produces structured text which makes it nice for parsing.

answered Nov 18 '10 at 13:49

Melipone%20Moody's gravatar image

Melipone Moody
221468

Have a look at Webstemmer (http://www.unixuser.org/~euske/python/webstemmer/). "Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up"

answered Nov 19 '10 at 16:29

Manish's gravatar image

Manish
6113

You may want to consider this: http://code.google.com/p/arc90labs-readability/. Also, I found another paper of interest: http://portal.acm.org/citation.cfm?doid=1860559.1860590.

answered Mar 14 '11 at 16:53

Marty's gravatar image

Marty
161

There is an alternative way for blogs or sites with RSS feeds using the Google Reader unofficial API. In that case you can retrieve directly the text of the article from a longer term feed since using the site feed you receive only the latests posts.

This week I published an article about this method specifically: Extraction of Main Text Content Using the Google Reader NoAPI

answered Aug 27 '11 at 20:42

DataBigBang's gravatar image

DataBigBang
12

edited Aug 27 '11 at 21:24

http://code.google.com/p/boilerpipe/ looks nice. I also need the main image and have it working for chinese sites so I grabbed goose

https://github.com/jiminoc/goose

and created my own simple Java library called snacktory:

https://github.com/karussell/snacktory

See snacktory in action on jetslide

There is also a blog post which lists a lot of readability clones: http://blog.arc90.com/2009/06/20/readability-now-available-in-three-delicious-flavors/

answered Aug 29 '11 at 11:26

Peter's gravatar image

Peter
12

edited Aug 29 '11 at 11:29

Diffbot provides an API. I ran a few qualitative tests on Diffbot's "Article API" and the results also look good, at least comparable in quality to boilerpipe.

I haven't gotten a chance to run a detailed or quantitative comparison.

There is some useful discussion on Hacker News.

answered Mar 11 '11 at 00:55

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
577551125146

I ran a single test with diffbot on a page of German news magazin Spiegel, and it broke down, saying "No news article found".

(Mar 11 '11 at 02:20) Justin Bayer

Interesting, do you mind posting the URL?

(Mar 14 '11 at 17:20) Joseph Turian ♦♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.