|
What would be a good way to extract content from a website? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites. |
|
I am the author of the paper "Boilerplate Detection Using Shallow Text Features", presented at WSDM 2010. The algorithms presented there are available as Open Source, from http://code.google.com/p/boilerpipe/ There also is a web app to demonstrate the algorithms on arbitrary pages: http://boilerpipe-web.appspot.com/ Commercial support and custom solutions are available through http://www.kohlschutter.com/ Boilerpipe is a great technology. What would be the correct interface for annotating many more webpages? Do you have a Mechnical Turk interface?
(Jan 17 '11 at 23:23)
Joseph Turian ♦♦
2
The program you'd want if you need to annotate additional data is KrdWrd (krdwrd.org): It's a Firefox plugin that lets you mark DOM nodes as "boilerplate", "uninteresting" or "main text" and then store these classifications on a server.
(Mar 11 '11 at 02:57)
sqrt17
|
|
http://code.google.com/p/boilerpipe/ looks nice. I also need the main image and have it working for chinese sites so I grabbed goose https://github.com/jiminoc/goose and created my own simple Java library called snacktory: https://github.com/karussell/snacktory See snacktory in action on jetslide There is also a blog post which lists a lot of readability clones: http://blog.arc90.com/2009/06/20/readability-now-available-in-three-delicious-flavors/ |
|
There is an alternative way for blogs or sites with RSS feeds using the Google Reader unofficial API. In that case you can retrieve directly the text of the article from a longer term feed since using the site feed you receive only the latests posts. This week I published an article about this method specifically: Extraction of Main Text Content Using the Google Reader NoAPI |
|
You may want to consider this: http://code.google.com/p/arc90labs-readability/. Also, I found another paper of interest: http://portal.acm.org/citation.cfm?doid=1860559.1860590. |
|
Here is a list of resources on Tomaz Kovacic's blog. On the same blog, here is an overview of approaches: Extracting text from html documents. I just read this blog post, and have to say that it is the most comprehensive summary I've read, to date.
(Mar 12 '11 at 19:30)
Joseph Turian ♦♦
|
|
Diffbot provides an API. I ran a few qualitative tests on Diffbot's "Article API" and the results also look good, at least comparable in quality to boilerpipe. I haven't gotten a chance to run a detailed or quantitative comparison. There is some useful discussion on Hacker News. I ran a single test with diffbot on a page of German news magazin Spiegel, and it broke down, saying "No news article found".
(Mar 11 '11 at 02:20)
Justin Bayer
Interesting, do you mind posting the URL?
(Mar 14 '11 at 17:20)
Joseph Turian ♦♦
|
|
Have a look at Webstemmer (http://www.unixuser.org/~euske/python/webstemmer/). "Webstemmer is a web crawler and HTML layout analyzer that automatically extracts main text of a news site without having banners, ads and/or navigation links mixed up" |
|
I watched an interesting videolecture about this problem the other day from the other direction. See my comment referring to "boilerpipe" for the corresponding implementations :)
(Jan 08 '11 at 14:15)
Christian Kohlschütter
|
|
You could also convert a page to text using lynx -dump url. It produces structured text which makes it nice for parsing. |
|
This is a hard research problem in general. Some of the research solutions out there are : Stalker and RoadRunner. A commercial solution is AgentBuilder from Fetch. http://www.fetch.com/products/ |
See also: http://metaoptimize.com/qa/questions/2815/how-are-search-engine-snippets-generated and Text extraction from HTML: Jericho vs. Boilerpipe