Revision history[back]
click to hide/show revision 1
Revision n. 1

Nov 18 '10 at 13:02

Ben%20McCann's gravatar image

Ben McCann
171458

Text extraction from HTML pages

What would be a good way to extract headlines, dates, and authors from news articles? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites.

click to hide/show revision 2
Revision n. 2

Nov 22 '10 at 16:41

Ben%20McCann's gravatar image

Ben McCann
171458

Text extraction from HTML pages

What would be a good way to extract headlines, dates, and authors content from news articles? a website? It seems easy to write a scraper using xpath or similar to extract this information from a single site, but I'm not sure of a more scalable solution if you're extracting from say 10,000 sites.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.