Revision history[back]
click to hide/show revision 1
Revision n. 1

Jul 13 '10 at 11:51

Phillip%20Mah's gravatar image

Phillip Mah
46139

I actually wrote an article on using Python to scrape a sports (hockey) statistics site and do some simple analysis. Check it out at: http://www.arandomforest.com/?cat=4

I used lxml with XPath queries. This is good for structured sites like sports statistics since they usually come in tables. One issue if you go down this path is many of the older sports statistics sites use table-based layouts, which may be intertwined with the data tables.

If you're interested in time-series for financial data the Yahoo finance API is your best bet for raw data. But there is a great R package: QuantMod that will automatically do charting and technical indicators, and will grab data from Yahoo or Google finance.

If you have a general interest in web scraping in general I can recommend Webbots, Spiders, and Screen Scrapers by Schrenk. It offers some interesting ideas and how they can be implemented with PHP.

If you're looking for software libraries, like Scrapy (as Andrew mentioned), lxml, and Beautiful Soup. From what I remember the Beautiful Soup stopped development, but it seems like it has been restarted. Scrapy is a framework for scraping, so if you're planning on doing a lot of scraping, or data integration it will probably have more useful tools. If you're simply scraping a set of pages I find lxml is simpler for prototyping.

click to hide/show revision 2
Revision n. 2

Aug 26 '10 at 16:20

Joseph%20Turian's gravatar image

Joseph Turian
579051125146

I actually wrote an article on using Python to scrape a sports (hockey) statistics site and do some simple analysis. Check it out at: http://www.arandomforest.com/?cat=4here.

I used lxml with XPath queries. This is good for structured sites like sports statistics since they usually come in tables. One issue if you go down this path is many of the older sports statistics sites use table-based layouts, which may be intertwined with the data tables.

If you're interested in time-series for financial data the Yahoo finance API is your best bet for raw data. But there is a great R package: QuantMod that will automatically do charting and technical indicators, and will grab data from Yahoo or Google finance.

If you have a general interest in web scraping in general I can recommend Webbots, Spiders, and Screen Scrapers by Schrenk. It offers some interesting ideas and how they can be implemented with PHP.

If you're looking for software libraries, like Scrapy (as Andrew mentioned), lxml, and Beautiful Soup. From what I remember the Beautiful Soup stopped development, but it seems like it has been restarted. Scrapy is a framework for scraping, so if you're planning on doing a lot of scraping, or data integration it will probably have more useful tools. If you're simply scraping a set of pages I find lxml is simpler for prototyping.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.