Revision history[back]
click to hide/show revision 1
Revision n. 1

Jul 08 '10 at 03:44

Daniel%20Duckwoth's gravatar image

Daniel Duckwoth
954222938

Practical Problem: Scraping Data from a large website

As Prof. Dan Klein (Stanford, U.C. Berkeley) once said during one of his lectures, "What we do in NLP is more constrained by our data than our ideas." This problem can be generalized to many fields, and as much as I would love just to work in the theoretical aspects of AI, proper training data is invaluable in testing hypotheses. For many tasks, convenient data sets are already in place (Penn Treebank, WordNet, etc), but sometimes I just need something of a different type that I know is able to harvested from a public web database.

Does anyone have any recommendations on tools or methods for gathering such data? Examples include movie names from IMDB, statistics from sports sites, time series of stocks from various sources.

click to hide/show revision 2
Revision n. 2

Jul 12 '10 at 06:33

Joseph%20Turian's gravatar image

Joseph Turian
579051125146

Practical Problem: Scraping Data from a large website

As Prof. Dan Klein (Stanford, U.C. Berkeley) once said during one of his lectures, "What we do in NLP is more constrained by our data than our ideas." This problem can be generalized to many fields, and as much as I would love just to work in the theoretical aspects of AI, proper training data is invaluable in testing hypotheses. For many tasks, convenient data sets are already in place (Penn Treebank, WordNet, etc), but sometimes I just need something of a different type that I know is able to harvested from a public web database.

Does anyone have any recommendations on tools or methods for gathering such data? Examples include movie names from IMDB, statistics from sports sites, time series of stocks from various sources.

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.