|
As Prof. Dan Klein (Stanford, U.C. Berkeley) once said during one of his lectures, "What we do in NLP is more constrained by our data than our ideas." This problem can be generalized to many fields, and as much as I would love just to work in the theoretical aspects of AI, proper training data is invaluable in testing hypotheses. For many tasks, convenient data sets are already in place (Penn Treebank, WordNet, etc), but sometimes I just need something of a different type that I know is able to harvested from a public web database. Does anyone have any recommendations on tools or methods for gathering such data? Examples include movie names from IMDB, statistics from sports sites, time series of stocks from various sources. |
|
I don't know if you have come across Poyozo, http://mypoyozo.com/#tour which does some of what you're asking for. It's a new product and has just been launched, so I've only heard of it, not used it I don't think they expose crawling functionality.
(Aug 26 '10 at 16:24)
Joseph Turian ♦♦
|