As Prof. Dan Klein (Stanford, U.C. Berkeley) once said during one of his lectures, "What we do in NLP is more constrained by our data than our ideas." This problem can be generalized to many fields, and as much as I would love just to work in the theoretical aspects of AI, proper training data is invaluable in testing hypotheses. For many tasks, convenient data sets are already in place (Penn Treebank, WordNet, etc), but sometimes I just need something of a different type that I know is able to harvested from a public web database.
Does anyone have any recommendations on tools or methods for gathering such data? Examples include movie names from IMDB, statistics from sports sites, time series of stocks from various sources.