I second the suggestion to use 80legs for simple computational crawling. You can use my code py80legsformat to grok their data from within Python.
get-theinfo is a great mailing list of data hoarders. Many times, the people on that list already have the data that you need.
I also think there would be value in asking on this site where you can get so-and-so specific dataset.