10
15

As Prof. Dan Klein (Stanford, U.C. Berkeley) once said during one of his lectures, "What we do in NLP is more constrained by our data than our ideas." This problem can be generalized to many fields, and as much as I would love just to work in the theoretical aspects of AI, proper training data is invaluable in testing hypotheses. For many tasks, convenient data sets are already in place (Penn Treebank, WordNet, etc), but sometimes I just need something of a different type that I know is able to harvested from a public web database.

Does anyone have any recommendations on tools or methods for gathering such data? Examples include movie names from IMDB, statistics from sports sites, time series of stocks from various sources.

asked Jul 08 '10 at 03:44

Daniel%20Duckwoth's gravatar image

Daniel Duckwoth
954222938

edited Jul 12 '10 at 06:33

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146


10 Answers:

A very important thing in writing your own scraper is to cache everything. Sometimes, a nasty bug only happens after a week of harvesting data, or you suddenly remember you wanted some extra piece of information, and crawling all over again is really annoying.

The solution I'm using is keeping a cache on disk with the content of every url visited by the scraper. With this it gets very easy to test, debug, and enhance the code, and it is also hard for a crash to destroy what you've got.

answered Jul 18 '10 at 16:44

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

You can download IMDB here http://www.imdb.com/interfaces#plain

answered Jul 14 '10 at 09:01

dirknbr's gravatar image

dirknbr
1112

edited Jul 14 '10 at 09:01

I actually wrote an article on using Python to scrape a sports (hockey) statistics site and do some simple analysis. Check it out here.

I used lxml with XPath queries. This is good for structured sites like sports statistics since they usually come in tables. One issue if you go down this path is many of the older sports statistics sites use table-based layouts, which may be intertwined with the data tables.

If you're interested in time-series for financial data the Yahoo finance API is your best bet for raw data. But there is a great R package: QuantMod that will automatically do charting and technical indicators, and will grab data from Yahoo or Google finance.

If you have a general interest in web scraping in general I can recommend Webbots, Spiders, and Screen Scrapers by Schrenk. It offers some interesting ideas and how they can be implemented with PHP.

If you're looking for software libraries, like Scrapy (as Andrew mentioned), lxml, and Beautiful Soup. From what I remember the Beautiful Soup stopped development, but it seems like it has been restarted. Scrapy is a framework for scraping, so if you're planning on doing a lot of scraping, or data integration it will probably have more useful tools. If you're simply scraping a set of pages I find lxml is simpler for prototyping.

answered Jul 13 '10 at 11:51

Phillip%20Mah's gravatar image

Phillip Mah
46139

edited Aug 26 '10 at 16:20

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

There is a python library at Google Code that is under the LGPL.

answered Jul 12 '10 at 08:16

Christian%20StadeSchuldt's gravatar image

Christian StadeSchuldt
11

Have you used this tool? I also mentioned it in my answer: http://metaoptimize.com/qa/questions/797/practical-problem-scraping-data-from-a-large-website#892 but I haven't gotten to try it yet.

(Aug 26 '10 at 16:20) Joseph Turian ♦♦

I may be old-fashioned, but I simply use an HTML parsing library with support for XPath expressions; I don't use Ruby but Hpricot is what all my Ruby friends seem to like for scraping. In order to ignore ad text and other non-content text, I typically look at the ratio of punctuation and stop words to text.

answered Jul 09 '10 at 09:49

aria42's gravatar image

aria42
209972441

I second the suggestion to use 80legs for simple computational crawling. You can use my code py80legsformat to grok their data from within Python.

get-theinfo is a great mailing list of data hoarders. Many times, the people on that list already have the data that you need.

I also think there would be value in asking on this site where you can get so-and-so specific dataset.

[edit: SiteScraper recently released their Python web-scraping code here as webscraping and sitescraper. Given that this is the author's primary freelancing specialty, I am interested to check it out.]

answered Jul 09 '10 at 00:49

Joseph%20Turian's gravatar image

Joseph Turian ♦♦
579051125146

edited Jul 12 '10 at 06:38

1

Loved your use of the word "grok" in this context :)

(Jul 11 '10 at 04:34) Tal Galili

get-theinfo is the mailing list for theinfo, which is another place to check. There's also infochimps, who link to a lot of publicly available data (The Quality varies widely, though.)

For the specific examples you cite, though: IMDB offers their data for download, and I believe the yahoo/google finance API allow getting stock data time series.

(Jul 13 '10 at 09:58) Srihari
2

thanks for linking to my work Joseph. Here are my thoughts on Scrapy, lxml and BeautifulSoup: http://blog.sitescraper.net/2010/08/why-not-just-use-scrapy.html

(Aug 29 '10 at 09:38) Richard Penman
-1

I don't know if you have come across Poyozo, http://mypoyozo.com/#tour which does some of what you're asking for. It's a new product and has just been launched, so I've only heard of it, not used it

answered Jul 08 '10 at 22:41

New%20High%20Score's gravatar image

New High Score
0

I don't think they expose crawling functionality.

(Aug 26 '10 at 16:24) Joseph Turian ♦♦

Check out Scrapy.

It's perfect for large-scale scraping tasks. We use it for all sorts of one-time scraping tasks on my startup, Parse.ly. Usually takes about 1 hour to write a scraper for a big site, and then the crawls run pretty quickly due to use of Python Twisted (evented IO framework). Plus, comes with a nice web-based console for monitoring crawl jobs in process.

answered Jul 08 '10 at 11:32

Andrew%20Montalenti's gravatar image

Andrew Montalenti
314

edited Jul 08 '10 at 11:33

Do you know how to distribute Scrapy? Some sites will throttle or block you if you crawl from a single IP.

(Aug 26 '10 at 16:19) Joseph Turian ♦♦

Try using the DOWNLOAD_DELAY setting to throttle the requests so that your IP doesn't get banned in the first place. If it continues being a problem, your question starts to move from the domain of Scrapy to that of extracting data out of a site in spite of the site owner's own policies over how their servers are used. Scrapy wasn't made for getting around this problem; if you need distributed crawling, try Nutch, http://nutch.apache.org/.

(Aug 27 '10 at 11:16) Andrew Montalenti

Scraping web pages is the best with Python (or Jython) and the Beautiful Soup library. I've used this many times to collect datasets for NLP, like say from message boards. Python in general is great because usually you only have to do this kind of thing once per site, and then forget about it, or just rerun your previous scripts. Python also has the urllib2 which reads directly from web addresses. Also, there are Flickr libraries that work well. These techniques also exist in other languages, I was mentioned a long time ago in an O'Reilly book about a way to do it with Microsoft ASP. But .NET, Ruby, they all have HTML parsing (note that RegEx may not be entirely helpful with this).

answered Jul 08 '10 at 10:53

th0ma5's gravatar image

th0ma5
463

2

For me, lxml.html.fromstring has dealt better with malformed markup than beautifulsoup. It also lets you do xpath queries to extract specific parts of the markup.

(Jul 08 '10 at 10:58) Alexandre Passos ♦
1

Python also has the advantage of really simple multithreading, so you can shoot off a bunch of http requests without worrying about a slow server response hanging your script. (http://www.tutorialspoint.com/python/python_multithreading.htm)

(Jul 08 '10 at 11:10) Andrew Rosenberg

You can use services like 80legs.com for scrapping/crawling websites. [No affiliation]

Though you need to be mindful of their TOS, also you should always follow the Robots.txt. There are several issues like Re-Identification attacks and such which you must consider while scraping a public website.

E.g. Scraping Flickr like website might seem harmless, but if you redistribute the data there might be several issues. Since changes in the original website are not reflected in your data set, it could be used to for finding out actions of certain users.

Certain Websites allow researchers to scrap the data and white list their IP's for purpose of scrapping, but you need to seek their permission.

Finally you can get several large NLP datasets on web:
http://twitter.mpi-sws.org/ [Not available yet ]
http://theinfo.org/
Amazon Public Data sets [hosted on Amazon EBS]

This answer is marked "community wiki".

answered Jul 08 '10 at 10:19

DirectedGraph's gravatar image

DirectedGraph
56031424

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.