Small (and or even pretty big) datasets can often be directly included in a published article. The cost of transmitting the data is thus rolled into the cost of making the article available.

Larger data sets in machine learning and natural language processing have traditionally been distributed by university-based research groups or consortia (e.g. UCI Machine Learning Repository, Linguistic Data Consortium, Evaluations and Language resources Distribution Agency, etc.). In some cases the cost of transmitting the data gets absorbed into a university budget. In other cases a fee is charged for access to the data.

Another distribution method is academic or government data competitions/joint evaluations (TREC, CLEF, NTCIR, KDD Cup, etc.). Here the cost of transmitting the data is borne by a university or government.

I'm curious about other distribution mechanisms. In particular, have any substantial research collections been distributed via BitTorrent or other peer-to-peer mechanisms? Any discussions of the practicality and economics of this?

asked Sep 25 '10 at 11:45

Dave%20Lewis's gravatar image

Dave Lewis
785162644

I think a problem with bittorrent for distributing research datasets is that with bittorrent there's very little incentive to continue "seeding" a file once you've downloaded it, and usually most people are selfish, hence to actually save on bandwidth costs with bittorrent you need a continuous stream of downloaders (since downloaders are always seeding be default), and this is an unrealistic scenario for researchers (I think).

(Sep 25 '10 at 14:03) Alexandre Passos ♦

2 Answers:

There's BioTorrents, a bittorrent tracker site. It's described in: Langille MGI, Eisen JA (2010) BioTorrents: A File Sharing Service for Scientific Data. PLoS ONE 5(4): e10071. doi:10.1371/journal.pone.0010071.

answered Sep 26 '10 at 03:13

jackem's gravatar image

jackem
312

If I can put in a plug for my own company, take a look at this solution: http://blog.webservius.com/2010/09/14/introducing-amazon-simpledb-integration/ - essentially, here's what it can do:

  • You upload your dataset to an Amazon SimpleDB table (or tables)
  • You connect your SimpleDB account to a WebServius account, and set any rules you want (e.g. anyone may download up to X rows of data per day for free, beyond that they need to pay $Y total or $Z per row, or anyone can download entire dataset upon verifying their email address, etc)
  • WebServius turns your SimpleDB tables into a fully-managed, read-only REST API allowing access to the data according to the rules you defined above. If you defined any required payments, WebServius will remit these to you (minus its fee)
  • We will likely be adding other capabilities to this soon (e.g. bulk download of a customized subset of records) in addition to the REST API as a way of accessing the data

answered Oct 24 '10 at 15:34

Eugene%20Osovetsky's gravatar image

Eugene Osovetsky
11

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.