|
Small (and or even pretty big) datasets can often be directly included in a published article. The cost of transmitting the data is thus rolled into the cost of making the article available. Larger data sets in machine learning and natural language processing have traditionally been distributed by university-based research groups or consortia (e.g. UCI Machine Learning Repository, Linguistic Data Consortium, Evaluations and Language resources Distribution Agency, etc.). In some cases the cost of transmitting the data gets absorbed into a university budget. In other cases a fee is charged for access to the data. Another distribution method is academic or government data competitions/joint evaluations (TREC, CLEF, NTCIR, KDD Cup, etc.). Here the cost of transmitting the data is borne by a university or government. I'm curious about other distribution mechanisms. In particular, have any substantial research collections been distributed via BitTorrent or other peer-to-peer mechanisms? Any discussions of the practicality and economics of this? |
|
There's BioTorrents, a bittorrent tracker site. It's described in: Langille MGI, Eisen JA (2010) BioTorrents: A File Sharing Service for Scientific Data. PLoS ONE 5(4): e10071. doi:10.1371/journal.pone.0010071. |
|
If I can put in a plug for my own company, take a look at this solution: http://blog.webservius.com/2010/09/14/introducing-amazon-simpledb-integration/ - essentially, here's what it can do:
|
I think a problem with bittorrent for distributing research datasets is that with bittorrent there's very little incentive to continue "seeding" a file once you've downloaded it, and usually most people are selfish, hence to actually save on bandwidth costs with bittorrent you need a continuous stream of downloaders (since downloaders are always seeding be default), and this is an unrealistic scenario for researchers (I think).