8
4

Does anyone know what are the largest datasets, that are publicly available, that are labeled and can be used for testing large-scale binary (or perhaps multinomial) classification models?

Ideally easy to get hold of and must be too large to fit in memory on a single node - preferably large enough to need say 4-6 processing nodes in an Amazon EC2 cluster.

Dimensionality not that important but would be good if it was of the order of 100s - 1000s and sparse.

asked Nov 15 '11 at 04:45

MLnick's gravatar image

MLnick
121235

Thanks everyone for your responses. I will definitely take a look into these various data sources.

I recently came across this, which is a publicly available dataset of a webcrawl of 5 bn+ pages: http://www.commoncrawl.org/. Total size is 40+ TB.

Although again it is not labeled, one could definitely come up with a prediction / classification task based on this. Just thought I would let anyone else know who is interested.

(Nov 28 '11 at 04:48) MLnick

4 Answers:

ImageNet has a very large number of images and there are a variety of different sets of classes you can consider. For instance, restricting yourself just to "animal" images gives nearly 3 million images over nearly 4000 different fine-grained classes with on average over 700 images per class. If you consider all the high level categories, there are 14 million+ images. If you just want ones with SIFT features extracted, there are 1.2 million. http://www.image-net.org/about-stats

The Pascal Large Scale Learning challenge has an OCR dataset with 3.5 million training cases, a "dna" dataset with 50 million. http://largescale.ml.tu-berlin.de/instructions/

In general, it is hard to find truly huge labeled datasets with high quality labels, especially if you want public ones. If you are willing to use "weakly supervised" data or unlabeled data you have a lot more options.

answered Nov 15 '11 at 16:02

gdahl's gravatar image

gdahl ♦
341453559

Most representations of ImageNet (i.e. SIFT) are dense, though. Didn't know the Large Scale Learning challenge. Looks interesting.

(Nov 16 '11 at 03:35) Andreas Mueller

In "Fast Kernel Classifiers with Online and Active Learning" by Bordes et al. several large sparse datasets are used.References to the datasets are given in the paper. Examples are Forest with 521000 training examples. Look on page 1593. They also mention "adult" a very popular large dataset. Other large datasets are KDDCUP99 (used here) or COVTYPE, used for example here.

This answer is marked "community wiki".

answered Nov 15 '11 at 06:20

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

The graphlab webpage links to some larger problems, unfortunately not in a binary/multinomial classification setup but in matrix factorization, but it's conceivable that you can come up with a natural problem on top of that.

Another approach is to build your own dataset. For something like sentiment analysis it should be very easy to get labeled data (just mine amazon reviews, for example, they have an API) in as large a quantity as you want.

answered Nov 15 '11 at 07:37

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

The eBird Reference Dataset is a large data set with hundreds of dense features. There are many potential binary prediction tasks (namely, predict if bird species X is observed under given conditions). Some data preprocessing is required.

Another large dataset is ClueWeb09, a web crawl of 1 billion pages that can be purchased from CMU. You will need to define the prediction task yourself though.

answered Nov 15 '11 at 17:48

Art%20Munson's gravatar image

Art Munson
64611316

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.