3
4

Imagine you want to implement a simple Machine Learning library (say in C++). One of the first steps is to define a class to represent a dataset. How would you do this? There are several problems, for example:

  • Your features/labels might have different types (real, integer, nominal, boolean)
  • You may or not have labels (supervised versus unsupervised)
  • The data might not fit all in memory (one might need some buffering scheme)

Do you know of some good open source examples? I'm looking for some simple and clean abstractions.

asked Jul 06 '10 at 05:20

Hugo%20Penedones's gravatar image

Hugo Penedones
76128

Great question!

(Jul 06 '10 at 12:40) Joseph Turian ♦♦

10 Answers:

I mostly avoid thinking of a general "dataset" class/interface. As well as the issues you pointed out, you sometimes have to deal with predefined folds, online/stochastic learning (in which you might never have access to the dataset in its entirety), converting a structured learning dataset to the input of a stochastic classifier (in which you don't always want to generate the entire dataset), structured data in general (that might need more complex tree- or graph-structured representation), etc. Generally I just use a dynamic language (python), so I don't have to deal with types, and represent a dataset as a list (or data-generating function) of (label, item) pair (where the label is null if it's unsupervised). But in the same learning problem there are usually more than one versions of this, in different states of parsing/preprocessing.

But, still, each program I write needs a different abstraction, and I hope I don't search for generality enough to find a perfectly complex representation as Weka's.

answered Jul 06 '10 at 08:43

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
1896744214334

Orange has a relatively sophisticated class structure designed by Janez Demšar, based on the following classes:

  • Descriptions of attributes/features (Variable)
  • Descriptions of domains/sets of features (Domain)
  • Values of attributes/features (Value)
  • Data instances, cases, examples (Example)
  • Data sets, collections of instances (ExampleTable)

answered Jul 11 '10 at 11:39

Aleks%20Jakulin's gravatar image

Aleks Jakulin
240110

edited Jul 11 '10 at 11:41

Developers of Mallet seem to have come up with a reasonably good abstraction for Instance, feature, labels, etc. This is in Java but the design can be easily implemented in any object oriented language.

In particular, see the following packages:
cc.mallet.types.
cc.mallet.pipes.

answered Jul 06 '10 at 13:09

Delip%20Rao's gravatar image

Delip Rao
6502810

One place to start looking at is the Weka file format called ARFF. Documentation is here.

Eseentially, it is a self describing format based; because it is based on text it is human readable. It has good support for sparse data and it is used by a reasonably large group of people. This could be a place to start building a Dataset type on top off.

answered Jul 06 '10 at 07:37

Jurgen's gravatar image

Jurgen
98031319

For the last two years I have been using the MVC programming model and databases. You might be familiar with MVC but here is a quick summary: You define your data in your Models files (M in MVC) and in a very abstract way you define the properties of your data. Then you read different sources of data and store your objects into the database by using your controller. View explains how you want to report your data. There are MVC libraries for almost every language (I use Python)

Over the course of 2 years I have found MVC to be at the core of every component I write. It really organizes my work when I do datamining or web crawling.

I hope this was in line with your question

answered Jul 06 '10 at 07:43

Mark%20Alen's gravatar image

Mark Alen
1118233442

I hadn't even tought about doing this before but MVC for machine learning makes a lot of sense. Is there any place we can see this in action?

(Jul 06 '10 at 07:45) Jurgen

What MVC library do you use with python?

(Jul 06 '10 at 11:52) Pedro Alcocer

MVC would give you a nice abstraction. I think the biggest thing it offers is the ability to use a 3rd party db to handle the 3rd bullet point * The data might not fit all in memory (one might need some buffering scheme) The rest of it you - handling different feature types specifically - still need to built up by the user, but there's a real opportunity to use this model to build up a very powerful abstraction for ML.

(Jul 06 '10 at 13:25) Andrew Rosenberg

I usually load Django and stay with their MVC model. I know there are easier models but Django enables me to do reporting over the web.

(Jul 06 '10 at 18:16) Mark Alen

Actually this depends on the programming paradigm you are planning to use. I have programmed abstractions for dataset in C , Java and Python. (But probably I may not come with the best solutions for you-because my model is pretty structured.) I used this solution for reading the data from arff file format of Weka.

Briefly my C version is like that:

I used 4 structs. 1 struct for attributes, 1 struct for instance, 1 struct for classes and 1 struct for the dataset. Struct for attributes actually stores the types of attributes and the names of them (like boolean, numeric, nominal etc...). These types are defined in a macro. Struct instance stores the values of the attributes. Struct for classes stores the name and number of classes acquired from the dataset. And the dataset struct is storing the instance-class pairs.

General overview of my Java version is:

This is similar to the C version except done with classes. This technique is pretty simple, however you might want to extend it according to your needs. I use a class called Attributes for managing and storing the attributes, types, etc...(I store the types of attributes stored in an enum). I have one abstract BaseInstance class and there other classes of Instance (like SparseInstance ...etc) derived from it. I don't use a separate class for classes. I have class for Dataset which is actually the pair of classes and instances. To fit the big amount of data in the memory and fetch it efficiently, I used an LRU Cache.

I'm not very experienced in Python, and tried to do it in the easiest way I can. Similar to the one described by Alexandre.

answered Jul 07 '10 at 16:29

cglr's gravatar image

cglr
1954811

edited Jul 10 '10 at 02:21

My typical architecture would look like the following:

I would have an Atomic class. This would basically represent a single measurement. This would a generic. So in C# my types would usually be decimal,money,string,bool,int. I would typically associate a set of constraints with that. So if my alphabet is {g,t,c,a} then I would constrain that through a set of validators that I inject into the Atomic Class.

I then typically need to put the atomic types into some sort of collection. I usually divide them up into two groups, ordered and undordered. Ordered is as it sounds in that I can place some sort of order on the objects (x < y) I would then usually place my Atomics in a proxy class that contains the index over which I order.

The Atomic Proxies are then placed in some sort of collection that implements an interface for selecting items with certain characteristics.

My methods are very structured but that is a result of my environment. I do now work in an adhoc enviroment but rather my data is mapped to persistence layers through ORM's ( fluent nhibernate), and my tools are rather industrial.

answered Jul 08 '10 at 14:45

bearrito's gravatar image

bearrito
16112

Netcdf may be an option. Also has python bindings.

answered Jul 06 '10 at 12:32

osdf's gravatar image

osdf
58031018

One thing I usually implement in my data classes is a sequence index. This is born from the need to keep data resulting from videos together per sequence. It is of paramount importance not to mix data from the same sequence into both training and test sets. My crossvalidation routines respect this.

Of course, the sequence data implementation is general enough so that it can be used just as easily with data of other variables that need to stay grouped, such as data from a different subjects, or different locations.

answered Jul 07 '10 at 04:11

Michel%20Valstar's gravatar image

Michel Valstar
12

I prefer to abstract from concrete structure of dataset and feature-vectors it represents using templates and some expected operators or functions rather than define any class or a concrete type.

Often you don't need to really know what is a type of a feature, you need just to distinguish between different values and types and/or compare them.

Whenever you need to access to a whone dataset you can just imagine it as an iterator of feature-vectors. If the dataset is really large, the only thing you need to do is to develop another iterator with behaviour of an "typed stream".

answered Jul 09 '10 at 06:17

Sergey%20Bartunov's gravatar image

Sergey Bartunov
317711

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.