5
5

In my work, I find myself constantly creating new data files (e.g., training data for different types of classifiers). I often iterate over different versions of these data files, and it's easy for me to lose track of what I did for each file. Do you have practical tips for managing your data files? For example, here are a couple methods I've found to be useful:

  • Put a README in each data directory that describes the files
  • When generating a new data file, always include the date of creation in the file name
  • Keep a journal of what I did each day

What practical tips do you have?-

asked Dec 19 '11 at 01:41

Alec's gravatar image

Alec
161359

edited Dec 19 '11 at 12:35


7 Answers:

I use Makefiles: I keep the original data files, and specify transformations as Makefile rules, e.g.:

build/data.normed.txt: data/data.txt
   cat $^ | norm-data-somehow > $@

build/data.trasformed.txt: build/data.normed.txt etc/transformation.rules
   cat build/data.normed.txt  | transform-data-somehow -r etc/transformation.rules > $@

This has the following advantages:

  • I can regenerate my transformed files when data changes (or I get more data)
  • I don't have to backup my transformed data files (e.g., I use the convention that everything that is generated from original data files goes to "build" subdirectory, and "build" is excluded from backup)
  • I always know how the transformed files were generated
  • When I change some step in transformation chain, I only have to regenerate data files that depend on that step (this is an advantage over "recipe" style transformation scripts that I've seen people often using, where a recipe script applies all the transformations to the data from start to end)
  • Makefile can be managed by version control

answered Dec 21 '11 at 08:14

paraba's gravatar image

paraba
256288

edited Jan 04 '12 at 10:34

Thanks! This is a fantastic suggestion.

(Dec 23 '11 at 22:24) Alec
1

Recently discovered an even better thing than Makefiles: Ducttape. It's a "workflow management system for researchers who love Unix", written in Scala.

(Sep 17 '12 at 10:07) paraba

The problem when it comes down to it is about how best to store metadata associated with files. Examples might include what dataset you used to derive the model/intermediate data, parameters you passed to the process, what version of the software you used, etc.

The most common way people try to approach this problem is by using the filename as a catch all for the metadata. This approach works well at least superficially and initially, but in the long run is doomed. A simple example let's assume you're generating a random forest model. At first you might only be varying the number of trees. So you might call your files "randomForest.1000Trees.model", randomForest.5000Trees.model."

However after the next pass you might want to look at the number of features you're using in each tree, so now you have to use a format like "randomForest.1000Trees.300Features.model", etc. Including going back and renaming your starting files. Now let's say you investigate whitening the data before fitting, "randomForest.1000Trees.300Features.PCA.model", "randomForest.5000Trees.1000Features.noWhiten.model".

You can clearly see that after a few iterations this approach is doomed. Some people try to fix this problem by using a directory structure as well. In the long run this might have a little longevity but is also doomed.

So my approach is somewhat different. I have a script that takes a given directory and stores a metadata file that maps random hashes with a csv of various fields. The hashes are generated using Unix's mktemp utility, which prevents collisions. So if I need to generate some data instead using insanely long file names I simply use my map function. The mapper script also takes command line flags that correspond to the metadata columns, and the best part is you can just re-use the same command line flags for both your data generator and the mapper. So:

fitRandomForest --trees 1000 --features 300 --whiten PCA --output

randomForest.1000Trees.300Features.PCA.model

Turns into

fitArgs="--trees 1000 --features 300 --whiten PCA"

fitRandomForest $fitArgs --output $(mapSubInst $fitArgs)

Guaranteeing that the file metadata actually matches up to what it was generated with.

The script can also return matches using regular expressions and numeric relations. Let's say you wanted to run some operation on every model with more than 1000 trees that was PCA or kernelPCA whitened:

mapSubInst --trees ">=1000" --whiten "(kernel)*PCA" | xargs ensembleFit

Finally the script also has the ability to add new metadata fields and assign default values to all the previous stores. So let's jump back and say you wanted to add the whiten field like before. In the filename metadata situation you have to manually re-name all your pre-existing files to represent their metatdata. Using a mapper script you can simply:

mapSubInst --addField whiten --default "noWhiten"

Now the metadata of all your pre-existing files have been updated to reflect the fact that they weren't whitened.

answered Sep 27 '12 at 15:51

Douglas%20Colkitt's gravatar image

Douglas Colkitt
76357

There are numerous good answers here. I would like to highlight a couple routes I haven't seen mentioned.

  1. Common Data Format (CDF). This format, and others like it, are designed to overcome dependency on a particular data format. The FAQ above explains it better than I can.

  2. NOSQL solutions. NOSQL is sort of a buzzword, but essentially there are numerous projects such as MongoDB, Kyoto Cabinet, Cassandra, Hypertable,... (the list goes on), that move away from the RDBMS database architecture of products such as MySQL and PostgreSQL. I bring this up because if you are dealing with many large files with associated metadata (as is often the case in scientific applications) then a document based DB such as MongoDB could be a good choice for you. Often building a full RDBMS schema for your scientific application is somewhat awkward and unnatural, given that what you are really interested in is simply a persistent representation of your data that doesn't suffer from file system (and brain) disorganization.

Additionally, if your application does happen to involve massive amounts of data, extremely high throughput and/or very strict requirements for data integrity/persistence then one of the above solutions is likely to perform better out of the box than a RDBMS.

  • I will re-iterate what everyone else has said: Keep a journal!! Regardless of what else you do, your original data files are still going to be kicking around somewhere and if someone does ever need to use them they will very frustrated if it's not clear what columns mean, what types are what, what binary format you're using, etc. Keeping notes is just good scientific practice, no matter what you're doing.

answered Jan 04 '12 at 22:28

Kyzyl%20Herzog's gravatar image

Kyzyl Herzog
371144

I have collected, processed, and analyzed large and complicated data sets for 20+ years in both commercial and academic settings. I can tell you that best data management techniques vary according to your project needs, but the fundamentals never change; they are, as you alluded to:

Protected - Identifiable - Reproducible

The problem with dates or other codes in filenames, README files, comment logs, etc. is that they are all unverifiable claims. I have no way of knowing if I or someone else inadvertently modified rawdata_2012jan01. I have no way of checking that the files in the directory have actually been processed exactly as stated in the README file. Thinking again about the fundamentals, these methods make the data identifiable but not protected or reproducible. (You may disagree on the last point, but what if you follow all the instructions in the README and get something different?)

@paraba presents a good method using makefiles. Obviously his sets are reproducible, but notice they are also protected. Since the processing steps are in version control, he has visibility of any intended or unintended changes in the process. Any doubt that 2012jan01_foobarred.csv is what it claims to be? Run the script again after pulling the latest rule files from the DVCS.

Do not overlook the power of modern, open-source databases like Postgres, along with modern innovations like a 500GB USB hard drive the size of the palm of your hand for $100. There you can store your original/raw data, create trigger functions that log any modification, and write back-end code (PL/SQL, PL/PGSQL, or even PL/Python) that returns processed data at your command. Assuming you place all code under version control, you are now protected, verifiable, and reproducible. (Please do not use the database to track file locations in a directory. If you think this is a good idea, have someone rename or move some of the files and report back.) Do not worry about the database's ability to handle your data with adequate speed. Speed is just a matter of database design and optimization which you can learn as you go. As for capacity, Postgres database sizes are unlimited. A single table is limited to 32 TB, row 1.6 TB, and field 1 GB! The rewards of using such a system are rich; in your analysis code you can do things like get_foobarred_data(using("raw_5jan2012")). You can program whatever system you can imagine.

Lastly I want to add that handwritten journals are a necessity. You (and others) still need to understand the big picture and the journal is the best place to keep track. You can also vastly improve your efficiency, productivity, and effectiveness by spending the last 10-15 min each day writing about your current obstacles and plan for the next session.

answered Jan 04 '12 at 10:07

Pete's gravatar image

Pete
6114

edited Jan 04 '12 at 10:09

I am starting from folder

0_data - initial data

1_processed - process data (cleaning, standartization, etc)

2_modeling - modeling

3_res_analysis

if the during processin I have generated several variants then it goes like

1_processed_1 // 1_processed_2 // 1_processed_3

for modeling I am using respictively

2_modeling_1 // 2_modeling_2 // 2_modeling_3

And yes every folder contains associated code and procedure.

answered Dec 24 '11 at 10:24

Vladimir%20Chupakhin's gravatar image

Vladimir Chupakhin
462

About

  • When generating a new data file, always include the date of creation in the file name
  • Keep a journal of what I did each day

Have you considered using a version control system? A very good option is git: http://git-scm.com/

answered Dec 19 '11 at 15:22

Phil%20Cal%C3%A7ado's gravatar image

Phil Calçado
1

A VCS will not necessarily solve meta data issues, but make things more unhandy. E.g. how to properly delete a data set that is not needed anymore from the repository?

(Dec 20 '11 at 05:27) Justin Bayer

You can use git annex http://git-annex.branchable.com/ which is better fit for such large data files.

(Dec 20 '11 at 12:40) Hannes S

Justin, this is easy to do in most VCS. Here's a quick tuto on how to do so with git: http://bit.ly/vfMmwg and I am sure why you say it won't resolve meta-data issues (commit history is the perfect way to keep track on when and why something changed).

(Dec 21 '11 at 12:03) Phil Calçado

I think Justin wasn't referring to the meta-data on what has happened to the file (which is exactly the problem a VCS solves), but rather the metadata relevant to the file that does not appear anywhere on or attached the file, and thus will not be entered into the VCS. For example, if I have a system of binary files with no header, their source, byte format, internal organization, associated parameter files, author, etc are all important bits of information, but a VCS doesn't help you track that.

(Jan 04 '12 at 22:32) Kyzyl Herzog

I mostly only save the original data.

Then I have a function that wraps all kinds of preprocessing that can be applied to my data: zscores, pca/whitening, feature construction etc. I then have a cache of that function on my hard disk.

For example

data = preprocess(pca_components=20, zscores=True, whiten=True, ...)

Good things about that:

  • serialization is transparent,
  • no moving around of files when going to a new computer,
  • fast most of the time.

answered Dec 19 '11 at 01:54

Justin%20Bayer's gravatar image

Justin Bayer
170693045

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.