|
I have machine learning algorithm with many parameters (may be five or six). I want to tune it for best results on several datasets. I tune it by hand: set parameters, run learning, run classifying, evaluate. Problem is to store results of experiments. At this moment I use file naming scheme. But for third or fourth parameter file name became a bloody mess. It is inconvenient to analyse them. How you solve such type of problems? Is there any system to store such type of data? Open source is great, free is fine too. |
|
Have a look at jobman. It was developed specifically to run and store results of models with many hyper-parameters, such as MLPs. It supports both a command line interface (results are then stored in a special directory structure) and a PostgreSQL backend for both scheduling and analyzing results. The documentation and link to the software can be found here: http://deeplearning.net/software/jobman/contents.html HTH |
|
I've tried several approaches to this problem, and I'm settling on NoSQL databases, especially MongoDB and CouchDB. http://www.mongodb.org http://couchdb.apache.org I prefer mongo because it is so easy to install, even when you do not have root access. There are other similar systems but I haven't tried them. These are simple document-based databases. The way to use them is to insert "documents" that are like report cards that describe an experiment's parameters and results. These databases don't support SQL, but do allow simple queries such as "find all results for dataset X" or "find all results for model Y using parameter Z". To analyze my results I use Python. The documents in these databases correspond to Python dictionaries, so it is easy to write little scripts to analyze results, show things with matplotlib, etc. |
|
It was for this exact purpose that I wrote jug, which is useful for non-parallel things as well. It is Python based, but does exactly what you want. See this section in the docs for examples and explanations. |
|
Sumatra sounds like a the tool you are looking for. Have a look at the getting started guide. If you work with python numpy, it can probably best be combined with joblib that features a powerful memoizing cache decorator to avoid recomputing the intermediate results that are not affected by some parameters and that would be wasteful to compute every time. looks fine, try to put my current results to it
(Dec 05 '10 at 09:13)
helmsman
Sumatra has no ability to add already formed result files. It combine process of obtaining result file and storing it. I can code this, but I'am not python coder.
(Dec 06 '10 at 08:54)
helmsman
I would suggest you to contact the author directly.
(Dec 06 '10 at 09:14)
ogrisel
|
|
I guess it fundamentally depends on what result do you want to store. If the performance is just a single number then keeping a CSV file with lines like "1st parameter, 2nd parameter, ..., nth parameter, result" does the job, and allows you to query it any way you want with SQL (after loading it into sqlite) or something similar. You can also do a hybrid approach if you have more data that could be interesting and write code to add a line to the csv file after each experiment and dump a file full of extra info at /tmp that you can glance at and then delete. scikits.learn has a very extensive cross-validation module, a grid search module that uses cross-validation to tune some parameters, and other nice related things. If all you want is parameter tuning then adapting your code to be called by scikits can work. What I like to do when I'm writing a paper is at first run experiments in the repl (interpreter prompt), not keeping track of results, as the algorithms are constantly changing. When I have an idea of what results I'm going to want and how to present them (like, I'll want tables of metrics X, Y, and Z, and line plots of value V given parameter P, etc) I change my code to output a latex file with the tables and a gnuplot-readable file with the data to plot. This has the nice bonus that you can then switch from the development to the test set blindly and without having to fiddle with anything. |
|
At one of my previous employers, we used a simple database schema in mysql for that sort of thing. It was a naive bayes classifier for binning web pages into various categories based on word content. I came onto the project later on, so I didn't participate in the schema design, but the short of it was essentially this: a new set of tables were created with the results in them, one new set of tables for each test executed. For example, if you need 5 tables to describe test results, A B C D E, then test #1 would generate tables A1, B1, C1, D1, E1. -Brian I know it is not hard to do. But before make custom solution I want to look for some general solution.
(Dec 05 '10 at 07:37)
helmsman
|
|
The important thing is to keep it as simple as possible. In experiments where I have dozens to hundreds of paramater specifications, I like to record everything in logfiles, then parse out parameters and evaluation results afterwards, into a CSV-ish file. |