I've noticed many papers report their results without listing information about the variance between re-initializations of the model. I can understand this is often not possible for models that are computationally expensive to train, but what if the variance in performance between re-initializations is higher than the compared results between different models?

How do most researchers resolve this issue? Running several re-initializations may not be possible because it would take months to get statistically significant results (if the model took 2-3 days to train each time). Using a subset of the dataset is also difficult if one wishes to compare their results to prior work.

asked Jul 10 '11 at 22:18

crdrn's gravatar image

crdrn
402162126


2 Answers:

Usually if this is the case you either (1) describe a way of initializing (or testing an initialization) on training data, (2) use a validation data set to choose which initialization to use on the test data (similar to what is often done in early stopping, using another data set to decide when to stop), or (3) report averaged performance over many random initializaions, maybe with a confidence interval.

If running many re-initializations is prohibitively expensive just use another method, or buy computational time on a large cluster where you can run these in parallel (as you only need from each initialization the performance numbers on the test set and maybe on the validation set as well this is embarassingly parallel and easy to do).

answered Jul 11 '11 at 03:17

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

I already use a validation set for early stopping and I suppose it would be possible to use the same validation results to choose initializations.

Thank you for reminding me of Amazon EC2, you're right about this being embarassingly parallel.

(Jul 14 '11 at 10:30) crdrn

A good reviewer at a good journal should ask this question - if the method is non-deterministic, how can you be sure that the results are significantly better than previous methods.

That said, often machine learning algorithms compete on a preexisting dataset - meaning that even tiny increases in performance can be publishable. I'm not a fan of this for non-deterministic algorithms, but it does show improvement.

That said, if you want to do it yourself, and you don't have the time to run thousands of iterations, consider what can be cached as an intermediate result. A trivial example is running k-means on a dataset thousands of times. If the procedure is

  1. Extract Features
  2. Cluster with k-means
  3. Repeat

Then there is no reason why you need to be repeating step 1 every iteration. Often first iterations of code do this type of thing (i.e. run everything as one big algorithm), when you can be modularising and caching those intermediate results. Another example, using k-means, is to calculate the X^2 value (the dataset squared), then expand the Euclidean distance to X^2 - 2XY - Y^2 (where Y is the centroids). This drastically reduces the calculations required, and if your features are the same every iteration, this only needs to be done once. There is a paper that shows the expansion, but it escapes me now.

answered Jul 11 '11 at 01:41

Robert%20Layton's gravatar image

Robert Layton
1625122637

I can't find the paper I'm referring to at the end, feel free to edit it if someone finds it.

(Jul 11 '11 at 01:42) Robert Layton

This is a good suggestion, but unfortuantely the method I'm using operates on raw data and is non-deterministic from beginning the start (it's a neural network).

(Jul 14 '11 at 10:35) crdrn
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.