|
Hello everyone. Let us consider a comparison of two machine learning algorithms (A and B) on some dataset. Results (RMSE/F1) of both algorithms depend on randomly generated initial approximation (parameters). Questions:
Relevant links are welcome! PS. I've seen papers in which authors use t-test and p-value; but i'm not sure if it is ok to use them in a such situation. |
|
Any test you make will only tell you that A outperforms B on those datasets you test it on. You might think that with enough different datasets you could maybe derive a p-value, but you must remember that while individual data points sometimes can be assumed IID, different datasets most certainly can't. Proving that algorithm A is better than algorithm B in general is a lost cause, per the no free lunch theorem. You can, however, use learning-theoretical generalisation bounds to compare expected generalisation ability, although this is also problematic since the bounds are often uncomparable or unpractical. To say that algorithm A is better than algorithm B on a specific dataset the trivial way is to separate uniformly at random a training and a test sets, train on the training set and use the test-set bound (essentially a confidence interval assuming a fixed but unknown error probability p for each classifier) on the test set, which can give you a p-value. |
if you are interested, there is some discussion here: http://stats.stackexchange.com/questions/4019/measuring-statistical-significance-of-machine-learning-algorithms-comparison