6
5

Writing unit tests is relatively straightforward for any code, even ML. I think what I want to do is functional testing for ML code, concretely semantic models (lsa, topics, lda, random indexing). Wikipedia says:

Functional testing differs from system testing in that functional testing "verif[ies] a program by checking it against ... design document(s) or specification(s)"

That is, to test an implementation, I'd like to see the model doing something it's known to do well. That involves replicating published results.

I've seen some tests for example in MDP that use a solved example is known to work. For example, one can do an artificial dataset that produces a known result, and test against this. In their case, they test PCA by generating data with known dimensions, and seeing if the PCA algo recovers them.

However for e.g., semantic models it is not easy to find such examples. The whole point about modeling language is that you won't get good results using a toy dataset.

One could run tests with large datasets replicating known results, for example solving the toefl test using an encyclopedia. But then the test would involve creating a space, and these things take time (minutes in the most optimistic conditions; even days for LDA).... not something to aim for with a test that must be run every time you change your code...

Still there are some published small examples. The Landauer and Dumais paper for LSA has a small example in the appendix.

What's your take on testing and ML/NLP? Have you found small, worked out examples for say LDA, topcs, Random indexing, etc?

asked Dec 27 '10 at 13:20

Jose%20Quesada's gravatar image

Jose Quesada
1863710

You argue that functional testing takes a long time, but does this need to be done more than every month or two? As long as you unit test the components to ensure that they have changed, perhaps add some assert statements to ensure data is correctly input to functions, you only need to check that your implementation works once, and is not later altered.

(May 24 '11 at 00:54) Robert Layton

3 Answers:

Usually the best bet is to try to recreate the experiment in the paper that the algorithm was introduced. If you get similar numbers on the dataset they used, you probably implemented it correctly. For algorithms that have already been implemented test both against some standard dataset. They should largely agree in values.

answered Dec 27 '10 at 16:25

zaxtax's gravatar image

zaxtax ♦
1051122545

But this is what I was afraid to hear. It involves: 1- finding the exact same dataset; sometimes not easy, being proprietary and buried behind forms, CD-shipping and other fossils. Some other times it involves doing a web scrapping...

2- replicate the parsing. Normally papers don't go into details on what was done there. A simple decision such as what to do with punctuation can change the results

Any of this things may fail before you even get to replicate the results.

Plus, it can be that the replication takes hours of computing time. For a test you need to run often, this is too much.

(Jan 03 '11 at 04:55) Jose Quesada

For lots of algorithms you can test the implementation (not the algorithm) on hand-crafted toy data. For testing an LSA implementation, for example, you could sample a low-rank decomposition, multiply to get a matrix, and then disrupt with noise, and see if you recover it reasonably well. For most bayesian models you can test your algorithm against small samples from the prior, and modulo identifiability issues you should always get something reasonable back.

answered Dec 27 '10 at 22:11

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Software testing methods is a much over-looked area in machine learning.

If you can replicate the results in a machine learning paper it is a good sign. However, it is often the case that the implementation used in the paper had bugs. Additionally, the experimental setup is often too compressed to give enough information to replicate the experiment exactly. Few machine learning papers include error bars so it is hard to control for factors like differing random seeds in MC algorithms.

I would recommend thoroughly testing sub-routines in a classic unit testing way. You should also have a stupid simple, but possibly, slow implementation of the algorithm which you can compare to an efficient implementation,

Particular inference methods come with ways of testing them. For instance, in EM and variational methods you can check that the variational lower bound increases every iteration. If not, there is bug. For MCMC methods, John Geweke has a collection of methods to test that they are implemented correctly. Derivatives can be checked with finite differences.

answered Dec 29 '10 at 20:49

Ryan%20Turner's gravatar image

Ryan Turner
2564812

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.