Hi all!

I've the following setup. I'm able to generate training samples, so I can test the performance of a CRF training it with a different number of these examples.

Intuitively, using more training simples you can achieve a better sucess. That's the case if I compare the results achieved training with 10 samples and with 1000, but if I use more than 1000 the results are worst.

I'm training optimizing the sum of the negative log likelihood of that samples.

Any suggestion about the possible cause of this phenomenon?

Thanks!

asked May 30 '13 at 10:53

JR%20Ruiz's gravatar image

JR Ruiz
16336


One Answer:

If you get worse, this is either noise (so shouldn't be consistent over, say different subsamples of size 2000 of 10000 or however many you have), or your distribution of training and test data is different. Don't think there could be any other reasonable explanation.

answered May 31 '13 at 12:55

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

Hi,

In fact, my distribution is a model of the real world, and test data is real data. What sounds strange for me is the increase in the success untill using 1000 samples, which is a high number of samples, and then go down, each time worse while using more samples.

(May 31 '13 at 15:52) JR Ruiz

It sounds then like you are either not training your model properly or using an evaluation metric which is not statistically consistent. For example if you are using a bad optimization algorithm it might mean that it takes longer to train with more than 1k examples and hence you end up not training that model so well. What is your test-time evaluation metric?

(Jun 01 '13 at 06:17) Alexandre Passos ♦

To measure the success I'm using the F metric (http://en.wikipedia.org/wiki/Precision_and_recall).

To train, i.e., to optimize, I'm relying on a Quasi-Newton with Limited-Memory BFGS implemented in the UGM library (http://www.di.ens.fr/~mschmidt/Software/UGM.html), and the function to optimize is the sum of the negative log likelihood of the training samples.

I've tested with 1000 training samples, and the optimization takes about 100 steps, but with 7000 it takes only 60. Is there an explanation for that? I can't figure it out...

(Jun 03 '13 at 06:54) JR Ruiz

It sounds like the convergence criteria for LBFGS is wrong, and its stopping too early.

Here's a test you can do: pick a very small learning rate and just do batch gradient descent on your objective, a learning rate small enough that eveyr step strictly decreases the objective value, and run for somethng like 1000 iteration. See which value you can achieve with this method, and which value you achieve with lbfgs. My bet is that lbfgs will achieve a worse value (higher negative log likelihood), and this would explain the behavior you're seeing.

(Jun 03 '13 at 10:07) Alexandre Passos ♦
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.