My research is on authorship analysis, and is concerned mainly with smaller amounts of data. i.e. given 100 tweets for each of 10 authors, accurately predict which author wrote which document.

Obviously, there is much, much more data than this, so I have an instance where I have data I don't want to use in training.

When data is limited, it is often best to use cross-fold validation as a way of re-sampling datasets. Given the large amounts of data I have, I am thinking it may be better to just create an entirely new dataset for each iteration -- ensuring generalisability more than CV could. However I can't see any justification for that in the literature.

My question is, should I still use CV when I have other data I'm not using, or should I just create an entirely new dataset for each iteration of testing?

asked Feb 13 '12 at 18:29

Robert%20Layton's gravatar image

Robert Layton

edited Feb 13 '12 at 18:29

One Answer:

In the case where the amount of data is large enough that computation becomes a factor, a lot of general machine learning wisdom can be revisited to achieve better performance. The best guide I know to the tradeoffs in this setting is Leon Bottou's The Tradeoffs of Large-scale Learning.

In the online setting advocated by Bottou, the progressive validation loss--that is, the error your algorithm makes on a new data sample before updating on it, averaged over all recent enough samples--is an unbiased estimator of test-set loss (for obvious reasons) that is more stable than cross-validation loss.

You can find a justification for a simple train/test split (as is common in many benchmark datasets) in the test-set error bound, in John Langford's tutorial on practical prediction theory for classification, which as a bonus shows how to compute tight confidence intervals on the expected error (as long as you don't optimize on the test set).

What is standard in the neural networks and natural language processing communities (among others) is to use a three-way split of your data: a part is used for training, a second smaller part for hyperparameter tuning and model selection, and a third part is reserved for one last test the day before the paper deadline to make sure that you're definitely not overfitting.

answered Feb 13 '12 at 19:25

Alexandre%20Passos's gravatar image

Alexandre Passos ♦

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.