1
1

In general, when using k-fold cross validation, it seems to be the case that:

  • A larger k will produce an estimate with smaller bias but potentially higher variance (on top of being computationally expensive)
  • A smaller k will lead to a smaller variance but may lead to a a biased estimate (your random split may be too easy or too hard)

(here by estimator I mean the estimator of the true error rate, which is calculated by averaging the validation error across all k possible runs)

First off, is the above view accurate? Second, what if we pick a smaller k but repeat cross-validation m times on different random splits? Would we gain anything from having a small k but larger m versus having a large k repeated only once?

From a more practical standpoint, what would be a "sound" way of going about picking k and m for a given dataset size? For N=500, I have the vague intuition that k=10 may be too high since that would only leave 50 instances for validation, but using a smaller k like 3 would increase the chance of having an uneven split in terms of learnability... How would I know whether my chosen number of folds is appropriate or not?

asked Feb 04 '11 at 15:07

yeastwars's gravatar image

yeastwars
23171417

edited Feb 04 '11 at 15:09

1

See the section, "What are cross-validation and bootstrapping?", in Part 3 of the Usenet comp.ai.neural-nets FAQ, ftp://ftp.sas.com/pub/neural/FAQ3.html#A_cross , for a review of this subject, including references.

(Feb 04 '11 at 20:42) Will Dwinnell

2 Answers:

You can do leave-one-out CV very easily with some learning algorithms. Aside from KNN, where this is obviously as easy to do as classifying the whole training set, SVMs are also very easy (you classify non-support-vector points normally and classify support-vector points according to the other support vectors after optimizing with the whole dataset).

IIRC John Langford doesn't like leave-one-out cross-validation because it, funnily enough, can be very biased. See for example http://hunch.net/?p=29 and http://hunch.net/?p=22 (specially the comments) for discussions of pitfalls involving cross-validation.

I try to avoid doing cross-validation, and prefer to gather enough data (if possible) to have a held-out-until-paper-writing-day test set and a smaller development set. If the problem is easy, such as binary classification, a test-set error bound doesn't need that many examples to become very tight.

answered Feb 04 '11 at 17:23

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

3

JL's blog post (the first one you linked to) seems to argue that it's difficult to take the result of training on each fold of k-fold CV and build/train a single classifier along with tight bounds on the generalization error of that particular classifier. In my understanding, that's a slightly different problem than estimating the error on whatever classifier you end up with after training on an i.i.d. training set, which counterintuitively seems to be an easier thing to estimate, since it's exactly what you're doing when you do CV.

(Feb 04 '11 at 18:23) Kevin Canini
1

Yes, but I think this is relevant to the question because,as the blog shows, increasing K is not necessarily a good idea, since the CV estimator on the worst case is equivalent to a test set bound of size m/K, which is not really very good if K is large.

(Feb 04 '11 at 19:35) Alexandre Passos ♦

Thanks for the links. John Langford's blog is quite interesting.

I try to avoid doing cross-validation, and prefer to gather enough data (if possible) to have a held-out-until-paper-writing-day test set and a smaller development set. If the problem is easy, such as binary classification, a test-set error bound doesn't need that many examples to become very tight.

Isn't it risky to make a comparison on a single held out test set? What if you were unlucky and ended up with a very easy, very hard or very uneven test set?

(Feb 05 '11 at 19:05) yeastwars

Under i.i.d. assumptions, cross validation always gives an unbiased estimate of the true error rate because you're sampling from exactly that distribution (Note: there is a small bias towards estimating poorer performance than you would get with training on the full dataset because you subsample the data and using a smaller training set). Increasing k decreases the variance, but only slightly. As an interesting side-note, there is no unbiased estimator of the variance of cross validation. For a thorough treatment, see: http://jmlr.csail.mit.edu/papers/volume6/markatou05a/markatou05a.pdf

The procedure you describe in your second question is quite common; it's called repeated random sub-sampling. I suspect that k-fold cross validation is a newer idea.

I like to use leave-one-out k-fold CV because it has the lowest variance, but it's really slow, so you can only use it on smallish datasets. A value of k=10 is really common and scales well to large datasets. You're concerned about k=10 being too high because it only leaves 50 instances for validation, but keep in mind that you're repeating the CV procedure for all 10 sets of 50, so you're actually doing validation on every example.

answered Feb 04 '11 at 17:01

Kevin%20Canini's gravatar image

Kevin Canini
126021330

keep in mind that you're repeating the CV procedure for all 10 sets of 50, so you're actually doing validation on every example

That's true, but if you're doing model selection based on a discontinuous metric like the misclassification rate, then the number of different values your metric can take will be limited by the number of instances in the validation set. You'll effectively be picking models based on whether they classified a handful of instances better than another model. Even if you're averaging over multiple runs, this doesn't seem very robust (for lack of a better term). I can see this being less of a problem in regression or when using more continuous error functions like Mean Squared Error or Negative Log Likelihood.

(Feb 05 '11 at 18:59) yeastwars

Your comment "I like to use leave-one-out k-fold CV because it has the lowest variance" is at odds with these two answers: (1): http://stats.stackexchange.com/a/62080/2798 and (2): http://stats.stackexchange.com/questions/61546/number-of-folds-for-k-fold/61549?noredirect=1#61549

(Jul 11 '13 at 22:59) user815423426
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.