|
I remember hearing that when a large number of people gather to guess the correct number of jellybeans in a jar or the correct weight of a cow; and the numbers that they propose are averaged: They produce a result quite close to the target. Is this an example of Ensemble methods? Or bagging? Or what theory is behind this? as a side question, does this underlie or connect to the wisdom of crowds? |
|
The key idea is that uncorrelated errors cancel. The graph that drives this point home for me the most is on page 12 of this paper: Many of the classifiers in the Adaboost ensemble do very poorly, but the ensemble contains a lot of diversity and outperforms other less diverse but more individually accurate ensembles. Certainly the wisdom of the crowds and ensemble methods both exploit this idea uncorrelated errors cancelling idea. Bagging exploits this a bit by giving the ensemble methods different samples of the same dataset. Random subspace methods like random forest exploit it further, by forcing members of the ensemble to make decisions based on different subsets of features. @Rob Renaud, is there a simple example for why uncorrelated errors cancel? Something more visual or a toy example?
(Sep 29 '11 at 10:36)
VassMan
1
It's just that uncorrelated things in general tend to cancel when added. If you flip a coin 10 times, you are much more likely to be around 5 heads than 10 heads. The 'excess' heads from 1 flip cancels with the 'excess' tails of another flip. The cool tradeoff that some of the ensemble methods make is that they sacrifice some individual accuracy in ensemble members so they can get big gains in uncorrelatedness. It's interesting and unobvious a-priori that this would be a good strategy for building a classifier.
(Sep 29 '11 at 14:49)
Rob Renaud
@Rob Renaud, is there any connection of this to the Central limit theorem? That these distributions combined form around the true value and make a Normal distribution with a mean on the true value? thanks
(Sep 30 '11 at 12:51)
VassMan
1
Yes, this is the basis for the CLT. If you have a sum of i.i.d. variables (i.e. uncorrelated "errors"), with finite mean and variance, their sum will tend to the Normal distribution. Note that if each member of the crowd is biased in the same direction, then the crowd will also have this bias. That's why diversity is a good thing.
(Oct 10 '11 at 05:56)
Oscar Täckström
|
|
The first answer is not addressing quite the right issue here. The question was (1) about averaging numbers, and (2) how this connects to "wisdom of the crowds". The answer is that yes, the averaging of a set of real-valued or integer-valued numbers, as described, is an example of an ensemble method. The fact that the result is very close to the target is a consequence of Jensen's inequality. The precise formulation is as follows. Given a set of guesses x_i, (i=1:M) for the number of jelly beans, and the true number of beans x*. The average (mean) of the guesses is written: Now using the squared distance as our measure, the 'error' of the average guess is given by: Notice that the first term on the right hand side is the average error of the individual guesses. Notice also that the second term is always non-negative. Hence, from this we can see that the error of the average guess is guaranteed to be less than or equal to the average of the individual errors. This is not saying that the average will be closer than the best individual guess. The proof of the above can be seen by setting x^* to zero, which shows this is a special case of Jensen's inequality. In ensemble methods literature, this is known as the Ambiguity decomposition, or more widely "diversity" - see this paper for details. When the numbers we combine are not ordinal, but are votes as in Adaboost, the above theory does not apply. There is however substantial theory showing how a voting ensemble reduces the error by large sample properties of the binomial distribution, more eloquently described by Tom Dietterich in his classic 2001 paper. The Wisdom of Crowds is the title of a book, not a formal mathematical theory. The book made reference to an event in history where Francis Galton (cousin of Charles Darwin) was at a country fair where a competition was set up to guess the weight of an ox. Galton amazed the crowd by applying the above principle for squared distance, and getting an answer that was better than most, but not all, of the crowd. @Gavin Brown, why is this an inequality when there is an equals sign?
(Aug 13 '13 at 05:59)
VassMan
|