7
4

From my studies I have been exposed to only Bayesian methods and partially frequentist statistics. I have recently discovered many new methods by using consensus clustering to determine the right number of clusters when using kmeans.

The theory behind consensus clustering appears to be bagging. From the theory of bagging other methods appear (adjacent/above/below?) such as boosting, ensemble methods, random forests, weak/strong classifiers etc. From reading the wikipedia pages about them as a preliminary I am confused on how to differentiate them, where they differ and where they overlap (which is maybe the basis for these methods ;).

I would like to have a structured overview of the relationship that these methods have with each other and the changes in the basic underlying theory that differentiate them.

asked Sep 23 '11 at 07:19

VassMan's gravatar image

VassMan
209121518


One Answer:
21

Ensemble methods is the most general concept. It just means building multiple variant models and aggregating over them somehow. Ensemble methods are usually divided into bagging & boosting, depending on whether the models are trained independently (bagging) or if previous models are somehow allowed to influence the training of subsequent models (boosting), usually by making the new model compensate for errors of previous models. Boosting also used to be called "arcing" particularly by Breiman. Bagging originally had a narrower meaning: training all the variant models by applying the same algorithm to bootstrap samples of the training data, but it seems that in more recent literature it seems to be used in the sense I gave above. Generally bagging primarily reduces variance error, while boosting/arcing reduces bias error. Boosting can be prone to overfitting, particularly in the presence of label noise. Bagging is generally more robust, but boosting can learn faster if your data is clean. Also bagging & boosting can be combined in various ways to reduce both variance & bias error.

Random forests are a kind of bagged tree. The trees are trained on (usually bootstrap) samples of the training data, but at each split in a tree only a random subset of the available features are considered for splitting on.

Ensemble methods have been primarily used in supervised settings (classification & regression) but the concept makes sense in unsupervised, semi-supervise, ... setting as well, but this is only being explored more recently. So consensus clustering seems to be primarily bagged clustering, usually bagged k-means. There are also unsupervised random forests, which allow you to learn a similarity measure between items, in a way that makes no assumptions about the geometry of your feature space.

It is less clear how to adapt boosting to unsupervised methods, since the way models depend on each other in boosting is to compensate for each others errors. In the absence of a training signal, I do not see what kind of dependency you would want. However, Google tells me that people seem to be doing something with unsupervised boosting, so what do I know :)

answered Sep 23 '11 at 13:52

Daniel%20Mahler's gravatar image

Daniel Mahler
122631322

edited Sep 26 '11 at 13:25

2

Very nice and synthetic answer, Daniel, it's a pleasure to read.

(Sep 25 '11 at 10:04) Gael Varoquaux
Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.