A Random Forest is a collection of Decision Trees formed by randomly selecting only certain features to build each tree with (and sometimes bagging the training data). Apparently they learn and generalize well. Has anybody done MCMC sampling of the decision tree space or compared them to Random Forests? I know it might be computationally more expensive to run the MCMC and save all the sampled trees, but I am interested in the theoretical features of this model, not the computational costs. What I mean is something like this:

  1. Construct a random decision tree (It would probably perform horribly)
  2. Compute likelihood of the tree with something like P(Tree|Data) propto P(Data|Tree), or perhaps add a P_{prior}(Tree) term.
  3. Choose a random step to change the tree and select based on the likelihood P(Tree|Data).
  4. Every N steps, save a copy of the current tree
  5. Go back to 3 for some large N*M times
  6. Use the collection of M saved trees to do prediction

Would this give a similar performance to Random Forests? Note that here we are not throwing away good data or features at any step unlike random forests.

asked Aug 24 '11 at 15:18

probreasoning's gravatar image

probreasoning
1366715

edited Aug 24 '11 at 15:57

Decision trees usually don't compute P(data|tree), so I think this is unlikely.

(Aug 24 '11 at 15:23) Alexandre Passos ♦

@Alexandre, but we could compute P(data|tree) (under some model)

(Aug 24 '11 at 15:52) probreasoning
1

Sure, but this is not how decision trees are used: they are discriminative, not generative, and they are also non probabilistic.

(Aug 24 '11 at 15:58) Alexandre Passos ♦

3 Answers:

Pretty much this exact idea has been done. Some literature references are below to get you started in the right place. I have absolutely no idea if this actually works well, have no personal experience, have not studied the literature closely enough to have an informed guess whether it works, etc.

Chipman, George, & McCulloch (1998). Bayesian CART model search. Journal of the American Statistical Association 93(443):935-950.

Chipman, George, and McCulloch (2010). BART: Bayesian Additive Regression Trees. Annals of Applied Statistics 4(1):266-298.

A different Bayesian approach that DOES work is to define some convenient tree structure priors and use reasonable search heuristics to look for trees with large posterior probability. Wray Buntine has some very nice work using this practical Bayesian approach to decision tree learning (note, no MCMC). I've used these kinds of trees on pretty hard learning tasks, and they work very well as base models in ensembles. The choice of prior is important, but there is a good default that is based on minimum description length encoding. You can search for an old C implementation from NASA AMES, that goes under the name IND decision tree package. The tree style you want to use is MML (not strict MML). Unfortunately, the latest version of gcc has gotten pickier and the IND code no longer compiles without some extensive source code fiddling. Might be easier to use gcc 3.x... Anyways, here is the key Buntine reference:

Buntine (1992). Learning classification trees. Statistics and Computing 2(2):63-73.

answered Aug 24 '11 at 20:01

Art%20Munson's gravatar image

Art Munson
64611316

Why do you mean that decision trees "throw away good data or features"? Random forests consist of many trees that where constructed greedily to optimize classification given a subset of features or data. You propose to use some trees that are all highly correlated (since there is a Markov Chain) and that are all pretty bad. Correlation between trees is bad (as was shown in a strict mathematical sense), too.

Usually you have to limit the number of trees for computational reasons. Why do you think your construction would be better for a given number of trees?

@Alexandre: Why do you need a generative model? I think (hope) P(data|tree) is meant as P(y|tree,X). This is computed by a decision tree, right?

Maximizing P(y|tree, X) is what is (approximately) done in decision forest training, right? To do Bayesian model selection, you also need a prior on trees. In practice, this is done by restricting the size of the tree or setting some threshold on the information gain that a new leaf needs to have.

So I guess in defining moves on a tree and setting a prior you might get trees that fit the data better (since random moves can fit better than greedy selection) and have a better control over the complexity of the model. If that tree performs better than a standard decision tree is not really clear to me.

It's not that closely related but you might be interested in the paper "MIForests". The authors construct a forest using deterministic annealing.

This answer is marked "community wiki".

answered Aug 24 '11 at 16:45

Andreas%20Mueller's gravatar image

Andreas Mueller
2686185893

I'm not so much arguing that the MCMC sampling of trees would be better or more efficient, I am just trying to understand the differences and similarities since sampling can be used to compute expectations and is used as such in a number of applications in fields other than decision trees. Should bagging and random selection of features be used to construct a number of weak learners in those fields to take a majority/average vote. Can we argue/prove something either way?

(Aug 24 '11 at 21:51) probreasoning

I think that if you could show that some moves generate trees with "independent" predictions, you would be in good shape.

(Aug 25 '11 at 07:07) Andreas Mueller

Well, the idea is that one would sample at sufficient intervals and take sufficient samples that one does get a good representation of the space. I mean this is a pervasive problem in MCMC methods, and it leads to computational expense and sometimes one doesnt know whether we have explored the space or not. MCMC sampling is however still used. But you are right, random forests avoid this problem so they may perform better on practice. I think it would be nice to see this in some experiments on real data.

(Aug 25 '11 at 13:55) probreasoning

In a similar vein, the Decision tree fields paper by Nowozin et al from MSR Cambridge uses decision trees as the feature functions in CRFs for image recognition tasks, and they have some pretty good results.

answered Aug 24 '11 at 15:26

Alexandre%20Passos's gravatar image

Alexandre Passos ♦
2554154278421

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.