|
What is the current state-of-the-art in the large-scale learning of a single decision tree on a single core? Is it feasible today to build an accurate decision tree over millions of examples with hundreds of real-valued features? Thanks! P.S. I know that ensembles of decision trees, learned in parallel, are widely used in the industry, but that's another topic. I am interested in a single decision tree on a single core. |
|
I did that using scikit-learn on image net as a test run. That has 1.2 million examples and I used 1000 dimensional features and there are 1000 classes. So from a computing point of view it is definitely possible. But I'm pretty sure that you'll run into overfitting issues using a single tree.
This answer is marked "community wiki".
Thanks for the pointer! I hope I will be able to overcome the overfitting by not going too deep and/or by using pruning.
(Jan 19 '12 at 00:36)
dgp
afaik there is no pruning but the minimal number of samples an a leaf and the maximum depth can be specified.
(Jan 19 '12 at 04:47)
Andreas Mueller
1
The minimum number of samples in a leaf is a decent heuristic, since a surrogate for how good your loss estimates are. I don't think maximum depth is useful, from my experience. It doesn't correspond to any useful criterion. The most principled thing you could do is to minimize the l1-regularized log-loss. Look at chapter 4 ("Learning") of my thesis, in which I provide the splitting criterion and line-search method for picking the leaf weights. I do it over an ensemble of decision trees, but you could induce a single tree and stop. The simplest thing would be to choose the minimum number of examples per leaf, and crossvalidate to pick that hyperparameter.
(Jan 26 '12 at 04:45)
Joseph Turian ♦♦
this seems to miss the point. i would argue that if the problem can fit in to main memory, then the problem isnt large scale.
(Feb 05 '12 at 16:33)
downer
|
|
Two pieces of work that might be useful. 1) The Hoeffding tree work by Pedro Domingos and Geoff Hulten. The idea is to grow a single tree from a stream of data points. As a result, the training data do not need to be held in memory all at once. Pedro has a C implementation on his website. 2) YaDT: Yet another Decision Tree builder claims to be fast. There is follow on work that changes some of the algorithm design to exploit multicore. You can find software on Ruggieri's home page. (I haven't used this software myself.) Finally, in the old IND tree software package, there is a (poorly documented) approximation used to improve scalability, and I haven't seen it used elsewhere. The idea is to use a random subsample of the data when computing information gain (or gini or whatever) if there are more than T data points at the current tree node. The default for T in IND was something like 1000. On large data with many numeric features, this approximation greatly reduced running time without appreciably altering model accuracy. My one worry would be how it might impact performance on problems with a skewed class distribution. I forgot to mention: the original Hoeffding tree paper ignores the problem of how to handle numeric features. You'd need to check the follow-up literature to see how people handle that. The implementation on Pedro's website does handle numeric features, but I've never dug into the code to learn which mechanism it uses.
(Feb 01 '12 at 14:23)
Art Munson
|
|
I wouldn't say that this is current, by any means, but you could look at my thesis. (It's several years old.) Why do you want a single decision tree, instead of boosting an ensemble of them? I induced a boosted ensemble of decision trees on a dataset with 20M examples and 1M features, if I recall correctly. I minimize the l1-regularized loss, which creates sparse decision trees. Because I couldn't fit the entire feature matrix in memory, I used a sampling technique called Priority Sampling. It samples the examples and reweights them to give unbiased estimates of the loss which I minimize. I would then keep the feature matrix for the sample in memory, induce an entire tree, reweight all the examples (because I'm boosting), and then repeat. |
|
A data set with a few million cases and a few hundred features is not that large by modern standards - it would fit in memory of a larger Amazon EC2 instance which means that one can train a model using the usual set of in-memory tools such as R (for example see R packages rpart, party, and gbm for very high quality implementations of decision trees). Having said that, one should be careful with the max depth of the tree as well as with the split and stopping criteria since the solution space for trees can grow exponentially as a function of some of these parameters (so it is easy to overwhelm any super-computer by relaxing these parameters too much) |
|
this JMLR paper ``A Streaming Parallel Decision Tree Algorithm'' allows sequential construction similar to the Hoeffding tree paper mentioned above. This can be done in parallel or on a single core. |