The idea of a histogram approximation to training examples seen by an algorithm is quite useful for online learning methods, where the training data is seen sequentially. The most standard implementation is to have a histogram with N ordered buckets that are filled, one at a time. Each bucket corresponds to a range of values, and are filled as the algorithm encounters data falling within the bucket's range.

This presents two situations: in the case where the data is bounded, each bucket will be of width RANGE/N--this presents us with an exceptionally simple, O(1) algorithm to add data to the bins. However, it becomes slightly more difficult when the data is boundless or its bounds are not known.

In this case, we'd want to resize the bins to account for the new data point which exceeds the current histogram's bounds, merge the two bins with smallest counts, and then add the new value. This is clearly more computationally expensive.

Do any of you know of any really efficient implementations or discussions concerning the the design of histograms meeting this criteria?

One paper that discusses this appeared in JMLR in 2010: A Streaming Parallel Decision Tree Algorithm (Section 2.1). However, the biggest problem with this approach is that the histogram buckets are not uniform in size, and I'm not really sure how tractable it would be to alter their algorithm to handle this constraint.

The reason I'm asking this question here is because I am using two histograms per feature per class per leaf node of a decision tree to maintain a running average of the samples that would be classified as belonging to each class given the feature values. From this, calculating information gain and selecting the splitting feature (and split point) is trivial. Thus, it appears far simpler to use a histogram approximation that maintains equal bin sizes, although I'm not entirely sure how crucial this is.

asked Nov 17 '11 at 15:42

kmore's gravatar image

kmore
26558


One Answer:

My suggestion is to measure how much your current approximation method hurts the results in your application.

You don't mention building ensembles with your streaming trees, but I recommend trying that as long as interpretability isn't a requirement. For one thing, combining multiple tree models will probably recover any losses incurred by your approximations --- it's just another source of randomness that will get averaged out. If you go down this path, you might want to check out "Online Bagging and Boosting" by Oza, 2001.

answered Nov 18 '11 at 14:43

Art%20Munson's gravatar image

Art Munson
64611316

Your answer
toggle preview

powered by OSQA

User submitted content is under Creative Commons: Attribution - Share Alike; Other things copyright (C) 2010, MetaOptimize LLC.