|
In a sequential (on-line) tree algorithm, I'm trying to estimate class label densities using a histogram. The algorithm grows a tree by generating test functions at each decision node, which correspond to the features in the data set. That is to say each feature is used to create a test in the form of g(x) > 0 that's used to decide the left/right propagation of samples down each node of the binary tree. If the result of that test exceeds some threshold theta, then samples are propagated down the right branch, otherwise they're propagated down the left branch. When training the tree, I have to choose the best test to use at each decision node, which is accomplished using a quality measurement in the form
Where R_jls and R_jrs are the left and right partitions made by test s, which corresponds to the data greater than or less than the threshold theta, (p_i)^j is the label density of class i in node j (or jls/jrs for the label densities of class i in the left or right child nodes), K is the total number of classes, and |x| denotes the number of samples in a partition. In other words, the gain with respect to a test s = g(x) depends on the entropy of class labels within that feature. This requires knowing the class label density at each decision node. So, my uncertainty lies in how can this be accomplished with a histogram. A histogram for each feature would have N bins of size range/N, where range is the difference between the max and min values that each feature takes. If we don't have a priori knowledge of this range, we can keep track of the max/min values as we get more training data. This histogram would track the number of samples falling within each range, but it tells nothing about the class labels. Another option is to have a histogram for each class, for each feature, for each node. So, you'd be keeping track of the number of samples having a particular class label given a range of feature values (each bin). This solution seems to be more in line with the above equation, since we need the label densities for each class and for each node, but I just want to verify that I'm on the right track. For reference, see part 2.1.2 of this paper. |
|
The standard approach is to keep a histogram of class counts, for each feature, in the node currently being split. In the online setting, you are correct about also maintaining histograms for all nodes where the feature test is not yet finalized. Thanks for the response. Another issue I'm having is keeping track of the max/min values--for a histogram with uniform bin sizes, it seems naive to think that the first couple samples will provide reliable information concerning the range of subsequent features. I suppose since I'm the one ostensibly designing the features, however, these max/min values are already known (or, reasonably guessed). That said, in the event that I don't have a priori knowledge of the range, would it makes sense to have a histogram with non-uniform bin sizes to come up with an approximation that's better represented by the available bins? The trade-off is in speed, since a non-uniform-bin-size histogram is O(n), whereas a uniform-bin-size histogram is O(logn), I believe. Additionally, in computing the threshold value, can it be thought of as the point at which the positive class label histogram (i.e., the histogram of class counts for class A) and the negative class label histogram (i.e., the histogram of counts for any class other than A) intersect? If so, that's just where the in-class and out-of-class counts are equal, but in a multi-modal histogram, there could be many such points, so it might be better still to just generate a random threshold.
(Sep 07 '11 at 03:26)
kmore
|
